Try to model the core's behavior which accesses cache in word granularity. This also simplifies the coalescer design as coalescer no longer needs to uncoalesce response data chunk into single bytes (and therefore fewer muxes).