Try to model the core's behavior which accesses cache in word granularity. This
also simplifies the coalescer design as coalescer no longer needs to uncoalesce
response data chunk into single bytes (and therefore fewer muxes).
The queue was enabling shifting of the registers whenever deq.ready
was 1, even when the queue was empty. This caused `wen` to disable
writing enq.bits to any of the entries in the queue. Fixed by setting
`shift` to 0 when queue is empty.
Inside DPI code, have a vector of unique_ptrs that act as handles to multiple
different trace logger instances. Each logger instance is instantiated in a
single instance of the Verilog module, and multiple of these Verilog modules may
be instantiated in the Chisel module (see simReq and simResp in MemTraceLogger).
* TileLink doesn't alter the `address` field from what we originally used in the
Get/Put call.
* Same goes for the `data` field.
* The only thing TL generates by itself is `mask`. This means we have to align
data to the beatBytes boundary ourselves when Putting, and also taking
the right sublanes using the mask when Getting.
TODO: since TileLink rounds all address down to a multiple of its beat
size (8 in the current code), we can't directly compare the memory trace
input to its output. Need to take masks into account.
Instead of making MemTraceLogger a TL slave, make it an IdentityNode
that simply snoops on the TL edges and generates logs.
We can attach a TLRAM at the downstream to actually get response back,
rather than MemTraceLogger simply absorbing all requests.
When invalidate signal is given for queue head, that head should be
gone immediately at the next cycle, regardless of what deq.ready was
at the previous cycle.