This is already done in the higher-level thread_block_gemm function, and flash also has explicit DMA sync code. Also having this executed twice sometimes triggers vx_bar movement into a branch which we really want to avoid.
This is already done in the higher-level thread_block_gemm function, and flash also has explicit DMA sync code. Also having this executed twice sometimes triggers vx_bar movement into a branch which we really want to avoid.