& split single-time GEMM into a separate function.
thread_block_gemm is meant to be reusable, so it shouldn't assume what the kernel arg struct looks like. threadblock_dim_y was ambiguous and didn't match the literal name either (it was used as # of warps that participate in a barrier).
This is to make thread_block_gemm a re-usable library function for GEMM operations for use in other kernels.