This is to make thread_block_gemm a re-usable library function for GEMM operations for use in other kernels.