446 B
446 B
wu_arch_hgemm
Tensor-warp HGEMM smoke test for the Wu split scalar/tensor warp configuration with the 4-lane Blackwell tensor-core path.
Scalar warp 0 initializes the shared-memory B operand, spawns only the tensor
warp mask, waits for tensor warps NUM_SCALAR_WARPS..NUM_WARPS-1, and reports
completion through tohost. Tensor warps execute the Blackwell custom HGEMM
instruction sequence using 16-byte fragments and then stop themselves.