Enables a fairer comparison between core-coupled tensor core to Hopper
tensor core, where the latter benefits from coalesced full-throughput
moveout to GMEM because it does not use the 1x2 interleaved register
mapping. This means the result matrix will be stored swizzled in the
GMEM, without breaking correctness.