This is useful when you want to have the tensor core output to multiple accumulator registers, e.g. when doing outer product within the RF.
This is useful when you want to have the tensor core output to multiple accumulator registers, e.g. when doing outer product within the RF.