Kernel is software-pipelined around 2 GEMMs and softmax; it requires two iterations to fully complete a tile.