4.8 KiB
TMC Operand Debug Notes
This note records a failure pattern found while debugging
case12_3_scalar_tmem_lane_store.
Symptom
The simulator keeps producing trace output, but the kernel makes no forward progress:
- a tensor worker repeatedly polls a ready flag such as
g_case_mem[0]; - the poll load always returns the old value;
- the scalar warp has already committed the work before the ready flag store;
- the expected ready flag store never appears in the LSU trace.
For case12_3_scalar_tmem_lane_store, the tensor worker was looping at
PC=0x80000034..0x80000048, repeatedly reading g_case_mem[0] at
0x20000430 as 0. The scalar warp committed the scalar TMEM store and the
following TMC, but never issued the next ready flag store at PC=0x8000015c.
Root Cause Pattern
Vortex scalar registers are lane-local. A value computed while only lane 0 is active is only known to be valid in lane 0. If that register is later consumed by an instruction while multiple lanes are active, the inactive lanes may hold stale or zero values.
This is especially easy to miss around TMC:
# Bad when xN was defined while only lane 0 was active.
vx_tmc all_lanes
...
vx_tmc xN
The failed case12_3 sequence used a C operand for 1u that the compiler
materialized before switching to all lanes. Runtime trace showed the source
register for the final TMC(1) as:
rs1_data={0x0, 0x0, 0x0, 0x1}
After that TMC, the scalar warp did not fetch the ready flag store.
Correct Pattern
When a value is consumed under an all-lane mask, define that value under the same all-lane mask unless the instruction semantics explicitly use only lane 0.
For switching back to lane 0 from an all-lane region, keep the immediate
materialization and the TMC adjacent inside the all-lane region:
vx_tmc all_lanes
...
fence rw, rw
li t2, 1
vx_tmc t2
The expected trace at the final TMC is:
rs1_data={0x1, 0x1, 0x1, 0x1}
The library helper vx_tmc_one() follows this pattern because it emits
li a0, 1 and vx_tmc a0 in the same volatile asm block. It is safe when
called while all lanes are active.
Fast Log Checks
Use the simulation log to distinguish a simulator stall from a kernel-level wait loop:
stat -c '%s %y' chipyard/sims/verilator/output/chipyard.harness.TestHarness.VirgoBlackwellConfig/kernel.radiance.log
tail -n 80 chipyard/sims/verilator/output/chipyard.harness.TestHarness.VirgoBlackwellConfig/kernel.radiance.log
If the file is still growing but the tail repeatedly shows the same PC range, the simulator is alive and the kernel is probably spinning.
For a ready flag handoff, check both the polling load and the producer store:
rg -n "0x20000430|0x9900|wid=0, PC=0x8000014|wid=0, PC=0x8000015" \
chipyard/sims/verilator/output/chipyard.harness.TestHarness.VirgoBlackwellConfig/kernel.radiance.log
Interpretation:
- load repeatedly returns
0and no later0x9900store exists: producer did not reach the ready flag store; - ready flag store exists but consumer still reads old data: investigate memory ordering or address aliasing;
- producer stops immediately after a
TMC: inspect thatTMCsource register across all lanes.
Dump Checks
Check that the final lane-mask narrowing defines 1 immediately before TMC
inside the all-lane region:
rg -n "li\\s+.*1|vx_tmc" \
kernels/wu_arch_cases/case12_3_scalar_tmem_lane_store/kernel.radiance.dump
The fixed case12_3 dump has:
li t2, 1
vx_tmc t2
auipc a3, 1
sw a5, ...
Case12-Case15 Audit
The following cases were checked after fixing case12_3:
case12_flash_pv_accumcase12_2_flash_pv_p_probecase13_flash_pv_two_warpscase14_flash_pv_k64case15_flash_softmax_pv_stage
They switch to all lanes with vx_tmc(wu_bw_all_lanes_mask()) and switch back
with vx_tmc_one(). Their dumps show the safe adjacent sequence:
li a0, 1
vx_tmc a0
No analogous source change is needed for those cases.
Scalar TMEM Fill Lane Coverage
wu_bw_fill_tmem_tile() must run with all fragment lanes active. Scalar TMEM
store writes the active lane's source word into the matching TMEM fragment word:
TMEM[addr].word[lane] = rs2_data[lane]
Therefore a fill loop executed with tmask=0001 only initializes word 0 of
each 16-byte fragment. Word 1 and later can retain old TMEM data. In
case12_1_scalar_tmem_cb_probe, this showed up as a normal completion path
followed by verification failure 0x14; g_aux[0] was 1, and g_aux[1]
held the stale copied-back word.
The correct pattern is:
vx_tmc(wu_bw_all_lanes_mask());
wu_bw_fill_tmem_tile(wu_bw_tmem_a_byte_base(0), WU_BW_FP16_ONE_PACKED);
vx_tmc_one();
The dump should show the fill loop bracketed by all-lane and lane-0 masks:
li a5, 15
vx_tmc a5
...
vx_cmov zero, a0, a5, a2
...
li a0, 1
vx_tmc a0