wu-arch/kernels

Fork 0

Files

Zhongdi LUO d6fbd447c3 Add Wu TMEM FlashAttention validation cases

2026-06-24 06:26:30 +00:00

4.8 KiB

Raw Blame History

TMC Operand Debug Notes

This note records a failure pattern found while debugging case12_3_scalar_tmem_lane_store.

Symptom

The simulator keeps producing trace output, but the kernel makes no forward progress:

a tensor worker repeatedly polls a ready flag such as g_case_mem[0];
the poll load always returns the old value;
the scalar warp has already committed the work before the ready flag store;
the expected ready flag store never appears in the LSU trace.

For case12_3_scalar_tmem_lane_store, the tensor worker was looping at PC=0x80000034..0x80000048, repeatedly reading g_case_mem[0] at 0x20000430 as 0. The scalar warp committed the scalar TMEM store and the following TMC, but never issued the next ready flag store at PC=0x8000015c.

Root Cause Pattern

Vortex scalar registers are lane-local. A value computed while only lane 0 is active is only known to be valid in lane 0. If that register is later consumed by an instruction while multiple lanes are active, the inactive lanes may hold stale or zero values.

This is especially easy to miss around TMC:

# Bad when xN was defined while only lane 0 was active.
vx_tmc all_lanes
...
vx_tmc xN

The failed case12_3 sequence used a C operand for 1u that the compiler materialized before switching to all lanes. Runtime trace showed the source register for the final TMC(1) as:

rs1_data={0x0, 0x0, 0x0, 0x1}

After that TMC, the scalar warp did not fetch the ready flag store.

Correct Pattern

When a value is consumed under an all-lane mask, define that value under the same all-lane mask unless the instruction semantics explicitly use only lane 0.

For switching back to lane 0 from an all-lane region, keep the immediate materialization and the TMC adjacent inside the all-lane region:

vx_tmc all_lanes
...
fence rw, rw
li t2, 1
vx_tmc t2

The expected trace at the final TMC is:

rs1_data={0x1, 0x1, 0x1, 0x1}

The library helper vx_tmc_one() follows this pattern because it emits li a0, 1 and vx_tmc a0 in the same volatile asm block. It is safe when called while all lanes are active.

Fast Log Checks

Use the simulation log to distinguish a simulator stall from a kernel-level wait loop:

stat -c '%s %y' chipyard/sims/verilator/output/chipyard.harness.TestHarness.VirgoBlackwellConfig/kernel.radiance.log
tail -n 80 chipyard/sims/verilator/output/chipyard.harness.TestHarness.VirgoBlackwellConfig/kernel.radiance.log

If the file is still growing but the tail repeatedly shows the same PC range, the simulator is alive and the kernel is probably spinning.

For a ready flag handoff, check both the polling load and the producer store:

rg -n "0x20000430|0x9900|wid=0, PC=0x8000014|wid=0, PC=0x8000015" \
  chipyard/sims/verilator/output/chipyard.harness.TestHarness.VirgoBlackwellConfig/kernel.radiance.log

Interpretation:

load repeatedly returns 0 and no later 0x9900 store exists: producer did not reach the ready flag store;
ready flag store exists but consumer still reads old data: investigate memory ordering or address aliasing;
producer stops immediately after a TMC: inspect that TMC source register across all lanes.

Dump Checks

Check that the final lane-mask narrowing defines 1 immediately before TMC inside the all-lane region:

rg -n "li\\s+.*1|vx_tmc" \
  kernels/wu_arch_cases/case12_3_scalar_tmem_lane_store/kernel.radiance.dump

The fixed case12_3 dump has:

li      t2, 1
vx_tmc  t2
auipc   a3, 1
sw      a5, ...

Case12-Case15 Audit

The following cases were checked after fixing case12_3:

case12_flash_pv_accum
case12_2_flash_pv_p_probe
case13_flash_pv_two_warps
case14_flash_pv_k64
case15_flash_softmax_pv_stage

They switch to all lanes with vx_tmc(wu_bw_all_lanes_mask()) and switch back with vx_tmc_one(). Their dumps show the safe adjacent sequence:

li      a0, 1
vx_tmc  a0

No analogous source change is needed for those cases.

Scalar TMEM Fill Lane Coverage

wu_bw_fill_tmem_tile() must run with all fragment lanes active. Scalar TMEM store writes the active lane's source word into the matching TMEM fragment word:

TMEM[addr].word[lane] = rs2_data[lane]

Therefore a fill loop executed with tmask=0001 only initializes word 0 of each 16-byte fragment. Word 1 and later can retain old TMEM data. In case12_1_scalar_tmem_cb_probe, this showed up as a normal completion path followed by verification failure 0x14; g_aux[0] was 1, and g_aux[1] held the stale copied-back word.

The correct pattern is:

vx_tmc(wu_bw_all_lanes_mask());
wu_bw_fill_tmem_tile(wu_bw_tmem_a_byte_base(0), WU_BW_FP16_ONE_PACKED);
vx_tmc_one();

The dump should show the fill loop bracketed by all-lane and lane-0 masks:

li      a5, 15
vx_tmc  a5
...
vx_cmov zero, a0, a5, a2
...
li      a0, 1
vx_tmc  a0

4.8 KiB Raw Blame History