Wu Architecture Staged Cases
This directory contains small bare-metal kernels for incremental Wu architecture testing. The original kernels/wu_arch kernel is useful as an integrated test, but it combines scalar spawning, tensor spawning, barriers, tensor control, and memory behavior in one large workload. These cases isolate those surfaces so failures can be reproduced faster under Verilator.
Case List
case00_boot_scalar: minimal scalar boot, status writes, and pass marker.case01_scalar_spawn: scalar warp spawning without tensor warps or barriers.case02_tensor_spawn_stop: tensor warp spawn, marker store, and stop.case03_dual_fetch_issue: scalar and tensor warps active together to exercise split scheduling and issue.case04_scalar_barrier: scalar-domain barrier release.case05_tensor_barrier: tensor-domain barrier through tensor control.case06_masked_barrier: explicit mixedBAR_MASKwith scalar warp 0 and tensor warps.case07_tensor_csr_tmc: tensor CSR/TMC path without barrier behavior.case08_tensor_lsu_optional: tensor LSU store/load marker path; keep last because memory interaction is broader and slower.case09_scalar_tmem_ldst: scalar warp direct TMEM store/load path for the banked TMEM softmax mechanism.case10_tensor_scalar_tmem_handoff: tensor BWGMMA result in TMEM C observed by scalar TMEM loads.case11_scalar_tmem_softmax_stage: scalar TMEM transform written back for tensor-side copy-out.case12_flash_pv_accum: one tensor warp consumes scalar-writtenPin TMEM A forO = O + P @ V.case12_1_scalar_tmem_cb_probe: scalar-written TMEM A rows copied back by tensortcgen05_cb.case12_2_flash_pv_p_probe: case12 P-write diagnostic using the same scalar fill path.case12_3_scalar_tmem_lane_store: scalar TMEM store lane-coalesced fragment write diagnostic.case13_flash_pv_two_warps: both tensor warps consume scalar-writtenPtiles for two row blocks.case14_flash_pv_k64: two consecutiveK=32BWGMMA steps accumulate into one PV output.case15_flash_softmax_pv_stage: scalar reads TMEM C, writes softmax-likeP, and tensor consumes it in PV.case16_flash_full_pipeline: compactQK -> scalar softmax handoff -> PVend-to-end FlashAttention-style pipeline.case17_flash_exp_softmax_probe: scalar non-uniforme^xsoftmax probe for generalized FlashAttention.case18_scalar_fexp: scalarFEXP.Snumerical probe.case20_flash_bwd_fused: FlashAttention backward-style fused 5xMMA plus scalar softmax/dsoftmax handoff.case21_moe_gating: scalarsoftmax -> Top-K -> scatterMoE gating pipeline.case22_gemm_silu: tensor GEMM followed by scalar SiLU activation.case23_softmax_only: scalar-only stable softmax probe.case24_flash_sw_pipeline: four-iteration ping-pong FlashAttention-style software pipeline.
Each case has its own README.md describing the test objective, RTL surface, and expected pass marker.
Debug Notes
TMC_DEBUG_NOTES.md: reusable notes for diagnosing lane-mask/TMC operand bugs, including thecase12_3_scalar_tmem_lane_storefailure where a lane0-only register value was consumed under an all-lane mask.
Build
Use the suite Makefile from this directory:
make smoke -j4 LLVM_VORTEX=/home/hexu/dse/wu/virgo-artifact-full/toolchain/llvm-vortex2 RISCV_TOOLCHAIN_PATH=/home/hexu/dse/wu/virgo-artifact-full/chipyard/.conda-env/riscv-tools RISCV_PREFIX=riscv64-unknown-elf
make barriers -j4 LLVM_VORTEX=/home/hexu/dse/wu/virgo-artifact-full/toolchain/llvm-vortex2 RISCV_TOOLCHAIN_PATH=/home/hexu/dse/wu/virgo-artifact-full/chipyard/.conda-env/riscv-tools RISCV_PREFIX=riscv64-unknown-elf
make full -j4 LLVM_VORTEX=/home/hexu/dse/wu/virgo-artifact-full/toolchain/llvm-vortex2 RISCV_TOOLCHAIN_PATH=/home/hexu/dse/wu/virgo-artifact-full/chipyard/.conda-env/riscv-tools RISCV_PREFIX=riscv64-unknown-elf
smoke builds the boot/spawn/dual-issue cases. barriers builds the barrier-focused cases. full builds all cases.
Verilator Run Notes
For RTL simulation, use the same simulator setup as the main Virgo artifact, but run these ELFs one at a time:
VM_PARALLEL_BUILDS=1LOADMEM=1, soSimDRAM::memory_init()preloads the ELF instead of relying on slow runtime SimTSI writes.CCACHE_DIR=/tmp/ccachewhen ccache is enabled in the sandbox.- Use
/home/hexu/dse/firtool-1.62.0for firtool and/usr/local/bin/verilatorfor Verilator. - Keep system
gcc/g++onPATH; do not use thegcc/g++injected bychipyard/env.sh. - For generated Verilator C++ compilation, prefer
-O0 -fno-inlineto reduce compile time.
Cleanup
make clean-all
This removes kernel ELF/dump outputs and the generated placeholder input blobs in each case directory.