# Code Modification Readme — `asc26-plan-a` **Baseline branch:** `baseline` **Target branch:** `asc26-plan-a` **Date:** 2026-05-19 --- ## Overview This branch delivers two major performance overhauls to the AMSS-NCKU numerical relativity codebase: 1. **TwoPunctureABE Multithreading** — OpenMP parallelization of the TwoPunctures initial-data solver, combined with a BLAS-driven spectral derivative engine, MKL/LAPACK integration, and C/C++ rewrites of hot Fortran kernel subroutines. 2. **ABE GPU Rewrite** — Complete replacement of the legacy `bssn_gpu_class` abstraction layer with direct, monolithic CUDA kernels for BSSN, Z4C, and Shell-Patch evolution, plus GPU-resident state management and CUDA-aware MPI. **Total diff:** 84 files changed, +57,919 / −33,795 lines. --- ## Part 1 — TwoPunctureABE Multithreading ### 1.1 Spectral Derivative Engine: BLAS Matrix-Multiplication Rewrite **Files:** `AMSS_NCKU_source/TwoPunctures.C`, `AMSS_NCKU_source/TwoPunctures.h` The original `Derivatives_AB3` computed spectral derivatives (Chebyshev in A/B, Fourier in phi) with nested scalar loops over every grid point. The new `Derivatives_AB3_MatMul` expresses all derivatives as matrix-matrix products over pencil-shaped data slices, dispatched to Intel MKL `cblas_dgemm`. - **Precomputed derivative matrices** — `precompute_derivative_matrices()` builds `D1_A`, `D2_A`, `D1_B`, `D2_B` (Chebyshev collocation derivative matrices) and `DF1_phi`, `DF2_phi` (Fourier derivative matrices) once at construction time. - **Pencil-based GEMM** — data is gathered into 2D arrays where one dimension is the spectral direction and the other enumerates all remaining degrees of freedom (variables × orthogonal grid indices). Each derivative direction becomes a single `cblas_dgemm` call. The pure derivatives (d/dA, d/dB, d/dphi) and all mixed derivatives (d²/dAdB, d²/dAdphi, d²/dBdphi) are computed this way. - **`build_cheb_deriv_matrices` / `build_fourier_deriv_matrices`** — construct the standard Chebyshev and Fourier collocation derivative matrices. ### 1.2 OpenMP Parallelization of TwoPunctures **Files:** `AMSS_NCKU_source/TwoPunctures.C`, `AMSS_NCKU_source/TwoPunctures.h` Three critical regions are parallelized: | Region | Directive | Strategy | |--------|-----------|----------| | `F_of_v` residual evaluation | `#pragma omp parallel for collapse(3) schedule(dynamic,1)` | Each (i,j,k) thread stack-allocates its own `l_U` (derivs struct) and `l_values[]` to eliminate heap contention and data races | | `relax_omp` line relaxation | `#pragma omp parallel for schedule(static)` over k-slices | Alternating be/al sweeps, each thread uses pre-allocated per-thread Thomas-algorithm workspace (`ws_*_be[tid]`, `ws_*_al[tid]`) | | `LineRelax_be_omp` / `LineRelax_al_omp` | Called from `relax_omp` with explicit `tid` | Thread-safe tridiagonal solves using the thread's private scratch arrays | **Per-thread workspace** — `allocate_workspace()` allocates independent Thomas-algorithm buffers (`diag`, `e`, `f`, `b`, `x`, `l`, `u`, `d`, `y`) for each OpenMP thread in both be and al directions, avoiding lock contention in the inner Newton iteration. ### 1.3 MKL BLAS / LAPACK Integration **Files:** `AMSS_NCKU_source/TwoPunctures.C`, `AMSS_NCKU_source/gaussj.C` | Function | Old | New | Benefit | |----------|-----|-----|---------| | `norm2` | scalar `sqrt(sum(v[i]²))` loop | `cblas_dnrm2` | BLAS Level 1, SIMD-optimized | | `scalarproduct` | scalar `sum(v[i]*w[i])` loop | `cblas_ddot` | BLAS Level 1, SIMD-optimized | | `gaussj` | hand-written Gauss-Jordan elimination (~100 lines) | `LAPACKE_dgesv` + `LAPACKE_dgetrf` + `LAPACKE_dgetri` | LAPACK LU with partial pivoting, asymptotically faster for the `n~50` matrix sizes used in spectral elliptic solves | ### 1.4 C/C++ Rewrite of Hot Fortran Kernels **Files (new):** - `AMSS_NCKU_source/fderivs_c.C` (167 lines) — first derivatives, 2nd/4th order - `AMSS_NCKU_source/fdderivs_c.C` (332 lines) — second derivatives, 2nd/4th order - `AMSS_NCKU_source/kodiss_c.C` (117 lines) — Kreiss-Oliger dissipation - `AMSS_NCKU_source/lopsided_c.C` (255 lines) — lopsided advection - `AMSS_NCKU_source/lopsided_kodis_c.C` (248 lines) — fused advection + dissipation - `AMSS_NCKU_source/rungekutta4_rout_c.C` (212 lines) — RK4 time-stepper - `AMSS_NCKU_source/bssn_rhs_c.C` (1,287 lines) — full BSSN RHS kernel - `AMSS_NCKU_source/z4c_rhs_c.C` (725 lines) — full Z4C RHS kernel Every C rewrite follows a consistent optimization pattern: - **64-byte aligned allocation** (`aligned_alloc(64, ...)`) for AVX-512 compatibility. - **Static buffer caching** — scratch arrays (e.g., the padded `fh` ghost-zone buffer) persist across calls via a `static` pointer + capacity check, avoiding repeated `malloc`/`free`. - **Two-pass strategy** — 2nd-order finite differences are computed on the full domain first, then the interior sub-volume is overwritten with 4th-order stencils. This eliminates the per-point `if/elseif` branching of the original Fortran. - **Non-overlapping shell pass** — in `fdderivs_c.C`, the 2nd-order pass skips points that will be overwritten by the 4th-order pass, avoiding redundant computation. ### 1.5 Fortran Kernel Fusion: lopsided_kodis **File:** `AMSS_NCKU_source/lopsidediff.f90` A new `lopsided_kodis` subroutine fuses the advection (lopsided) and Kreiss-Oliger dissipation (kodis) operators into a single pass over the grid. Both operators previously called `symmetry_bd` independently to fill ghost zones — the fused version calls it once and shares the padded `fh` array, halving ghost-zone fill overhead for this hot path. ### 1.6 Build System for TwoPunctures **Files:** `AMSS_NCKU_source/makefile`, `AMSS_NCKU_source/makefile.inc` - **`TP_OPTFLAGS`** — TwoPunctures and TwoPunctureABE are compiled with a dedicated, more aggressive optimization flag set (`-O3 -march=znver5 -fp-model fast=2 -fma -ipo`) separate from the main code. - **`USE_CXX_KERNELS`** — selects between the C rewrites and the original Fortran kernels (`bssn_rhs.f90`, etc.) for the CPU path. - **`USE_CXX_RK4`** — independently selects between the C and Fortran RK4 stepper. - **Intel oneTBB allocator** (`libtbbmalloc.so`) — replaces the system `malloc` with a scalable thread-safe allocator, critical for multi-threaded TwoPunctures performance. - **PGO support** — `PGO_MODE=opt|instrument` for profile-guided optimization (currently disabled after testing showed negative benefit). - **Toolchains** — Intel oneAPI (`TOOLCHAIN=intel`, default) and NVIDIA HPC SDK (`TOOLCHAIN=nvhpc`). --- ## Part 2 — ABE GPU Rewrite ### 2.1 Architecture: From Class Wrapper to Direct CUDA Kernels The old GPU path (`baseline`) was organized as: ``` bssn_gpu_class.C/h — C++ class managing GPU state and kernel launches bssn_step_gpu.C — RK4 stepper with per-substep GPU/CPU synchronisation bssn_gpu.cu — CUDA kernel implementations called through the class ``` The new GPU path (`asc26-plan-a`) replaces all of the above with: ``` bssn_rhs_cuda.cu/h — 10,381-line monolithic CUDA BSSN RHS kernel z4c_rhs_cuda.cu/h — 7,909-line monolithic CUDA Z4C RHS kernel fd_cuda_helpers.cuh — 412-line shared finite-difference device functions bssn_gpu_rhs_ss.cu — (retained, lightly modified) Shell-Patch GPU RHS ``` **Key architectural differences:** - The old `bssn_gpu_class` managed GPU memory through a C++ class with explicit allocate/free/sync methods scattered across the time-stepping logic. The new kernels operate directly on raw device pointers with a clear resident/transient memory model. - The old code launched many small kernels (one per derivative or algebraic term). The new code is a **single monolithic kernel per formulation** — all 24 BSSN evolution variables are computed in one launch with on-the-fly finite differences, eliminating kernel-launch latency and intermediate global-memory round-trips. - The old `bssn_step_gpu.C` performed per-substep GPU→CPU downloads for boundary conditions and analysis. The new model supports **GPU-resident state** — variables stay on device across timesteps unless explicitly requested. ### 2.2 GPU-Resident State Model A central theme across ~20 commits is the "resident-sync" optimization: | Commit | What it does | |--------|-------------| | `22c1e71` | Optimize BSSN CUDA resident state and CUDA-aware MPI | | `090d865` | Optimize BSSN CUDA state transfers | | `68eab03` | Add opt-in BSSN CUDA resident AMR path | | `1ee229a` | Add keyed BSSN CUDA resident banks | | `18e9c9c` | Optimize BSSN CUDA resident AMR prolong | | `8486532` | Add resident BSSN GPU point interpolation | | `b1974ef` | Stabilize device AMR restrict across regrid | | `ae64a22` | Complete BSSN-EScalar CUDA resident transfers | | `83afaf1` | Skip zero EM resident downloads | | `35b6cef` | Broaden cached CUDA sync paths | The resident model works as follows: - BSSN grid functions are allocated once on the GPU and persist across timesteps. - Inter-processor ghost-zone exchanges use **CUDA-aware MPI** — MPI directly reads/writes device memory without staging through host buffers. - AMR prolongation and restriction operate directly on device memory. - Boundary conditions and analysis routines download only the specific slices/points they need, not the full grid. - When EM fields are zero (pure-gravity runs), EM downloads are skipped entirely. ### 2.3 Z4C and Shell-Patch GPU Acceleration **Files:** `AMSS_NCKU_source/z4c_rhs_cuda.cu`, `AMSS_NCKU_source/bssn_gpu_rhs_ss.cu` - The Z4C constraint-damped formulation gets its own 7,909-line monolithic CUDA kernel (`z4c_rhs_cuda.cu`), matching the BSSN kernel's architecture. - **Shell-Patch GPU acceleration** — the spherical shell boundary patches now compute on GPU with dedicated kernels in `bssn_gpu_rhs_ss.cu`. - Z4C + Shell-Patch can coexist on GPU (Phase 3 commits). - A CPU-side wrapper (`z4c_rhs_c.C`) handles the trKd + TZ_rhs contribution that remains on CPU, minimizing GPU/CPU traffic. ### 2.4 Finite-Difference Order Flexibility **File:** `AMSS_NCKU_source/fd_cuda_helpers.cuh` Shared device functions for finite-difference stencils support **2nd, 4th, 6th, and 8th order** at compile time via preprocessor switches. This enables: - Per-run selection of convergence order without recompilation of the full kernel. - 8th-order AMR transfers (`1064a68`) for BSSN-EM. - 6th-order optimized AMR stencils (`0076b3c`). ### 2.5 GPU Diagnostics and Quality Assurance **File:** `AMSS_NCKU_GPUCheck.py` (559 lines, new) A Python-based GPU correctness verification tool that compares GPU and CPU evolution outputs. The GPU build pipeline includes optional kernel profiling switches (`7683459`) for performance debugging. **GPU-specific bug fixes:** - `f226498` — Fix CUDA AMR symmetry drift (incorrect ghost-zone handling under symmetry boundary conditions) - `2317e4a` — Fix BSSN GPU resident AMR sync default - `fea2dcc` — Fix BSSN-EM runtime crash - `dd0e20d` — Fix BSSN-EScalar CUDA boundary and scalar KO - `5eb4994` — Fix AHF crash under CUDA resident-sync mode ### 2.6 Build Integration **Makefile switches:** - `USE_CUDA_BSSN=0/1` — route BSSN RHS through GPU or CPU - `USE_CUDA_Z4C=0/1` — route Z4C RHS through GPU or CPU - `CUDA_ARCH=sm_80` — target NVIDIA Ampere (A100) - `NVHPC_ROOT` — path to NVIDIA HPC SDK for the `nvcc` compiler wrapper - CUDA compilation flags: `-O3 --ptxas-options=-v -arch=$(CUDA_ARCH)` --- ## Part 3 — Shared Infrastructure ### 3.1 Interp_Points Load-Balance Profiler **Files:** `AMSS_NCKU_source/interp_lb_profile.C`, `interp_lb_profile.h`, `interp_lb_profile_data.h`, `generate_interp_lb_header.py` A two-pass instrumentation system for load-balancing the `Interp_Points` parallel interpolation routine: - **Pass 1** (`INTERP_LB_MODE=profile`): instrument each MPI rank's interpolation calls with timing, write a binary profile. - **Pass 2** (`INTERP_LB_MODE=optimize`): read the profile and rebalance work across MPI ranks. ### 3.2 Helper Headers **Files:** `AMSS_NCKU_source/tool.h` (33 lines), `AMSS_NCKU_source/share_func.h` (246 lines) - `tool.h` — shared indexing macros (`idx_ex`, `idx_fh_F_ord2`) and the `symmetry_bd` declaration used by all C kernel rewrites. - `share_func.h` — common utility functions shared across the C++ source files. ### 3.3 Plot-Only Restart Script **File:** `parallel_plot_helper.py` (29 lines) A lightweight restart script that skips recomputation when plotting was interrupted — reads existing checkpoint data and replots without re-running the simulation. --- ## Performance Summary | Component | Optimization | Expected Impact | |-----------|-------------|-----------------| | TwoPunctures `Derivatives_AB3` | Scalar loops → MKL GEMM | 5-20× speedup for spectral derivative computation | | TwoPunctures `F_of_v` | OpenMP collapse(3) + stack-local variables | Near-linear scaling with core count for residual evaluation | | TwoPunctures `gaussj` | Hand-written Gauss-Jordan → LAPACK LU | 2-5× speedup for N~50 matrix inversion | | BSSN RHS (GPU) | Many small kernels → one monolithic kernel | Eliminates kernel-launch overhead; 2-5× GPU throughput improvement | | GPU state transfers | Per-step download → resident model | Eliminates ~80% of GPU↔CPU PCIe traffic | | `lopsided_kodis` fusion | Two `symmetry_bd` calls → one shared call | ~30% reduction in ghost-zone fill cost for this operator pair | | Memory allocator | System malloc → Intel TBB malloc | Significant reduction in malloc contention under OpenMP | | C kernel rewrites | Fortran → C with aligned alloc + static buffers | Enables Intel compiler IPO across C/C++/Fortran boundaries; better SIMD codegen | ---