Refactored calc_derivs and calc_dderivs to include correct boundary
handling logic matching the legacy code. Implemented fallback to 2nd
order derivatives when near boundaries where 4th order stencils cannot
be used. Added logic to initialize output arrays to zero to avoid
uninitialized memory access.
- Implemented cache blocking (BLK=8) in bssn_rhs_opt.f90 to improve L1/L2 cache hit rate.
- Introduced bssn_rhs_opt.f90 module with vectorized derivative and physics kernels.
- Renamed original implementation to bssn_rhs_legacy.f90 for fallback.
- Updated bssn_rhs.f90 to act as a dispatcher, using the optimized path for ghost_width=3.
- Updated makefile to include new source files.
- Added DEBUG_NAN_CHECK macro to optionally disable NaN checks in production.
This commit transitions the verification approach from post-Newtonian theory
comparison to regression testing against baseline simulations, and optimizes
critical numerical kernels using Intel oneMKL BLAS routines.
Verification Changes:
- Replace PN theory-based RMS calculation with trajectory-based comparison
- Compare optimized results against baseline (GW150914-origin) on XY plane
- Compute RMS independently for BH1 and BH2, report maximum as final metric
- Update documentation to reflect new regression test methodology
Performance Optimizations:
- Replace manual vector operations with oneMKL BLAS routines:
* norm2() and scalarproduct() now use cblas_dnrm2/cblas_ddot (C++)
* L2 norm calculations use DDOT for dot products (Fortran)
* Interpolation weighted sums use DDOT (Fortran)
- Disable OpenMP threading (switch to sequential MKL) for better performance
Build Configuration:
- Switch from lmkl_intel_thread to lmkl_sequential
- Remove -qopenmp flags from compiler options
- Maintain aggressive optimization flags (-O3, -xHost, -fp-model fast=2, -fma)
Other Changes:
- Update .gitignore for GW150914-origin, docs, and temporary files
- Update makefile.inc with Intel oneAPI compiler flags and oneMKL linking
- Configure taskset CPU binding to use nohz_full cores (4-55, 60-111)
- Set build parallelism to 104 jobs for faster compilation
- Update MPI process count to 48 in input configuration
Bind all computation processes (ABE, ABEGPU, TwoPunctureABE) to
CPU cores 4-55 and 60-111 using numactl --physcpubind to prevent
interference with system processes on reserved cores.