In the two Step() functions that handle both Patch and Shell Patch,
defer the Patch error check until after Shell Patch computation completes,
then perform a single combined MPI_Allreduce instead of two separate ones.
This eliminates 4 MPI_Allreduce calls per timestep (2 per Step function,
Predictor + Corrector phases each). The optimization is mathematically
equivalent: in normal execution (no NaN) behavior is identical; on error,
both Patch and Shell data are dumped before MPI_Abort.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This should reduce the pressure on the memory allocator, indirectly improving caching behavior.
Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>
Pre-allocate workspace buffers as class members to remove ~8M malloc/free
pairs per Newton iteration from LineRelax, ThomasAlgorithm, JFD_times_dv,
J_times_dv, chebft_Zeros, fourft, Derivatives_AB3, and F_of_v.
Rewrite ThomasAlgorithm to operate in-place on input arrays.
Results are bit-identical; no algorithmic changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changes:
- polint: Rewrite Neville algorithm from array-slice operations to
scalar loop. Mathematically identical, avoids temporary array
allocations for den(1:n-m) slices. (Credit: yx-fmisc branch)
- polin2: Swap interpolation order so inner loop accesses ya(:,j)
(contiguous in Fortran column-major) instead of ya(i,:) (strided).
Tensor product interpolation is commutative; all call sites pass
identical coordinate arrays for all dimensions.
- polin3: Swap interpolation order to process contiguous first
dimension first: ya(:,j,k) -> yatmp(:,k) -> ymtmp(:).
Same commutativity argument as polin2.
Compile-time safety switch:
-DPOLINT_LEGACY_ORDER restores original dimension ordering
Default (no flag): uses optimized contiguous-memory ordering
Usage:
# Production (optimized order):
make clean && make -j ABE
# Fallback if results differ (original order):
Add -DPOLINT_LEGACY_ORDER to f90appflags in makefile.inc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap the NaN sanity check (21 sum() full-array traversals per RHS call)
with #ifdef DEBUG so it is compiled out in production builds.
This eliminates 84 redundant full-array scans per timestep (21 per RHS
call × 4 RK4 substages) that serve no purpose when input data is valid.
Usage:
- Production build (default): NaN check is disabled, no changes needed.
- Debug build: add -DDEBUG to f90appflags in makefile.inc, e.g.
f90appflags = -O3 ... -DDEBUG -fpp ...
to re-enable the NaN sanity check.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- makefile.inc: add -ipo (interprocedural optimization) and
-align array64byte (64-byte array alignment for vectorization)
- fmisc.f90: remove redundant funcc=0.d0 zeroing from symmetry_bd,
symmetry_tbd, symmetry_stbd (~328+ full-array memsets eliminated
per timestep)
- enforce_algebra.f90: rewrite enforce_ag and enforce_ga as point-wise
loops, replacing 12 stack-allocated 3D temporary arrays with scalar
locals for better cache locality
All changes are mathematically equivalent — no algorithmic modifications.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit transitions the verification approach from post-Newtonian theory
comparison to regression testing against baseline simulations, and optimizes
critical numerical kernels using Intel oneMKL BLAS routines.
Verification Changes:
- Replace PN theory-based RMS calculation with trajectory-based comparison
- Compare optimized results against baseline (GW150914-origin) on XY plane
- Compute RMS independently for BH1 and BH2, report maximum as final metric
- Update documentation to reflect new regression test methodology
Performance Optimizations:
- Replace manual vector operations with oneMKL BLAS routines:
* norm2() and scalarproduct() now use cblas_dnrm2/cblas_ddot (C++)
* L2 norm calculations use DDOT for dot products (Fortran)
* Interpolation weighted sums use DDOT (Fortran)
- Disable OpenMP threading (switch to sequential MKL) for better performance
Build Configuration:
- Switch from lmkl_intel_thread to lmkl_sequential
- Remove -qopenmp flags from compiler options
- Maintain aggressive optimization flags (-O3, -xHost, -fp-model fast=2, -fma)
Other Changes:
- Update .gitignore for GW150914-origin, docs, and temporary files
- Update makefile.inc with Intel oneAPI compiler flags and oneMKL linking
- Configure taskset CPU binding to use nohz_full cores (4-55, 60-111)
- Set build parallelism to 104 jobs for faster compilation
- Update MPI process count to 48 in input configuration
Bind all computation processes (ABE, ABEGPU, TwoPunctureABE) to
CPU cores 4-55 and 60-111 using numactl --physcpubind to prevent
interference with system processes on reserved cores.