The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}}
loops across RestrictProlong (3 overloads) and ProlongRestrict each
produced N_c × N_f separate transfer() → MPI_Waitall barriers.
Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which
merges all patch pairs into a single transfer() call with 1 MPI_Waitall.
Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached
to Parallel (unused for now — kept as infrastructure for future use).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The data_packer(NULL, ...) calls that compute send/recv buffer lengths
traverse all grid segments × variables × nprocs on every Sync_start
invocation, even though lengths never change once the cache is built.
Add a lengths_valid flag to SyncCache so these length computations are
done once and reused on subsequent calls (4× per RK4 step).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of broadcasting all interpolated point data to every MPI rank,
the new overload sends each point only to the one rank that needs it
for integration, reducing communication volume by ~nprocs times.
The consumer rank is computed deterministically using the same Nmin/Nmax
work distribution formula used by surface_integral callers. Two active
call sites (surf_Wave and surf_MassPAng with MPI_COMM_WORLD) now use
the new overload. Other callers (ShellPatch, Comm_here variants, etc.)
remain unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The two MPI_Allreduce calls (data + weight) were the #1 hotspot at 38.5%
CPU time. Since all ranks traverse the same block list and agree on point
ownership, we replace the global reduction with targeted MPI_Bcast from
each owner rank. This also eliminates the weight array/Allreduce entirely,
removes redundant heap allocations (shellf, weight, DH, llb, uub), and
writes interpolation results directly into the output buffer.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cache Sync in RestrictProlong: replace 11 basic Parallel::Sync() calls
with Parallel::Sync_cached() across RestrictProlong, RestrictProlong_aux,
and ProlongRestrict to avoid rebuilding grid segment lists every call.
Merge paired MPI_Allreduce in surface_integral: combine 9 pairs of
consecutive RP/IP Allreduce calls into single calls with count=2*NN.
Merge scalar MPI_Allreduce in surf_MassPAng: combine 3 groups of 7
scalar Allreduce calls (mass + angular/linear momentum) into single
calls with count=7.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1: Merge N+1 transfer() calls into a single transfer() per
Sync(PatchList), reducing N+1 MPI_Waitall barriers to 1 via new
Sync_merged() that collects all intra-patch and inter-patch grid
segment lists into combined per-rank arrays.
Phase 2: Cache grid segment lists and reuse grow-only communication
buffers across RK4 substeps via SyncCache struct. Caches are per-level
and per-variable-list (predictor/corrector), invalidated on regrid.
Eliminates redundant build_ghost_gsl/build_owned_gsl0/build_gstl
rebuilds and malloc/free cycles between regrids.
Phase 3: Split Sync into async Sync_start/Sync_finish to overlap
Cartesian ghost zone exchange (MPI_Isend/Irecv) with Shell patch
synchronization. Uses MPI tag 2 to avoid conflicts with SH->Synch()
which uses transfer() with tag 1.
Also updates makefile.inc paths and flags for local build environment.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace all 8 blocking MPI_Allreduce error-check calls with MPI_Iallreduce,
overlapping the reduction with subsequent Parallel::Sync/SH->Synch operations.
MPI_Wait is called after Sync completes to retrieve the error result.
This hides the Allreduce latency (46.5% of CPU time) behind the ghost zone
exchange communication that must happen anyway. Safe because Sync only copies
existing data to ghost zones and the error check + abort happens before any
further computation uses the synced data.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In the two Step() functions that handle both Patch and Shell Patch,
defer the Patch error check until after Shell Patch computation completes,
then perform a single combined MPI_Allreduce instead of two separate ones.
This eliminates 4 MPI_Allreduce calls per timestep (2 per Step function,
Predictor + Corrector phases each). The optimization is mathematically
equivalent: in normal execution (no NaN) behavior is identical; on error,
both Patch and Shell data are dumped before MPI_Abort.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This should reduce the pressure on the memory allocator, indirectly improving caching behavior.
Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>
Pre-allocate workspace buffers as class members to remove ~8M malloc/free
pairs per Newton iteration from LineRelax, ThomasAlgorithm, JFD_times_dv,
J_times_dv, chebft_Zeros, fourft, Derivatives_AB3, and F_of_v.
Rewrite ThomasAlgorithm to operate in-place on input arrays.
Results are bit-identical; no algorithmic changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changes:
- polint: Rewrite Neville algorithm from array-slice operations to
scalar loop. Mathematically identical, avoids temporary array
allocations for den(1:n-m) slices. (Credit: yx-fmisc branch)
- polin2: Swap interpolation order so inner loop accesses ya(:,j)
(contiguous in Fortran column-major) instead of ya(i,:) (strided).
Tensor product interpolation is commutative; all call sites pass
identical coordinate arrays for all dimensions.
- polin3: Swap interpolation order to process contiguous first
dimension first: ya(:,j,k) -> yatmp(:,k) -> ymtmp(:).
Same commutativity argument as polin2.
Compile-time safety switch:
-DPOLINT_LEGACY_ORDER restores original dimension ordering
Default (no flag): uses optimized contiguous-memory ordering
Usage:
# Production (optimized order):
make clean && make -j ABE
# Fallback if results differ (original order):
Add -DPOLINT_LEGACY_ORDER to f90appflags in makefile.inc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap the NaN sanity check (21 sum() full-array traversals per RHS call)
with #ifdef DEBUG so it is compiled out in production builds.
This eliminates 84 redundant full-array scans per timestep (21 per RHS
call × 4 RK4 substages) that serve no purpose when input data is valid.
Usage:
- Production build (default): NaN check is disabled, no changes needed.
- Debug build: add -DDEBUG to f90appflags in makefile.inc, e.g.
f90appflags = -O3 ... -DDEBUG -fpp ...
to re-enable the NaN sanity check.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- makefile.inc: add -ipo (interprocedural optimization) and
-align array64byte (64-byte array alignment for vectorization)
- fmisc.f90: remove redundant funcc=0.d0 zeroing from symmetry_bd,
symmetry_tbd, symmetry_stbd (~328+ full-array memsets eliminated
per timestep)
- enforce_algebra.f90: rewrite enforce_ag and enforce_ga as point-wise
loops, replacing 12 stack-allocated 3D temporary arrays with scalar
locals for better cache locality
All changes are mathematically equivalent — no algorithmic modifications.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit transitions the verification approach from post-Newtonian theory
comparison to regression testing against baseline simulations, and optimizes
critical numerical kernels using Intel oneMKL BLAS routines.
Verification Changes:
- Replace PN theory-based RMS calculation with trajectory-based comparison
- Compare optimized results against baseline (GW150914-origin) on XY plane
- Compute RMS independently for BH1 and BH2, report maximum as final metric
- Update documentation to reflect new regression test methodology
Performance Optimizations:
- Replace manual vector operations with oneMKL BLAS routines:
* norm2() and scalarproduct() now use cblas_dnrm2/cblas_ddot (C++)
* L2 norm calculations use DDOT for dot products (Fortran)
* Interpolation weighted sums use DDOT (Fortran)
- Disable OpenMP threading (switch to sequential MKL) for better performance
Build Configuration:
- Switch from lmkl_intel_thread to lmkl_sequential
- Remove -qopenmp flags from compiler options
- Maintain aggressive optimization flags (-O3, -xHost, -fp-model fast=2, -fma)
Other Changes:
- Update .gitignore for GW150914-origin, docs, and temporary files
- Update makefile.inc with Intel oneAPI compiler flags and oneMKL linking
- Configure taskset CPU binding to use nohz_full cores (4-55, 60-111)
- Set build parallelism to 104 jobs for faster compilation
- Update MPI process count to 48 in input configuration
Bind all computation processes (ABE, ABEGPU, TwoPunctureABE) to
CPU cores 4-55 and 60-111 using numactl --physcpubind to prevent
interference with system processes on reserved cores.