clock() measures total CPU time across all threads, not wall-clock
time. With the new OpenMP parallel regions in bssn_rhs_c.C, clock()
sums CPU time from all OpenMP threads, producing inflated timing that
scales with thread count rather than reflecting actual elapsed time.
MPI_Wtime() returns wall-clock seconds, giving accurate timing
regardless of the number of OpenMP threads running inside the
measured interval.
Co-authored-by: ianchb <i@4t.pw>
Apply Intel Advisor optimization recommendations:
- Add FORCEINLINE to polint for better inlining
- Add SIMD VECTORLENGTHFOR and UNROLL directives to fderivs,
fdderivs, symmetry_bd, and kodis functions
This improves vectorization efficiency of finite difference
computations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}}
loops across RestrictProlong (3 overloads) and ProlongRestrict each
produced N_c × N_f separate transfer() → MPI_Waitall barriers.
Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which
merges all patch pairs into a single transfer() call with 1 MPI_Waitall.
Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached
to Parallel (unused for now — kept as infrastructure for future use).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The data_packer(NULL, ...) calls that compute send/recv buffer lengths
traverse all grid segments × variables × nprocs on every Sync_start
invocation, even though lengths never change once the cache is built.
Add a lengths_valid flag to SyncCache so these length computations are
done once and reused on subsequent calls (4× per RK4 step).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of broadcasting all interpolated point data to every MPI rank,
the new overload sends each point only to the one rank that needs it
for integration, reducing communication volume by ~nprocs times.
The consumer rank is computed deterministically using the same Nmin/Nmax
work distribution formula used by surface_integral callers. Two active
call sites (surf_Wave and surf_MassPAng with MPI_COMM_WORLD) now use
the new overload. Other callers (ShellPatch, Comm_here variants, etc.)
remain unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The two MPI_Allreduce calls (data + weight) were the #1 hotspot at 38.5%
CPU time. Since all ranks traverse the same block list and agree on point
ownership, we replace the global reduction with targeted MPI_Bcast from
each owner rank. This also eliminates the weight array/Allreduce entirely,
removes redundant heap allocations (shellf, weight, DH, llb, uub), and
writes interpolation results directly into the output buffer.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cache Sync in RestrictProlong: replace 11 basic Parallel::Sync() calls
with Parallel::Sync_cached() across RestrictProlong, RestrictProlong_aux,
and ProlongRestrict to avoid rebuilding grid segment lists every call.
Merge paired MPI_Allreduce in surface_integral: combine 9 pairs of
consecutive RP/IP Allreduce calls into single calls with count=2*NN.
Merge scalar MPI_Allreduce in surf_MassPAng: combine 3 groups of 7
scalar Allreduce calls (mass + angular/linear momentum) into single
calls with count=7.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>