Commit Graph

10 Commits

Author SHA1 Message Date
f7ada421cf skip redundant MPI ghost cell syncs for stages 0, 1 & 2
BSSN 每个 RK4 时间步执行 4 次 MPI ghost zone 同步:
Stage 0(预测)结束后:Parallel::Sync(SynchList_pre)
Stage 1(校正 1)结束后:Parallel::Sync(SynchList_cor)
Stage 2(校正 2)结束后:Parallel::Sync(SynchList_cor)
Stage 3(校正 3)结束后:Parallel::Sync(SynchList_cor) ← 必要(为下一步提供 ghost)

bssnEM_class.C、Z4c_class.C 结构相同,一起修改了
2026-02-26 16:16:33 +08:00
f5a63f1e42 Revert "Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement"
This reverts commit 09b937c022.
2026-02-25 22:21:43 +08:00
copilot-swe-agent[bot]
09b937c022 Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement
clock() measures total CPU time across all threads, not wall-clock
time. With the new OpenMP parallel regions in bssn_rhs_c.C, clock()
sums CPU time from all OpenMP threads, producing inflated timing that
scales with thread count rather than reflecting actual elapsed time.

MPI_Wtime() returns wall-clock seconds, giving accurate timing
regardless of the number of OpenMP threads running inside the
measured interval.

Co-authored-by: ianchb <i@4t.pw>
2026-02-25 22:21:19 +08:00
CGH0S7
efc8bf29ea 按需失效同步缓存:Regrid_Onelevel 改为返回 bool
将 cgh::Regrid_Onelevel 的返回类型从 void 改为 bool,
在网格真正发生移动时返回 true,否则返回 false。
调用方仅在返回 true 时才失效 sync_cache_*,避免了
每次 RecursiveStep 结束后无条件失效所有层级缓存的冗余开销。

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-25 16:00:26 +08:00
5c1790277b Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong
The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}}
loops across RestrictProlong (3 overloads) and ProlongRestrict each
produced N_c × N_f separate transfer() → MPI_Waitall barriers.
Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which
merges all patch pairs into a single transfer() call with 1 MPI_Waitall.

Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached
to Parallel (unused for now — kept as infrastructure for future use).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-11 16:09:08 +08:00
738498cb28 Optimize MPI communication in RestrictProlong and surface_integral
Cache Sync in RestrictProlong: replace 11 basic Parallel::Sync() calls
with Parallel::Sync_cached() across RestrictProlong, RestrictProlong_aux,
and ProlongRestrict to avoid rebuilding grid segment lists every call.

Merge paired MPI_Allreduce in surface_integral: combine 9 pairs of
consecutive RP/IP Allreduce calls into single calls with count=2*NN.

Merge scalar MPI_Allreduce in surf_MassPAng: combine 3 groups of 7
scalar Allreduce calls (mass + angular/linear momentum) into single
calls with count=7.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 22:07:12 +08:00
42b9cf1ad9 Optimize MPI Sync with merged transfers, caching, and async overlap
Phase 1: Merge N+1 transfer() calls into a single transfer() per
Sync(PatchList), reducing N+1 MPI_Waitall barriers to 1 via new
Sync_merged() that collects all intra-patch and inter-patch grid
segment lists into combined per-rank arrays.

Phase 2: Cache grid segment lists and reuse grow-only communication
buffers across RK4 substeps via SyncCache struct. Caches are per-level
and per-variable-list (predictor/corrector), invalidated on regrid.
Eliminates redundant build_ghost_gsl/build_owned_gsl0/build_gstl
rebuilds and malloc/free cycles between regrids.

Phase 3: Split Sync into async Sync_start/Sync_finish to overlap
Cartesian ghost zone exchange (MPI_Isend/Irecv) with Shell patch
synchronization. Uses MPI tag 2 to avoid conflicts with SH->Synch()
which uses transfer() with tag 1.

Also updates makefile.inc paths and flags for local build environment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 21:03:37 +08:00
e9d321fd00 Convert MPI_Allreduce error checks to non-blocking MPI_Iallreduce overlapped with Sync
Replace all 8 blocking MPI_Allreduce error-check calls with MPI_Iallreduce,
overlapping the reduction with subsequent Parallel::Sync/SH->Synch operations.
MPI_Wait is called after Sync completes to retrieve the error result.

This hides the Allreduce latency (46.5% of CPU time) behind the ghost zone
exchange communication that must happen anyway. Safe because Sync only copies
existing data to ghost zones and the error check + abort happens before any
further computation uses the synced data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 12:39:29 +08:00
ed1d86ade9 Merge paired MPI_Allreduce error checks to reduce global sync barriers
In the two Step() functions that handle both Patch and Shell Patch,
defer the Patch error check until after Shell Patch computation completes,
then perform a single combined MPI_Allreduce instead of two separate ones.
This eliminates 4 MPI_Allreduce calls per timestep (2 per Step function,
Predictor + Corrector phases each). The optimization is mathematically
equivalent: in normal execution (no NaN) behavior is identical; on error,
both Patch and Shell data are dumped before MPI_Abort.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 12:12:16 +08:00
f2fc9af70e asc26 amss-ncku initialized 2026-01-13 15:01:15 +08:00