Commit Graph

43 Commits

Author SHA1 Message Date
fbb2ed112d Fix Compile_Constraint/analysis use CPU Fortran for shell RHS
Limit GPU shell RHS redirection to Step and SHStep only via #define/#undef.
Compute_Constraint, Interp_Constraint, and Constraint_Out continue using
the CPU Fortran path to avoid GPU alloc-per-call overhead during
initialization and analysis phases.

Also: wrap compare_result_gpu in #ifdef RESULT_CHECK to avoid link error.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 19:25:45 +08:00
bd4ce3fbf3 GPU-accelerate Shell-Patch BSSN evolution
Phase 1: Enable GPU resident state for Cartesian patches in Shell mode.
- Remove WithShell guard from bssn_cuda_use_resident_sync().
- Add GPU-to-CPU state sync before shell CPU consumers (SHStep,
  CS_Inter, inline shell RHS blocks).

Phase 2: GPU-accelerate BSSN Shell Patch RHS.
- Create bssn_gpu.h with RHS_SS_PARA macro and gpu_rhs_ss declaration.
- Fix compilation bugs in legacy bssn_gpu_rhs_ss.cu (deprecated
  cudaThreadSynchronize, tmp_con2 redeclaration, ijkmin3_h typo,
  CUDA_SAFE_CALL, missing compare_result guard).
- Add bssn_gpu_rhs_ss.o to CFILES_CUDA_BSSN with build rule.
- Write cuda_compute_rhs_bssn_ss() wrapper bridging Fortran and GPU
  parameter conventions, redirect all shell RHS call sites via #define.

Verified: 30-step Shell-Patch GPU run completes without errors/NaN.
Step wall time ~4.4s (step_fn ~2.0s + RP ~0.68s + constraint ~0.70s).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 18:50:10 +08:00
5eb49949d9 Fix AHF crash under CUDA resident-sync mode
Download BSSN StateList from GPU to CPU before AHFinderDirect_find_horizons
so that AH_Interp_Points reads valid field data instead of stale CPU arrays.
The resident-sync path keeps canonical state on GPU; without this download the
Newton iteration diverges and probes outside the computational domain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 16:11:56 +08:00
fea2dcc0d5 Fix BSSN-EM runtime crash 2026-05-07 16:47:55 +08:00
96829d0441 Optimize Z4C GPU runtime defaults 2026-05-07 15:37:09 +08:00
cb911dec06 Add EM GPU fast paths and defaults 2026-05-07 12:18:56 +08:00
ae64a22178 Complete BSSN-EScalar CUDA resident transfers 2026-05-05 23:57:42 +08:00
85fe29cc2e Optimize BSSN-EScalar CUDA path 2026-05-05 10:47:46 +08:00
b1974ef146 Stabilize device AMR restrict across regrid 2026-04-30 20:01:18 +08:00
6835608f92 Add configurable analysis MAP cadence 2026-04-30 19:10:12 +08:00
da4d56ccf7 Optimize BSSN surface interpolation fast path 2026-04-30 18:25:21 +08:00
8486532920 Add resident BSSN GPU point interpolation 2026-04-30 11:39:15 +08:00
1ee229a91f Add keyed BSSN CUDA resident banks 2026-04-29 19:44:19 +08:00
68eab03bac Add opt-in BSSN CUDA resident AMR path 2026-04-29 19:15:37 +08:00
c768e1220b Also disable cached sync for Z4C 2026-04-25 10:25:54 +08:00
02f149e2e3 Disable cached sync for BSSN-EScalar 2026-04-25 10:17:47 +08:00
422e8ec4dc Fallback BSSN-EScalar restrict/prolong path 2026-04-25 10:10:34 +08:00
f521a97563 Fix ABE CPU version build error 2026-04-25 09:39:49 +08:00
768345954f Add optional BSSN kernel profiling switches
(cherry picked from commit 9c31384b2f)
2026-04-25 08:39:43 +08:00
6410c62e3e Add fine-grained step timing and trim BH RHS overhead
(cherry picked from commit 968522995b)
2026-04-25 08:37:19 +08:00
11977eb82f Merge wave and mass extraction interpolation
(cherry picked from commit f3988ac8ca)
2026-04-25 08:25:34 +08:00
c589097618 Reuse mass integrand across detector radii
(cherry picked from commit 4b10519876)
2026-04-25 08:24:11 +08:00
b713e5a9be Batch constraint norm reductions
(cherry picked from commit 3a58273501)
2026-04-25 08:22:00 +08:00
0396701572 Optimize constraint refresh after regrid
(cherry picked from commit 5c65cea2f0)
2026-04-25 08:18:51 +08:00
bb20c9a876 fix ADM Constrant Violation Analysis 2026-04-15 19:19:16 +08:00
8fe60ea703 Add zero matter handling and interpolation for resident state in CUDA BSSN 2026-04-15 00:25:53 +08:00
f9119e8a2a Add resident-GA mode switch and simplify sync logic 2026-04-14 21:09:27 +08:00
e952ee8e91 Batch GA/BH subset sync with indexed GPU pack/unpack buffers 2026-04-13 20:40:09 +08:00
c5d1268dd1 Batch patch-boundary copy and gate CPU BC in GPU substeps 2026-04-13 11:52:17 +08:00
1b3c0b80d2 Refactor CUDA step buffers to remove loop-time allocations 2026-04-13 10:33:03 +08:00
636e35bfd8 Add direct CUDA resident-state sync path and profiling hooks 2026-04-13 00:57:05 +08:00
4fa12a2009 Integrate CUDA support into RK4 substep execution 2026-04-12 22:11:44 +08:00
aaf7bf0a26 Merge remote-tracking branch 'origin/main' 2026-04-12 20:55:42 +08:00
b35e1b289f 设置开关关闭内存打印统计 2026-03-03 16:17:47 +08:00
4b9de28feb 将 Restrict/Prolong 链路里的 coarse-level Sync_cached 改为可选(默认跳过)
OutBdLow2Hi_cached 读的是 coarse owned 区域(非 coarse ghost/buffer)
回退旧行为:编译时定义 RP_SYNC_COARSE_AFTER_RESTRICT=1
2026-03-03 14:25:27 +08:00
4012e9d068 perf(RestrictProlong): 用 Restrict_cached/OutBdLow2Hi_cached 替换非缓存版本,Sync_finish 改为渐进式解包
- RestrictProlong/RestrictProlong_aux 中的 Restrict() 和 OutBdLow2Hi() 替换为 _cached 版本,
  复用 gridseg 列表和 MPI 缓冲区,避免每次调用重新分配
- 新增 sync_cache_restrict/sync_cache_outbd 两组 per-level 缓存
- Sync_finish 从 MPI_Waitall 改为 MPI_Waitsome 渐进式解包,降低尾延迟
- AsyncSyncState 扩展 req_node/req_is_recv/pending_recv 字段支持渐进解包

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-02 20:48:38 +08:00
CGH0S7
efc8bf29ea 按需失效同步缓存:Regrid_Onelevel 改为返回 bool
将 cgh::Regrid_Onelevel 的返回类型从 void 改为 bool,
在网格真正发生移动时返回 true,否则返回 false。
调用方仅在返回 true 时才失效 sync_cache_*,避免了
每次 RecursiveStep 结束后无条件失效所有层级缓存的冗余开销。

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-25 16:00:26 +08:00
5c1790277b Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong
The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}}
loops across RestrictProlong (3 overloads) and ProlongRestrict each
produced N_c × N_f separate transfer() → MPI_Waitall barriers.
Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which
merges all patch pairs into a single transfer() call with 1 MPI_Waitall.

Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached
to Parallel (unused for now — kept as infrastructure for future use).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-11 16:09:08 +08:00
738498cb28 Optimize MPI communication in RestrictProlong and surface_integral
Cache Sync in RestrictProlong: replace 11 basic Parallel::Sync() calls
with Parallel::Sync_cached() across RestrictProlong, RestrictProlong_aux,
and ProlongRestrict to avoid rebuilding grid segment lists every call.

Merge paired MPI_Allreduce in surface_integral: combine 9 pairs of
consecutive RP/IP Allreduce calls into single calls with count=2*NN.

Merge scalar MPI_Allreduce in surf_MassPAng: combine 3 groups of 7
scalar Allreduce calls (mass + angular/linear momentum) into single
calls with count=7.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 22:07:12 +08:00
42b9cf1ad9 Optimize MPI Sync with merged transfers, caching, and async overlap
Phase 1: Merge N+1 transfer() calls into a single transfer() per
Sync(PatchList), reducing N+1 MPI_Waitall barriers to 1 via new
Sync_merged() that collects all intra-patch and inter-patch grid
segment lists into combined per-rank arrays.

Phase 2: Cache grid segment lists and reuse grow-only communication
buffers across RK4 substeps via SyncCache struct. Caches are per-level
and per-variable-list (predictor/corrector), invalidated on regrid.
Eliminates redundant build_ghost_gsl/build_owned_gsl0/build_gstl
rebuilds and malloc/free cycles between regrids.

Phase 3: Split Sync into async Sync_start/Sync_finish to overlap
Cartesian ghost zone exchange (MPI_Isend/Irecv) with Shell patch
synchronization. Uses MPI tag 2 to avoid conflicts with SH->Synch()
which uses transfer() with tag 1.

Also updates makefile.inc paths and flags for local build environment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 21:03:37 +08:00
e9d321fd00 Convert MPI_Allreduce error checks to non-blocking MPI_Iallreduce overlapped with Sync
Replace all 8 blocking MPI_Allreduce error-check calls with MPI_Iallreduce,
overlapping the reduction with subsequent Parallel::Sync/SH->Synch operations.
MPI_Wait is called after Sync completes to retrieve the error result.

This hides the Allreduce latency (46.5% of CPU time) behind the ghost zone
exchange communication that must happen anyway. Safe because Sync only copies
existing data to ghost zones and the error check + abort happens before any
further computation uses the synced data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 12:39:29 +08:00
ed1d86ade9 Merge paired MPI_Allreduce error checks to reduce global sync barriers
In the two Step() functions that handle both Patch and Shell Patch,
defer the Patch error check until after Shell Patch computation completes,
then perform a single combined MPI_Allreduce instead of two separate ones.
This eliminates 4 MPI_Allreduce calls per timestep (2 per Step function,
Predictor + Corrector phases each). The optimization is mathematically
equivalent: in normal execution (no NaN) behavior is identical; on error,
both Patch and Shell data are dumped before MPI_Abort.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 12:12:16 +08:00
f2fc9af70e asc26 amss-ncku initialized 2026-01-13 15:01:15 +08:00