AMSS-NCKU

64-BitBrainstorm_2026/AMSS-NCKU

Author	SHA1	Message	Date
CGH0S7	fbb2ed112d	Fix Compile_Constraint/analysis use CPU Fortran for shell RHS Limit GPU shell RHS redirection to Step and SHStep only via #define/#undef. Compute_Constraint, Interp_Constraint, and Constraint_Out continue using the CPU Fortran path to avoid GPU alloc-per-call overhead during initialization and analysis phases. Also: wrap compare_result_gpu in #ifdef RESULT_CHECK to avoid link error. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 19:25:45 +08:00
CGH0S7	bd4ce3fbf3	GPU-accelerate Shell-Patch BSSN evolution Phase 1: Enable GPU resident state for Cartesian patches in Shell mode. - Remove WithShell guard from bssn_cuda_use_resident_sync(). - Add GPU-to-CPU state sync before shell CPU consumers (SHStep, CS_Inter, inline shell RHS blocks). Phase 2: GPU-accelerate BSSN Shell Patch RHS. - Create bssn_gpu.h with RHS_SS_PARA macro and gpu_rhs_ss declaration. - Fix compilation bugs in legacy bssn_gpu_rhs_ss.cu (deprecated cudaThreadSynchronize, tmp_con2 redeclaration, ijkmin3_h typo, CUDA_SAFE_CALL, missing compare_result guard). - Add bssn_gpu_rhs_ss.o to CFILES_CUDA_BSSN with build rule. - Write cuda_compute_rhs_bssn_ss() wrapper bridging Fortran and GPU parameter conventions, redirect all shell RHS call sites via #define. Verified: 30-step Shell-Patch GPU run completes without errors/NaN. Step wall time ~4.4s (step_fn ~2.0s + RP ~0.68s + constraint ~0.70s). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 18:50:10 +08:00
CGH0S7	5eb49949d9	Fix AHF crash under CUDA resident-sync mode Download BSSN StateList from GPU to CPU before AHFinderDirect_find_horizons so that AH_Interp_Points reads valid field data instead of stale CPU arrays. The resident-sync path keeps canonical state on GPU; without this download the Newton iteration diverges and probes outside the computational domain. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 16:11:56 +08:00
CGH0S7	fea2dcc0d5	Fix BSSN-EM runtime crash	2026-05-07 16:47:55 +08:00
CGH0S7	96829d0441	Optimize Z4C GPU runtime defaults	2026-05-07 15:37:09 +08:00
CGH0S7	cb911dec06	Add EM GPU fast paths and defaults	2026-05-07 12:18:56 +08:00
CGH0S7	ae64a22178	Complete BSSN-EScalar CUDA resident transfers	2026-05-05 23:57:42 +08:00
CGH0S7	85fe29cc2e	Optimize BSSN-EScalar CUDA path	2026-05-05 10:47:46 +08:00
CGH0S7	b1974ef146	Stabilize device AMR restrict across regrid	2026-04-30 20:01:18 +08:00
CGH0S7	6835608f92	Add configurable analysis MAP cadence	2026-04-30 19:10:12 +08:00
CGH0S7	da4d56ccf7	Optimize BSSN surface interpolation fast path	2026-04-30 18:25:21 +08:00
CGH0S7	8486532920	Add resident BSSN GPU point interpolation	2026-04-30 11:39:15 +08:00
CGH0S7	1ee229a91f	Add keyed BSSN CUDA resident banks	2026-04-29 19:44:19 +08:00
CGH0S7	68eab03bac	Add opt-in BSSN CUDA resident AMR path	2026-04-29 19:15:37 +08:00
ianchb	c768e1220b	Also disable cached sync for Z4C	2026-04-25 10:25:54 +08:00
CGH0S7	02f149e2e3	Disable cached sync for BSSN-EScalar	2026-04-25 10:17:47 +08:00
CGH0S7	422e8ec4dc	Fallback BSSN-EScalar restrict/prolong path	2026-04-25 10:10:34 +08:00
ianchb	f521a97563	Fix ABE CPU version build error	2026-04-25 09:39:49 +08:00
CGH0S7	768345954f	Add optional BSSN kernel profiling switches (cherry picked from commit `9c31384b2f`)	2026-04-25 08:39:43 +08:00
CGH0S7	6410c62e3e	Add fine-grained step timing and trim BH RHS overhead (cherry picked from commit `968522995b`)	2026-04-25 08:37:19 +08:00
CGH0S7	11977eb82f	Merge wave and mass extraction interpolation (cherry picked from commit `f3988ac8ca`)	2026-04-25 08:25:34 +08:00
CGH0S7	c589097618	Reuse mass integrand across detector radii (cherry picked from commit `4b10519876`)	2026-04-25 08:24:11 +08:00
CGH0S7	b713e5a9be	Batch constraint norm reductions (cherry picked from commit `3a58273501`)	2026-04-25 08:22:00 +08:00
CGH0S7	0396701572	Optimize constraint refresh after regrid (cherry picked from commit `5c65cea2f0`)	2026-04-25 08:18:51 +08:00
ianchb	bb20c9a876	fix ADM Constrant Violation Analysis	2026-04-15 19:19:16 +08:00
ianchb	8fe60ea703	Add zero matter handling and interpolation for resident state in CUDA BSSN	2026-04-15 00:25:53 +08:00
ianchb	f9119e8a2a	Add resident-GA mode switch and simplify sync logic	2026-04-14 21:09:27 +08:00
ianchb	e952ee8e91	Batch GA/BH subset sync with indexed GPU pack/unpack buffers	2026-04-13 20:40:09 +08:00
ianchb	c5d1268dd1	Batch patch-boundary copy and gate CPU BC in GPU substeps	2026-04-13 11:52:17 +08:00
ianchb	1b3c0b80d2	Refactor CUDA step buffers to remove loop-time allocations	2026-04-13 10:33:03 +08:00
ianchb	636e35bfd8	Add direct CUDA resident-state sync path and profiling hooks	2026-04-13 00:57:05 +08:00
ianchb	4fa12a2009	Integrate CUDA support into RK4 substep execution	2026-04-12 22:11:44 +08:00
ianchb	aaf7bf0a26	Merge remote-tracking branch 'origin/main'	2026-04-12 20:55:42 +08:00
CGH0S7	b35e1b289f	设置开关关闭内存打印统计	2026-03-03 16:17:47 +08:00
ianchb	4b9de28feb	将 Restrict/Prolong 链路里的 coarse-level Sync_cached 改为可选（默认跳过） OutBdLow2Hi_cached 读的是 coarse owned 区域（非 coarse ghost/buffer）回退旧行为：编译时定义 RP_SYNC_COARSE_AFTER_RESTRICT=1	2026-03-03 14:25:27 +08:00
CGH0S7	4012e9d068	perf(RestrictProlong): 用 Restrict_cached/OutBdLow2Hi_cached 替换非缓存版本，Sync_finish 改为渐进式解包 - RestrictProlong/RestrictProlong_aux 中的 Restrict() 和 OutBdLow2Hi() 替换为 _cached 版本，复用 gridseg 列表和 MPI 缓冲区，避免每次调用重新分配 - 新增 sync_cache_restrict/sync_cache_outbd 两组 per-level 缓存 - Sync_finish 从 MPI_Waitall 改为 MPI_Waitsome 渐进式解包，降低尾延迟 - AsyncSyncState 扩展 req_node/req_is_recv/pending_recv 字段支持渐进解包 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-02 20:48:38 +08:00
CGH0S7	efc8bf29ea	按需失效同步缓存：Regrid_Onelevel 改为返回 bool 将 cgh::Regrid_Onelevel 的返回类型从 void 改为 bool，在网格真正发生移动时返回 true，否则返回 false。调用方仅在返回 true 时才失效 sync_cache_*，避免了每次 RecursiveStep 结束后无条件失效所有层级缓存的冗余开销。 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-02-25 16:00:26 +08:00
CGH0S7	5c1790277b	Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}} loops across RestrictProlong (3 overloads) and ProlongRestrict each produced N_c × N_f separate transfer() → MPI_Waitall barriers. Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which merges all patch pairs into a single transfer() call with 1 MPI_Waitall. Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached to Parallel (unused for now — kept as infrastructure for future use). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-11 16:09:08 +08:00
CGH0S7	738498cb28	Optimize MPI communication in RestrictProlong and surface_integral Cache Sync in RestrictProlong: replace 11 basic Parallel::Sync() calls with Parallel::Sync_cached() across RestrictProlong, RestrictProlong_aux, and ProlongRestrict to avoid rebuilding grid segment lists every call. Merge paired MPI_Allreduce in surface_integral: combine 9 pairs of consecutive RP/IP Allreduce calls into single calls with count=2*NN. Merge scalar MPI_Allreduce in surf_MassPAng: combine 3 groups of 7 scalar Allreduce calls (mass + angular/linear momentum) into single calls with count=7. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 22:07:12 +08:00
CGH0S7	42b9cf1ad9	Optimize MPI Sync with merged transfers, caching, and async overlap Phase 1: Merge N+1 transfer() calls into a single transfer() per Sync(PatchList), reducing N+1 MPI_Waitall barriers to 1 via new Sync_merged() that collects all intra-patch and inter-patch grid segment lists into combined per-rank arrays. Phase 2: Cache grid segment lists and reuse grow-only communication buffers across RK4 substeps via SyncCache struct. Caches are per-level and per-variable-list (predictor/corrector), invalidated on regrid. Eliminates redundant build_ghost_gsl/build_owned_gsl0/build_gstl rebuilds and malloc/free cycles between regrids. Phase 3: Split Sync into async Sync_start/Sync_finish to overlap Cartesian ghost zone exchange (MPI_Isend/Irecv) with Shell patch synchronization. Uses MPI tag 2 to avoid conflicts with SH->Synch() which uses transfer() with tag 1. Also updates makefile.inc paths and flags for local build environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 21:03:37 +08:00
CGH0S7	e9d321fd00	Convert MPI_Allreduce error checks to non-blocking MPI_Iallreduce overlapped with Sync Replace all 8 blocking MPI_Allreduce error-check calls with MPI_Iallreduce, overlapping the reduction with subsequent Parallel::Sync/SH->Synch operations. MPI_Wait is called after Sync completes to retrieve the error result. This hides the Allreduce latency (46.5% of CPU time) behind the ghost zone exchange communication that must happen anyway. Safe because Sync only copies existing data to ghost zones and the error check + abort happens before any further computation uses the synced data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 12:39:29 +08:00
CGH0S7	ed1d86ade9	Merge paired MPI_Allreduce error checks to reduce global sync barriers In the two Step() functions that handle both Patch and Shell Patch, defer the Patch error check until after Shell Patch computation completes, then perform a single combined MPI_Allreduce instead of two separate ones. This eliminates 4 MPI_Allreduce calls per timestep (2 per Step function, Predictor + Corrector phases each). The optimization is mathematically equivalent: in normal execution (no NaN) behavior is identical; on error, both Patch and Shell data are dumped before MPI_Abort. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 12:12:16 +08:00
CGH0S7	f2fc9af70e	asc26 amss-ncku initialized	2026-01-13 15:01:15 +08:00

43 Commits