43ddaab903
fix: add C RK4 kernel to CFILES_CUDA
2026-03-02 12:19:52 +08:00
5839755c2f
compute div_beta on-the-fly to remove temp array
2026-03-02 12:12:58 +08:00
a893b4007c
merge lopsided+kodis
2026-03-02 12:12:26 +08:00
ad5ff03615
build: switch allocator option to oneTBB tbbmalloc
...
(cherry picked from commit e29ca2dca9 )
2026-03-02 11:53:30 +08:00
b185f84cce
Add switchable C RK4 kernel and build toggle
...
(cherry picked from commit b91cfff301 )
2026-03-02 11:53:00 +08:00
71f6eb7b44
Remove profiling code
2026-03-02 11:29:48 +08:00
90620c2aec
Optimize fdderivs: skip redundant 2nd-order work in 4th-order overlap
2026-03-02 11:04:04 +08:00
f561522d89
prolong3:提升cache命中率
2026-03-02 11:02:19 +08:00
jaunatisblue
3f4715b8cc
修改prolong
2026-03-02 11:02:17 +08:00
jaunatisblue
710ea8f76b
对prolong3做访存优化
2026-03-02 11:02:12 +08:00
5cf891359d
Optimize symmetry_bd with stride-based fast paths
...
(cherry picked from commit 16013081e0 )
2026-03-02 11:01:49 +08:00
222747449a
Optimize average2: use DO CONCURRENT loop form
...
(cherry picked from commit 1a518cd3f6 )
2026-03-02 11:01:45 +08:00
14de4d535e
Optimize average2: replace array expression with explicit loops
...
(cherry picked from commit 1dc622e516 )
2026-03-02 11:01:42 +08:00
787295692a
Optimize prolong3: hoist bounds check out of inner loop
...
(cherry picked from commit 3046a0ccde )
2026-03-02 11:01:39 +08:00
335f2f23fe
Optimize prolong3: replace parity branches with coefficient lookup
...
(cherry picked from commit d4ec69c98a )
2026-03-02 11:01:37 +08:00
7109474a14
Optimize prolong3: precompute coarse index/parity maps
...
(cherry picked from commit 2c0a3055d4 )
2026-03-02 11:01:31 +08:00
e7a02e8f72
perf(polint): add uniform-grid fast path for barycentric n=6
2026-03-01 14:13:51 +08:00
8dad910c6c
perf(polint): add switchable barycentric ordn=6 path
2026-03-01 14:13:51 +08:00
01b4cf71d1
perf(polin3): switch to lagrange-weight tensor contraction
2026-03-01 14:13:04 +08:00
66dabe8cc4
perf(polint): add ordn=6 specialized neville path
2026-03-01 14:12:22 +08:00
abf2f640e4
add fused symmetry packing kernels for orders 2 and 3 in BSSN RHS
2026-02-28 15:35:14 +08:00
94f40627aa
refine GPU dispatch initialization and optimize H2D/D2H data transfers
2026-02-28 15:23:41 +08:00
d94c31c5c4
[WIP]Implement multi-GPU support in BSSN RHS and add profiling for H2D/D2H transfers
2026-02-28 11:12:14 +08:00
724e9cd415
[WIP]Add CUDA support for BSSN RHS with new kernel and update makefiles
2026-02-28 11:12:13 +08:00
c001939461
Add Lagrange interpolation subroutine and update calls in prolongrestrict modules
2026-02-28 11:12:13 +08:00
94d236385d
Revert "skip redundant MPI ghost cell syncs for stages 0, 1 & 2"
...
This reverts commit f7ada421cf .
2026-02-28 11:12:12 +08:00
780f1c80d0
skip redundant MPI ghost cell syncs for stages 0, 1 & 2
...
BSSN 每个 RK4 时间步执行 4 次 MPI ghost zone 同步:
Stage 0(预测)结束后:Parallel::Sync(SynchList_pre)
Stage 1(校正 1)结束后:Parallel::Sync(SynchList_cor)
Stage 2(校正 2)结束后:Parallel::Sync(SynchList_cor)
Stage 3(校正 3)结束后:Parallel::Sync(SynchList_cor) ← 必要(为下一步提供 ghost)
bssnEM_class.C、Z4c_class.C 结构相同,一起修改了
2026-02-28 11:12:09 +08:00
3cee05f262
Merge branch 'cjy-oneapi-opus-hotfix'
2026-02-27 15:13:40 +08:00
e0b5e012df
引入 PGO 式两遍编译流程,将 Interp_Points 负载均衡优化合法化
...
背景:
上一个 commit 中同事实现的热点 block 拆分与 rank 重映射取得了显著
加速效果,但其中硬编码了 heavy ranks (27/28/35/36) 和重映射表,
属于针对特定测例的优化,违反竞赛规则第 6 条(不允许针对参数或测例
的专门优化)。
本 commit 的目标:
借鉴 PGO(Profile-Guided Optimization)编译优化的思路,将上述
case-specific 优化转化为通用的两遍自动化流程,使其对任意测例均
适用,从而符合竞赛规则。
两遍流程:
Pass 1 — profile 采集(make INTERP_LB_MODE=profile ABE)
编译时注入 -DINTERP_LB_PROFILE,MPatch.C 中 Interp_Points
在首次调用时用 MPI_Wtime 计时 + MPI_Gather 汇总各 rank 耗时,
识别超过均值 2.5 倍的热点 rank,写入 interp_lb_profile.bin。
中间步骤 — 生成编译时头文件
python3 gen_interp_lb_header.py 读取 profile.bin,自动计算
拆分策略和重映射表,生成 interp_lb_profile_data.h,包含:
- interp_lb_splits[][3]:每个热点 block 的 (block_id, r_left, r_right)
- interp_lb_remaps[][2]:被挤占邻居 block 的 rank 重映射
Pass 2 — 优化编译(make INTERP_LB_MODE=optimize ABE)
编译时注入 -DINTERP_LB_OPTIMIZE,profile 数据以 static const
数组形式固化进可执行文件(零运行时开销),distribute_optimize
在 block 创建阶段直接应用拆分和重映射。
具体改动:
- makefile.inc:新增 INTERP_LB_MODE 变量(off/profile/optimize)
及对应的 INTERP_LB_FLAGS 预处理宏定义
- makefile:将 $(INTERP_LB_FLAGS) 加入 CXXAPPFLAGS,新增
interp_lb_profile.o 编译目标
- gen_interp_lb_header.py:profile.bin → interp_lb_profile_data.h
的自动转换脚本
- interp_lb_profile_data.h:自动生成的编译时常量头文件
- interp_lb_profile.bin:profile 采集阶段生成的二进制数据
- AMSS_NCKU_Program.py:构建时自动拷贝 profile.bin 到运行目录
- makefile_and_run.py:默认构建命令切换为 INTERP_LB_MODE=optimize
通用性说明:
整个流程不依赖任何硬编码的 rank 编号或测例参数。对于不同的网格
配置、进程数或物理问题,只需重新执行 Pass 1 采集 profile,即可
自动生成对应的优化方案。这与 PGO 编译优化的理念完全一致——先
profile 再优化,是一种通用的性能优化方法论。
2026-02-27 15:10:22 +08:00
jaunatisblue
6b2464b80c
Interp_Points 负载均衡:热点 block 拆分与 rank 重映射
...
问题背景:
Patch::Interp_Points 在球面插值时存在严重的 MPI 负载不均衡。
通过 MPI_Wtime 计时诊断发现,64 进程中 rank 27/28/35/36 四个进程
承担了绝大部分插值计算(耗时为平均值的 2.6~3.3 倍),导致其余 60
个进程在 MPI 集合通信处空等,成为整体性能瓶颈。
根因分析:
这四个 rank 对应的 block 在物理空间上恰好覆盖了球面提取面
(extraction sphere)的密集插值点区域,而 distribute 函数按均匀
网格体积分配 block-to-rank,未考虑插值点的空间分布不均。
优化方案:
1. 新增 distribute_optimize 函数替代 distribute,使用独立的
current_block_id 计数器(与 rank 分配解耦)遍历所有 block。
2. 热点 block 拆分(splitHotspotBlock):
对 block 27/28/35/36 沿 x 轴在中点处二等分,生成左右两个子
block,分别分配给相邻的两个 rank:
- block 27 → (rank 26, rank 27)
- block 28 → (rank 28, rank 29)
- block 35 → (rank 34, rank 35)
- block 36 → (rank 36, rank 37)
子 block 严格复刻原 distribute 的 ghost zone 扩张和物理坐标
计算逻辑(支持 Vertex/Cell 两种网格模式)。
3. 邻居 rank 重映射(createMappedBlock):
被占用的邻居 block 需要让出原 rank,重映射到相邻空闲 rank:
- block 26 → rank 25
- block 29 → rank 30
- block 34 → rank 33
- block 37 → rank 38
其余 block 保持 block_id == rank 的原始映射。
4. cgh.C 中 compose_cgh 通过预处理宏切换调用 distribute_optimize
或原始 distribute。
5. MPatch.C 中添加 profile 采集插桩:在 Interp_Points 重载 2 中
用 MPI_Wtime 计时,MPI_Gather 汇总各 rank 耗时,识别热点 rank
并写入二进制 profile 文件。
6. 新增 interp_lb_profile.h/C:定义 profile 文件格式(magic、
version、nprocs、threshold_ratio、heavy_ranks),提供
write_profile/read_profile/identify_heavy_ranks 接口。
数学等价性:拆分和重映射仅改变 block 的几何划分与 rank 归属,
不修改任何物理方程、差分格式或插值算法,计算结果严格一致。
2026-02-27 15:07:40 +08:00
45b7a43576
补全C算子和Fortran算子的数学差异
2026-02-26 15:48:11 +08:00
dfb79e3e11
Initialize output arrays to zero in fdderivs_c.C and fderivs_c.C
2026-02-26 14:18:31 +08:00
e157ea3a23
合并 chb-replace:C++ 算子替换 Fortran bssn_rhs,添加回退开关与独立 PGO profdata
...
- 合并 chb-replace 分支,引入 bssn_rhs_c.C / fderivs_c.C / fdderivs_c.C /
kodiss_c.C / lopsided_c.C 五个 C++ 算子实现
- 添加 USE_CXX_KERNELS 开关(默认 1),设为 0 可回退到原始 Fortran bssn_rhs.o
- TwoPunctureABE 改用独立的 TwoPunctureABE.profdata 而非 default.profdata
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-02-25 22:50:46 +08:00
f5a63f1e42
Revert "Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement"
...
This reverts commit 09b937c022 .
2026-02-25 22:21:43 +08:00
284ab80baf
Remove OpenMP from C rewrite kernel
...
The C rewrite introduced OpenMP parallelism. Remove all OpenMP.
2026-02-25 22:21:20 +08:00
copilot-swe-agent[bot]
09b937c022
Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement
...
clock() measures total CPU time across all threads, not wall-clock
time. With the new OpenMP parallel regions in bssn_rhs_c.C, clock()
sums CPU time from all OpenMP threads, producing inflated timing that
scales with thread count rather than reflecting actual elapsed time.
MPI_Wtime() returns wall-clock seconds, giving accurate timing
regardless of the number of OpenMP threads running inside the
measured interval.
Co-authored-by: ianchb <i@4t.pw >
2026-02-25 22:21:19 +08:00
wingrew
8a9c775705
Replace Fortran bssn_rhs with C implementation and add C helper kernels
...
- Modify bssn_rhs_c.C to use existing project headers (macrodef.h, bssn_rhs.h)
- Update makefile: remove bssn_rhs.o from F90FILES, add CFILES with OpenMP
- Keep Fortran helper files (diff_new.f90, kodiss.f90, lopsidediff.f90) for other Fortran callers
[copilot: fix compiling errors & a nan error]
Co-authored-by: ianchb <i@4t.pw >
Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com >
2026-02-25 22:21:19 +08:00
a5c713a7e0
完善PGO机制
2026-02-25 17:22:56 +08:00
9e6b25163a
更新 PGO profdata 并为 ABE 插桩编译添加 PGO_MODE 开关
...
- 更新 pgo_profile/default.profdata 为最新收集的 profile 数据
- 备份旧 profdata 至 default.profdata.backup2
- makefile: 新增 PGO_MODE 开关(默认 opt),支持 make PGO_MODE=instrument
切换到 Phase 1 插桩模式重新收集数据,无需手动修改 flags
- makefile: TwoPunctureABE 独立使用 TP_OPTFLAGS,不受 PGO_MODE 影响
- makefile: PROFDATA 路径改为 /home/$(shell whoami)/AMSS-NCKU/pgo_profile/default.profdata
- makefile.inc: 移除硬编码的编译 flags,改由 makefile 中的 ifeq 逻辑管理
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-02-25 17:00:55 +08:00
CGH0S7
efc8bf29ea
按需失效同步缓存:Regrid_Onelevel 改为返回 bool
...
将 cgh::Regrid_Onelevel 的返回类型从 void 改为 bool,
在网格真正发生移动时返回 true,否则返回 false。
调用方仅在返回 true 时才失效 sync_cache_*,避免了
每次 RecursiveStep 结束后无条件失效所有层级缓存的冗余开销。
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com >
2026-02-25 16:00:26 +08:00
CGH0S7
ccf6adaf75
提供正确的macrodef.h避免llm被误导
2026-02-25 11:47:14 +08:00
e6329b013d
Merge branch 'cjy-oneapi-opus-hotfix'
2026-02-20 14:18:33 +08:00
82339f5282
Merge lopsided advection + kodis dissipation to share symmetry_bd buffer
...
Cherry-picked from 38c2c30 .
2026-02-20 13:36:27 +08:00
94f38c57f9
Don't hardcode pgo profile path
2026-02-20 13:36:27 +08:00
85d1e8de87
Add Intel SIMD vectorization directives to hot-spot functions
...
Apply Intel Advisor optimization recommendations:
- Add FORCEINLINE to polint for better inlining
- Add SIMD VECTORLENGTHFOR and UNROLL directives to fderivs,
fdderivs, symmetry_bd, and kodis functions
This improves vectorization efficiency of finite difference
computations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-14 00:43:39 +08:00
72ce153e48
Merge cjy-oneapi-opus-hotfix into main
2026-02-11 19:15:12 +08:00
5c1790277b
Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong
...
The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}}
loops across RestrictProlong (3 overloads) and ProlongRestrict each
produced N_c × N_f separate transfer() → MPI_Waitall barriers.
Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which
merges all patch pairs into a single transfer() call with 1 MPI_Waitall.
Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached
to Parallel (unused for now — kept as infrastructure for future use).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-11 16:09:08 +08:00
e09ae438a2
Cache data_packer lengths in Sync_start to skip redundant buffer-size traversals
...
The data_packer(NULL, ...) calls that compute send/recv buffer lengths
traverse all grid segments × variables × nprocs on every Sync_start
invocation, even though lengths never change once the cache is built.
Add a lengths_valid flag to SyncCache so these length computations are
done once and reused on subsequent calls (4× per RK4 step).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-10 21:39:22 +08:00
d06d5b4db8
Add targeted point-to-point Interp_Points overload for surface_integral
...
Instead of broadcasting all interpolated point data to every MPI rank,
the new overload sends each point only to the one rank that needs it
for integration, reducing communication volume by ~nprocs times.
The consumer rank is computed deterministically using the same Nmin/Nmax
work distribution formula used by surface_integral callers. Two active
call sites (surf_Wave and surf_MassPAng with MPI_COMM_WORLD) now use
the new overload. Other callers (ShellPatch, Comm_here variants, etc.)
remain unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-10 19:18:56 +08:00
50e2a845f8
Replace MPI_Allreduce with owner-rank MPI_Bcast in Patch::Interp_Points
...
The two MPI_Allreduce calls (data + weight) were the #1 hotspot at 38.5%
CPU time. Since all ranks traverse the same block list and agree on point
ownership, we replace the global reduction with targeted MPI_Bcast from
each owner rank. This also eliminates the weight array/Allreduce entirely,
removes redundant heap allocations (shellf, weight, DH, llb, uub), and
writes interpolation results directly into the output buffer.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-09 22:39:18 +08:00