1a518cd3f6
Optimize average2: use DO CONCURRENT loop form
2026-03-01 00:41:32 +08:00
1dc622e516
Optimize average2: replace array expression with explicit loops
2026-03-01 00:33:01 +08:00
3046a0ccde
Optimize prolong3: hoist bounds check out of inner loop
2026-03-01 00:17:30 +08:00
d4ec69c98a
Optimize prolong3: replace parity branches with coefficient lookup
2026-02-28 23:59:57 +08:00
2c0a3055d4
Optimize prolong3: precompute coarse index/parity maps
2026-02-28 23:53:30 +08:00
1eba73acbe
先关闭绑核心,发现速度对比:不绑定核心+SCX>绑核心+SCX
2026-02-28 23:27:44 +08:00
b91cfff301
Add switchable C RK4 kernel and build toggle
2026-02-28 21:12:19 +08:00
e29ca2dca9
build: switch allocator option to oneTBB tbbmalloc
2026-02-28 17:16:00 +08:00
6493101ca0
bssn_rhs_c: recompute contracted Gamma terms to remove temp arrays
2026-02-28 16:34:23 +08:00
169986cde1
bssn_rhs_c: compute div_beta on-the-fly to remove temp array
2026-02-28 16:25:57 +08:00
1fbc213888
bssn_rhs_c: remove gxx/gyy/gzz temporaries in favor of dxx/dyy/dzz+1
2026-02-28 15:50:52 +08:00
6024708a48
derivs_c: split low/high stencil regions to reduce branch overhead
2026-02-28 15:42:31 +08:00
bc457d981e
bssn_rhs_c: merge lopsided+kodis with shared symmetry buffer
2026-02-28 15:23:01 +08:00
51dead090e
bssn_rhs_c: 融合最终RHS两循环为一循环,用局部变量传递fij中间值 (Modify 6)
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 13:49:45 +08:00
34d6922a66
fdderivs_c: 全量清零改为只清零边界面,减少无效内存写入
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 13:20:06 +08:00
8010ad27ed
kodiss_c: 收紧循环范围消除边界无用迭代和分支判断
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 13:04:21 +08:00
38e691f013
bssn_rhs_c: 融合Christoffel修正+trK_rhs两循环为一循环 (Modify 5)
...
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 12:57:07 +08:00
808387aa11
bssn_rhs_c: 融合fxx/Gamxa+Gamma_rhs_part2两循环为一循环 (Modify 4)
...
fxx/fxy/fxz和Gamxa/ya/za保留在局部标量中直接复用于Gamma_rhs part2,减少数组读写
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 11:14:35 +08:00
c2b676abf2
bssn_rhs_c: 融合A^{ij}升指标+Gamma_rhs_part1两循环为一循环 (Modify 3)
...
A^{ij}六分量保留在局部标量中直接复用于Gamma_rhs计算,减少Rxx..Ryz数组的额外读取
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 11:02:27 +08:00
2c60533501
bssn_rhs_c: 融合逆度规+Gamma约束+Christoffel三循环为一循环 (Modify 2)
...
逆度规计算结果保留在局部标量中直接复用,减少对gupxx..gupzz数组的重复读取,每步加速0.01秒
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-28 10:57:40 +08:00
318b5254cc
根据组委会邮件要求更新检测脚本,增加对3D向量和三个分量分别检测RMS小于1.0%
2026-02-27 17:38:21 +08:00
3cee05f262
Merge branch 'cjy-oneapi-opus-hotfix'
2026-02-27 15:13:40 +08:00
e0b5e012df
引入 PGO 式两遍编译流程,将 Interp_Points 负载均衡优化合法化
...
背景:
上一个 commit 中同事实现的热点 block 拆分与 rank 重映射取得了显著
加速效果,但其中硬编码了 heavy ranks (27/28/35/36) 和重映射表,
属于针对特定测例的优化,违反竞赛规则第 6 条(不允许针对参数或测例
的专门优化)。
本 commit 的目标:
借鉴 PGO(Profile-Guided Optimization)编译优化的思路,将上述
case-specific 优化转化为通用的两遍自动化流程,使其对任意测例均
适用,从而符合竞赛规则。
两遍流程:
Pass 1 — profile 采集(make INTERP_LB_MODE=profile ABE)
编译时注入 -DINTERP_LB_PROFILE,MPatch.C 中 Interp_Points
在首次调用时用 MPI_Wtime 计时 + MPI_Gather 汇总各 rank 耗时,
识别超过均值 2.5 倍的热点 rank,写入 interp_lb_profile.bin。
中间步骤 — 生成编译时头文件
python3 gen_interp_lb_header.py 读取 profile.bin,自动计算
拆分策略和重映射表,生成 interp_lb_profile_data.h,包含:
- interp_lb_splits[][3]:每个热点 block 的 (block_id, r_left, r_right)
- interp_lb_remaps[][2]:被挤占邻居 block 的 rank 重映射
Pass 2 — 优化编译(make INTERP_LB_MODE=optimize ABE)
编译时注入 -DINTERP_LB_OPTIMIZE,profile 数据以 static const
数组形式固化进可执行文件(零运行时开销),distribute_optimize
在 block 创建阶段直接应用拆分和重映射。
具体改动:
- makefile.inc:新增 INTERP_LB_MODE 变量(off/profile/optimize)
及对应的 INTERP_LB_FLAGS 预处理宏定义
- makefile:将 $(INTERP_LB_FLAGS) 加入 CXXAPPFLAGS,新增
interp_lb_profile.o 编译目标
- gen_interp_lb_header.py:profile.bin → interp_lb_profile_data.h
的自动转换脚本
- interp_lb_profile_data.h:自动生成的编译时常量头文件
- interp_lb_profile.bin:profile 采集阶段生成的二进制数据
- AMSS_NCKU_Program.py:构建时自动拷贝 profile.bin 到运行目录
- makefile_and_run.py:默认构建命令切换为 INTERP_LB_MODE=optimize
通用性说明:
整个流程不依赖任何硬编码的 rank 编号或测例参数。对于不同的网格
配置、进程数或物理问题,只需重新执行 Pass 1 采集 profile,即可
自动生成对应的优化方案。这与 PGO 编译优化的理念完全一致——先
profile 再优化,是一种通用的性能优化方法论。
2026-02-27 15:10:22 +08:00
jaunatisblue
6b2464b80c
Interp_Points 负载均衡:热点 block 拆分与 rank 重映射
...
问题背景:
Patch::Interp_Points 在球面插值时存在严重的 MPI 负载不均衡。
通过 MPI_Wtime 计时诊断发现,64 进程中 rank 27/28/35/36 四个进程
承担了绝大部分插值计算(耗时为平均值的 2.6~3.3 倍),导致其余 60
个进程在 MPI 集合通信处空等,成为整体性能瓶颈。
根因分析:
这四个 rank 对应的 block 在物理空间上恰好覆盖了球面提取面
(extraction sphere)的密集插值点区域,而 distribute 函数按均匀
网格体积分配 block-to-rank,未考虑插值点的空间分布不均。
优化方案:
1. 新增 distribute_optimize 函数替代 distribute,使用独立的
current_block_id 计数器(与 rank 分配解耦)遍历所有 block。
2. 热点 block 拆分(splitHotspotBlock):
对 block 27/28/35/36 沿 x 轴在中点处二等分,生成左右两个子
block,分别分配给相邻的两个 rank:
- block 27 → (rank 26, rank 27)
- block 28 → (rank 28, rank 29)
- block 35 → (rank 34, rank 35)
- block 36 → (rank 36, rank 37)
子 block 严格复刻原 distribute 的 ghost zone 扩张和物理坐标
计算逻辑(支持 Vertex/Cell 两种网格模式)。
3. 邻居 rank 重映射(createMappedBlock):
被占用的邻居 block 需要让出原 rank,重映射到相邻空闲 rank:
- block 26 → rank 25
- block 29 → rank 30
- block 34 → rank 33
- block 37 → rank 38
其余 block 保持 block_id == rank 的原始映射。
4. cgh.C 中 compose_cgh 通过预处理宏切换调用 distribute_optimize
或原始 distribute。
5. MPatch.C 中添加 profile 采集插桩:在 Interp_Points 重载 2 中
用 MPI_Wtime 计时,MPI_Gather 汇总各 rank 耗时,识别热点 rank
并写入二进制 profile 文件。
6. 新增 interp_lb_profile.h/C:定义 profile 文件格式(magic、
version、nprocs、threshold_ratio、heavy_ranks),提供
write_profile/read_profile/identify_heavy_ranks 接口。
数学等价性:拆分和重映射仅改变 block 的几何划分与 rank 归属,
不修改任何物理方程、差分格式或插值算法,计算结果严格一致。
2026-02-27 15:07:40 +08:00
9c33e16571
增加C算子PGO文件
2026-02-27 11:30:36 +08:00
45b7a43576
补全C算子和Fortran算子的数学差异
2026-02-26 15:48:11 +08:00
dfb79e3e11
Initialize output arrays to zero in fdderivs_c.C and fderivs_c.C
2026-02-26 14:18:31 +08:00
d2c2214fa1
补充TwoPunctureABE专用PGO插桩文件
2026-02-25 23:06:17 +08:00
e157ea3a23
合并 chb-replace:C++ 算子替换 Fortran bssn_rhs,添加回退开关与独立 PGO profdata
...
- 合并 chb-replace 分支,引入 bssn_rhs_c.C / fderivs_c.C / fdderivs_c.C /
kodiss_c.C / lopsided_c.C 五个 C++ 算子实现
- 添加 USE_CXX_KERNELS 开关(默认 1),设为 0 可回退到原始 Fortran bssn_rhs.o
- TwoPunctureABE 改用独立的 TwoPunctureABE.profdata 而非 default.profdata
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-02-25 22:50:46 +08:00
f5a63f1e42
Revert "Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement"
...
This reverts commit 09b937c022 .
2026-02-25 22:21:43 +08:00
284ab80baf
Remove OpenMP from C rewrite kernel
...
The C rewrite introduced OpenMP parallelism. Remove all OpenMP.
2026-02-25 22:21:20 +08:00
copilot-swe-agent[bot]
09b937c022
Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement
...
clock() measures total CPU time across all threads, not wall-clock
time. With the new OpenMP parallel regions in bssn_rhs_c.C, clock()
sums CPU time from all OpenMP threads, producing inflated timing that
scales with thread count rather than reflecting actual elapsed time.
MPI_Wtime() returns wall-clock seconds, giving accurate timing
regardless of the number of OpenMP threads running inside the
measured interval.
Co-authored-by: ianchb <i@4t.pw >
2026-02-25 22:21:19 +08:00
wingrew
8a9c775705
Replace Fortran bssn_rhs with C implementation and add C helper kernels
...
- Modify bssn_rhs_c.C to use existing project headers (macrodef.h, bssn_rhs.h)
- Update makefile: remove bssn_rhs.o from F90FILES, add CFILES with OpenMP
- Keep Fortran helper files (diff_new.f90, kodiss.f90, lopsidediff.f90) for other Fortran callers
[copilot: fix compiling errors & a nan error]
Co-authored-by: ianchb <i@4t.pw >
Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com >
2026-02-25 22:21:19 +08:00
d942122043
更新PGO文件
2026-02-25 18:25:20 +08:00
a5c713a7e0
完善PGO机制
2026-02-25 17:22:56 +08:00
9e6b25163a
更新 PGO profdata 并为 ABE 插桩编译添加 PGO_MODE 开关
...
- 更新 pgo_profile/default.profdata 为最新收集的 profile 数据
- 备份旧 profdata 至 default.profdata.backup2
- makefile: 新增 PGO_MODE 开关(默认 opt),支持 make PGO_MODE=instrument
切换到 Phase 1 插桩模式重新收集数据,无需手动修改 flags
- makefile: TwoPunctureABE 独立使用 TP_OPTFLAGS,不受 PGO_MODE 影响
- makefile: PROFDATA 路径改为 /home/$(shell whoami)/AMSS-NCKU/pgo_profile/default.profdata
- makefile.inc: 移除硬编码的编译 flags,改由 makefile 中的 ifeq 逻辑管理
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-02-25 17:00:55 +08:00
CGH0S7
efc8bf29ea
按需失效同步缓存:Regrid_Onelevel 改为返回 bool
...
将 cgh::Regrid_Onelevel 的返回类型从 void 改为 bool,
在网格真正发生移动时返回 true,否则返回 false。
调用方仅在返回 true 时才失效 sync_cache_*,避免了
每次 RecursiveStep 结束后无条件失效所有层级缓存的冗余开销。
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com >
2026-02-25 16:00:26 +08:00
CGH0S7
ccf6adaf75
提供正确的macrodef.h避免llm被误导
2026-02-25 11:47:14 +08:00
CGH0S7
e2bc472845
优化绑核逻辑,取消硬编码改为智能识别
2026-02-25 10:59:32 +08:00
e6329b013d
Merge branch 'cjy-oneapi-opus-hotfix'
2026-02-20 14:18:33 +08:00
82339f5282
Merge lopsided advection + kodis dissipation to share symmetry_bd buffer
...
Cherry-picked from 38c2c30 .
2026-02-20 13:36:27 +08:00
94f38c57f9
Don't hardcode pgo profile path
2026-02-20 13:36:27 +08:00
85d1e8de87
Add Intel SIMD vectorization directives to hot-spot functions
...
Apply Intel Advisor optimization recommendations:
- Add FORCEINLINE to polint for better inlining
- Add SIMD VECTORLENGTHFOR and UNROLL directives to fderivs,
fdderivs, symmetry_bd, and kodis functions
This improves vectorization efficiency of finite difference
computations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-14 00:43:39 +08:00
2791d2e225
Merge pull request 'PGO updated' ( #1 ) from cjy-oneapi-opus-hotfix into main
...
Reviewed-on: #1
2026-02-11 19:17:35 +08:00
72ce153e48
Merge cjy-oneapi-opus-hotfix into main
2026-02-11 19:15:12 +08:00
5b7e05cd32
PGO updated
2026-02-11 18:26:30 +08:00
85afe00fc5
Merge plotting optimizations from chb-copilot-test
...
- Implement multiprocessing-based parallel plotting
- Add parallel_plot_helper.py for concurrent plot task execution
- Use matplotlib 'Agg' backend for multiprocessing safety
- Set OMP_NUM_THREADS=1 to prevent BLAS thread explosion
- Use subprocess for binary data plots to avoid thread conflicts
- Add fork bomb protection in main program
This merge only includes plotting improvements and excludes
MPI communication changes to preserve existing optimizations.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-02-11 16:19:17 +08:00
5c1790277b
Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong
...
The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}}
loops across RestrictProlong (3 overloads) and ProlongRestrict each
produced N_c × N_f separate transfer() → MPI_Waitall barriers.
Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which
merges all patch pairs into a single transfer() call with 1 MPI_Waitall.
Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached
to Parallel (unused for now — kept as infrastructure for future use).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-11 16:09:08 +08:00
e09ae438a2
Cache data_packer lengths in Sync_start to skip redundant buffer-size traversals
...
The data_packer(NULL, ...) calls that compute send/recv buffer lengths
traverse all grid segments × variables × nprocs on every Sync_start
invocation, even though lengths never change once the cache is built.
Add a lengths_valid flag to SyncCache so these length computations are
done once and reused on subsequent calls (4× per RK4 step).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-10 21:39:22 +08:00
d06d5b4db8
Add targeted point-to-point Interp_Points overload for surface_integral
...
Instead of broadcasting all interpolated point data to every MPI rank,
the new overload sends each point only to the one rank that needs it
for integration, reducing communication volume by ~nprocs times.
The consumer rank is computed deterministically using the same Nmin/Nmax
work distribution formula used by surface_integral callers. Two active
call sites (surf_Wave and surf_MassPAng with MPI_COMM_WORLD) now use
the new overload. Other callers (ShellPatch, Comm_here variants, etc.)
remain unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-10 19:18:56 +08:00