b185f84cce
Add switchable C RK4 kernel and build toggle
...
(cherry picked from commit b91cfff301 )
2026-03-02 11:53:00 +08:00
d94c31c5c4
[WIP]Implement multi-GPU support in BSSN RHS and add profiling for H2D/D2H transfers
2026-02-28 11:12:14 +08:00
724e9cd415
[WIP]Add CUDA support for BSSN RHS with new kernel and update makefiles
2026-02-28 11:12:13 +08:00
3cee05f262
Merge branch 'cjy-oneapi-opus-hotfix'
2026-02-27 15:13:40 +08:00
e0b5e012df
引入 PGO 式两遍编译流程,将 Interp_Points 负载均衡优化合法化
...
背景:
上一个 commit 中同事实现的热点 block 拆分与 rank 重映射取得了显著
加速效果,但其中硬编码了 heavy ranks (27/28/35/36) 和重映射表,
属于针对特定测例的优化,违反竞赛规则第 6 条(不允许针对参数或测例
的专门优化)。
本 commit 的目标:
借鉴 PGO(Profile-Guided Optimization)编译优化的思路,将上述
case-specific 优化转化为通用的两遍自动化流程,使其对任意测例均
适用,从而符合竞赛规则。
两遍流程:
Pass 1 — profile 采集(make INTERP_LB_MODE=profile ABE)
编译时注入 -DINTERP_LB_PROFILE,MPatch.C 中 Interp_Points
在首次调用时用 MPI_Wtime 计时 + MPI_Gather 汇总各 rank 耗时,
识别超过均值 2.5 倍的热点 rank,写入 interp_lb_profile.bin。
中间步骤 — 生成编译时头文件
python3 gen_interp_lb_header.py 读取 profile.bin,自动计算
拆分策略和重映射表,生成 interp_lb_profile_data.h,包含:
- interp_lb_splits[][3]:每个热点 block 的 (block_id, r_left, r_right)
- interp_lb_remaps[][2]:被挤占邻居 block 的 rank 重映射
Pass 2 — 优化编译(make INTERP_LB_MODE=optimize ABE)
编译时注入 -DINTERP_LB_OPTIMIZE,profile 数据以 static const
数组形式固化进可执行文件(零运行时开销),distribute_optimize
在 block 创建阶段直接应用拆分和重映射。
具体改动:
- makefile.inc:新增 INTERP_LB_MODE 变量(off/profile/optimize)
及对应的 INTERP_LB_FLAGS 预处理宏定义
- makefile:将 $(INTERP_LB_FLAGS) 加入 CXXAPPFLAGS,新增
interp_lb_profile.o 编译目标
- gen_interp_lb_header.py:profile.bin → interp_lb_profile_data.h
的自动转换脚本
- interp_lb_profile_data.h:自动生成的编译时常量头文件
- interp_lb_profile.bin:profile 采集阶段生成的二进制数据
- AMSS_NCKU_Program.py:构建时自动拷贝 profile.bin 到运行目录
- makefile_and_run.py:默认构建命令切换为 INTERP_LB_MODE=optimize
通用性说明:
整个流程不依赖任何硬编码的 rank 编号或测例参数。对于不同的网格
配置、进程数或物理问题,只需重新执行 Pass 1 采集 profile,即可
自动生成对应的优化方案。这与 PGO 编译优化的理念完全一致——先
profile 再优化,是一种通用的性能优化方法论。
2026-02-27 15:10:22 +08:00
e157ea3a23
合并 chb-replace:C++ 算子替换 Fortran bssn_rhs,添加回退开关与独立 PGO profdata
...
- 合并 chb-replace 分支,引入 bssn_rhs_c.C / fderivs_c.C / fdderivs_c.C /
kodiss_c.C / lopsided_c.C 五个 C++ 算子实现
- 添加 USE_CXX_KERNELS 开关(默认 1),设为 0 可回退到原始 Fortran bssn_rhs.o
- TwoPunctureABE 改用独立的 TwoPunctureABE.profdata 而非 default.profdata
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-02-25 22:50:46 +08:00
wingrew
8a9c775705
Replace Fortran bssn_rhs with C implementation and add C helper kernels
...
- Modify bssn_rhs_c.C to use existing project headers (macrodef.h, bssn_rhs.h)
- Update makefile: remove bssn_rhs.o from F90FILES, add CFILES with OpenMP
- Keep Fortran helper files (diff_new.f90, kodiss.f90, lopsidediff.f90) for other Fortran callers
[copilot: fix compiling errors & a nan error]
Co-authored-by: ianchb <i@4t.pw >
Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com >
2026-02-25 22:21:19 +08:00
9e6b25163a
更新 PGO profdata 并为 ABE 插桩编译添加 PGO_MODE 开关
...
- 更新 pgo_profile/default.profdata 为最新收集的 profile 数据
- 备份旧 profdata 至 default.profdata.backup2
- makefile: 新增 PGO_MODE 开关(默认 opt),支持 make PGO_MODE=instrument
切换到 Phase 1 插桩模式重新收集数据,无需手动修改 flags
- makefile: TwoPunctureABE 独立使用 TP_OPTFLAGS,不受 PGO_MODE 影响
- makefile: PROFDATA 路径改为 /home/$(shell whoami)/AMSS-NCKU/pgo_profile/default.profdata
- makefile.inc: 移除硬编码的编译 flags,改由 makefile 中的 ifeq 逻辑管理
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-02-25 17:00:55 +08:00
e6329b013d
Merge branch 'cjy-oneapi-opus-hotfix'
2026-02-20 14:18:33 +08:00
94f38c57f9
Don't hardcode pgo profile path
2026-02-20 13:36:27 +08:00
72ce153e48
Merge cjy-oneapi-opus-hotfix into main
2026-02-11 19:15:12 +08:00
471baa5065
PGO supported
2026-02-09 10:59:26 +08:00
b8e41b2b39
Only enable OpenMP for TwoPunctures
2026-02-08 13:00:37 +08:00
f345b0e520
Performance optimization for the TwoPunctures module
...
* Re-enabled OpenMP.
1. Batch spectral derivatives (Chebyshev & Fourier) via precomputed matrices:
Chebyshev/Fourier transforms and derivatives are precomputed as explicit physical-space operator matrices.
Batch DGEMM now applies to entire tensor fields, mathematically identical to original per-line transforms but vastly faster.
2. Gauss-Seidel relaxation & tridiagonal solver workspace reuse:
Per-thread reusable workspaces replace per-call heap new/delete in all tridiagonal and relaxation routines.
3. Efficient OpenMP multithreading throughout relaxation/deriv:
relax_omp and friends parallelize over grouped lines/planes, maximizing threading efficiency and memory independence.
Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com >
2026-02-07 14:48:47 +08:00
6738854a9d
Compiler-level and hot-path optimizations for GW150914
...
- makefile.inc: add -ipo (interprocedural optimization) and
-align array64byte (64-byte array alignment for vectorization)
- fmisc.f90: remove redundant funcc=0.d0 zeroing from symmetry_bd,
symmetry_tbd, symmetry_stbd (~328+ full-array memsets eliminated
per timestep)
- enforce_algebra.f90: rewrite enforce_ag and enforce_ga as point-wise
loops, replacing 12 stack-allocated 3D temporary arrays with scalar
locals for better cache locality
All changes are mathematically equivalent — no algorithmic modifications.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-06 17:13:39 +08:00
CGH0S7
79af79d471
baseline updated
2026-02-05 19:53:55 +08:00
26c81d8e81
makefile updated
2026-01-19 23:53:16 +08:00
CGH0S7
9deeda9831
Refactor verification method and optimize numerical kernels with oneMKL BLAS
...
This commit transitions the verification approach from post-Newtonian theory
comparison to regression testing against baseline simulations, and optimizes
critical numerical kernels using Intel oneMKL BLAS routines.
Verification Changes:
- Replace PN theory-based RMS calculation with trajectory-based comparison
- Compare optimized results against baseline (GW150914-origin) on XY plane
- Compute RMS independently for BH1 and BH2, report maximum as final metric
- Update documentation to reflect new regression test methodology
Performance Optimizations:
- Replace manual vector operations with oneMKL BLAS routines:
* norm2() and scalarproduct() now use cblas_dnrm2/cblas_ddot (C++)
* L2 norm calculations use DDOT for dot products (Fortran)
* Interpolation weighted sums use DDOT (Fortran)
- Disable OpenMP threading (switch to sequential MKL) for better performance
Build Configuration:
- Switch from lmkl_intel_thread to lmkl_sequential
- Remove -qopenmp flags from compiler options
- Maintain aggressive optimization flags (-O3, -xHost, -fp-model fast=2, -fma)
Other Changes:
- Update .gitignore for GW150914-origin, docs, and temporary files
2026-01-18 14:25:21 +08:00
CGH0S7
3a7bce3af2
Update Intel oneAPI configuration and CPU binding settings
...
- Update makefile.inc with Intel oneAPI compiler flags and oneMKL linking
- Configure taskset CPU binding to use nohz_full cores (4-55, 60-111)
- Set build parallelism to 104 jobs for faster compilation
- Update MPI process count to 48 in input configuration
2026-01-17 20:41:02 +08:00
CGH0S7
cb252f5ea2
Optimize numerical algorithms with Intel oneMKL
...
- FFT.f90: Replace hand-written Cooley-Tukey FFT with oneMKL DFTI
- ilucg.f90: Replace manual dot product loop with BLAS DDOT
- gaussj.C: Replace Gauss-Jordan elimination with LAPACK dgesv/dgetri
- makefile.inc: Add MKL include paths and library linking
All optimizations maintain mathematical equivalence and numerical precision.
2026-01-16 10:58:11 +08:00
CGH0S7
57a7376044
Switch compiler toolchain from GCC to Intel oneAPI
...
- makefile.inc: Replace GCC compilers with Intel oneAPI
- C/C++: gcc/g++ -> icx/icpx
- Fortran: gfortran -> ifx
- MPI linker: mpic++ -> mpiicpx
- Update LDLIBS and compiler flags accordingly
- macrodef.h: Fix include path (microdef.fh -> macrodef.fh)
Requires: source /home/intel/oneapi/setvars.sh before build
2026-01-15 16:32:12 +08:00
cd5ceaa15f
main branch updated
2026-01-14 08:55:53 +08:00
f2fc9af70e
asc26 amss-ncku initialized
2026-01-13 15:01:15 +08:00