Commit Graph

  • 8c1f4d8108 迁移C算子的循环融合和临时量消除 main CGH0S7 2026-03-03 15:57:10 +08:00
  • d310ef918b bssn_rhs(fortran): migrate C kernel loop-fusion optimizations CGH0S7 2026-03-03 15:41:26 +08:00
  • b35e1b289f 设置开关关闭内存打印统计 CGH0S7 2026-03-03 15:15:06 +08:00
  • 05851b2c59 关闭静态负载 CGH0S7 2026-03-03 12:36:19 +08:00
  • 3b39583d67 fix(bssn_rhs) ianchb 2026-03-03 16:00:45 +08:00
  • 9c44d1c885 fix(bssn_rhs) chb-parallel ianchb 2026-03-03 16:00:45 +08:00
  • f1fe9fd443 迁移C算子的循环融合和临时量消除 cjy-dystopia CGH0S7 2026-03-03 15:57:10 +08:00
  • 7bb9042b18 bssn_rhs(fortran): migrate C kernel loop-fusion optimizations CGH0S7 2026-03-03 15:41:26 +08:00
  • 9991b7f41e 关闭C重写算子 CGH0S7 2026-03-03 15:28:09 +08:00
  • 57abf12bbd Fix C derivative kernels to match Fortran ghost_width=3 stencil gating CGH0S7 2026-03-03 15:22:01 +08:00
  • 51efc47c1b 设置开关关闭内存打印统计 CGH0S7 2026-03-03 15:15:06 +08:00
  • 4b9de28feb 将 Restrict/Prolong 链路里的 coarse-level Sync_cached 改为可选(默认跳过) ianchb 2026-03-03 14:25:27 +08:00
  • 4eb5dc4ddb 删除重复的一次 chi 一阶导计算 ianchb 2026-03-02 20:16:16 +08:00
  • 234c4f7344 关闭静态负载 CGH0S7 2026-03-03 12:36:19 +08:00
  • 688bdb6708 Merge pull request 'cjy-dystopia' (#3) from cjy-dystopia into main gh0s7 2026-03-02 21:36:26 +08:00
  • 12e1f63d50 prolong3: 减少Z-pass 冗余计算 yx-prolong jaunatisblue 2026-03-02 21:20:49 +08:00
  • 5070134857 perf(transfer_cached): 将 per-call new/delete 的 req_node/req_is_recv/completed 数组移入 SyncCache 复用 CGH0S7 2026-03-02 21:14:35 +08:00
  • 4012e9d068 perf(RestrictProlong): 用 Restrict_cached/OutBdLow2Hi_cached 替换非缓存版本,Sync_finish 改为渐进式解包 CGH0S7 2026-03-02 20:48:38 +08:00
  • 43975017eb prolong3 改为先算实际 stencil 窗口;只有窗口触及对称边界时才走全域 symmetry_bd,否则只复制必需窗口。restrict3 同样改成窗口判定,无触边时仅填 ii/jj/kk 必需窗口。 chb-rebase-wip ianchb 2026-03-02 17:17:16 +08:00
  • 485667ef4c perf(restrict3): shrink X-pass ii sweep to required overlap window - compute fi_min/fi_max from output i-range and derive ii_lo/ii_hi - replace full ii sweep (-1:extf(1)) with windowed sweep in Z/Y precompute passes - keep stencil math unchanged; add bounds sanity check for ii window ianchb 2026-03-02 16:08:13 +08:00
  • 2a977ce82e perf(MPatch): 用空间 bin 索引加速 Interp_Points 的 block 归属查找 ianchb 2026-03-02 15:54:37 +08:00
  • b3c367f15b prolong3 改为先算实际 stencil 窗口;只有窗口触及对称边界时才走全域 symmetry_bd,否则只复制必需窗口。restrict3 同样改成窗口判定,无触边时仅填 ii/jj/kk 必需窗口。 ianchb 2026-03-02 17:17:16 +08:00
  • e73911f292 perf(restrict3): shrink X-pass ii sweep to required overlap window - compute fi_min/fi_max from output i-range and derive ii_lo/ii_hi - replace full ii sweep (-1:extf(1)) with windowed sweep in Z/Y precompute passes - keep stencil math unchanged; add bounds sanity check for ii window ianchb 2026-03-02 16:08:13 +08:00
  • 7543d3e8c7 perf(MPatch): 用空间 bin 索引加速 Interp_Points 的 block 归属查找 ianchb 2026-03-02 15:54:37 +08:00
  • 42c69fab24 refactor(Parallel): streamline MPI communication by consolidating request handling and memory management ianchb 2026-03-01 16:20:51 +08:00
  • 95220a05c8 optimize fdderivs core-region branch elimination for ghost_width=3 CGH0S7 2026-03-02 17:33:26 +08:00
  • 160e2a0369 fix prolong/restrict index bounds after cherry-pick 12e1f63 CGH0S7 2026-03-02 13:59:47 +08:00
  • 01410de05a refactor(Parallel): streamline MPI communication by consolidating request handling and memory management ianchb 2026-03-01 16:20:51 +08:00
  • 83c826eb49 prolong3: 减少Z-pass 冗余计算 jaunatisblue 2026-03-02 21:20:49 +08:00
  • 466b084a58 fix prolong/restrict index bounds after cherry-pick 12e1f63 CGH0S7 2026-03-02 13:59:47 +08:00
  • 61ccef9f97 prolong3: 减少Z-pass 冗余计算 jaunatisblue 2026-03-02 21:20:49 +08:00
  • 43ddaab903 fix: add C RK4 kernel to CFILES_CUDA ianchb 2026-03-02 12:19:52 +08:00
  • 5839755c2f compute div_beta on-the-fly to remove temp array ianchb 2026-03-02 12:12:58 +08:00
  • a893b4007c merge lopsided+kodis ianchb 2026-03-02 12:12:26 +08:00
  • ad5ff03615 build: switch allocator option to oneTBB tbbmalloc CGH0S7 2026-02-28 17:16:00 +08:00
  • b4bc0ef269 先关闭绑核心,发现速度对比:不绑定核心+SCX>绑核心+SCX CGH0S7 2026-02-28 23:27:44 +08:00
  • b185f84cce Add switchable C RK4 kernel and build toggle CGH0S7 2026-02-28 21:12:19 +08:00
  • 71f6eb7b44 Remove profiling code ianchb 2026-03-02 11:29:48 +08:00
  • 90620c2aec Optimize fdderivs: skip redundant 2nd-order work in 4th-order overlap CGH0S7 2026-03-02 03:21:21 +08:00
  • f561522d89 prolong3:提升cache命中率 jaunatisblue 2026-03-02 10:31:46 +08:00
  • 3f4715b8cc 修改prolong jaunatisblue 2026-03-02 02:01:07 +08:00
  • 710ea8f76b 对prolong3做访存优化 jaunatisblue 2026-03-02 01:16:10 +08:00
  • 5cf891359d Optimize symmetry_bd with stride-based fast paths CGH0S7 2026-03-01 15:50:56 +08:00
  • 222747449a Optimize average2: use DO CONCURRENT loop form CGH0S7 2026-03-01 00:41:32 +08:00
  • 14de4d535e Optimize average2: replace array expression with explicit loops CGH0S7 2026-03-01 00:33:01 +08:00
  • 787295692a Optimize prolong3: hoist bounds check out of inner loop CGH0S7 2026-03-01 00:17:30 +08:00
  • 335f2f23fe Optimize prolong3: replace parity branches with coefficient lookup CGH0S7 2026-02-28 23:59:57 +08:00
  • 7109474a14 Optimize prolong3: precompute coarse index/parity maps CGH0S7 2026-02-28 23:53:30 +08:00
  • 47f91ff46f prolong3:提升cache命中率 jaunatisblue 2026-03-02 10:31:46 +08:00
  • e11363e06e Optimize fdderivs: skip redundant 2nd-order work in 4th-order overlap hxh-omp CGH0S7 2026-03-02 03:21:21 +08:00
  • f70e90f694 prolong3:提升cache命中率 jaunatisblue 2026-03-02 10:31:46 +08:00
  • 75dd5353b0 修改prolong jaunatisblue 2026-03-02 02:01:07 +08:00
  • 23a82d063b 对prolong3做访存优化 jaunatisblue 2026-03-02 01:16:10 +08:00
  • 672b7ebee2 修改prolong jaunatisblue 2026-03-02 02:01:07 +08:00
  • 63bf180159 对prolong3做访存优化 jaunatisblue 2026-03-02 01:16:10 +08:00
  • 524d1d1512 Merge pull request 'cjy-dystopia' (#2) from cjy-dystopia into main gh0s7 2026-03-01 19:22:09 +08:00
  • 44efb2e08c 预赛最终版本v1.0.0: 确定PGO和原负载均衡方案在当前版本造成负优化已经回退 CGH0S7 2026-03-01 18:04:25 +08:00
  • 16013081e0 Optimize symmetry_bd with stride-based fast paths CGH0S7 2026-03-01 15:50:56 +08:00
  • e7a02e8f72 perf(polint): add uniform-grid fast path for barycentric n=6 CGH0S7 2026-03-01 13:26:39 +08:00
  • 8dad910c6c perf(polint): add switchable barycentric ordn=6 path CGH0S7 2026-03-01 13:20:46 +08:00
  • 01b4cf71d1 perf(polin3): switch to lagrange-weight tensor contraction CGH0S7 2026-03-01 13:04:33 +08:00
  • 66dabe8cc4 perf(polint): add ordn=6 specialized neville path CGH0S7 2026-03-01 12:39:53 +08:00
  • 03416a7b28 perf(polint): add uniform-grid fast path for barycentric n=6 CGH0S7 2026-03-01 13:26:39 +08:00
  • cca3c16c2b perf(polint): add switchable barycentric ordn=6 path CGH0S7 2026-03-01 13:20:46 +08:00
  • e5231849ee perf(polin3): switch to lagrange-weight tensor contraction CGH0S7 2026-03-01 13:04:33 +08:00
  • a766e49ff0 perf(polint): add ordn=6 specialized neville path CGH0S7 2026-03-01 12:39:53 +08:00
  • 19b0e79692 黄老板逆天重写 hxh-new wingrew 2026-03-01 05:48:40 +08:00
  • 1a518cd3f6 Optimize average2: use DO CONCURRENT loop form CGH0S7 2026-03-01 00:41:32 +08:00
  • 1dc622e516 Optimize average2: replace array expression with explicit loops CGH0S7 2026-03-01 00:33:01 +08:00
  • 3046a0ccde Optimize prolong3: hoist bounds check out of inner loop CGH0S7 2026-03-01 00:17:30 +08:00
  • d4ec69c98a Optimize prolong3: replace parity branches with coefficient lookup CGH0S7 2026-02-28 23:59:57 +08:00
  • 2c0a3055d4 Optimize prolong3: precompute coarse index/parity maps CGH0S7 2026-02-28 23:53:30 +08:00
  • 1eba73acbe 先关闭绑核心,发现速度对比:不绑定核心+SCX>绑核心+SCX CGH0S7 2026-02-28 23:27:44 +08:00
  • 588fb675a0 尝试划分4block但是效果不好,转为研究访存 yx_new_split jaunatisblue 2026-02-28 21:17:02 +08:00
  • b91cfff301 Add switchable C RK4 kernel and build toggle CGH0S7 2026-02-28 21:12:19 +08:00
  • e29ca2dca9 build: switch allocator option to oneTBB tbbmalloc CGH0S7 2026-02-28 17:16:00 +08:00
  • 6493101ca0 bssn_rhs_c: recompute contracted Gamma terms to remove temp arrays CGH0S7 2026-02-28 16:34:23 +08:00
  • 169986cde1 bssn_rhs_c: compute div_beta on-the-fly to remove temp array CGH0S7 2026-02-28 16:25:57 +08:00
  • 1fbc213888 bssn_rhs_c: remove gxx/gyy/gzz temporaries in favor of dxx/dyy/dzz+1 CGH0S7 2026-02-28 15:50:52 +08:00
  • 6024708a48 derivs_c: split low/high stencil regions to reduce branch overhead CGH0S7 2026-02-28 15:42:31 +08:00
  • abf2f640e4 add fused symmetry packing kernels for orders 2 and 3 in BSSN RHS ianchb 2026-02-28 15:35:14 +08:00
  • 94f40627aa refine GPU dispatch initialization and optimize H2D/D2H data transfers ianchb 2026-02-28 15:23:41 +08:00
  • bc457d981e bssn_rhs_c: merge lopsided+kodis with shared symmetry buffer CGH0S7 2026-02-28 15:23:01 +08:00
  • 51dead090e bssn_rhs_c: 融合最终RHS两循环为一循环,用局部变量传递fij中间值 (Modify 6) CGH0S7 2026-02-28 13:49:45 +08:00
  • 34d6922a66 fdderivs_c: 全量清零改为只清零边界面,减少无效内存写入 CGH0S7 2026-02-28 13:20:06 +08:00
  • 8010ad27ed kodiss_c: 收紧循环范围消除边界无用迭代和分支判断 CGH0S7 2026-02-28 13:04:21 +08:00
  • 38e691f013 bssn_rhs_c: 融合Christoffel修正+trK_rhs两循环为一循环 (Modify 5) CGH0S7 2026-02-28 12:57:07 +08:00
  • 808387aa11 bssn_rhs_c: 融合fxx/Gamxa+Gamma_rhs_part2两循环为一循环 (Modify 4) CGH0S7 2026-02-28 11:14:35 +08:00
  • d94c31c5c4 [WIP]Implement multi-GPU support in BSSN RHS and add profiling for H2D/D2H transfers ianchb 2026-02-28 01:21:45 +08:00
  • 724e9cd415 [WIP]Add CUDA support for BSSN RHS with new kernel and update makefiles ianchb 2026-02-27 21:46:43 +08:00
  • c001939461 Add Lagrange interpolation subroutine and update calls in prolongrestrict modules ianchb 2026-02-27 12:46:39 +08:00
  • 94d236385d Revert "skip redundant MPI ghost cell syncs for stages 0, 1 & 2" ianchb 2026-02-26 20:56:41 +08:00
  • 780f1c80d0 skip redundant MPI ghost cell syncs for stages 0, 1 & 2 ianchb 2026-02-26 16:16:33 +08:00
  • c2b676abf2 bssn_rhs_c: 融合A^{ij}升指标+Gamma_rhs_part1两循环为一循环 (Modify 3) CGH0S7 2026-02-28 11:02:27 +08:00
  • 2c60533501 bssn_rhs_c: 融合逆度规+Gamma约束+Christoffel三循环为一循环 (Modify 2) CGH0S7 2026-02-28 10:57:40 +08:00
  • aabe74c098 短暂的4划分但是以失败告终 jaunatisblue 2026-02-28 08:23:30 +08:00
  • 318b5254cc 根据组委会邮件要求更新检测脚本,增加对3D向量和三个分量分别检测RMS小于1.0% CGH0S7 2026-02-27 17:38:21 +08:00
  • 3cee05f262 Merge branch 'cjy-oneapi-opus-hotfix' CGH0S7 2026-02-27 15:13:40 +08:00
  • e0b5e012df 引入 PGO 式两遍编译流程,将 Interp_Points 负载均衡优化合法化 cjy-oneapi-opus-hotfix CGH0S7 2026-02-27 15:10:22 +08:00
  • 6b2464b80c Interp_Points 负载均衡:热点 block 拆分与 rank 重映射 jaunatisblue 2026-02-27 15:07:40 +08:00