AMSS-NCKU

64-BitBrainstorm_2026/AMSS-NCKU

Author	SHA1	Message	Date
CGH0S7	5070134857	perf(transfer_cached): 将 per-call new/delete 的 req_node/req_is_recv/completed 数组移入 SyncCache 复用避免 transfer_cached 每次调用分配释放 3 个临时数组，减少堆操作开销。 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-02 21:14:35 +08:00
CGH0S7	4012e9d068	perf(RestrictProlong): 用 Restrict_cached/OutBdLow2Hi_cached 替换非缓存版本，Sync_finish 改为渐进式解包 - RestrictProlong/RestrictProlong_aux 中的 Restrict() 和 OutBdLow2Hi() 替换为 _cached 版本，复用 gridseg 列表和 MPI 缓冲区，避免每次调用重新分配 - 新增 sync_cache_restrict/sync_cache_outbd 两组 per-level 缓存 - Sync_finish 从 MPI_Waitall 改为 MPI_Waitsome 渐进式解包，降低尾延迟 - AsyncSyncState 扩展 req_node/req_is_recv/pending_recv 字段支持渐进解包 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-02 20:48:38 +08:00
ianchb	42c69fab24	refactor(Parallel): streamline MPI communication by consolidating request handling and memory management	2026-03-02 17:37:13 +08:00
jaunatisblue	6b2464b80c	Interp_Points 负载均衡：热点 block 拆分与 rank 重映射问题背景： Patch::Interp_Points 在球面插值时存在严重的 MPI 负载不均衡。通过 MPI_Wtime 计时诊断发现，64 进程中 rank 27/28/35/36 四个进程承担了绝大部分插值计算（耗时为平均值的 2.6~3.3 倍），导致其余 60 个进程在 MPI 集合通信处空等，成为整体性能瓶颈。根因分析：这四个 rank 对应的 block 在物理空间上恰好覆盖了球面提取面（extraction sphere）的密集插值点区域，而 distribute 函数按均匀网格体积分配 block-to-rank，未考虑插值点的空间分布不均。优化方案： 1. 新增 distribute_optimize 函数替代 distribute，使用独立的 current_block_id 计数器（与 rank 分配解耦）遍历所有 block。 2. 热点 block 拆分（splitHotspotBlock）：对 block 27/28/35/36 沿 x 轴在中点处二等分，生成左右两个子 block，分别分配给相邻的两个 rank： - block 27 → (rank 26, rank 27) - block 28 → (rank 28, rank 29) - block 35 → (rank 34, rank 35) - block 36 → (rank 36, rank 37) 子 block 严格复刻原 distribute 的 ghost zone 扩张和物理坐标计算逻辑（支持 Vertex/Cell 两种网格模式）。 3. 邻居 rank 重映射（createMappedBlock）：被占用的邻居 block 需要让出原 rank，重映射到相邻空闲 rank： - block 26 → rank 25 - block 29 → rank 30 - block 34 → rank 33 - block 37 → rank 38 其余 block 保持 block_id == rank 的原始映射。 4. cgh.C 中 compose_cgh 通过预处理宏切换调用 distribute_optimize 或原始 distribute。 5. MPatch.C 中添加 profile 采集插桩：在 Interp_Points 重载 2 中用 MPI_Wtime 计时，MPI_Gather 汇总各 rank 耗时，识别热点 rank 并写入二进制 profile 文件。 6. 新增 interp_lb_profile.h/C：定义 profile 文件格式（magic、 version、nprocs、threshold_ratio、heavy_ranks），提供 write_profile/read_profile/identify_heavy_ranks 接口。数学等价性：拆分和重映射仅改变 block 的几何划分与 rank 归属，不修改任何物理方程、差分格式或插值算法，计算结果严格一致。	2026-02-27 15:07:40 +08:00
CGH0S7	5c1790277b	Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}} loops across RestrictProlong (3 overloads) and ProlongRestrict each produced N_c × N_f separate transfer() → MPI_Waitall barriers. Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which merges all patch pairs into a single transfer() call with 1 MPI_Waitall. Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached to Parallel (unused for now — kept as infrastructure for future use). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-11 16:09:08 +08:00
CGH0S7	e09ae438a2	Cache data_packer lengths in Sync_start to skip redundant buffer-size traversals The data_packer(NULL, ...) calls that compute send/recv buffer lengths traverse all grid segments × variables × nprocs on every Sync_start invocation, even though lengths never change once the cache is built. Add a lengths_valid flag to SyncCache so these length computations are done once and reused on subsequent calls (4× per RK4 step). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 21:39:22 +08:00
CGH0S7	42b9cf1ad9	Optimize MPI Sync with merged transfers, caching, and async overlap Phase 1: Merge N+1 transfer() calls into a single transfer() per Sync(PatchList), reducing N+1 MPI_Waitall barriers to 1 via new Sync_merged() that collects all intra-patch and inter-patch grid segment lists into combined per-rank arrays. Phase 2: Cache grid segment lists and reuse grow-only communication buffers across RK4 substeps via SyncCache struct. Caches are per-level and per-variable-list (predictor/corrector), invalidated on regrid. Eliminates redundant build_ghost_gsl/build_owned_gsl0/build_gstl rebuilds and malloc/free cycles between regrids. Phase 3: Split Sync into async Sync_start/Sync_finish to overlap Cartesian ghost zone exchange (MPI_Isend/Irecv) with Shell patch synchronization. Uses MPI tag 2 to avoid conflicts with SH->Synch() which uses transfer() with tag 1. Also updates makefile.inc paths and flags for local build environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 21:03:37 +08:00
CGH0S7	f2fc9af70e	asc26 amss-ncku initialized	2026-01-13 15:01:15 +08:00

8 Commits