Interp_Points 负载均衡：热点 block 拆分与 rank 重映射

问题背景： Patch::Interp_Points 在球面插值时存在严重的 MPI 负载不均衡。通过 MPI_Wtime 计时诊断发现，64 进程中 rank 27/28/35/36 四个进程承担了绝大部分插值计算（耗时为平均值的 2.6~3.3 倍），导致其余 60 个进程在 MPI 集合通信处空等，成为整体性能瓶颈。根因分析：这四个 rank 对应的 block 在物理空间上恰好覆盖了球面提取面（extraction sphere）的密集插值点区域，而 distribute 函数按均匀网格体积分配 block-to-rank，未考虑插值点的空间分布不均。优化方案： 1. 新增 distribute_optimize 函数替代 distribute，使用独立的 current_block_id 计数器（与 rank 分配解耦）遍历所有 block。 2. 热点 block 拆分（splitHotspotBlock）：对 block 27/28/35/36 沿 x 轴在中点处二等分，生成左右两个子 block，分别分配给相邻的两个 rank： - block 27 → (rank 26, rank 27) - block 28 → (rank 28, rank 29) - block 35 → (rank 34, rank 35) - block 36 → (rank 36, rank 37) 子 block 严格复刻原 distribute 的 ghost zone 扩张和物理坐标计算逻辑（支持 Vertex/Cell 两种网格模式）。 3. 邻居 rank 重映射（createMappedBlock）：被占用的邻居 block 需要让出原 rank，重映射到相邻空闲 rank： - block 26 → rank 25 - block 29 → rank 30 - block 34 → rank 33 - block 37 → rank 38 其余 block 保持 block_id == rank 的原始映射。 4. cgh.C 中 compose_cgh 通过预处理宏切换调用 distribute_optimize 或原始 distribute。 5. MPatch.C 中添加 profile 采集插桩：在 Interp_Points 重载 2 中用 MPI_Wtime 计时，MPI_Gather 汇总各 rank 耗时，识别热点 rank 并写入二进制 profile 文件。 6. 新增 interp_lb_profile.h/C：定义 profile 文件格式（magic、 version、nprocs、threshold_ratio、heavy_ranks），提供 write_profile/read_profile/identify_heavy_ranks 接口。数学等价性：拆分和重映射仅改变 block 的几何划分与 rank 归属，不修改任何物理方程、差分格式或插值算法，计算结果严格一致。
2026-02-27 15:07:40 +08:00
parent 9c33e16571
commit 6b2464b80c
6 changed files with 574 additions and 1 deletions
--- a/AMSS_NCKU_source/MPatch.C
+++ b/AMSS_NCKU_source/MPatch.C
@@ -13,6 +13,9 @@ using namespace std;
 #include "MPatch.h"
 #include "Parallel.h"
 #include "fmisc.h"
+#ifdef INTERP_LB_PROFILE
+#include "interp_lb_profile.h"
+#endif

 Patch::Patch(int DIM, int *shapei, double *bboxi, int levi, bool buflog, int Symmetry) : lev(levi)
 {
@@ -507,6 +510,9 @@ void Patch::Interp_Points(MyList<var> *VarList,
  // Targeted point-to-point overload: each owner sends each point only to
  // the one rank that needs it for integration (consumer), reducing
  // communication volume by ~nprocs times compared to the Bcast version.
+#ifdef INTERP_LB_PROFILE
+  double t_interp_start = MPI_Wtime();
+#endif
  int myrank, nprocs;
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
@@ -608,6 +614,11 @@ void Patch::Interp_Points(MyList<var> *VarList,
    }
  }

+#ifdef INTERP_LB_PROFILE
+  double t_interp_end = MPI_Wtime();
+  double t_interp_local = t_interp_end - t_interp_start;
+#endif
+
  // --- Error check for unfound points ---
  for (int j = 0; j < NN; j++)
  {
@@ -764,6 +775,31 @@ void Patch::Interp_Points(MyList<var> *VarList,
  delete[] recv_count;
  delete[] consumer_rank;
  delete[] owner_rank;
+
+#ifdef INTERP_LB_PROFILE
+  {
+    static bool profile_written = false;
+    if (!profile_written) {
+      double *all_times = nullptr;
+      if (myrank == 0) all_times = new double[nprocs];
+      MPI_Gather(&t_interp_local, 1, MPI_DOUBLE,
+                 all_times, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
+      if (myrank == 0) {
+        int heavy[64];
+        int nh = InterpLBProfile::identify_heavy_ranks(
+            all_times, nprocs, 2.5, heavy, 64);
+        InterpLBProfile::write_profile(
+            "interp_lb_profile.bin", nprocs,
+            all_times, heavy, nh, 2.5);
+        printf("[InterpLB] Profile written: %d heavy ranks\n", nh);
+        for (int i = 0; i < nh; i++)
+          printf("  Heavy rank %d: %.6f s\n", heavy[i], all_times[heavy[i]]);
+        delete[] all_times;
+      }
+      profile_written = true;
+    }
+  }
+#endif
 }
 void Patch::Interp_Points(MyList<var> *VarList,
                          int NN, double **XX,