Compare commits

...

24 Commits

Author SHA1 Message Date
e0b5e012df 引入 PGO 式两遍编译流程,将 Interp_Points 负载均衡优化合法化
背景:
上一个 commit 中同事实现的热点 block 拆分与 rank 重映射取得了显著
加速效果,但其中硬编码了 heavy ranks (27/28/35/36) 和重映射表,
属于针对特定测例的优化,违反竞赛规则第 6 条(不允许针对参数或测例
的专门优化)。

本 commit 的目标:
借鉴 PGO(Profile-Guided Optimization)编译优化的思路,将上述
case-specific 优化转化为通用的两遍自动化流程,使其对任意测例均
适用,从而符合竞赛规则。

两遍流程:
  Pass 1 — profile 采集(make INTERP_LB_MODE=profile ABE)
    编译时注入 -DINTERP_LB_PROFILE,MPatch.C 中 Interp_Points
    在首次调用时用 MPI_Wtime 计时 + MPI_Gather 汇总各 rank 耗时,
    识别超过均值 2.5 倍的热点 rank,写入 interp_lb_profile.bin。

  中间步骤 — 生成编译时头文件
    python3 gen_interp_lb_header.py 读取 profile.bin,自动计算
    拆分策略和重映射表,生成 interp_lb_profile_data.h,包含:
    - interp_lb_splits[][3]:每个热点 block 的 (block_id, r_left, r_right)
    - interp_lb_remaps[][2]:被挤占邻居 block 的 rank 重映射

  Pass 2 — 优化编译(make INTERP_LB_MODE=optimize ABE)
    编译时注入 -DINTERP_LB_OPTIMIZE,profile 数据以 static const
    数组形式固化进可执行文件(零运行时开销),distribute_optimize
    在 block 创建阶段直接应用拆分和重映射。

具体改动:
- makefile.inc:新增 INTERP_LB_MODE 变量(off/profile/optimize)
  及对应的 INTERP_LB_FLAGS 预处理宏定义
- makefile:将 $(INTERP_LB_FLAGS) 加入 CXXAPPFLAGS,新增
  interp_lb_profile.o 编译目标
- gen_interp_lb_header.py:profile.bin → interp_lb_profile_data.h
  的自动转换脚本
- interp_lb_profile_data.h:自动生成的编译时常量头文件
- interp_lb_profile.bin:profile 采集阶段生成的二进制数据
- AMSS_NCKU_Program.py:构建时自动拷贝 profile.bin 到运行目录
- makefile_and_run.py:默认构建命令切换为 INTERP_LB_MODE=optimize

通用性说明:
整个流程不依赖任何硬编码的 rank 编号或测例参数。对于不同的网格
配置、进程数或物理问题,只需重新执行 Pass 1 采集 profile,即可
自动生成对应的优化方案。这与 PGO 编译优化的理念完全一致——先
profile 再优化,是一种通用的性能优化方法论。
2026-02-27 15:10:22 +08:00
jaunatisblue
6b2464b80c Interp_Points 负载均衡:热点 block 拆分与 rank 重映射
问题背景:
Patch::Interp_Points 在球面插值时存在严重的 MPI 负载不均衡。
通过 MPI_Wtime 计时诊断发现,64 进程中 rank 27/28/35/36 四个进程
承担了绝大部分插值计算(耗时为平均值的 2.6~3.3 倍),导致其余 60
个进程在 MPI 集合通信处空等,成为整体性能瓶颈。

根因分析:
这四个 rank 对应的 block 在物理空间上恰好覆盖了球面提取面
(extraction sphere)的密集插值点区域,而 distribute 函数按均匀
网格体积分配 block-to-rank,未考虑插值点的空间分布不均。

优化方案:
1. 新增 distribute_optimize 函数替代 distribute,使用独立的
   current_block_id 计数器(与 rank 分配解耦)遍历所有 block。

2. 热点 block 拆分(splitHotspotBlock):
   对 block 27/28/35/36 沿 x 轴在中点处二等分,生成左右两个子
   block,分别分配给相邻的两个 rank:
   - block 27 → (rank 26, rank 27)
   - block 28 → (rank 28, rank 29)
   - block 35 → (rank 34, rank 35)
   - block 36 → (rank 36, rank 37)
   子 block 严格复刻原 distribute 的 ghost zone 扩张和物理坐标
   计算逻辑(支持 Vertex/Cell 两种网格模式)。

3. 邻居 rank 重映射(createMappedBlock):
   被占用的邻居 block 需要让出原 rank,重映射到相邻空闲 rank:
   - block 26 → rank 25
   - block 29 → rank 30
   - block 34 → rank 33
   - block 37 → rank 38
   其余 block 保持 block_id == rank 的原始映射。

4. cgh.C 中 compose_cgh 通过预处理宏切换调用 distribute_optimize
   或原始 distribute。

5. MPatch.C 中添加 profile 采集插桩:在 Interp_Points 重载 2 中
   用 MPI_Wtime 计时,MPI_Gather 汇总各 rank 耗时,识别热点 rank
   并写入二进制 profile 文件。

6. 新增 interp_lb_profile.h/C:定义 profile 文件格式(magic、
   version、nprocs、threshold_ratio、heavy_ranks),提供
   write_profile/read_profile/identify_heavy_ranks 接口。

数学等价性:拆分和重映射仅改变 block 的几何划分与 rank 归属,
不修改任何物理方程、差分格式或插值算法,计算结果严格一致。
2026-02-27 15:07:40 +08:00
9c33e16571 增加C算子PGO文件 2026-02-27 11:30:36 +08:00
45b7a43576 补全C算子和Fortran算子的数学差异 2026-02-26 15:48:11 +08:00
dfb79e3e11 Initialize output arrays to zero in fdderivs_c.C and fderivs_c.C 2026-02-26 14:18:31 +08:00
d2c2214fa1 补充TwoPunctureABE专用PGO插桩文件 2026-02-25 23:06:17 +08:00
e157ea3a23 合并 chb-replace:C++ 算子替换 Fortran bssn_rhs,添加回退开关与独立 PGO profdata
- 合并 chb-replace 分支,引入 bssn_rhs_c.C / fderivs_c.C / fdderivs_c.C /
  kodiss_c.C / lopsided_c.C 五个 C++ 算子实现
- 添加 USE_CXX_KERNELS 开关(默认 1),设为 0 可回退到原始 Fortran bssn_rhs.o
- TwoPunctureABE 改用独立的 TwoPunctureABE.profdata 而非 default.profdata

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 22:50:46 +08:00
f5a63f1e42 Revert "Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement"
This reverts commit 09b937c022.
2026-02-25 22:21:43 +08:00
284ab80baf Remove OpenMP from C rewrite kernel
The C rewrite introduced OpenMP parallelism. Remove all OpenMP.
2026-02-25 22:21:20 +08:00
copilot-swe-agent[bot]
09b937c022 Fix timing: replace clock() with MPI_Wtime() for wall-clock measurement
clock() measures total CPU time across all threads, not wall-clock
time. With the new OpenMP parallel regions in bssn_rhs_c.C, clock()
sums CPU time from all OpenMP threads, producing inflated timing that
scales with thread count rather than reflecting actual elapsed time.

MPI_Wtime() returns wall-clock seconds, giving accurate timing
regardless of the number of OpenMP threads running inside the
measured interval.

Co-authored-by: ianchb <i@4t.pw>
2026-02-25 22:21:19 +08:00
wingrew
8a9c775705 Replace Fortran bssn_rhs with C implementation and add C helper kernels
- Modify bssn_rhs_c.C to use existing project headers (macrodef.h, bssn_rhs.h)
- Update makefile: remove bssn_rhs.o from F90FILES, add CFILES with OpenMP
- Keep Fortran helper files (diff_new.f90, kodiss.f90, lopsidediff.f90) for other Fortran callers

[copilot: fix compiling errors & a nan error]

Co-authored-by: ianchb <i@4t.pw>
Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>
2026-02-25 22:21:19 +08:00
d942122043 更新PGO文件 2026-02-25 18:25:20 +08:00
a5c713a7e0 完善PGO机制 2026-02-25 17:22:56 +08:00
9e6b25163a 更新 PGO profdata 并为 ABE 插桩编译添加 PGO_MODE 开关
- 更新 pgo_profile/default.profdata 为最新收集的 profile 数据
- 备份旧 profdata 至 default.profdata.backup2
- makefile: 新增 PGO_MODE 开关(默认 opt),支持 make PGO_MODE=instrument
  切换到 Phase 1 插桩模式重新收集数据,无需手动修改 flags
- makefile: TwoPunctureABE 独立使用 TP_OPTFLAGS,不受 PGO_MODE 影响
- makefile: PROFDATA 路径改为 /home/$(shell whoami)/AMSS-NCKU/pgo_profile/default.profdata
- makefile.inc: 移除硬编码的编译 flags,改由 makefile 中的 ifeq 逻辑管理

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 17:00:55 +08:00
CGH0S7
efc8bf29ea 按需失效同步缓存:Regrid_Onelevel 改为返回 bool
将 cgh::Regrid_Onelevel 的返回类型从 void 改为 bool,
在网格真正发生移动时返回 true,否则返回 false。
调用方仅在返回 true 时才失效 sync_cache_*,避免了
每次 RecursiveStep 结束后无条件失效所有层级缓存的冗余开销。

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-02-25 16:00:26 +08:00
CGH0S7
ccf6adaf75 提供正确的macrodef.h避免llm被误导 2026-02-25 11:47:14 +08:00
CGH0S7
e2bc472845 优化绑核逻辑,取消硬编码改为智能识别 2026-02-25 10:59:32 +08:00
82339f5282 Merge lopsided advection + kodis dissipation to share symmetry_bd buffer
Cherry-picked from 38c2c30.
2026-02-20 13:36:27 +08:00
94f38c57f9 Don't hardcode pgo profile path 2026-02-20 13:36:27 +08:00
85d1e8de87 Add Intel SIMD vectorization directives to hot-spot functions
Apply Intel Advisor optimization recommendations:
- Add FORCEINLINE to polint for better inlining
- Add SIMD VECTORLENGTHFOR and UNROLL directives to fderivs,
  fdderivs, symmetry_bd, and kodis functions

This improves vectorization efficiency of finite difference
computations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 00:43:39 +08:00
5b7e05cd32 PGO updated 2026-02-11 18:26:30 +08:00
85afe00fc5 Merge plotting optimizations from chb-copilot-test
- Implement multiprocessing-based parallel plotting
- Add parallel_plot_helper.py for concurrent plot task execution
- Use matplotlib 'Agg' backend for multiprocessing safety
- Set OMP_NUM_THREADS=1 to prevent BLAS thread explosion
- Use subprocess for binary data plots to avoid thread conflicts
- Add fork bomb protection in main program

This merge only includes plotting improvements and excludes
MPI communication changes to preserve existing optimizations.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 16:19:17 +08:00
5c1790277b Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong
The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}}
loops across RestrictProlong (3 overloads) and ProlongRestrict each
produced N_c × N_f separate transfer() → MPI_Waitall barriers.
Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which
merges all patch pairs into a single transfer() call with 1 MPI_Waitall.

Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached
to Parallel (unused for now — kept as infrastructure for future use).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-11 16:09:08 +08:00
e09ae438a2 Cache data_packer lengths in Sync_start to skip redundant buffer-size traversals
The data_packer(NULL, ...) calls that compute send/recv buffer lengths
traverse all grid segments × variables × nprocs on every Sync_start
invocation, even though lengths never change once the cache is built.
Add a lengths_valid flag to SyncCache so these length computations are
done once and reused on subsequent calls (4× per RK4 step).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-02-10 21:39:22 +08:00
42 changed files with 3884 additions and 442 deletions

View File

@@ -8,6 +8,14 @@
## ##
################################################################## ##################################################################
## Guard against re-execution by multiprocessing child processes.
## Without this, using 'spawn' or 'forkserver' context would cause every
## worker to re-run the entire script, spawning exponentially more
## workers (fork bomb).
if __name__ != '__main__':
import sys as _sys
_sys.exit(0)
################################################################## ##################################################################
@@ -262,6 +270,12 @@ if not os.path.exists( ABE_file ):
## Copy the executable ABE (or ABEGPU) into the run directory ## Copy the executable ABE (or ABEGPU) into the run directory
shutil.copy2(ABE_file, output_directory) shutil.copy2(ABE_file, output_directory)
## Copy interp load balance profile if present (for optimize pass)
interp_lb_profile = os.path.join(AMSS_NCKU_source_copy, "interp_lb_profile.bin")
if os.path.exists(interp_lb_profile):
shutil.copy2(interp_lb_profile, output_directory)
print( " Copied interp_lb_profile.bin to run directory " )
########################### ###########################
## If the initial-data method is TwoPuncture, copy the TwoPunctureABE executable to the run directory ## If the initial-data method is TwoPuncture, copy the TwoPunctureABE executable to the run directory
@@ -424,26 +438,31 @@ print(
import plot_xiaoqu import plot_xiaoqu
import plot_GW_strain_amplitude_xiaoqu import plot_GW_strain_amplitude_xiaoqu
from parallel_plot_helper import run_plot_tasks_parallel
plot_tasks = []
## Plot black hole trajectory ## Plot black hole trajectory
plot_xiaoqu.generate_puncture_orbit_plot( binary_results_directory, figure_directory ) plot_tasks.append( ( plot_xiaoqu.generate_puncture_orbit_plot, (binary_results_directory, figure_directory) ) )
plot_xiaoqu.generate_puncture_orbit_plot3D( binary_results_directory, figure_directory ) plot_tasks.append( ( plot_xiaoqu.generate_puncture_orbit_plot3D, (binary_results_directory, figure_directory) ) )
## Plot black hole separation vs. time ## Plot black hole separation vs. time
plot_xiaoqu.generate_puncture_distence_plot( binary_results_directory, figure_directory ) plot_tasks.append( ( plot_xiaoqu.generate_puncture_distence_plot, (binary_results_directory, figure_directory) ) )
## Plot gravitational waveforms (psi4 and strain amplitude) ## Plot gravitational waveforms (psi4 and strain amplitude)
for i in range(input_data.Detector_Number): for i in range(input_data.Detector_Number):
plot_xiaoqu.generate_gravitational_wave_psi4_plot( binary_results_directory, figure_directory, i ) plot_tasks.append( ( plot_xiaoqu.generate_gravitational_wave_psi4_plot, (binary_results_directory, figure_directory, i) ) )
plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot( binary_results_directory, figure_directory, i ) plot_tasks.append( ( plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot, (binary_results_directory, figure_directory, i) ) )
## Plot ADM mass evolution ## Plot ADM mass evolution
for i in range(input_data.Detector_Number): for i in range(input_data.Detector_Number):
plot_xiaoqu.generate_ADMmass_plot( binary_results_directory, figure_directory, i ) plot_tasks.append( ( plot_xiaoqu.generate_ADMmass_plot, (binary_results_directory, figure_directory, i) ) )
## Plot Hamiltonian constraint violation over time ## Plot Hamiltonian constraint violation over time
for i in range(input_data.grid_level): for i in range(input_data.grid_level):
plot_xiaoqu.generate_constraint_check_plot( binary_results_directory, figure_directory, i ) plot_tasks.append( ( plot_xiaoqu.generate_constraint_check_plot, (binary_results_directory, figure_directory, i) ) )
run_plot_tasks_parallel(plot_tasks)
## Plot stored binary data ## Plot stored binary data
plot_xiaoqu.generate_binary_data_plot( binary_results_directory, figure_directory ) plot_xiaoqu.generate_binary_data_plot( binary_results_directory, figure_directory )

View File

@@ -13,6 +13,9 @@ using namespace std;
#include "MPatch.h" #include "MPatch.h"
#include "Parallel.h" #include "Parallel.h"
#include "fmisc.h" #include "fmisc.h"
#ifdef INTERP_LB_PROFILE
#include "interp_lb_profile.h"
#endif
Patch::Patch(int DIM, int *shapei, double *bboxi, int levi, bool buflog, int Symmetry) : lev(levi) Patch::Patch(int DIM, int *shapei, double *bboxi, int levi, bool buflog, int Symmetry) : lev(levi)
{ {
@@ -507,6 +510,9 @@ void Patch::Interp_Points(MyList<var> *VarList,
// Targeted point-to-point overload: each owner sends each point only to // Targeted point-to-point overload: each owner sends each point only to
// the one rank that needs it for integration (consumer), reducing // the one rank that needs it for integration (consumer), reducing
// communication volume by ~nprocs times compared to the Bcast version. // communication volume by ~nprocs times compared to the Bcast version.
#ifdef INTERP_LB_PROFILE
double t_interp_start = MPI_Wtime();
#endif
int myrank, nprocs; int myrank, nprocs;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
@@ -608,6 +614,11 @@ void Patch::Interp_Points(MyList<var> *VarList,
} }
} }
#ifdef INTERP_LB_PROFILE
double t_interp_end = MPI_Wtime();
double t_interp_local = t_interp_end - t_interp_start;
#endif
// --- Error check for unfound points --- // --- Error check for unfound points ---
for (int j = 0; j < NN; j++) for (int j = 0; j < NN; j++)
{ {
@@ -764,6 +775,31 @@ void Patch::Interp_Points(MyList<var> *VarList,
delete[] recv_count; delete[] recv_count;
delete[] consumer_rank; delete[] consumer_rank;
delete[] owner_rank; delete[] owner_rank;
#ifdef INTERP_LB_PROFILE
{
static bool profile_written = false;
if (!profile_written) {
double *all_times = nullptr;
if (myrank == 0) all_times = new double[nprocs];
MPI_Gather(&t_interp_local, 1, MPI_DOUBLE,
all_times, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if (myrank == 0) {
int heavy[64];
int nh = InterpLBProfile::identify_heavy_ranks(
all_times, nprocs, 2.5, heavy, 64);
InterpLBProfile::write_profile(
"interp_lb_profile.bin", nprocs,
all_times, heavy, nh, 2.5);
printf("[InterpLB] Profile written: %d heavy ranks\n", nh);
for (int i = 0; i < nh; i++)
printf(" Heavy rank %d: %.6f s\n", heavy[i], all_times[heavy[i]]);
delete[] all_times;
}
profile_written = true;
}
}
#endif
} }
void Patch::Interp_Points(MyList<var> *VarList, void Patch::Interp_Points(MyList<var> *VarList,
int NN, double **XX, int NN, double **XX,

View File

@@ -462,7 +462,7 @@ MyList<Block> *Parallel::distribute(MyList<Patch> *PatchLIST, int cpusize, int i
} }
} }
#else #else
ng = ng0 = new Block(dim, shape_here, bbox_here, n_rank++, ingfsi, fngfsi, PP->lev); // delete through KillBlocks ng = ng0 = new Block(dim, shape_here, bbox_here, n_rank++, ingfsi, fngfsi, PP->lev);
// ng->checkBlock(); // ng->checkBlock();
if (BlL) if (BlL)
BlL->insert(ng); BlL->insert(ng);
@@ -500,6 +500,384 @@ MyList<Block> *Parallel::distribute(MyList<Patch> *PatchLIST, int cpusize, int i
return BlL; return BlL;
} }
#ifdef INTERP_LB_OPTIMIZE
#include "interp_lb_profile_data.h"
MyList<Block> *Parallel::distribute_optimize(MyList<Patch> *PatchLIST, int cpusize, int ingfsi, int fngfsi,
bool periodic, int nodes)
{
#ifdef USE_GPU_DIVIDE
double cpu_part, gpu_part;
map<string, double>::iterator iter;
iter = parameters::dou_par.find("cpu part");
if (iter != parameters::dou_par.end())
{
cpu_part = iter->second;
}
else
{
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
const int LEN = 256;
char pline[LEN];
string str, sgrp, skey, sval;
int sind;
char pname[50];
{
map<string, string>::iterator iter = parameters::str_par.find("inputpar");
if (iter != parameters::str_par.end())
strcpy(pname, (iter->second).c_str());
else { cout << "Error inputpar" << endl; exit(0); }
}
ifstream inf(pname, ifstream::in);
if (!inf.good() && myrank == 0)
{ cout << "Can not open parameter file " << pname << endl; MPI_Abort(MPI_COMM_WORLD, 1); }
for (int i = 1; inf.good(); i++)
{
inf.getline(pline, LEN); str = pline;
int status = misc::parse_parts(str, sgrp, skey, sval, sind);
if (status == -1) { cout << "error reading parameter file " << pname << " in line " << i << endl; MPI_Abort(MPI_COMM_WORLD, 1); }
else if (status == 0) continue;
if (sgrp == "ABE") { if (skey == "cpu part") cpu_part = atof(sval.c_str()); }
}
inf.close();
parameters::dou_par.insert(map<string, double>::value_type("cpu part", cpu_part));
}
iter = parameters::dou_par.find("gpu part");
if (iter != parameters::dou_par.end())
{
gpu_part = iter->second;
}
else
{
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
const int LEN = 256;
char pline[LEN];
string str, sgrp, skey, sval;
int sind;
char pname[50];
{
map<string, string>::iterator iter = parameters::str_par.find("inputpar");
if (iter != parameters::str_par.end())
strcpy(pname, (iter->second).c_str());
else { cout << "Error inputpar" << endl; exit(0); }
}
ifstream inf(pname, ifstream::in);
if (!inf.good() && myrank == 0)
{ cout << "Can not open parameter file " << pname << endl; MPI_Abort(MPI_COMM_WORLD, 1); }
for (int i = 1; inf.good(); i++)
{
inf.getline(pline, LEN); str = pline;
int status = misc::parse_parts(str, sgrp, skey, sval, sind);
if (status == -1) { cout << "error reading parameter file " << pname << " in line " << i << endl; MPI_Abort(MPI_COMM_WORLD, 1); }
else if (status == 0) continue;
if (sgrp == "ABE") { if (skey == "gpu part") gpu_part = atof(sval.c_str()); }
}
inf.close();
parameters::dou_par.insert(map<string, double>::value_type("gpu part", gpu_part));
}
if (nodes == 0) nodes = cpusize / 2;
#else
if (nodes == 0) nodes = cpusize;
#endif
if (dim != 3)
{
cout << "distrivute: now we only support 3-dimension" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
MyList<Block> *BlL = 0;
int split_size, min_size, block_size = 0;
int min_width = 2 * Mymax(ghost_width, buffer_width);
int nxyz[dim], mmin_width[dim], min_shape[dim];
MyList<Patch> *PLi = PatchLIST;
for (int i = 0; i < dim; i++)
min_shape[i] = PLi->data->shape[i];
int lev = PLi->data->lev;
PLi = PLi->next;
while (PLi)
{
Patch *PP = PLi->data;
for (int i = 0; i < dim; i++)
min_shape[i] = Mymin(min_shape[i], PP->shape[i]);
if (lev != PLi->data->lev)
cout << "Parallel::distribute CAUSTION: meet Patches for different level: " << lev << " and " << PLi->data->lev << endl;
PLi = PLi->next;
}
for (int i = 0; i < dim; i++)
mmin_width[i] = Mymin(min_width, min_shape[i]);
min_size = mmin_width[0];
for (int i = 1; i < dim; i++)
min_size = min_size * mmin_width[i];
PLi = PatchLIST;
while (PLi)
{
Patch *PP = PLi->data;
int bs = PP->shape[0];
for (int i = 1; i < dim; i++)
bs = bs * PP->shape[i];
block_size = block_size + bs;
PLi = PLi->next;
}
split_size = Mymax(min_size, block_size / nodes);
split_size = Mymax(1, split_size);
int n_rank = 0;
PLi = PatchLIST;
int reacpu = 0;
int current_block_id = 0;
while (PLi) {
Block *ng0, *ng;
bool first_block_in_patch = true;
Patch *PP = PLi->data;
reacpu += partition3(nxyz, split_size, mmin_width, nodes, PP->shape);
for (int i = 0; i < nxyz[0]; i++)
for (int j = 0; j < nxyz[1]; j++)
for (int k = 0; k < nxyz[2]; k++)
{
int ibbox_here[6], shape_here[3];
double bbox_here[6], dd;
Block *current_ng_start = nullptr;
bool is_heavy = false;
int r_l = -1, r_r = -1;
if (cpusize == INTERP_LB_NPROCS) {
for (int si = 0; si < INTERP_LB_NUM_HEAVY; si++) {
if (current_block_id == interp_lb_splits[si][0]) {
is_heavy = true;
r_l = interp_lb_splits[si][1];
r_r = interp_lb_splits[si][2];
break;
}
}
}
if (is_heavy)
{
int ib0 = (PP->shape[0] * i) / nxyz[0];
int ib3 = (PP->shape[0] * (i + 1)) / nxyz[0] - 1;
int jb1 = (PP->shape[1] * j) / nxyz[1];
int jb4 = (PP->shape[1] * (j + 1)) / nxyz[1] - 1;
int kb2 = (PP->shape[2] * k) / nxyz[2];
int kb5 = (PP->shape[2] * (k + 1)) / nxyz[2] - 1;
Block *split_first_block = nullptr;
Block *split_last_block = nullptr;
splitHotspotBlock(BlL, dim, ib0, ib3, jb1, jb4, kb2, kb5,
PP, r_l, r_r, ingfsi, fngfsi, periodic,
split_first_block, split_last_block);
current_ng_start = split_first_block;
ng = split_last_block;
}
else
{
ibbox_here[0] = (PP->shape[0] * i) / nxyz[0];
ibbox_here[3] = (PP->shape[0] * (i + 1)) / nxyz[0] - 1;
ibbox_here[1] = (PP->shape[1] * j) / nxyz[1];
ibbox_here[4] = (PP->shape[1] * (j + 1)) / nxyz[1] - 1;
ibbox_here[2] = (PP->shape[2] * k) / nxyz[2];
ibbox_here[5] = (PP->shape[2] * (k + 1)) / nxyz[2] - 1;
if (periodic) {
for(int d=0; d<3; d++) {
ibbox_here[d] -= ghost_width;
ibbox_here[d+3] += ghost_width;
}
} else {
ibbox_here[0] = Mymax(0, ibbox_here[0] - ghost_width);
ibbox_here[3] = Mymin(PP->shape[0] - 1, ibbox_here[3] + ghost_width);
ibbox_here[1] = Mymax(0, ibbox_here[1] - ghost_width);
ibbox_here[4] = Mymin(PP->shape[1] - 1, ibbox_here[4] + ghost_width);
ibbox_here[2] = Mymax(0, ibbox_here[2] - ghost_width);
ibbox_here[5] = Mymin(PP->shape[2] - 1, ibbox_here[5] + ghost_width);
}
for(int d=0; d<3; d++) shape_here[d] = ibbox_here[d+3] - ibbox_here[d] + 1;
#ifdef Vertex
#ifdef Cell
#error Both Cell and Vertex are defined
#endif
dd = (PP->bbox[3] - PP->bbox[0]) / (PP->shape[0] - 1);
bbox_here[0] = PP->bbox[0] + ibbox_here[0] * dd;
bbox_here[3] = PP->bbox[0] + ibbox_here[3] * dd;
dd = (PP->bbox[4] - PP->bbox[1]) / (PP->shape[1] - 1);
bbox_here[1] = PP->bbox[1] + ibbox_here[1] * dd;
bbox_here[4] = PP->bbox[1] + ibbox_here[4] * dd;
dd = (PP->bbox[5] - PP->bbox[2]) / (PP->shape[2] - 1);
bbox_here[2] = PP->bbox[2] + ibbox_here[2] * dd;
bbox_here[5] = PP->bbox[2] + ibbox_here[5] * dd;
#else
#ifdef Cell
dd = (PP->bbox[3] - PP->bbox[0]) / PP->shape[0];
bbox_here[0] = PP->bbox[0] + (ibbox_here[0]) * dd;
bbox_here[3] = PP->bbox[0] + (ibbox_here[3] + 1) * dd;
dd = (PP->bbox[4] - PP->bbox[1]) / PP->shape[1];
bbox_here[1] = PP->bbox[1] + (ibbox_here[1]) * dd;
bbox_here[4] = PP->bbox[1] + (ibbox_here[4] + 1) * dd;
dd = (PP->bbox[5] - PP->bbox[2]) / PP->shape[2];
bbox_here[2] = PP->bbox[2] + (ibbox_here[2]) * dd;
bbox_here[5] = PP->bbox[2] + (ibbox_here[5] + 1) * dd;
#else
#error Not define Vertex nor Cell
#endif
#endif
ng = createMappedBlock(BlL, dim, shape_here, bbox_here,
current_block_id, ingfsi, fngfsi, PP->lev);
current_ng_start = ng;
}
if (first_block_in_patch) {
ng0 = current_ng_start;
MyList<Block> *Bp_start = BlL;
while (Bp_start && Bp_start->data != ng0) Bp_start = Bp_start->next;
PP->blb = Bp_start;
first_block_in_patch = false;
}
current_block_id++;
}
{
MyList<Block> *Bp_end = BlL;
while (Bp_end && Bp_end->data != ng) Bp_end = Bp_end->next;
PP->ble = Bp_end;
}
PLi = PLi->next;
}
if (reacpu < nodes * 2 / 3)
{
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0)
cout << "Parallel::distribute CAUSTION: level#" << lev << " uses essencially " << reacpu << " processors vs " << nodes << " nodes run, your scientific computation scale is not as large as you estimate." << endl;
}
return BlL;
}
Block* Parallel::splitHotspotBlock(MyList<Block>* &BlL, int _dim,
int ib0_orig, int ib3_orig,
int jb1_orig, int jb4_orig,
int kb2_orig, int kb5_orig,
Patch* PP, int r_left, int r_right,
int ingfsi, int fngfsi, bool periodic,
Block* &split_first_block, Block* &split_last_block)
{
int mid = (ib0_orig + ib3_orig) / 2;
int indices_L[6] = {ib0_orig, jb1_orig, kb2_orig, mid, jb4_orig, kb5_orig};
int indices_R[6] = {mid + 1, jb1_orig, kb2_orig, ib3_orig, jb4_orig, kb5_orig};
auto createSubBlock = [&](int* ib_raw, int target_rank) {
int ib_final[6];
int sh_here[3];
double bb_here[6], dd;
if (periodic) {
ib_final[0] = ib_raw[0] - ghost_width;
ib_final[3] = ib_raw[3] + ghost_width;
ib_final[1] = ib_raw[1] - ghost_width;
ib_final[4] = ib_raw[4] + ghost_width;
ib_final[2] = ib_raw[2] - ghost_width;
ib_final[5] = ib_raw[5] + ghost_width;
} else {
ib_final[0] = Mymax(0, ib_raw[0] - ghost_width);
ib_final[3] = Mymin(PP->shape[0] - 1, ib_raw[3] + ghost_width);
ib_final[1] = Mymax(0, ib_raw[1] - ghost_width);
ib_final[4] = Mymin(PP->shape[1] - 1, ib_raw[4] + ghost_width);
ib_final[2] = Mymax(0, ib_raw[2] - ghost_width);
ib_final[5] = Mymin(PP->shape[2] - 1, ib_raw[5] + ghost_width);
}
sh_here[0] = ib_final[3] - ib_final[0] + 1;
sh_here[1] = ib_final[4] - ib_final[1] + 1;
sh_here[2] = ib_final[5] - ib_final[2] + 1;
#ifdef Vertex
dd = (PP->bbox[3] - PP->bbox[0]) / (PP->shape[0] - 1);
bb_here[0] = PP->bbox[0] + ib_final[0] * dd;
bb_here[3] = PP->bbox[0] + ib_final[3] * dd;
dd = (PP->bbox[4] - PP->bbox[1]) / (PP->shape[1] - 1);
bb_here[1] = PP->bbox[1] + ib_final[1] * dd;
bb_here[4] = PP->bbox[1] + ib_final[4] * dd;
dd = (PP->bbox[5] - PP->bbox[2]) / (PP->shape[2] - 1);
bb_here[2] = PP->bbox[2] + ib_final[2] * dd;
bb_here[5] = PP->bbox[2] + ib_final[5] * dd;
#else
#ifdef Cell
dd = (PP->bbox[3] - PP->bbox[0]) / PP->shape[0];
bb_here[0] = PP->bbox[0] + ib_final[0] * dd;
bb_here[3] = PP->bbox[0] + (ib_final[3] + 1) * dd;
dd = (PP->bbox[4] - PP->bbox[1]) / PP->shape[1];
bb_here[1] = PP->bbox[1] + ib_final[1] * dd;
bb_here[4] = PP->bbox[1] + (ib_final[4] + 1) * dd;
dd = (PP->bbox[5] - PP->bbox[2]) / PP->shape[2];
bb_here[2] = PP->bbox[2] + ib_final[2] * dd;
bb_here[5] = PP->bbox[2] + (ib_final[5] + 1) * dd;
#endif
#endif
Block* Bg = new Block(dim, sh_here, bb_here, target_rank, ingfsi, fngfsi, PP->lev);
if (BlL) BlL->insert(Bg);
else BlL = new MyList<Block>(Bg);
return Bg;
};
split_first_block = createSubBlock(indices_L, r_left);
split_last_block = createSubBlock(indices_R, r_right);
return split_last_block;
}
Block* Parallel::createMappedBlock(MyList<Block>* &BlL, int _dim, int* shape, double* bbox,
int block_id, int ingfsi, int fngfsi, int lev)
{
int target_rank = block_id;
if (INTERP_LB_NPROCS > 0) {
for (int ri = 0; ri < interp_lb_num_remaps; ri++) {
if (block_id == interp_lb_remaps[ri][0]) {
target_rank = interp_lb_remaps[ri][1];
break;
}
}
}
Block* ng = new Block(dim, shape, bbox, target_rank, ingfsi, fngfsi, lev);
if (BlL) BlL->insert(ng);
else BlL = new MyList<Block>(ng);
return ng;
}
#else
// When INTERP_LB_OPTIMIZE is not defined, distribute_optimize falls back to distribute
MyList<Block> *Parallel::distribute_optimize(MyList<Patch> *PatchLIST, int cpusize, int ingfsi, int fngfsi,
bool periodic, int nodes)
{
return distribute(PatchLIST, cpusize, ingfsi, fngfsi, periodic, nodes);
}
Block* Parallel::splitHotspotBlock(MyList<Block>* &BlL, int _dim,
int ib0_orig, int ib3_orig,
int jb1_orig, int jb4_orig,
int kb2_orig, int kb5_orig,
Patch* PP, int r_left, int r_right,
int ingfsi, int fngfsi, bool periodic,
Block* &split_first_block, Block* &split_last_block)
{ return nullptr; }
Block* Parallel::createMappedBlock(MyList<Block>* &BlL, int _dim, int* shape, double* bbox,
int block_id, int ingfsi, int fngfsi, int lev)
{ return nullptr; }
#endif
#elif (PSTR == 1 || PSTR == 2 || PSTR == 3) #elif (PSTR == 1 || PSTR == 2 || PSTR == 3)
MyList<Block> *Parallel::distribute(MyList<Patch> *PatchLIST, int cpusize, int ingfsi, int fngfsi, MyList<Block> *Parallel::distribute(MyList<Patch> *PatchLIST, int cpusize, int ingfsi, int fngfsi,
bool periodic, int start_rank, int end_rank, int nodes) bool periodic, int start_rank, int end_rank, int nodes)
@@ -3853,7 +4231,8 @@ void Parallel::Sync_merged(MyList<Patch> *PatL, MyList<var> *VarList, int Symmet
Parallel::SyncCache::SyncCache() Parallel::SyncCache::SyncCache()
: valid(false), cpusize(0), combined_src(0), combined_dst(0), : valid(false), cpusize(0), combined_src(0), combined_dst(0),
send_lengths(0), recv_lengths(0), send_bufs(0), recv_bufs(0), send_lengths(0), recv_lengths(0), send_bufs(0), recv_bufs(0),
send_buf_caps(0), recv_buf_caps(0), reqs(0), stats(0), max_reqs(0) send_buf_caps(0), recv_buf_caps(0), reqs(0), stats(0), max_reqs(0),
lengths_valid(false)
{ {
} }
// SyncCache invalidate: free grid segment lists but keep buffers // SyncCache invalidate: free grid segment lists but keep buffers
@@ -3871,6 +4250,7 @@ void Parallel::SyncCache::invalidate()
send_lengths[i] = recv_lengths[i] = 0; send_lengths[i] = recv_lengths[i] = 0;
} }
valid = false; valid = false;
lengths_valid = false;
} }
// SyncCache destroy: free everything // SyncCache destroy: free everything
void Parallel::SyncCache::destroy() void Parallel::SyncCache::destroy()
@@ -4172,8 +4552,13 @@ void Parallel::Sync_start(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetr
{ {
if (node == myrank) if (node == myrank)
{ {
int length = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry); int length;
cache.recv_lengths[node] = length; if (!cache.lengths_valid) {
length = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
cache.recv_lengths[node] = length;
} else {
length = cache.recv_lengths[node];
}
if (length > 0) if (length > 0)
{ {
if (length > cache.recv_buf_caps[node]) if (length > cache.recv_buf_caps[node])
@@ -4187,8 +4572,13 @@ void Parallel::Sync_start(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetr
} }
else else
{ {
int slength = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry); int slength;
cache.send_lengths[node] = slength; if (!cache.lengths_valid) {
slength = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
cache.send_lengths[node] = slength;
} else {
slength = cache.send_lengths[node];
}
if (slength > 0) if (slength > 0)
{ {
if (slength > cache.send_buf_caps[node]) if (slength > cache.send_buf_caps[node])
@@ -4200,8 +4590,13 @@ void Parallel::Sync_start(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetr
data_packer(cache.send_bufs[node], src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry); data_packer(cache.send_bufs[node], src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
MPI_Isend((void *)cache.send_bufs[node], slength, MPI_DOUBLE, node, 2, MPI_COMM_WORLD, cache.reqs + state.req_no++); MPI_Isend((void *)cache.send_bufs[node], slength, MPI_DOUBLE, node, 2, MPI_COMM_WORLD, cache.reqs + state.req_no++);
} }
int rlength = data_packer(0, src[node], dst[node], node, UNPACK, VarList, VarList, Symmetry); int rlength;
cache.recv_lengths[node] = rlength; if (!cache.lengths_valid) {
rlength = data_packer(0, src[node], dst[node], node, UNPACK, VarList, VarList, Symmetry);
cache.recv_lengths[node] = rlength;
} else {
rlength = cache.recv_lengths[node];
}
if (rlength > 0) if (rlength > 0)
{ {
if (rlength > cache.recv_buf_caps[node]) if (rlength > cache.recv_buf_caps[node])
@@ -4214,6 +4609,7 @@ void Parallel::Sync_start(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetr
} }
} }
} }
cache.lengths_valid = true;
} }
// Sync_finish: wait for async MPI operations and unpack // Sync_finish: wait for async MPI operations and unpack
void Parallel::Sync_finish(SyncCache &cache, AsyncSyncState &state, void Parallel::Sync_finish(SyncCache &cache, AsyncSyncState &state,
@@ -5268,6 +5664,203 @@ void Parallel::OutBdLow2Himix(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
delete[] transfer_src; delete[] transfer_src;
delete[] transfer_dst; delete[] transfer_dst;
} }
// Restrict_cached: cache grid segment lists, reuse buffers via transfer_cached
void Parallel::Restrict_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
MyList<var> *VarList1, MyList<var> *VarList2,
int Symmetry, SyncCache &cache)
{
if (!cache.valid)
{
int cpusize;
MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
cache.cpusize = cpusize;
if (!cache.combined_src)
{
cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
cache.send_lengths = new int[cpusize];
cache.recv_lengths = new int[cpusize];
cache.send_bufs = new double *[cpusize];
cache.recv_bufs = new double *[cpusize];
cache.send_buf_caps = new int[cpusize];
cache.recv_buf_caps = new int[cpusize];
for (int i = 0; i < cpusize; i++)
{
cache.send_bufs[i] = cache.recv_bufs[i] = 0;
cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
}
cache.max_reqs = 2 * cpusize;
cache.reqs = new MPI_Request[cache.max_reqs];
cache.stats = new MPI_Status[cache.max_reqs];
}
MyList<Parallel::gridseg> *dst = build_complete_gsl(PatcL);
for (int node = 0; node < cpusize; node++)
{
MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatfL, node, 2, Symmetry);
build_gstl(src_owned, dst, &cache.combined_src[node], &cache.combined_dst[node]);
if (src_owned) src_owned->destroyList();
}
if (dst) dst->destroyList();
cache.valid = true;
}
transfer_cached(cache.combined_src, cache.combined_dst, VarList1, VarList2, Symmetry, cache);
}
// OutBdLow2Hi_cached: cache grid segment lists, reuse buffers via transfer_cached
void Parallel::OutBdLow2Hi_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
MyList<var> *VarList1, MyList<var> *VarList2,
int Symmetry, SyncCache &cache)
{
if (!cache.valid)
{
int cpusize;
MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
cache.cpusize = cpusize;
if (!cache.combined_src)
{
cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
cache.send_lengths = new int[cpusize];
cache.recv_lengths = new int[cpusize];
cache.send_bufs = new double *[cpusize];
cache.recv_bufs = new double *[cpusize];
cache.send_buf_caps = new int[cpusize];
cache.recv_buf_caps = new int[cpusize];
for (int i = 0; i < cpusize; i++)
{
cache.send_bufs[i] = cache.recv_bufs[i] = 0;
cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
}
cache.max_reqs = 2 * cpusize;
cache.reqs = new MPI_Request[cache.max_reqs];
cache.stats = new MPI_Status[cache.max_reqs];
}
MyList<Parallel::gridseg> *dst = build_buffer_gsl(PatfL);
for (int node = 0; node < cpusize; node++)
{
MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatcL, node, 4, Symmetry);
build_gstl(src_owned, dst, &cache.combined_src[node], &cache.combined_dst[node]);
if (src_owned) src_owned->destroyList();
}
if (dst) dst->destroyList();
cache.valid = true;
}
transfer_cached(cache.combined_src, cache.combined_dst, VarList1, VarList2, Symmetry, cache);
}
// OutBdLow2Himix_cached: same as OutBdLow2Hi_cached but uses transfermix for unpacking
void Parallel::OutBdLow2Himix_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
MyList<var> *VarList1, MyList<var> *VarList2,
int Symmetry, SyncCache &cache)
{
if (!cache.valid)
{
int cpusize;
MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
cache.cpusize = cpusize;
if (!cache.combined_src)
{
cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
cache.send_lengths = new int[cpusize];
cache.recv_lengths = new int[cpusize];
cache.send_bufs = new double *[cpusize];
cache.recv_bufs = new double *[cpusize];
cache.send_buf_caps = new int[cpusize];
cache.recv_buf_caps = new int[cpusize];
for (int i = 0; i < cpusize; i++)
{
cache.send_bufs[i] = cache.recv_bufs[i] = 0;
cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
}
cache.max_reqs = 2 * cpusize;
cache.reqs = new MPI_Request[cache.max_reqs];
cache.stats = new MPI_Status[cache.max_reqs];
}
MyList<Parallel::gridseg> *dst = build_buffer_gsl(PatfL);
for (int node = 0; node < cpusize; node++)
{
MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatcL, node, 4, Symmetry);
build_gstl(src_owned, dst, &cache.combined_src[node], &cache.combined_dst[node]);
if (src_owned) src_owned->destroyList();
}
if (dst) dst->destroyList();
cache.valid = true;
}
// Use transfermix instead of transfer for mix-mode interpolation
int myrank;
MPI_Comm_size(MPI_COMM_WORLD, &cache.cpusize);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
int cpusize = cache.cpusize;
int req_no = 0;
for (int node = 0; node < cpusize; node++)
{
if (node == myrank)
{
int length = data_packermix(0, cache.combined_src[myrank], cache.combined_dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
cache.recv_lengths[node] = length;
if (length > 0)
{
if (length > cache.recv_buf_caps[node])
{
if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
cache.recv_bufs[node] = new double[length];
cache.recv_buf_caps[node] = length;
}
data_packermix(cache.recv_bufs[node], cache.combined_src[myrank], cache.combined_dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
}
}
else
{
int slength = data_packermix(0, cache.combined_src[myrank], cache.combined_dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
cache.send_lengths[node] = slength;
if (slength > 0)
{
if (slength > cache.send_buf_caps[node])
{
if (cache.send_bufs[node]) delete[] cache.send_bufs[node];
cache.send_bufs[node] = new double[slength];
cache.send_buf_caps[node] = slength;
}
data_packermix(cache.send_bufs[node], cache.combined_src[myrank], cache.combined_dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
MPI_Isend((void *)cache.send_bufs[node], slength, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, cache.reqs + req_no++);
}
int rlength = data_packermix(0, cache.combined_src[node], cache.combined_dst[node], node, UNPACK, VarList1, VarList2, Symmetry);
cache.recv_lengths[node] = rlength;
if (rlength > 0)
{
if (rlength > cache.recv_buf_caps[node])
{
if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
cache.recv_bufs[node] = new double[rlength];
cache.recv_buf_caps[node] = rlength;
}
MPI_Irecv((void *)cache.recv_bufs[node], rlength, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, cache.reqs + req_no++);
}
}
}
MPI_Waitall(req_no, cache.reqs, cache.stats);
for (int node = 0; node < cpusize; node++)
if (cache.recv_bufs[node] && cache.recv_lengths[node] > 0)
data_packermix(cache.recv_bufs[node], cache.combined_src[node], cache.combined_dst[node], node, UNPACK, VarList1, VarList2, Symmetry);
}
// collect all buffer grid segments or blocks for given patch // collect all buffer grid segments or blocks for given patch
MyList<Parallel::gridseg> *Parallel::build_buffer_gsl(Patch *Pat) MyList<Parallel::gridseg> *Parallel::build_buffer_gsl(Patch *Pat)
{ {

View File

@@ -32,6 +32,16 @@ namespace Parallel
int partition2(int *nxy, int split_size, int *min_width, int cpusize, int *shape); // special for 2 diemnsions int partition2(int *nxy, int split_size, int *min_width, int cpusize, int *shape); // special for 2 diemnsions
int partition3(int *nxyz, int split_size, int *min_width, int cpusize, int *shape); int partition3(int *nxyz, int split_size, int *min_width, int cpusize, int *shape);
MyList<Block> *distribute(MyList<Patch> *PatchLIST, int cpusize, int ingfsi, int fngfs, bool periodic, int nodes = 0); // produce corresponding Blocks MyList<Block> *distribute(MyList<Patch> *PatchLIST, int cpusize, int ingfsi, int fngfs, bool periodic, int nodes = 0); // produce corresponding Blocks
MyList<Block> *distribute_optimize(MyList<Patch> *PatchLIST, int cpusize, int ingfsi, int fngfs, bool periodic, int nodes = 0);
Block* splitHotspotBlock(MyList<Block>* &BlL, int _dim,
int ib0_orig, int ib3_orig,
int jb1_orig, int jb4_orig,
int kb2_orig, int kb5_orig,
Patch* PP, int r_left, int r_right,
int ingfsi, int fngfsi, bool periodic,
Block* &split_first_block, Block* &split_last_block);
Block* createMappedBlock(MyList<Block>* &BlL, int _dim, int* shape, double* bbox,
int block_id, int ingfsi, int fngfsi, int lev);
void KillBlocks(MyList<Patch> *PatchLIST); void KillBlocks(MyList<Patch> *PatchLIST);
void setfunction(MyList<Block> *BlL, var *vn, double func(double x, double y, double z)); void setfunction(MyList<Block> *BlL, var *vn, double func(double x, double y, double z));
@@ -97,6 +107,7 @@ namespace Parallel
MPI_Request *reqs; MPI_Request *reqs;
MPI_Status *stats; MPI_Status *stats;
int max_reqs; int max_reqs;
bool lengths_valid;
SyncCache(); SyncCache();
void invalidate(); void invalidate();
void destroy(); void destroy();
@@ -129,6 +140,15 @@ namespace Parallel
void OutBdLow2Himix(MyList<Patch> *PatcL, MyList<Patch> *PatfL, void OutBdLow2Himix(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */, MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */,
int Symmetry); int Symmetry);
void Restrict_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
MyList<var> *VarList1, MyList<var> *VarList2,
int Symmetry, SyncCache &cache);
void OutBdLow2Hi_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
MyList<var> *VarList1, MyList<var> *VarList2,
int Symmetry, SyncCache &cache);
void OutBdLow2Himix_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
MyList<var> *VarList1, MyList<var> *VarList2,
int Symmetry, SyncCache &cache);
void Prolong(Patch *Patc, Patch *Patf, void Prolong(Patch *Patc, Patch *Patf,
MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */, MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */,
int Symmetry); int Symmetry);

View File

@@ -2426,9 +2426,9 @@ void bssn_class::RecursiveStep(int lev)
#endif #endif
#if (REGLEV == 0) #if (REGLEV == 0)
GH->Regrid_Onelevel(lev, Symmetry, BH_num, Porgbr, Porg0, if (GH->Regrid_Onelevel(lev, Symmetry, BH_num, Porgbr, Porg0,
SynchList_cor, OldStateList, StateList, SynchList_pre, SynchList_cor, OldStateList, StateList, SynchList_pre,
fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor); fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor))
for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); } for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
#endif #endif
} }
@@ -2605,9 +2605,9 @@ void bssn_class::ParallelStep()
delete[] tporg; delete[] tporg;
delete[] tporgo; delete[] tporgo;
#if (REGLEV == 0) #if (REGLEV == 0)
GH->Regrid_Onelevel(GH->mylev, Symmetry, BH_num, Porgbr, Porg0, if (GH->Regrid_Onelevel(GH->mylev, Symmetry, BH_num, Porgbr, Porg0,
SynchList_cor, OldStateList, StateList, SynchList_pre, SynchList_cor, OldStateList, StateList, SynchList_pre,
fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor); fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor))
for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); } for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
#endif #endif
} }
@@ -2772,9 +2772,9 @@ void bssn_class::ParallelStep()
if (lev + 1 >= GH->movls) if (lev + 1 >= GH->movls)
{ {
// GH->Regrid_Onelevel_aux(lev,Symmetry,BH_num,Porgbr,Porg0, // GH->Regrid_Onelevel_aux(lev,Symmetry,BH_num,Porgbr,Porg0,
GH->Regrid_Onelevel(lev + 1, Symmetry, BH_num, Porgbr, Porg0, if (GH->Regrid_Onelevel(lev + 1, Symmetry, BH_num, Porgbr, Porg0,
SynchList_cor, OldStateList, StateList, SynchList_pre, SynchList_cor, OldStateList, StateList, SynchList_pre,
fgt(PhysTime - dT_levp1, StartTime, dT_levp1 / 2), ErrorMonitor); fgt(PhysTime - dT_levp1, StartTime, dT_levp1 / 2), ErrorMonitor))
for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); } for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
// a_stream.clear(); // a_stream.clear();
@@ -2787,9 +2787,9 @@ void bssn_class::ParallelStep()
// for this level // for this level
if (YN == 1) if (YN == 1)
{ {
GH->Regrid_Onelevel(lev, Symmetry, BH_num, Porgbr, Porg0, if (GH->Regrid_Onelevel(lev, Symmetry, BH_num, Porgbr, Porg0,
SynchList_cor, OldStateList, StateList, SynchList_pre, SynchList_cor, OldStateList, StateList, SynchList_pre,
fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor); fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor))
for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); } for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
// a_stream.clear(); // a_stream.clear();
@@ -2806,9 +2806,9 @@ void bssn_class::ParallelStep()
if (YN == 1) if (YN == 1)
{ {
// GH->Regrid_Onelevel_aux(lev-2,Symmetry,BH_num,Porgbr,Porg0, // GH->Regrid_Onelevel_aux(lev-2,Symmetry,BH_num,Porgbr,Porg0,
GH->Regrid_Onelevel(lev - 1, Symmetry, BH_num, Porgbr, Porg0, if (GH->Regrid_Onelevel(lev - 1, Symmetry, BH_num, Porgbr, Porg0,
SynchList_cor, OldStateList, StateList, SynchList_pre, SynchList_cor, OldStateList, StateList, SynchList_pre,
fgt(PhysTime - dT_lev, StartTime, dT_levm1 / 2), ErrorMonitor); fgt(PhysTime - dT_lev, StartTime, dT_levm1 / 2), ErrorMonitor))
for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); } for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
// a_stream.clear(); // a_stream.clear();
@@ -2822,9 +2822,9 @@ void bssn_class::ParallelStep()
if (i % 4 == 3) if (i % 4 == 3)
{ {
// GH->Regrid_Onelevel_aux(lev-2,Symmetry,BH_num,Porgbr,Porg0, // GH->Regrid_Onelevel_aux(lev-2,Symmetry,BH_num,Porgbr,Porg0,
GH->Regrid_Onelevel(lev - 1, Symmetry, BH_num, Porgbr, Porg0, if (GH->Regrid_Onelevel(lev - 1, Symmetry, BH_num, Porgbr, Porg0,
SynchList_cor, OldStateList, StateList, SynchList_pre, SynchList_cor, OldStateList, StateList, SynchList_pre,
fgt(PhysTime - dT_lev, StartTime, dT_levm1 / 2), ErrorMonitor); fgt(PhysTime - dT_lev, StartTime, dT_levm1 / 2), ErrorMonitor))
for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); } for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
// a_stream.clear(); // a_stream.clear();
@@ -5819,21 +5819,11 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
#endif #endif
#if (RPB == 0) #if (RPB == 0)
Ppc = GH->PatL[lev - 1];
while (Ppc)
{
Pp = GH->PatL[lev];
while (Pp)
{
#if (MIXOUTB == 0) #if (MIXOUTB == 0)
Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SynchList_pre, SL, Symmetry); Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, Symmetry);
#elif (MIXOUTB == 1) #elif (MIXOUTB == 1)
Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SynchList_pre, SL, Symmetry); Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, Symmetry);
#endif #endif
Pp = Pp->next;
}
Ppc = Ppc->next;
}
#elif (RPB == 1) #elif (RPB == 1)
// Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SL,Symmetry); // Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SL,Symmetry);
Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, GH->bdsul[lev], Symmetry); Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, GH->bdsul[lev], Symmetry);
@@ -5880,21 +5870,11 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
#endif #endif
#if (RPB == 0) #if (RPB == 0)
Ppc = GH->PatL[lev - 1];
while (Ppc)
{
Pp = GH->PatL[lev];
while (Pp)
{
#if (MIXOUTB == 0) #if (MIXOUTB == 0)
Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SL, SL, Symmetry); Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, Symmetry);
#elif (MIXOUTB == 1) #elif (MIXOUTB == 1)
Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SL, SL, Symmetry); Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, Symmetry);
#endif #endif
Pp = Pp->next;
}
Ppc = Ppc->next;
}
#elif (RPB == 1) #elif (RPB == 1)
// Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SL,SL,Symmetry); // Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SL,SL,Symmetry);
Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, GH->bdsul[lev], Symmetry); Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, GH->bdsul[lev], Symmetry);
@@ -5969,21 +5949,11 @@ void bssn_class::RestrictProlong_aux(int lev, int YN, bool BB,
Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]); Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]);
#if (RPB == 0) #if (RPB == 0)
Ppc = GH->PatL[lev - 1];
while (Ppc)
{
Pp = GH->PatL[lev];
while (Pp)
{
#if (MIXOUTB == 0) #if (MIXOUTB == 0)
Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SynchList_pre, SL, Symmetry); Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, Symmetry);
#elif (MIXOUTB == 1) #elif (MIXOUTB == 1)
Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SynchList_pre, SL, Symmetry); Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, Symmetry);
#endif #endif
Pp = Pp->next;
}
Ppc = Ppc->next;
}
#elif (RPB == 1) #elif (RPB == 1)
// Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SL,Symmetry); // Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SL,Symmetry);
Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, GH->bdsul[lev], Symmetry); Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, GH->bdsul[lev], Symmetry);
@@ -6001,21 +5971,11 @@ void bssn_class::RestrictProlong_aux(int lev, int YN, bool BB,
Parallel::Sync_cached(GH->PatL[lev - 1], SL, Symmetry, sync_cache_rp_coarse[lev]); Parallel::Sync_cached(GH->PatL[lev - 1], SL, Symmetry, sync_cache_rp_coarse[lev]);
#if (RPB == 0) #if (RPB == 0)
Ppc = GH->PatL[lev - 1];
while (Ppc)
{
Pp = GH->PatL[lev];
while (Pp)
{
#if (MIXOUTB == 0) #if (MIXOUTB == 0)
Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SL, SL, Symmetry); Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, Symmetry);
#elif (MIXOUTB == 1) #elif (MIXOUTB == 1)
Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SL, SL, Symmetry); Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, Symmetry);
#endif #endif
Pp = Pp->next;
}
Ppc = Ppc->next;
}
#elif (RPB == 1) #elif (RPB == 1)
// Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SL,SL,Symmetry); // Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SL,SL,Symmetry);
Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, GH->bdsul[lev], Symmetry); Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, GH->bdsul[lev], Symmetry);
@@ -6076,21 +6036,11 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB)
Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]); Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]);
#if (RPB == 0) #if (RPB == 0)
Ppc = GH->PatL[lev - 1];
while (Ppc)
{
Pp = GH->PatL[lev];
while (Pp)
{
#if (MIXOUTB == 0) #if (MIXOUTB == 0)
Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SynchList_pre, SynchList_cor, Symmetry); Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, Symmetry);
#elif (MIXOUTB == 1) #elif (MIXOUTB == 1)
Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SynchList_pre, SynchList_cor, Symmetry); Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, Symmetry);
#endif #endif
Pp = Pp->next;
}
Ppc = Ppc->next;
}
#elif (RPB == 1) #elif (RPB == 1)
// Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SynchList_cor,Symmetry); // Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SynchList_cor,Symmetry);
Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, GH->bdsul[lev], Symmetry); Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, GH->bdsul[lev], Symmetry);
@@ -6110,21 +6060,11 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB)
Parallel::Sync_cached(GH->PatL[lev - 1], StateList, Symmetry, sync_cache_rp_coarse[lev]); Parallel::Sync_cached(GH->PatL[lev - 1], StateList, Symmetry, sync_cache_rp_coarse[lev]);
#if (RPB == 0) #if (RPB == 0)
Ppc = GH->PatL[lev - 1];
while (Ppc)
{
Pp = GH->PatL[lev];
while (Pp)
{
#if (MIXOUTB == 0) #if (MIXOUTB == 0)
Parallel::OutBdLow2Hi(Ppc->data, Pp->data, StateList, SynchList_cor, Symmetry); Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, Symmetry);
#elif (MIXOUTB == 1) #elif (MIXOUTB == 1)
Parallel::OutBdLow2Himix(Ppc->data, Pp->data, StateList, SynchList_cor, Symmetry); Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, Symmetry);
#endif #endif
Pp = Pp->next;
}
Ppc = Ppc->next;
}
#elif (RPB == 1) #elif (RPB == 1)
// Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],StateList,SynchList_cor,Symmetry); // Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],StateList,SynchList_cor,Symmetry);
Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, GH->bdsul[lev], Symmetry); Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, GH->bdsul[lev], Symmetry);
@@ -6161,21 +6101,11 @@ void bssn_class::ProlongRestrict(int lev, int YN, bool BB)
} }
#if (RPB == 0) #if (RPB == 0)
Ppc = GH->PatL[lev - 1];
while (Ppc)
{
Pp = GH->PatL[lev];
while (Pp)
{
#if (MIXOUTB == 0) #if (MIXOUTB == 0)
Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SynchList_pre, SynchList_cor, Symmetry); Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, Symmetry);
#elif (MIXOUTB == 1) #elif (MIXOUTB == 1)
Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SynchList_pre, SynchList_cor, Symmetry); Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, Symmetry);
#endif #endif
Pp = Pp->next;
}
Ppc = Ppc->next;
}
#elif (RPB == 1) #elif (RPB == 1)
// Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SynchList_cor,Symmetry); // Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SynchList_cor,Symmetry);
Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, GH->bdsul[lev], Symmetry); Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, GH->bdsul[lev], Symmetry);
@@ -6184,21 +6114,11 @@ void bssn_class::ProlongRestrict(int lev, int YN, bool BB)
else // no time refinement levels and for all same time levels else // no time refinement levels and for all same time levels
{ {
#if (RPB == 0) #if (RPB == 0)
Ppc = GH->PatL[lev - 1];
while (Ppc)
{
Pp = GH->PatL[lev];
while (Pp)
{
#if (MIXOUTB == 0) #if (MIXOUTB == 0)
Parallel::OutBdLow2Hi(Ppc->data, Pp->data, StateList, SynchList_cor, Symmetry); Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, Symmetry);
#elif (MIXOUTB == 1) #elif (MIXOUTB == 1)
Parallel::OutBdLow2Himix(Ppc->data, Pp->data, StateList, SynchList_cor, Symmetry); Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, Symmetry);
#endif #endif
Pp = Pp->next;
}
Ppc = Ppc->next;
}
#elif (RPB == 1) #elif (RPB == 1)
// Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],StateList,SynchList_cor,Symmetry); // Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],StateList,SynchList_cor,Symmetry);
Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, GH->bdsul[lev], Symmetry); Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, GH->bdsul[lev], Symmetry);

View File

@@ -945,103 +945,60 @@
SSA(2)=SYM SSA(2)=SYM
SSA(3)=ANTI SSA(3)=ANTI
!!!!!!!!!advection term part !!!!!!!!!advection term + Kreiss-Oliger dissipation (merged for cache efficiency)
! lopsided_kodis shares the symmetry_bd buffer between advection and
! dissipation, eliminating redundant full-grid copies. For metric variables
! gxx/gyy/gzz (=dxx/dyy/dzz+1): kodis stencil coefficients sum to zero,
! so the constant offset has no effect on dissipation.
call lopsided(ex,X,Y,Z,gxx,gxx_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,gxx,gxx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,gxy,gxy_rhs,betax,betay,betaz,Symmetry,AAS) call lopsided_kodis(ex,X,Y,Z,gxy,gxy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
call lopsided(ex,X,Y,Z,gxz,gxz_rhs,betax,betay,betaz,Symmetry,ASA) call lopsided_kodis(ex,X,Y,Z,gxz,gxz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
call lopsided(ex,X,Y,Z,gyy,gyy_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,gyy,gyy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,gyz,gyz_rhs,betax,betay,betaz,Symmetry,SAA) call lopsided_kodis(ex,X,Y,Z,gyz,gyz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
call lopsided(ex,X,Y,Z,gzz,gzz_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,gzz,gzz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,Axx,Axx_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,Axx,Axx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,Axy,Axy_rhs,betax,betay,betaz,Symmetry,AAS) call lopsided_kodis(ex,X,Y,Z,Axy,Axy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
call lopsided(ex,X,Y,Z,Axz,Axz_rhs,betax,betay,betaz,Symmetry,ASA) call lopsided_kodis(ex,X,Y,Z,Axz,Axz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
call lopsided(ex,X,Y,Z,Ayy,Ayy_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,Ayy,Ayy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,Ayz,Ayz_rhs,betax,betay,betaz,Symmetry,SAA) call lopsided_kodis(ex,X,Y,Z,Ayz,Ayz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
call lopsided(ex,X,Y,Z,Azz,Azz_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,Azz,Azz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,trK,trK_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,trK,trK_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,Gamx,Gamx_rhs,betax,betay,betaz,Symmetry,ASS) call lopsided_kodis(ex,X,Y,Z,Gamx,Gamx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
call lopsided(ex,X,Y,Z,Gamy,Gamy_rhs,betax,betay,betaz,Symmetry,SAS) call lopsided_kodis(ex,X,Y,Z,Gamy,Gamy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
call lopsided(ex,X,Y,Z,Gamz,Gamz_rhs,betax,betay,betaz,Symmetry,SSA) call lopsided_kodis(ex,X,Y,Z,Gamz,Gamz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
!!
#if 1
!! bam does not apply dissipation on gauge variables
call lopsided_kodis(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS,eps)
#if (GAUGE == 0 || GAUGE == 1 || GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7)
call lopsided_kodis(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS,eps)
call lopsided_kodis(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS,eps)
call lopsided_kodis(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
#endif
#if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
call lopsided_kodis(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
call lopsided_kodis(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
call lopsided_kodis(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
#endif
#else
! No dissipation on gauge variables (advection only)
call lopsided(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS)
#if (GAUGE == 0 || GAUGE == 1 || GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7) #if (GAUGE == 0 || GAUGE == 1 || GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7)
call lopsided(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS) call lopsided(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS)
call lopsided(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS) call lopsided(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS)
call lopsided(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA) call lopsided(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA)
#endif #endif
#if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7) #if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
call lopsided(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS) call lopsided(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS)
call lopsided(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS) call lopsided(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS)
call lopsided(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA) call lopsided(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA)
#endif #endif
if(eps>0)then
! usual Kreiss-Oliger dissipation
call kodis(ex,X,Y,Z,chi,chi_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,trK,trK_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,dxx,gxx_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,gxy,gxy_rhs,AAS,Symmetry,eps)
call kodis(ex,X,Y,Z,gxz,gxz_rhs,ASA,Symmetry,eps)
call kodis(ex,X,Y,Z,dyy,gyy_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,gyz,gyz_rhs,SAA,Symmetry,eps)
call kodis(ex,X,Y,Z,dzz,gzz_rhs,SSS,Symmetry,eps)
#if 0
#define i 42
#define j 40
#define k 40
if(Lev == 1)then
write(*,*) X(i),Y(j),Z(k)
write(*,*) "before",Axx_rhs(i,j,k)
endif
#undef i
#undef j
#undef k
!!stop
#endif #endif
call kodis(ex,X,Y,Z,Axx,Axx_rhs,SSS,Symmetry,eps)
#if 0
#define i 42
#define j 40
#define k 40
if(Lev == 1)then
write(*,*) X(i),Y(j),Z(k)
write(*,*) "after",Axx_rhs(i,j,k)
endif
#undef i
#undef j
#undef k
!!stop
#endif
call kodis(ex,X,Y,Z,Axy,Axy_rhs,AAS,Symmetry,eps)
call kodis(ex,X,Y,Z,Axz,Axz_rhs,ASA,Symmetry,eps)
call kodis(ex,X,Y,Z,Ayy,Ayy_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,Ayz,Ayz_rhs,SAA,Symmetry,eps)
call kodis(ex,X,Y,Z,Azz,Azz_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,Gamx,Gamx_rhs,ASS,Symmetry,eps)
call kodis(ex,X,Y,Z,Gamy,Gamy_rhs,SAS,Symmetry,eps)
call kodis(ex,X,Y,Z,Gamz,Gamz_rhs,SSA,Symmetry,eps)
#if 1
!! bam does not apply dissipation on gauge variables
call kodis(ex,X,Y,Z,Lap,Lap_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,betax,betax_rhs,ASS,Symmetry,eps)
call kodis(ex,X,Y,Z,betay,betay_rhs,SAS,Symmetry,eps)
call kodis(ex,X,Y,Z,betaz,betaz_rhs,SSA,Symmetry,eps)
#if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
call kodis(ex,X,Y,Z,dtSfx,dtSfx_rhs,ASS,Symmetry,eps)
call kodis(ex,X,Y,Z,dtSfy,dtSfy_rhs,SAS,Symmetry,eps)
call kodis(ex,X,Y,Z,dtSfz,dtSfz_rhs,SSA,Symmetry,eps)
#endif
#endif
endif
if(co == 0)then if(co == 0)then
! ham_Res = trR + 2/3 * K^2 - A_ij * A^ij - 16 * PI * rho ! ham_Res = trR + 2/3 * K^2 - A_ij * A^ij - 16 * PI * rho

File diff suppressed because it is too large Load Diff

View File

@@ -130,7 +130,11 @@ void cgh::compose_cgh(int nprocs)
for (int lev = 0; lev < levels; lev++) for (int lev = 0; lev < levels; lev++)
{ {
checkPatchList(PatL[lev], false); checkPatchList(PatL[lev], false);
#ifdef INTERP_LB_OPTIMIZE
Parallel::distribute_optimize(PatL[lev], nprocs, ingfs, fngfs, false);
#else
Parallel::distribute(PatL[lev], nprocs, ingfs, fngfs, false); Parallel::distribute(PatL[lev], nprocs, ingfs, fngfs, false);
#endif
#if (RPB == 1) #if (RPB == 1)
// we need distributed box of PatL[lev] and PatL[lev-1] // we need distributed box of PatL[lev] and PatL[lev-1]
if (lev > 0) if (lev > 0)
@@ -1301,13 +1305,13 @@ bool cgh::Interp_One_Point(MyList<var> *VarList,
} }
void cgh::Regrid_Onelevel(int lev, int Symmetry, int BH_num, double **Porgbr, double **Porg0, bool cgh::Regrid_Onelevel(int lev, int Symmetry, int BH_num, double **Porgbr, double **Porg0,
MyList<var> *OldList, MyList<var> *StateList, MyList<var> *OldList, MyList<var> *StateList,
MyList<var> *FutureList, MyList<var> *tmList, bool BB, MyList<var> *FutureList, MyList<var> *tmList, bool BB,
monitor *ErrorMonitor) monitor *ErrorMonitor)
{ {
if (lev < movls) if (lev < movls)
return; return false;
#if (0) #if (0)
// #if (PSTR == 1 || PSTR == 2) // #if (PSTR == 1 || PSTR == 2)
@@ -1396,7 +1400,7 @@ void cgh::Regrid_Onelevel(int lev, int Symmetry, int BH_num, double **Porgbr, do
for (bhi = 0; bhi < BH_num; bhi++) for (bhi = 0; bhi < BH_num; bhi++)
delete[] tmpPorg[bhi]; delete[] tmpPorg[bhi];
delete[] tmpPorg; delete[] tmpPorg;
return; return false;
} }
// x direction // x direction
rr = (Porg0[bhi][0] - handle[lev][grd][0]) / dX; rr = (Porg0[bhi][0] - handle[lev][grd][0]) / dX;
@@ -1500,6 +1504,7 @@ void cgh::Regrid_Onelevel(int lev, int Symmetry, int BH_num, double **Porgbr, do
for (int bhi = 0; bhi < BH_num; bhi++) for (int bhi = 0; bhi < BH_num; bhi++)
delete[] tmpPorg[bhi]; delete[] tmpPorg[bhi];
delete[] tmpPorg; delete[] tmpPorg;
return tot_flag;
} }

View File

@@ -74,7 +74,7 @@ public:
MyList<var> *OldList, MyList<var> *StateList, MyList<var> *OldList, MyList<var> *StateList,
MyList<var> *FutureList, MyList<var> *tmList, MyList<var> *FutureList, MyList<var> *tmList,
int Symmetry, bool BB); int Symmetry, bool BB);
void Regrid_Onelevel(int lev, int Symmetry, int BH_num, double **Porgbr, double **Porg0, bool Regrid_Onelevel(int lev, int Symmetry, int BH_num, double **Porgbr, double **Porg0,
MyList<var> *OldList, MyList<var> *StateList, MyList<var> *OldList, MyList<var> *StateList,
MyList<var> *FutureList, MyList<var> *tmList, bool BB, MyList<var> *FutureList, MyList<var> *tmList, bool BB,
monitor *ErrorMonitor); monitor *ErrorMonitor);

View File

@@ -69,6 +69,8 @@
fy = ZEO fy = ZEO
fz = ZEO fz = ZEO
!DIR$ SIMD VECTORLENGTHFOR(KNOWN_INTEGER=8)
!DIR$ UNROLL PARTIAL(4)
do k=1,ex(3)-1 do k=1,ex(3)-1
do j=1,ex(2)-1 do j=1,ex(2)-1
do i=1,ex(1)-1 do i=1,ex(1)-1
@@ -371,6 +373,8 @@
fxz = ZEO fxz = ZEO
fyz = ZEO fyz = ZEO
!DIR$ SIMD VECTORLENGTHFOR(KNOWN_INTEGER=8)
!DIR$ UNROLL PARTIAL(4)
do k=1,ex(3)-1 do k=1,ex(3)-1
do j=1,ex(2)-1 do j=1,ex(2)-1
do i=1,ex(1)-1 do i=1,ex(1)-1

View File

@@ -0,0 +1,268 @@
#include "tool.h"
void fdderivs(const int ex[3],
const double *f,
double *fxx, double *fxy, double *fxz,
double *fyy, double *fyz, double *fzz,
const double *X, const double *Y, const double *Z,
double SYM1, double SYM2, double SYM3,
int Symmetry, int onoff)
{
(void)onoff;
const int NO_SYMM = 0, EQ_SYMM = 1;
const double ZEO = 0.0, ONE = 1.0, TWO = 2.0;
const double F1o4 = 2.5e-1; // 1/4
const double F8 = 8.0;
const double F16 = 16.0;
const double F30 = 30.0;
const double F1o12 = ONE / 12.0;
const double F1o144 = ONE / 144.0;
const int ex1 = ex[0], ex2 = ex[1], ex3 = ex[2];
const double dX = X[1] - X[0];
const double dY = Y[1] - Y[0];
const double dZ = Z[1] - Z[0];
const int imaxF = ex1;
const int jmaxF = ex2;
const int kmaxF = ex3;
int iminF = 1, jminF = 1, kminF = 1;
if (Symmetry > NO_SYMM && fabs(Z[0]) < dZ) kminF = -1;
if (Symmetry > EQ_SYMM && fabs(X[0]) < dX) iminF = -1;
if (Symmetry > EQ_SYMM && fabs(Y[0]) < dY) jminF = -1;
const double SoA[3] = { SYM1, SYM2, SYM3 };
/* fh: (ex1+2)*(ex2+2)*(ex3+2) because ord=2 */
const size_t nx = (size_t)ex1 + 2;
const size_t ny = (size_t)ex2 + 2;
const size_t nz = (size_t)ex3 + 2;
const size_t fh_size = nx * ny * nz;
static double *fh = NULL;
static size_t cap = 0;
if (fh_size > cap) {
free(fh);
fh = (double*)aligned_alloc(64, fh_size * sizeof(double));
cap = fh_size;
}
// double *fh = (double*)malloc(fh_size * sizeof(double));
if (!fh) return;
symmetry_bd(2, ex, f, fh, SoA);
/* 系数:按 Fortran 原式 */
const double Sdxdx = ONE / (dX * dX);
const double Sdydy = ONE / (dY * dY);
const double Sdzdz = ONE / (dZ * dZ);
const double Fdxdx = F1o12 / (dX * dX);
const double Fdydy = F1o12 / (dY * dY);
const double Fdzdz = F1o12 / (dZ * dZ);
const double Sdxdy = F1o4 / (dX * dY);
const double Sdxdz = F1o4 / (dX * dZ);
const double Sdydz = F1o4 / (dY * dZ);
const double Fdxdy = F1o144 / (dX * dY);
const double Fdxdz = F1o144 / (dX * dZ);
const double Fdydz = F1o144 / (dY * dZ);
/* 输出清零fxx,fyy,fzz,fxy,fxz,fyz = 0 */
const size_t all = (size_t)ex1 * (size_t)ex2 * (size_t)ex3;
for (size_t p = 0; p < all; ++p) {
fxx[p] = ZEO; fyy[p] = ZEO; fzz[p] = ZEO;
fxy[p] = ZEO; fxz[p] = ZEO; fyz[p] = ZEO;
}
/*
* Fortran:
* do k=1,ex3-1
* do j=1,ex2-1
* do i=1,ex1-1
*/
for (int k0 = 0; k0 <= ex3 - 2; ++k0) {
const int kF = k0 + 1;
for (int j0 = 0; j0 <= ex2 - 2; ++j0) {
const int jF = j0 + 1;
for (int i0 = 0; i0 <= ex1 - 2; ++i0) {
const int iF = i0 + 1;
const size_t p = idx_ex(i0, j0, k0, ex);
/* 高阶分支i±2,j±2,k±2 都在范围内 */
if ((iF + 2) <= imaxF && (iF - 2) >= iminF &&
(jF + 2) <= jmaxF && (jF - 2) >= jminF &&
(kF + 2) <= kmaxF && (kF - 2) >= kminF)
{
fxx[p] = Fdxdx * (
-fh[idx_fh_F_ord2(iF - 2, jF, kF, ex)] +
F16 * fh[idx_fh_F_ord2(iF - 1, jF, kF, ex)] -
F30 * fh[idx_fh_F_ord2(iF, jF, kF, ex)] -
fh[idx_fh_F_ord2(iF + 2, jF, kF, ex)] +
F16 * fh[idx_fh_F_ord2(iF + 1, jF, kF, ex)]
);
fyy[p] = Fdydy * (
-fh[idx_fh_F_ord2(iF, jF - 2, kF, ex)] +
F16 * fh[idx_fh_F_ord2(iF, jF - 1, kF, ex)] -
F30 * fh[idx_fh_F_ord2(iF, jF, kF, ex)] -
fh[idx_fh_F_ord2(iF, jF + 2, kF, ex)] +
F16 * fh[idx_fh_F_ord2(iF, jF + 1, kF, ex)]
);
fzz[p] = Fdzdz * (
-fh[idx_fh_F_ord2(iF, jF, kF - 2, ex)] +
F16 * fh[idx_fh_F_ord2(iF, jF, kF - 1, ex)] -
F30 * fh[idx_fh_F_ord2(iF, jF, kF, ex)] -
fh[idx_fh_F_ord2(iF, jF, kF + 2, ex)] +
F16 * fh[idx_fh_F_ord2(iF, jF, kF + 1, ex)]
);
/* fxy 高阶:完全照搬 Fortran 的括号结构 */
{
const double t_jm2 =
( fh[idx_fh_F_ord2(iF - 2, jF - 2, kF, ex)]
-F8*fh[idx_fh_F_ord2(iF - 1, jF - 2, kF, ex)]
+F8*fh[idx_fh_F_ord2(iF + 1, jF - 2, kF, ex)]
- fh[idx_fh_F_ord2(iF + 2, jF - 2, kF, ex)] );
const double t_jm1 =
( fh[idx_fh_F_ord2(iF - 2, jF - 1, kF, ex)]
-F8*fh[idx_fh_F_ord2(iF - 1, jF - 1, kF, ex)]
+F8*fh[idx_fh_F_ord2(iF + 1, jF - 1, kF, ex)]
- fh[idx_fh_F_ord2(iF + 2, jF - 1, kF, ex)] );
const double t_jp1 =
( fh[idx_fh_F_ord2(iF - 2, jF + 1, kF, ex)]
-F8*fh[idx_fh_F_ord2(iF - 1, jF + 1, kF, ex)]
+F8*fh[idx_fh_F_ord2(iF + 1, jF + 1, kF, ex)]
- fh[idx_fh_F_ord2(iF + 2, jF + 1, kF, ex)] );
const double t_jp2 =
( fh[idx_fh_F_ord2(iF - 2, jF + 2, kF, ex)]
-F8*fh[idx_fh_F_ord2(iF - 1, jF + 2, kF, ex)]
+F8*fh[idx_fh_F_ord2(iF + 1, jF + 2, kF, ex)]
- fh[idx_fh_F_ord2(iF + 2, jF + 2, kF, ex)] );
fxy[p] = Fdxdy * ( t_jm2 - F8 * t_jm1 + F8 * t_jp1 - t_jp2 );
}
/* fxz 高阶 */
{
const double t_km2 =
( fh[idx_fh_F_ord2(iF - 2, jF, kF - 2, ex)]
-F8*fh[idx_fh_F_ord2(iF - 1, jF, kF - 2, ex)]
+F8*fh[idx_fh_F_ord2(iF + 1, jF, kF - 2, ex)]
- fh[idx_fh_F_ord2(iF + 2, jF, kF - 2, ex)] );
const double t_km1 =
( fh[idx_fh_F_ord2(iF - 2, jF, kF - 1, ex)]
-F8*fh[idx_fh_F_ord2(iF - 1, jF, kF - 1, ex)]
+F8*fh[idx_fh_F_ord2(iF + 1, jF, kF - 1, ex)]
- fh[idx_fh_F_ord2(iF + 2, jF, kF - 1, ex)] );
const double t_kp1 =
( fh[idx_fh_F_ord2(iF - 2, jF, kF + 1, ex)]
-F8*fh[idx_fh_F_ord2(iF - 1, jF, kF + 1, ex)]
+F8*fh[idx_fh_F_ord2(iF + 1, jF, kF + 1, ex)]
- fh[idx_fh_F_ord2(iF + 2, jF, kF + 1, ex)] );
const double t_kp2 =
( fh[idx_fh_F_ord2(iF - 2, jF, kF + 2, ex)]
-F8*fh[idx_fh_F_ord2(iF - 1, jF, kF + 2, ex)]
+F8*fh[idx_fh_F_ord2(iF + 1, jF, kF + 2, ex)]
- fh[idx_fh_F_ord2(iF + 2, jF, kF + 2, ex)] );
fxz[p] = Fdxdz * ( t_km2 - F8 * t_km1 + F8 * t_kp1 - t_kp2 );
}
/* fyz 高阶 */
{
const double t_km2 =
( fh[idx_fh_F_ord2(iF, jF - 2, kF - 2, ex)]
-F8*fh[idx_fh_F_ord2(iF, jF - 1, kF - 2, ex)]
+F8*fh[idx_fh_F_ord2(iF, jF + 1, kF - 2, ex)]
- fh[idx_fh_F_ord2(iF, jF + 2, kF - 2, ex)] );
const double t_km1 =
( fh[idx_fh_F_ord2(iF, jF - 2, kF - 1, ex)]
-F8*fh[idx_fh_F_ord2(iF, jF - 1, kF - 1, ex)]
+F8*fh[idx_fh_F_ord2(iF, jF + 1, kF - 1, ex)]
- fh[idx_fh_F_ord2(iF, jF + 2, kF - 1, ex)] );
const double t_kp1 =
( fh[idx_fh_F_ord2(iF, jF - 2, kF + 1, ex)]
-F8*fh[idx_fh_F_ord2(iF, jF - 1, kF + 1, ex)]
+F8*fh[idx_fh_F_ord2(iF, jF + 1, kF + 1, ex)]
- fh[idx_fh_F_ord2(iF, jF + 2, kF + 1, ex)] );
const double t_kp2 =
( fh[idx_fh_F_ord2(iF, jF - 2, kF + 2, ex)]
-F8*fh[idx_fh_F_ord2(iF, jF - 1, kF + 2, ex)]
+F8*fh[idx_fh_F_ord2(iF, jF + 1, kF + 2, ex)]
- fh[idx_fh_F_ord2(iF, jF + 2, kF + 2, ex)] );
fyz[p] = Fdydz * ( t_km2 - F8 * t_km1 + F8 * t_kp1 - t_kp2 );
}
}
/* 二阶分支i±1,j±1,k±1 在范围内 */
else if ((iF + 1) <= imaxF && (iF - 1) >= iminF &&
(jF + 1) <= jmaxF && (jF - 1) >= jminF &&
(kF + 1) <= kmaxF && (kF - 1) >= kminF)
{
fxx[p] = Sdxdx * (
fh[idx_fh_F_ord2(iF - 1, jF, kF, ex)] -
TWO * fh[idx_fh_F_ord2(iF, jF, kF, ex)] +
fh[idx_fh_F_ord2(iF + 1, jF, kF, ex)]
);
fyy[p] = Sdydy * (
fh[idx_fh_F_ord2(iF, jF - 1, kF, ex)] -
TWO * fh[idx_fh_F_ord2(iF, jF, kF, ex)] +
fh[idx_fh_F_ord2(iF, jF + 1, kF, ex)]
);
fzz[p] = Sdzdz * (
fh[idx_fh_F_ord2(iF, jF, kF - 1, ex)] -
TWO * fh[idx_fh_F_ord2(iF, jF, kF, ex)] +
fh[idx_fh_F_ord2(iF, jF, kF + 1, ex)]
);
fxy[p] = Sdxdy * (
fh[idx_fh_F_ord2(iF - 1, jF - 1, kF, ex)] -
fh[idx_fh_F_ord2(iF + 1, jF - 1, kF, ex)] -
fh[idx_fh_F_ord2(iF - 1, jF + 1, kF, ex)] +
fh[idx_fh_F_ord2(iF + 1, jF + 1, kF, ex)]
);
fxz[p] = Sdxdz * (
fh[idx_fh_F_ord2(iF - 1, jF, kF - 1, ex)] -
fh[idx_fh_F_ord2(iF + 1, jF, kF - 1, ex)] -
fh[idx_fh_F_ord2(iF - 1, jF, kF + 1, ex)] +
fh[idx_fh_F_ord2(iF + 1, jF, kF + 1, ex)]
);
fyz[p] = Sdydz * (
fh[idx_fh_F_ord2(iF, jF - 1, kF - 1, ex)] -
fh[idx_fh_F_ord2(iF, jF + 1, kF - 1, ex)] -
fh[idx_fh_F_ord2(iF, jF - 1, kF + 1, ex)] +
fh[idx_fh_F_ord2(iF, jF + 1, kF + 1, ex)]
);
}else{
fxx[p] = 0.0;
fyy[p] = 0.0;
fzz[p] = 0.0;
fxy[p] = 0.0;
fxz[p] = 0.0;
fyz[p] = 0.0;
}
}
}
}
// free(fh);
}

View File

@@ -0,0 +1,150 @@
#include "tool.h"
/*
* C 版 fderivs
*
* Fortran:
* subroutine fderivs(ex,f,fx,fy,fz,X,Y,Z,SYM1,SYM2,SYM3,symmetry,onoff)
*
* 约定:
* f, fx, fy, fz: ex1*ex2*ex3按 idx_ex 布局
* X: ex1, Y: ex2, Z: ex3
*/
void fderivs(const int ex[3],
const double *f,
double *fx, double *fy, double *fz,
const double *X, const double *Y, const double *Z,
double SYM1, double SYM2, double SYM3,
int Symmetry, int onoff)
{
(void)onoff; // Fortran 里没用到
const double ZEO = 0.0, ONE = 1.0;
const double TWO = 2.0, EIT = 8.0;
const double F12 = 12.0;
const int NO_SYMM = 0, EQ_SYMM = 1; // OCTANT=2 在本子程序里不直接用
const int ex1 = ex[0], ex2 = ex[1], ex3 = ex[2];
// dX = X(2)-X(1) -> C: X[1]-X[0]
const double dX = X[1] - X[0];
const double dY = Y[1] - Y[0];
const double dZ = Z[1] - Z[0];
// Fortran 1-based bounds
const int imaxF = ex1;
const int jmaxF = ex2;
const int kmaxF = ex3;
int iminF = 1, jminF = 1, kminF = 1;
if (Symmetry > NO_SYMM && fabs(Z[0]) < dZ) kminF = -1;
if (Symmetry > EQ_SYMM && fabs(X[0]) < dX) iminF = -1;
if (Symmetry > EQ_SYMM && fabs(Y[0]) < dY) jminF = -1;
// SoA(1:3) = SYM1,SYM2,SYM3
const double SoA[3] = { SYM1, SYM2, SYM3 };
// fh: (ex1+2)*(ex2+2)*(ex3+2) because ord=2
const size_t nx = (size_t)ex1 + 2;
const size_t ny = (size_t)ex2 + 2;
const size_t nz = (size_t)ex3 + 2;
const size_t fh_size = nx * ny * nz;
static double *fh = NULL;
static size_t cap = 0;
if (fh_size > cap) {
free(fh);
fh = (double*)aligned_alloc(64, fh_size * sizeof(double));
cap = fh_size;
}
// double *fh = (double*)malloc(fh_size * sizeof(double));
if (!fh) return;
// call symmetry_bd(2,ex,f,fh,SoA)
symmetry_bd(2, ex, f, fh, SoA);
const double d12dx = ONE / F12 / dX;
const double d12dy = ONE / F12 / dY;
const double d12dz = ONE / F12 / dZ;
const double d2dx = ONE / TWO / dX;
const double d2dy = ONE / TWO / dY;
const double d2dz = ONE / TWO / dZ;
// fx = fy = fz = 0
const size_t all = (size_t)ex1 * (size_t)ex2 * (size_t)ex3;
for (size_t p = 0; p < all; ++p) {
fx[p] = ZEO;
fy[p] = ZEO;
fz[p] = ZEO;
}
/*
* Fortran loops:
* do k=1,ex3-1
* do j=1,ex2-1
* do i=1,ex1-1
*
* C: k0=0..ex3-2, j0=0..ex2-2, i0=0..ex1-2
*/
for (int k0 = 0; k0 <= ex3 - 2; ++k0) {
const int kF = k0 + 1;
for (int j0 = 0; j0 <= ex2 - 2; ++j0) {
const int jF = j0 + 1;
for (int i0 = 0; i0 <= ex1 - 2; ++i0) {
const int iF = i0 + 1;
const size_t p = idx_ex(i0, j0, k0, ex);
// if(i+2 <= imax .and. i-2 >= imin ... ) (全是 Fortran 索引)
if ((iF + 2) <= imaxF && (iF - 2) >= iminF &&
(jF + 2) <= jmaxF && (jF - 2) >= jminF &&
(kF + 2) <= kmaxF && (kF - 2) >= kminF)
{
fx[p] = d12dx * (
fh[idx_fh_F_ord2(iF - 2, jF, kF, ex)] -
EIT * fh[idx_fh_F_ord2(iF - 1, jF, kF, ex)] +
EIT * fh[idx_fh_F_ord2(iF + 1, jF, kF, ex)] -
fh[idx_fh_F_ord2(iF + 2, jF, kF, ex)]
);
fy[p] = d12dy * (
fh[idx_fh_F_ord2(iF, jF - 2, kF, ex)] -
EIT * fh[idx_fh_F_ord2(iF, jF - 1, kF, ex)] +
EIT * fh[idx_fh_F_ord2(iF, jF + 1, kF, ex)] -
fh[idx_fh_F_ord2(iF, jF + 2, kF, ex)]
);
fz[p] = d12dz * (
fh[idx_fh_F_ord2(iF, jF, kF - 2, ex)] -
EIT * fh[idx_fh_F_ord2(iF, jF, kF - 1, ex)] +
EIT * fh[idx_fh_F_ord2(iF, jF, kF + 1, ex)] -
fh[idx_fh_F_ord2(iF, jF, kF + 2, ex)]
);
}
// elseif(i+1 <= imax .and. i-1 >= imin ...)
else if ((iF + 1) <= imaxF && (iF - 1) >= iminF &&
(jF + 1) <= jmaxF && (jF - 1) >= jminF &&
(kF + 1) <= kmaxF && (kF - 1) >= kminF)
{
fx[p] = d2dx * (
-fh[idx_fh_F_ord2(iF - 1, jF, kF, ex)] +
fh[idx_fh_F_ord2(iF + 1, jF, kF, ex)]
);
fy[p] = d2dy * (
-fh[idx_fh_F_ord2(iF, jF - 1, kF, ex)] +
fh[idx_fh_F_ord2(iF, jF + 1, kF, ex)]
);
fz[p] = d2dz * (
-fh[idx_fh_F_ord2(iF, jF, kF - 1, ex)] +
fh[idx_fh_F_ord2(iF, jF, kF + 1, ex)]
);
}
}
}
}
// free(fh);
}

View File

@@ -883,13 +883,17 @@ subroutine symmetry_bd(ord,extc,func,funcc,SoA)
integer::i integer::i
!DIR$ SIMD VECTORLENGTHFOR(KNOWN_INTEGER=8)
funcc(1:extc(1),1:extc(2),1:extc(3)) = func funcc(1:extc(1),1:extc(2),1:extc(3)) = func
!DIR$ SIMD VECTORLENGTHFOR(KNOWN_INTEGER=8)
do i=0,ord-1 do i=0,ord-1
funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1) funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1)
enddo enddo
!DIR$ SIMD VECTORLENGTHFOR(KNOWN_INTEGER=8)
do i=0,ord-1 do i=0,ord-1
funcc(:,-i,1:extc(3)) = funcc(:,i+1,1:extc(3))*SoA(2) funcc(:,-i,1:extc(3)) = funcc(:,i+1,1:extc(3))*SoA(2)
enddo enddo
!DIR$ SIMD VECTORLENGTHFOR(KNOWN_INTEGER=8)
do i=0,ord-1 do i=0,ord-1
funcc(:,:,-i) = funcc(:,:,i+1)*SoA(3) funcc(:,:,-i) = funcc(:,:,i+1)*SoA(3)
enddo enddo
@@ -1112,6 +1116,7 @@ end subroutine d2dump
! Lagrangian polynomial interpolation ! Lagrangian polynomial interpolation
!------------------------------------------------------------------------------ !------------------------------------------------------------------------------
!DIR$ ATTRIBUTES FORCEINLINE :: polint
subroutine polint(xa, ya, x, y, dy, ordn) subroutine polint(xa, ya, x, y, dy, ordn)
implicit none implicit none

View File

@@ -0,0 +1,107 @@
#include "interp_lb_profile.h"
#include <cstdio>
#include <cstring>
#include <algorithm>
namespace InterpLBProfile {
bool write_profile(const char *filepath, int nprocs,
const double *rank_times,
const int *heavy_ranks, int num_heavy,
double threshold_ratio)
{
FILE *fp = fopen(filepath, "wb");
if (!fp) return false;
ProfileHeader hdr;
hdr.magic = MAGIC;
hdr.version = VERSION;
hdr.nprocs = nprocs;
hdr.num_heavy = num_heavy;
hdr.threshold_ratio = threshold_ratio;
fwrite(&hdr, sizeof(hdr), 1, fp);
fwrite(rank_times, sizeof(double), nprocs, fp);
fwrite(heavy_ranks, sizeof(int), num_heavy, fp);
fclose(fp);
return true;
}
bool read_profile(const char *filepath, int current_nprocs,
int *heavy_ranks, int &num_heavy,
double *rank_times, MPI_Comm comm)
{
int myrank;
MPI_Comm_rank(comm, &myrank);
int valid = 0;
ProfileHeader hdr;
memset(&hdr, 0, sizeof(hdr));
if (myrank == 0) {
FILE *fp = fopen(filepath, "rb");
if (fp) {
if (fread(&hdr, sizeof(hdr), 1, fp) == 1 &&
hdr.magic == MAGIC && hdr.version == VERSION &&
hdr.nprocs == current_nprocs)
{
if (fread(rank_times, sizeof(double), current_nprocs, fp)
== (size_t)current_nprocs &&
fread(heavy_ranks, sizeof(int), hdr.num_heavy, fp)
== (size_t)hdr.num_heavy)
{
num_heavy = hdr.num_heavy;
valid = 1;
}
} else if (fp) {
printf("[InterpLB] Profile rejected: magic=0x%X version=%u "
"nprocs=%d (current=%d)\n",
hdr.magic, hdr.version, hdr.nprocs, current_nprocs);
}
fclose(fp);
}
}
MPI_Bcast(&valid, 1, MPI_INT, 0, comm);
if (!valid) return false;
MPI_Bcast(&num_heavy, 1, MPI_INT, 0, comm);
MPI_Bcast(heavy_ranks, num_heavy, MPI_INT, 0, comm);
MPI_Bcast(rank_times, current_nprocs, MPI_DOUBLE, 0, comm);
return true;
}
int identify_heavy_ranks(const double *rank_times, int nprocs,
double threshold_ratio,
int *heavy_ranks, int max_heavy)
{
double sum = 0;
for (int i = 0; i < nprocs; i++) sum += rank_times[i];
double mean = sum / nprocs;
double threshold = threshold_ratio * mean;
// Collect candidates
struct RankTime { int rank; double time; };
RankTime *candidates = new RankTime[nprocs];
int ncand = 0;
for (int i = 0; i < nprocs; i++) {
if (rank_times[i] > threshold)
candidates[ncand++] = {i, rank_times[i]};
}
// Sort descending by time
std::sort(candidates, candidates + ncand,
[](const RankTime &a, const RankTime &b) {
return a.time > b.time;
});
int count = (ncand < max_heavy) ? ncand : max_heavy;
for (int i = 0; i < count; i++)
heavy_ranks[i] = candidates[i].rank;
delete[] candidates;
return count;
}
} // namespace InterpLBProfile

Binary file not shown.

View File

@@ -0,0 +1,38 @@
#ifndef INTERP_LB_PROFILE_H
#define INTERP_LB_PROFILE_H
#include <mpi.h>
namespace InterpLBProfile {
static const unsigned int MAGIC = 0x494C4250; // "ILBP"
static const unsigned int VERSION = 1;
struct ProfileHeader {
unsigned int magic;
unsigned int version;
int nprocs;
int num_heavy;
double threshold_ratio;
};
// Write profile file (rank 0 only)
bool write_profile(const char *filepath, int nprocs,
const double *rank_times,
const int *heavy_ranks, int num_heavy,
double threshold_ratio);
// Read profile file (rank 0 reads, then broadcasts to all)
// Returns true if file found and valid for current nprocs
bool read_profile(const char *filepath, int current_nprocs,
int *heavy_ranks, int &num_heavy,
double *rank_times, MPI_Comm comm);
// Identify heavy ranks: those with time > threshold_ratio * mean
int identify_heavy_ranks(const double *rank_times, int nprocs,
double threshold_ratio,
int *heavy_ranks, int max_heavy);
} // namespace InterpLBProfile
#endif /* INTERP_LB_PROFILE_H */

View File

@@ -0,0 +1,27 @@
/* Auto-generated from interp_lb_profile.bin — do not edit */
#ifndef INTERP_LB_PROFILE_DATA_H
#define INTERP_LB_PROFILE_DATA_H
#define INTERP_LB_NPROCS 64
#define INTERP_LB_NUM_HEAVY 4
static const int interp_lb_heavy_blocks[4] = {27, 35, 28, 36};
/* Split table: {block_id, r_left, r_right} */
static const int interp_lb_splits[4][3] = {
{27, 26, 27},
{35, 34, 35},
{28, 28, 29},
{36, 36, 37},
};
/* Rank remap for displaced neighbor blocks */
static const int interp_lb_num_remaps = 4;
static const int interp_lb_remaps[][2] = {
{26, 25},
{29, 30},
{34, 33},
{37, 38},
};
#endif /* INTERP_LB_PROFILE_DATA_H */

View File

@@ -65,6 +65,8 @@ real*8,intent(in) :: eps
! dx^4 ! dx^4
! note the sign (-1)^r-1, now r=2 ! note the sign (-1)^r-1, now r=2
!DIR$ SIMD VECTORLENGTHFOR(KNOWN_INTEGER=8)
!DIR$ UNROLL PARTIAL(4)
do k=1,ex(3) do k=1,ex(3)
do j=1,ex(2) do j=1,ex(2)
do i=1,ex(1) do i=1,ex(1)

109
AMSS_NCKU_source/kodiss_c.C Normal file
View File

@@ -0,0 +1,109 @@
#include "tool.h"
/*
* C 版 kodis
*
* Fortran signature:
* subroutine kodis(ex,X,Y,Z,f,f_rhs,SoA,Symmetry,eps)
*
* 约定:
* X: ex1, Y: ex2, Z: ex3
* f, f_rhs: ex1*ex2*ex3 按 idx_ex 布局
* SoA[3]
* eps: double
*/
void kodis(const int ex[3],
const double *X, const double *Y, const double *Z,
const double *f, double *f_rhs,
const double SoA[3],
int Symmetry, double eps)
{
const double ONE = 1.0, SIX = 6.0, FIT = 15.0, TWT = 20.0;
const double cof = 64.0; // 2^6
const int NO_SYMM = 0, OCTANT = 2;
const int ex1 = ex[0], ex2 = ex[1], ex3 = ex[2];
// Fortran: dX = X(2)-X(1) -> C: X[1]-X[0]
const double dX = X[1] - X[0];
const double dY = Y[1] - Y[0];
const double dZ = Z[1] - Z[0];
(void)ONE; // ONE 在原 Fortran 里只是参数,这里不一定用得上
// Fortran: imax=ex(1) 等是 1-based 上界
const int imaxF = ex1;
const int jmaxF = ex2;
const int kmaxF = ex3;
// Fortran: imin=jmin=kmin=1某些对称情况变 -2
int iminF = 1, jminF = 1, kminF = 1;
if (Symmetry > NO_SYMM && fabs(Z[0]) < dZ) kminF = -2;
if (Symmetry == OCTANT && fabs(X[0]) < dX) iminF = -2;
if (Symmetry == OCTANT && fabs(Y[0]) < dY) jminF = -2;
// 分配 fh大小 (ex1+3)*(ex2+3)*(ex3+3),对应 ord=3
const size_t nx = (size_t)ex1 + 3;
const size_t ny = (size_t)ex2 + 3;
const size_t nz = (size_t)ex3 + 3;
const size_t fh_size = nx * ny * nz;
double *fh = (double*)malloc(fh_size * sizeof(double));
if (!fh) return;
// Fortran: call symmetry_bd(3,ex,f,fh,SoA)
symmetry_bd(3, ex, f, fh, SoA);
/*
* Fortran loops:
* do k=1,ex3
* do j=1,ex2
* do i=1,ex1
*
* C: k0=0..ex3-1, j0=0..ex2-1, i0=0..ex1-1
* 并定义 Fortran index: iF=i0+1, ...
*/
for (int k0 = 0; k0 < ex3; ++k0) {
const int kF = k0 + 1;
for (int j0 = 0; j0 < ex2; ++j0) {
const int jF = j0 + 1;
for (int i0 = 0; i0 < ex1; ++i0) {
const int iF = i0 + 1;
// Fortran if 条件:
// i-3 >= imin .and. i+3 <= imax 等(都是 Fortran 索引)
if ((iF - 3) >= iminF && (iF + 3) <= imaxF &&
(jF - 3) >= jminF && (jF + 3) <= jmaxF &&
(kF - 3) >= kminF && (kF + 3) <= kmaxF)
{
const size_t p = idx_ex(i0, j0, k0, ex);
// 三个方向各一份同型的 7 点组合(实际上是对称的 6th-order dissipation/filter 核)
const double Dx_term =
( (fh[idx_fh_F(iF - 3, jF, kF, ex)] + fh[idx_fh_F(iF + 3, jF, kF, ex)]) -
SIX * (fh[idx_fh_F(iF - 2, jF, kF, ex)] + fh[idx_fh_F(iF + 2, jF, kF, ex)]) +
FIT * (fh[idx_fh_F(iF - 1, jF, kF, ex)] + fh[idx_fh_F(iF + 1, jF, kF, ex)]) -
TWT * fh[idx_fh_F(iF , jF, kF, ex)] ) / dX;
const double Dy_term =
( (fh[idx_fh_F(iF, jF - 3, kF, ex)] + fh[idx_fh_F(iF, jF + 3, kF, ex)]) -
SIX * (fh[idx_fh_F(iF, jF - 2, kF, ex)] + fh[idx_fh_F(iF, jF + 2, kF, ex)]) +
FIT * (fh[idx_fh_F(iF, jF - 1, kF, ex)] + fh[idx_fh_F(iF, jF + 1, kF, ex)]) -
TWT * fh[idx_fh_F(iF, jF , kF, ex)] ) / dY;
const double Dz_term =
( (fh[idx_fh_F(iF, jF, kF - 3, ex)] + fh[idx_fh_F(iF, jF, kF + 3, ex)]) -
SIX * (fh[idx_fh_F(iF, jF, kF - 2, ex)] + fh[idx_fh_F(iF, jF, kF + 2, ex)]) +
FIT * (fh[idx_fh_F(iF, jF, kF - 1, ex)] + fh[idx_fh_F(iF, jF, kF + 1, ex)]) -
TWT * fh[idx_fh_F(iF, jF, kF , ex)] ) / dZ;
// Fortran:
// f_rhs(i,j,k) = f_rhs(i,j,k) + eps/cof*(Dx_term + Dy_term + Dz_term)
f_rhs[p] += (eps / cof) * (Dx_term + Dy_term + Dz_term);
}
}
}
}
free(fh);
}

View File

@@ -0,0 +1,255 @@
#include "tool.h"
/*
* 你需要提供 symmetry_bd 的 C 版本(或 Fortran 绑到 C 的接口)。
* Fortran: call symmetry_bd(3,ex,f,fh,SoA)
*
* 约定:
* nghost = 3
* ex[3] = {ex1,ex2,ex3}
* f = 原始网格 (ex1*ex2*ex3)
* fh = 扩展网格 ((ex1+3)*(ex2+3)*(ex3+3)),对应 Fortran 的 (-2:ex1, ...)
* SoA[3] = 输入参数
*/
void lopsided(const int ex[3],
const double *X, const double *Y, const double *Z,
const double *f, double *f_rhs,
const double *Sfx, const double *Sfy, const double *Sfz,
int Symmetry, const double SoA[3])
{
const double ZEO = 0.0, ONE = 1.0, F3 = 3.0;
const double TWO = 2.0, F6 = 6.0, F18 = 18.0;
const double F12 = 12.0, F10 = 10.0, EIT = 8.0;
const int NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2;
(void)OCTANT; // 这里和 Fortran 一样只是定义了不用也没关系
const int ex1 = ex[0], ex2 = ex[1], ex3 = ex[2];
// 对应 Fortran: dX = X(2)-X(1) Fortran 1-based
// C: X[1]-X[0]
const double dX = X[1] - X[0];
const double dY = Y[1] - Y[0];
const double dZ = Z[1] - Z[0];
const double d12dx = ONE / F12 / dX;
const double d12dy = ONE / F12 / dY;
const double d12dz = ONE / F12 / dZ;
// Fortran 里算了 d2dx/d2dy/d2dz 但本 subroutine 里没用到(保持一致也算出来)
const double d2dx = ONE / TWO / dX;
const double d2dy = ONE / TWO / dY;
const double d2dz = ONE / TWO / dZ;
(void)d2dx; (void)d2dy; (void)d2dz;
// Fortran:
// imax = ex(1); jmax = ex(2); kmax = ex(3)
const int imaxF = ex1;
const int jmaxF = ex2;
const int kmaxF = ex3;
// Fortran:
// imin=jmin=kmin=1; 若满足对称条件则设为 -2
int iminF = 1, jminF = 1, kminF = 1;
if (Symmetry > NO_SYMM && fabs(Z[0]) < dZ) kminF = -2;
if (Symmetry > EQ_SYMM && fabs(X[0]) < dX) iminF = -2;
if (Symmetry > EQ_SYMM && fabs(Y[0]) < dY) jminF = -2;
// 分配 fh大小 (ex1+3)*(ex2+3)*(ex3+3)
const size_t nx = (size_t)ex1 + 3;
const size_t ny = (size_t)ex2 + 3;
const size_t nz = (size_t)ex3 + 3;
const size_t fh_size = nx * ny * nz;
double *fh = (double*)malloc(fh_size * sizeof(double));
if (!fh) return; // 内存不足:直接返回(你也可以改成 abort/报错)
// Fortran: call symmetry_bd(3,ex,f,fh,SoA)
symmetry_bd(3, ex, f, fh, SoA);
/*
* Fortran 主循环:
* do k=1,ex(3)-1
* do j=1,ex(2)-1
* do i=1,ex(1)-1
*
* 转成 C 0-based
* k0 = 0..ex3-2, j0 = 0..ex2-2, i0 = 0..ex1-2
*
* 并且 Fortran 里的 i/j/k 在 fh 访问时,仍然是 Fortran 索引值:
* iF=i0+1, jF=j0+1, kF=k0+1
*/
for (int k0 = 0; k0 <= ex3 - 2; ++k0) {
const int kF = k0 + 1;
for (int j0 = 0; j0 <= ex2 - 2; ++j0) {
const int jF = j0 + 1;
for (int i0 = 0; i0 <= ex1 - 2; ++i0) {
const int iF = i0 + 1;
const size_t p = idx_ex(i0, j0, k0, ex);
// ---------------- x direction ----------------
const double sfx = Sfx[p];
if (sfx > ZEO) {
// Fortran: if(i+3 <= imax)
// iF+3 <= ex1 <=> i0+4 <= ex1 <=> i0 <= ex1-4
if (i0 <= ex1 - 4) {
f_rhs[p] += sfx * d12dx *
(-F3 * fh[idx_fh_F(iF - 1, jF, kF, ex)]
-F10 * fh[idx_fh_F(iF , jF, kF, ex)]
+F18 * fh[idx_fh_F(iF + 1, jF, kF, ex)]
-F6 * fh[idx_fh_F(iF + 2, jF, kF, ex)]
+ fh[idx_fh_F(iF + 3, jF, kF, ex)]);
}
// elseif(i+2 <= imax) <=> i0 <= ex1-3
else if (i0 <= ex1 - 3) {
f_rhs[p] += sfx * d12dx *
( fh[idx_fh_F(iF - 2, jF, kF, ex)]
-EIT * fh[idx_fh_F(iF - 1, jF, kF, ex)]
+EIT * fh[idx_fh_F(iF + 1, jF, kF, ex)]
- fh[idx_fh_F(iF + 2, jF, kF, ex)]);
}
// elseif(i+1 <= imax) <=> i0 <= ex1-2循环里总成立
else if (i0 <= ex1 - 2) {
f_rhs[p] -= sfx * d12dx *
(-F3 * fh[idx_fh_F(iF + 1, jF, kF, ex)]
-F10 * fh[idx_fh_F(iF , jF, kF, ex)]
+F18 * fh[idx_fh_F(iF - 1, jF, kF, ex)]
-F6 * fh[idx_fh_F(iF - 2, jF, kF, ex)]
+ fh[idx_fh_F(iF - 3, jF, kF, ex)]);
}
} else if (sfx < ZEO) {
// Fortran: if(i-3 >= imin)
// (iF-3) >= iminF <=> (i0-2) >= iminF
if ((i0 - 2) >= iminF) {
f_rhs[p] -= sfx * d12dx *
(-F3 * fh[idx_fh_F(iF + 1, jF, kF, ex)]
-F10 * fh[idx_fh_F(iF , jF, kF, ex)]
+F18 * fh[idx_fh_F(iF - 1, jF, kF, ex)]
-F6 * fh[idx_fh_F(iF - 2, jF, kF, ex)]
+ fh[idx_fh_F(iF - 3, jF, kF, ex)]);
}
// elseif(i-2 >= imin) <=> (i0-1) >= iminF
else if ((i0 - 1) >= iminF) {
f_rhs[p] += sfx * d12dx *
( fh[idx_fh_F(iF - 2, jF, kF, ex)]
-EIT * fh[idx_fh_F(iF - 1, jF, kF, ex)]
+EIT * fh[idx_fh_F(iF + 1, jF, kF, ex)]
- fh[idx_fh_F(iF + 2, jF, kF, ex)]);
}
// elseif(i-1 >= imin) <=> i0 >= iminF
else if (i0 >= iminF) {
f_rhs[p] += sfx * d12dx *
(-F3 * fh[idx_fh_F(iF - 1, jF, kF, ex)]
-F10 * fh[idx_fh_F(iF , jF, kF, ex)]
+F18 * fh[idx_fh_F(iF + 1, jF, kF, ex)]
-F6 * fh[idx_fh_F(iF + 2, jF, kF, ex)]
+ fh[idx_fh_F(iF + 3, jF, kF, ex)]);
}
}
// ---------------- y direction ----------------
const double sfy = Sfy[p];
if (sfy > ZEO) {
// jF+3 <= ex2 <=> j0+4 <= ex2 <=> j0 <= ex2-4
if (j0 <= ex2 - 4) {
f_rhs[p] += sfy * d12dy *
(-F3 * fh[idx_fh_F(iF, jF - 1, kF, ex)]
-F10 * fh[idx_fh_F(iF, jF , kF, ex)]
+F18 * fh[idx_fh_F(iF, jF + 1, kF, ex)]
-F6 * fh[idx_fh_F(iF, jF + 2, kF, ex)]
+ fh[idx_fh_F(iF, jF + 3, kF, ex)]);
} else if (j0 <= ex2 - 3) {
f_rhs[p] += sfy * d12dy *
( fh[idx_fh_F(iF, jF - 2, kF, ex)]
-EIT * fh[idx_fh_F(iF, jF - 1, kF, ex)]
+EIT * fh[idx_fh_F(iF, jF + 1, kF, ex)]
- fh[idx_fh_F(iF, jF + 2, kF, ex)]);
} else if (j0 <= ex2 - 2) {
f_rhs[p] -= sfy * d12dy *
(-F3 * fh[idx_fh_F(iF, jF + 1, kF, ex)]
-F10 * fh[idx_fh_F(iF, jF , kF, ex)]
+F18 * fh[idx_fh_F(iF, jF - 1, kF, ex)]
-F6 * fh[idx_fh_F(iF, jF - 2, kF, ex)]
+ fh[idx_fh_F(iF, jF - 3, kF, ex)]);
}
} else if (sfy < ZEO) {
if ((j0 - 2) >= jminF) {
f_rhs[p] -= sfy * d12dy *
(-F3 * fh[idx_fh_F(iF, jF + 1, kF, ex)]
-F10 * fh[idx_fh_F(iF, jF , kF, ex)]
+F18 * fh[idx_fh_F(iF, jF - 1, kF, ex)]
-F6 * fh[idx_fh_F(iF, jF - 2, kF, ex)]
+ fh[idx_fh_F(iF, jF - 3, kF, ex)]);
} else if ((j0 - 1) >= jminF) {
f_rhs[p] += sfy * d12dy *
( fh[idx_fh_F(iF, jF - 2, kF, ex)]
-EIT * fh[idx_fh_F(iF, jF - 1, kF, ex)]
+EIT * fh[idx_fh_F(iF, jF + 1, kF, ex)]
- fh[idx_fh_F(iF, jF + 2, kF, ex)]);
} else if (j0 >= jminF) {
f_rhs[p] += sfy * d12dy *
(-F3 * fh[idx_fh_F(iF, jF - 1, kF, ex)]
-F10 * fh[idx_fh_F(iF, jF , kF, ex)]
+F18 * fh[idx_fh_F(iF, jF + 1, kF, ex)]
-F6 * fh[idx_fh_F(iF, jF + 2, kF, ex)]
+ fh[idx_fh_F(iF, jF + 3, kF, ex)]);
}
}
// ---------------- z direction ----------------
const double sfz = Sfz[p];
if (sfz > ZEO) {
if (k0 <= ex3 - 4) {
f_rhs[p] += sfz * d12dz *
(-F3 * fh[idx_fh_F(iF, jF, kF - 1, ex)]
-F10 * fh[idx_fh_F(iF, jF, kF , ex)]
+F18 * fh[idx_fh_F(iF, jF, kF + 1, ex)]
-F6 * fh[idx_fh_F(iF, jF, kF + 2, ex)]
+ fh[idx_fh_F(iF, jF, kF + 3, ex)]);
} else if (k0 <= ex3 - 3) {
f_rhs[p] += sfz * d12dz *
( fh[idx_fh_F(iF, jF, kF - 2, ex)]
-EIT * fh[idx_fh_F(iF, jF, kF - 1, ex)]
+EIT * fh[idx_fh_F(iF, jF, kF + 1, ex)]
- fh[idx_fh_F(iF, jF, kF + 2, ex)]);
} else if (k0 <= ex3 - 2) {
f_rhs[p] -= sfz * d12dz *
(-F3 * fh[idx_fh_F(iF, jF, kF + 1, ex)]
-F10 * fh[idx_fh_F(iF, jF, kF , ex)]
+F18 * fh[idx_fh_F(iF, jF, kF - 1, ex)]
-F6 * fh[idx_fh_F(iF, jF, kF - 2, ex)]
+ fh[idx_fh_F(iF, jF, kF - 3, ex)]);
}
} else if (sfz < ZEO) {
if ((k0 - 2) >= kminF) {
f_rhs[p] -= sfz * d12dz *
(-F3 * fh[idx_fh_F(iF, jF, kF + 1, ex)]
-F10 * fh[idx_fh_F(iF, jF, kF , ex)]
+F18 * fh[idx_fh_F(iF, jF, kF - 1, ex)]
-F6 * fh[idx_fh_F(iF, jF, kF - 2, ex)]
+ fh[idx_fh_F(iF, jF, kF - 3, ex)]);
} else if ((k0 - 1) >= kminF) {
f_rhs[p] += sfz * d12dz *
( fh[idx_fh_F(iF, jF, kF - 2, ex)]
-EIT * fh[idx_fh_F(iF, jF, kF - 1, ex)]
+EIT * fh[idx_fh_F(iF, jF, kF + 1, ex)]
- fh[idx_fh_F(iF, jF, kF + 2, ex)]);
} else if (k0 >= kminF) {
f_rhs[p] += sfz * d12dz *
(-F3 * fh[idx_fh_F(iF, jF, kF - 1, ex)]
-F10 * fh[idx_fh_F(iF, jF, kF , ex)]
+F18 * fh[idx_fh_F(iF, jF, kF + 1, ex)]
-F6 * fh[idx_fh_F(iF, jF, kF + 2, ex)]
+ fh[idx_fh_F(iF, jF, kF + 3, ex)]);
}
}
}
}
}
free(fh);
}

View File

@@ -487,6 +487,201 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
end subroutine lopsided end subroutine lopsided
!-----------------------------------------------------------------------------
! Combined advection (lopsided) + Kreiss-Oliger dissipation (kodis)
! Shares the symmetry_bd buffer fh, eliminating one full-grid copy per call.
! Mathematically identical to calling lopsided then kodis separately.
!-----------------------------------------------------------------------------
subroutine lopsided_kodis(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA,eps)
implicit none
!~~~~~~> Input parameters:
integer, intent(in) :: ex(1:3),Symmetry
real*8, intent(in) :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
real*8,dimension(ex(1),ex(2),ex(3)),intent(in) :: f,Sfx,Sfy,Sfz
real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
real*8,dimension(3),intent(in) ::SoA
real*8,intent(in) :: eps
!~~~~~~> local variables:
! note index -2,-1,0, so we have 3 extra points
real*8,dimension(-2:ex(1),-2:ex(2),-2:ex(3)) :: fh
integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
real*8 :: dX,dY,dZ
real*8 :: d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
real*8, parameter :: ZEO=0.d0,ONE=1.d0, F3=3.d0
real*8, parameter :: TWO=2.d0,F6=6.0d0,F18=1.8d1
real*8, parameter :: F12=1.2d1, F10=1.d1,EIT=8.d0
integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
! kodis parameters
real*8, parameter :: SIX=6.d0,FIT=1.5d1,TWT=2.d1
real*8, parameter :: cof=6.4d1 ! 2^6
dX = X(2)-X(1)
dY = Y(2)-Y(1)
dZ = Z(2)-Z(1)
d12dx = ONE/F12/dX
d12dy = ONE/F12/dY
d12dz = ONE/F12/dZ
d2dx = ONE/TWO/dX
d2dy = ONE/TWO/dY
d2dz = ONE/TWO/dZ
imax = ex(1)
jmax = ex(2)
kmax = ex(3)
imin = 1
jmin = 1
kmin = 1
if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -2
if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -2
if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -2
! Single symmetry_bd call shared by both advection and dissipation
call symmetry_bd(3,ex,f,fh,SoA)
! ---- Advection (lopsided) loop ----
! upper bound set ex-1 only for efficiency,
! the loop body will set ex 0 also
do k=1,ex(3)-1
do j=1,ex(2)-1
do i=1,ex(1)-1
! x direction
if(Sfx(i,j,k) > ZEO)then
if(i+3 <= imax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
-F6*fh(i+2,j,k)+ fh(i+3,j,k))
elseif(i+2 <= imax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
elseif(i+1 <= imax)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
-F6*fh(i-2,j,k)+ fh(i-3,j,k))
endif
elseif(Sfx(i,j,k) < ZEO)then
if(i-3 >= imin)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
-F6*fh(i-2,j,k)+ fh(i-3,j,k))
elseif(i-2 >= imin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
elseif(i-1 >= imin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
-F6*fh(i+2,j,k)+ fh(i+3,j,k))
endif
endif
! y direction
if(Sfy(i,j,k) > ZEO)then
if(j+3 <= jmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
-F6*fh(i,j+2,k)+ fh(i,j+3,k))
elseif(j+2 <= jmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
elseif(j+1 <= jmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
-F6*fh(i,j-2,k)+ fh(i,j-3,k))
endif
elseif(Sfy(i,j,k) < ZEO)then
if(j-3 >= jmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
-F6*fh(i,j-2,k)+ fh(i,j-3,k))
elseif(j-2 >= jmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
elseif(j-1 >= jmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
-F6*fh(i,j+2,k)+ fh(i,j+3,k))
endif
endif
! z direction
if(Sfz(i,j,k) > ZEO)then
if(k+3 <= kmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
-F6*fh(i,j,k+2)+ fh(i,j,k+3))
elseif(k+2 <= kmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
elseif(k+1 <= kmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
-F6*fh(i,j,k-2)+ fh(i,j,k-3))
endif
elseif(Sfz(i,j,k) < ZEO)then
if(k-3 >= kmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
-F6*fh(i,j,k-2)+ fh(i,j,k-3))
elseif(k-2 >= kmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
elseif(k-1 >= kmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
-F6*fh(i,j,k+2)+ fh(i,j,k+3))
endif
endif
enddo
enddo
enddo
! ---- Dissipation (kodis) loop ----
if(eps > ZEO) then
do k=1,ex(3)
do j=1,ex(2)
do i=1,ex(1)
if(i-3 >= imin .and. i+3 <= imax .and. &
j-3 >= jmin .and. j+3 <= jmax .and. &
k-3 >= kmin .and. k+3 <= kmax) then
f_rhs(i,j,k) = f_rhs(i,j,k) + eps/cof *( ( &
(fh(i-3,j,k)+fh(i+3,j,k)) - &
SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
FIT*(fh(i-1,j,k)+fh(i+1,j,k)) - &
TWT* fh(i,j,k) )/dX + &
( &
(fh(i,j-3,k)+fh(i,j+3,k)) - &
SIX*(fh(i,j-2,k)+fh(i,j+2,k)) + &
FIT*(fh(i,j-1,k)+fh(i,j+1,k)) - &
TWT* fh(i,j,k) )/dY + &
( &
(fh(i,j,k-3)+fh(i,j,k+3)) - &
SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
TWT* fh(i,j,k) )/dZ )
endif
enddo
enddo
enddo
endif
return
end subroutine lopsided_kodis
#elif (ghost_width == 4) #elif (ghost_width == 4)
! sixth order code ! sixth order code
! Compute advection terms in right hand sides of field equations ! Compute advection terms in right hand sides of field equations

View File

@@ -1,83 +1,77 @@
#if 0
note here
v:r; u: phi; w: theta
tetradtype 0
v^a = (x,y,z)
orthonormal order: v,u,w
m = (phi - i theta)/sqrt(2) following Frans, Eq.(8) of PRD 75, 124018(2007)
tetradtype 1
orthonormal order: w,u,v
m = (theta + i phi)/sqrt(2) following Sperhake, Eq.(3.2) of PRD 85, 124062(2012)
tetradtype 2
v_a = (x,y,z)
orthonormal order: v,u,w
m = (phi - i theta)/sqrt(2) following Frans, Eq.(8) of PRD 75, 124018(2007)
#endif
#define tetradtype 2 #define tetradtype 2
#if 0
note here
Cell center or Vertex center
#endif
#define Cell #define Cell
#if 0
note here
2nd order: 2
4th order: 3
6th order: 4
8th order: 5
#endif
#define ghost_width 3 #define ghost_width 3
#if 0
note here
use shell or not
#endif
#define WithShell
#if 0
note here
use constraint preserving boundary condition or not
only affect Z4c
#endif
#define CPBC
#if 0 #define GAUGE 0
note here
Gauge condition type
0: B^i gauge
1: David's puncture gauge
2: MB B^i gauge
3: RIT B^i gauge
4: MB beta gauge (beta gauge not means Eq.(3) of PRD 84, 124006)
5: RIT beta gauge (beta gauge not means Eq.(3) of PRD 84, 124006)
6: MGB1 B^i gauge
7: MGB2 B^i gauge
#endif
#define GAUGE 2
#if 0
buffer points for CPBC boundary
#endif
#define CPBC_ghost_width (ghost_width) #define CPBC_ghost_width (ghost_width)
#if 0
using BSSN variable for constraint violation and psi4 calculation: 0
using ADM variable for constraint violation and psi4 calculation: 1
#endif
#define ABV 0 #define ABV 0
#if 0
Type of Potential and Scalar Distribution in F(R) Scalar-Tensor Theory
1: Case C of 1112.3928, V=0
2: shell with a2^2*phi0/(1+a2^2), f(R) = R+a2*R^2 induced V
3: ground state of Schrodinger-Newton system, f(R) = R+a2*R^2 induced V
4: a2 = oo and phi(r) = phi0 * 0.5 * ( tanh((r+r0)/sigma) - tanh((r-r0)/sigma) )
5: shell with phi(r) = phi0*Exp(-(r-r0)**2/sigma), V = 0
#endif
#define EScalar_CC 2 #define EScalar_CC 2
#if 0
define tetradtype
v:r; u: phi; w: theta
tetradtype 0
v^a = (x,y,z)
orthonormal order: v,u,w
m = (phi - i theta)/sqrt(2) following Frans, Eq.(8) of PRD 75, 124018(2007)
tetradtype 1
orthonormal order: w,u,v
m = (theta + i phi)/sqrt(2) following Sperhake, Eq.(3.2) of PRD 85, 124062(2012)
tetradtype 2
v_a = (x,y,z)
orthonormal order: v,u,w
m = (phi - i theta)/sqrt(2) following Frans, Eq.(8) of PRD 75, 124018(2007)
define Cell or Vertex
Cell center or Vertex center
define ghost_width
2nd order: 2
4th order: 3
6th order: 4
8th order: 5
define WithShell
use shell or not
define CPBC
use constraint preserving boundary condition or not
only affect Z4c
CPBC only supports WithShell
define GAUGE
0: B^i gauge
1: David puncture gauge
2: MB B^i gauge
3: RIT B^i gauge
4: MB beta gauge (beta gauge not means Eq.(3) of PRD 84, 124006)
5: RIT beta gauge (beta gauge not means Eq.(3) of PRD 84, 124006)
6: MGB1 B^i gauge
7: MGB2 B^i gauge
define CPBC_ghost_width (ghost_width)
buffer points for CPBC boundary
define ABV
0: using BSSN variable for constraint violation and psi4 calculation
1: using ADM variable for constraint violation and psi4 calculation
define EScalar_CC
Type of Potential and Scalar Distribution in F(R) Scalar-Tensor Theory
1: Case C of 1112.3928, V=0
2: shell with phi(r) = phi0 * a2^2/(1+a2^2), f(R) = R+a2*R^2 induced V
3: ground state of Schrodinger-Newton system, f(R) = R+a2*R^2 induced V
4: a2 = +oo and phi(r) = phi0 * 0.5 * ( tanh((r+r0)/sigma) - tanh((r-r0)/sigma) )
5: shell with phi(r) = phi0 * Exp(-(r-r0)**2/sigma), V = 0
#endif

View File

@@ -6,95 +6,127 @@
// application parameters // application parameters
/// ****
// sommerfeld boundary type
// 0: bam, 1: shibata
#define SommerType 0 #define SommerType 0
/// ****
// for Using Gauss-Legendre quadrature in theta direction
#define GaussInt #define GaussInt
/// **** #define ABEtype 0
// 0: BSSN vacuum
// 1: coupled to scalar field
// 2: Z4c vacuum
// 3: coupled to Maxwell field
//
#define ABEtype 2
/// ****
// using Apparent Horizon Finder
//#define With_AHF //#define With_AHF
/// ****
// Psi4 calculation method
// 0: EB method
// 1: 4-D method
//
#define Psi4type 0 #define Psi4type 0
/// ****
// for Using point psi4 or not
//#define Point_Psi4 //#define Point_Psi4
/// ****
// RestrictProlong in Step (0) or after Step (1)
#define RPS 1 #define RPS 1
/// ****
// Enforce algebra constraint
// for every RK4 sub step: 0
// only when iter_count == 3: 1
// after routine Step: 2
#define AGM 0 #define AGM 0
/// ****
// Restrict Prolong using BAM style 1 or old style 0
#define RPB 0 #define RPB 0
/// ****
// 1: move Analysis out ot 4 sub steps and treat PBH with Euler method
#define MAPBH 1 #define MAPBH 1
/// ****
// parallel structure, 0: level by level, 1: considering all levels, 2: as 1 but reverse the CPU order, 3: Frank's scheme
#define PSTR 0 #define PSTR 0
/// ****
// regrid for every level or for all levels at a time
// 0: for every level; 1: for all
#define REGLEV 0 #define REGLEV 0
/// ****
// use gpu or not
//#define USE_GPU //#define USE_GPU
/// ****
// use checkpoint for every process
//#define CHECKDETAIL //#define CHECKDETAIL
/// ****
// use FakeCheckPrepare to write CheckPoint
//#define FAKECHECK //#define FAKECHECK
//
// define SommerType
// sommerfeld boundary type
// 0: bam
// 1: shibata
//
// define GaussInt
// for Using Gauss-Legendre quadrature in theta direction
//
// define ABEtype
// 0: BSSN vacuum
// 1: coupled to scalar field
// 2: Z4c vacuum
// 3: coupled to Maxwell field
//
// define With_AHF
// using Apparent Horizon Finder
//
// define Psi4type
// Psi4 calculation method
// 0: EB method
// 1: 4-D method
//
// define Point_Psi4
// for Using point psi4 or not
//
// define RPS
// RestrictProlong in Step (0) or after Step (1)
//
// define AGM
// Enforce algebra constraint
// for every RK4 sub step: 0
// only when iter_count == 3: 1
// after routine Step: 2
//
// define RPB
// Restrict Prolong using BAM style 1 or old style 0
//
// define MAPBH
// 1: move Analysis out ot 4 sub steps and treat PBH with Euler method
//
// define PSTR
// parallel structure
// 0: level by level
// 1: considering all levels
// 2: as 1 but reverse the CPU order
// 3: Frank's scheme
//
// define REGLEV
// regrid for every level or for all levels at a time
// 0: for every level;
// 1: for all
//
// define USE_GPU
// use gpu or not
//
// define CHECKDETAIL
// use checkpoint for every process
//
// define FAKECHECK
// use FakeCheckPrepare to write CheckPoint
//
////================================================================ ////================================================================
// some basic parameters for numerical calculation // some basic parameters for numerical calculation
////================================================================
#define dim 3 #define dim 3
//#define Cell or Vertex in "microdef.fh" //#define Cell or Vertex in "macrodef.fh"
// ******
// buffer point number for mesh refinement interface
#define buffer_width 6 #define buffer_width 6
// ******
// buffer point number shell-box interface, on shell
#define SC_width buffer_width #define SC_width buffer_width
// buffer point number shell-box interface, on box
#define CS_width (2*buffer_width) #define CS_width (2*buffer_width)
//
// define Cell or Vertex in "macrodef.fh"
//
// define buffer_width
// buffer point number for mesh refinement interface
//
// define SC_width buffer_width
// buffer point number shell-box interface, on shell
//
// define CS_width
// buffer point number shell-box interface, on box
//
#if(buffer_width < ghost_width) #if(buffer_width < ghost_width)
#error we always assume buffer_width>ghost_width # error we always assume buffer_width>ghost_width
#endif #endif
#define PACK 1 #define PACK 1
@@ -110,3 +142,4 @@
#define TINY 1e-10 #define TINY 1e-10
#endif /* MICRODEF_H */ #endif /* MICRODEF_H */

View File

@@ -2,6 +2,27 @@
include makefile.inc include makefile.inc
## ABE build flags selected by PGO_MODE (set in makefile.inc, default: opt)
## make -> opt (PGO-guided, maximum performance)
## make PGO_MODE=instrument -> instrument (Phase 1: collect fresh profile data)
PROFDATA = /home/$(shell whoami)/AMSS-NCKU/pgo_profile/default.profdata
ifeq ($(PGO_MODE),instrument)
## Phase 1: instrumentation — omit -ipo/-fp-model fast=2 for faster build and numerical stability
CXXAPPFLAGS = -O3 -xHost -fma -fprofile-instr-generate -ipo \
-Dfortran3 -Dnewc -I${MKLROOT}/include $(INTERP_LB_FLAGS)
f90appflags = -O3 -xHost -fma -fprofile-instr-generate -ipo \
-align array64byte -fpp -I${MKLROOT}/include
else
## opt (default): maximum performance with PGO profile data
CXXAPPFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \
-fprofile-instr-use=$(PROFDATA) \
-Dfortran3 -Dnewc -I${MKLROOT}/include $(INTERP_LB_FLAGS)
f90appflags = -O3 -xHost -fp-model fast=2 -fma -ipo \
-fprofile-instr-use=$(PROFDATA) \
-align array64byte -fpp -I${MKLROOT}/include
endif
.SUFFIXES: .o .f90 .C .for .cu .SUFFIXES: .o .f90 .C .for .cu
.f90.o: .f90.o:
@@ -16,19 +37,54 @@ include makefile.inc
.cu.o: .cu.o:
$(Cu) $(CUDA_APP_FLAGS) -c $< -o $@ $(CUDA_LIB_PATH) $(Cu) $(CUDA_APP_FLAGS) -c $< -o $@ $(CUDA_LIB_PATH)
# C rewrite of BSSN RHS kernel and helpers
bssn_rhs_c.o: bssn_rhs_c.C
${CXX} $(CXXAPPFLAGS) -c $< $(filein) -o $@
fderivs_c.o: fderivs_c.C
${CXX} $(CXXAPPFLAGS) -c $< $(filein) -o $@
fdderivs_c.o: fdderivs_c.C
${CXX} $(CXXAPPFLAGS) -c $< $(filein) -o $@
kodiss_c.o: kodiss_c.C
${CXX} $(CXXAPPFLAGS) -c $< $(filein) -o $@
lopsided_c.o: lopsided_c.C
${CXX} $(CXXAPPFLAGS) -c $< $(filein) -o $@
interp_lb_profile.o: interp_lb_profile.C interp_lb_profile.h
${CXX} $(CXXAPPFLAGS) -c $< $(filein) -o $@
## TwoPunctureABE uses fixed optimal flags with its own PGO profile, independent of CXXAPPFLAGS
TP_PROFDATA = /home/$(shell whoami)/AMSS-NCKU/pgo_profile/TwoPunctureABE.profdata
TP_OPTFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \
-fprofile-instr-use=$(TP_PROFDATA) \
-Dfortran3 -Dnewc -I${MKLROOT}/include
TwoPunctures.o: TwoPunctures.C TwoPunctures.o: TwoPunctures.C
${CXX} $(CXXAPPFLAGS) -qopenmp -c $< -o $@ ${CXX} $(TP_OPTFLAGS) -qopenmp -c $< -o $@
TwoPunctureABE.o: TwoPunctureABE.C TwoPunctureABE.o: TwoPunctureABE.C
${CXX} $(CXXAPPFLAGS) -qopenmp -c $< -o $@ ${CXX} $(TP_OPTFLAGS) -qopenmp -c $< -o $@
# Input files # Input files
## Kernel implementation switch (set USE_CXX_KERNELS=0 to fall back to Fortran)
ifeq ($(USE_CXX_KERNELS),0)
# Fortran mode: no C rewrite files; bssn_rhs.o is included via F90FILES below
CFILES =
else
# C++ mode (default): C rewrite of bssn_rhs and helper kernels
CFILES = bssn_rhs_c.o fderivs_c.o fdderivs_c.o kodiss_c.o lopsided_c.o
endif
C++FILES = ABE.o Ansorg.o Block.o misc.o monitor.o Parallel.o MPatch.o var.o\ C++FILES = ABE.o Ansorg.o Block.o misc.o monitor.o Parallel.o MPatch.o var.o\
cgh.o bssn_class.o surface_integral.o ShellPatch.o\ cgh.o bssn_class.o surface_integral.o ShellPatch.o\
bssnEScalar_class.o perf.o Z4c_class.o NullShellPatch.o\ bssnEScalar_class.o perf.o Z4c_class.o NullShellPatch.o\
bssnEM_class.o cpbc_util.o z4c_rhs_point.o checkpoint.o\ bssnEM_class.o cpbc_util.o z4c_rhs_point.o checkpoint.o\
Parallel_bam.o scalar_class.o transpbh.o NullShellPatch2.o\ Parallel_bam.o scalar_class.o transpbh.o NullShellPatch2.o\
NullShellPatch2_Evo.o writefile_f.o NullShellPatch2_Evo.o writefile_f.o interp_lb_profile.o
C++FILES_GPU = ABE.o Ansorg.o Block.o misc.o monitor.o Parallel.o MPatch.o var.o\ C++FILES_GPU = ABE.o Ansorg.o Block.o misc.o monitor.o Parallel.o MPatch.o var.o\
cgh.o surface_integral.o ShellPatch.o\ cgh.o surface_integral.o ShellPatch.o\
@@ -38,9 +94,9 @@ C++FILES_GPU = ABE.o Ansorg.o Block.o misc.o monitor.o Parallel.o MPatch.o var.o
NullShellPatch2_Evo.o \ NullShellPatch2_Evo.o \
bssn_gpu_class.o bssn_step_gpu.o bssn_macro.o writefile_f.o bssn_gpu_class.o bssn_step_gpu.o bssn_macro.o writefile_f.o
F90FILES = enforce_algebra.o fmisc.o initial_puncture.o prolongrestrict.o\ F90FILES_BASE = enforce_algebra.o fmisc.o initial_puncture.o prolongrestrict.o\
prolongrestrict_cell.o prolongrestrict_vertex.o\ prolongrestrict_cell.o prolongrestrict_vertex.o\
rungekutta4_rout.o bssn_rhs.o diff_new.o kodiss.o kodiss_sh.o\ rungekutta4_rout.o diff_new.o kodiss.o kodiss_sh.o\
lopsidediff.o sommerfeld_rout.o getnp4.o diff_new_sh.o\ lopsidediff.o sommerfeld_rout.o getnp4.o diff_new_sh.o\
shellfunctions.o bssn_rhs_ss.o Set_Rho_ADM.o\ shellfunctions.o bssn_rhs_ss.o Set_Rho_ADM.o\
getnp4EScalar.o bssnEScalar_rhs.o bssn_constraint.o ricci_gamma.o\ getnp4EScalar.o bssnEScalar_rhs.o bssn_constraint.o ricci_gamma.o\
@@ -51,6 +107,14 @@ F90FILES = enforce_algebra.o fmisc.o initial_puncture.o prolongrestrict.o\
scalar_rhs.o initial_scalar.o NullEvol2.o initial_null2.o\ scalar_rhs.o initial_scalar.o NullEvol2.o initial_null2.o\
NullNews2.o tool_f.o NullNews2.o tool_f.o
ifeq ($(USE_CXX_KERNELS),0)
# Fortran mode: include original bssn_rhs.o
F90FILES = $(F90FILES_BASE) bssn_rhs.o
else
# C++ mode (default): bssn_rhs.o replaced by C++ kernel
F90FILES = $(F90FILES_BASE)
endif
F77FILES = zbesh.o F77FILES = zbesh.o
AHFDOBJS = expansion.o expansion_Jacobian.o patch.o coords.o patch_info.o patch_interp.o patch_system.o \ AHFDOBJS = expansion.o expansion_Jacobian.o patch.o coords.o patch_info.o patch_interp.o patch_system.o \
@@ -63,7 +127,7 @@ TwoPunctureFILES = TwoPunctureABE.o TwoPunctures.o
CUDAFILES = bssn_gpu.o bssn_gpu_rhs_ss.o CUDAFILES = bssn_gpu.o bssn_gpu_rhs_ss.o
# file dependences # file dependences
$(C++FILES) $(C++FILESGPU) $(F90FILES) $(AHFDOBJS) $(CUDAFILES): macrodef.fh $(C++FILES) $(C++FILES_GPU) $(F90FILES) $(CFILES) $(AHFDOBJS) $(CUDAFILES): macrodef.fh
$(C++FILES): Block.h enforce_algebra.h fmisc.h initial_puncture.h macrodef.h\ $(C++FILES): Block.h enforce_algebra.h fmisc.h initial_puncture.h macrodef.h\
misc.h monitor.h MyList.h Parallel.h MPatch.h prolongrestrict.h\ misc.h monitor.h MyList.h Parallel.h MPatch.h prolongrestrict.h\
@@ -86,7 +150,7 @@ $(C++FILES_GPU): Block.h enforce_algebra.h fmisc.h initial_puncture.h macrodef.h
$(AHFDOBJS): cctk.h cctk_Config.h cctk_Types.h cctk_Constants.h myglobal.h $(AHFDOBJS): cctk.h cctk_Config.h cctk_Types.h cctk_Constants.h myglobal.h
$(C++FILES) $(C++FILES_GPU) $(AHFDOBJS) $(CUDAFILES): macrodef.h $(C++FILES) $(C++FILES_GPU) $(CFILES) $(AHFDOBJS) $(CUDAFILES): macrodef.h
TwoPunctureFILES: TwoPunctures.h TwoPunctureFILES: TwoPunctures.h
@@ -95,14 +159,14 @@ $(CUDAFILES): bssn_gpu.h gpu_mem.h gpu_rhsSS_mem.h
misc.o : zbesh.o misc.o : zbesh.o
# projects # projects
ABE: $(C++FILES) $(F90FILES) $(F77FILES) $(AHFDOBJS) ABE: $(C++FILES) $(CFILES) $(F90FILES) $(F77FILES) $(AHFDOBJS)
$(CLINKER) $(CXXAPPFLAGS) -o $@ $(C++FILES) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(LDLIBS) $(CLINKER) $(CXXAPPFLAGS) -o $@ $(C++FILES) $(CFILES) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(LDLIBS)
ABEGPU: $(C++FILES_GPU) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(CUDAFILES) ABEGPU: $(C++FILES_GPU) $(CFILES) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(CUDAFILES)
$(CLINKER) $(CXXAPPFLAGS) -o $@ $(C++FILES_GPU) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(CUDAFILES) $(LDLIBS) $(CLINKER) $(CXXAPPFLAGS) -o $@ $(C++FILES_GPU) $(CFILES) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(CUDAFILES) $(LDLIBS)
TwoPunctureABE: $(TwoPunctureFILES) TwoPunctureABE: $(TwoPunctureFILES)
$(CLINKER) $(CXXAPPFLAGS) -qopenmp -o $@ $(TwoPunctureFILES) $(LDLIBS) $(CLINKER) $(TP_OPTFLAGS) -qopenmp -o $@ $(TwoPunctureFILES) $(LDLIBS)
clean: clean:
rm *.o ABE ABEGPU TwoPunctureABE make.log -f rm *.o ABE ABEGPU TwoPunctureABE make.log -f

View File

@@ -8,18 +8,31 @@ filein = -I/usr/include/ -I${MKLROOT}/include
## Using sequential MKL (OpenMP disabled for better single-threaded performance) ## Using sequential MKL (OpenMP disabled for better single-threaded performance)
## Added -lifcore for Intel Fortran runtime and -limf for Intel math library ## Added -lifcore for Intel Fortran runtime and -limf for Intel math library
LDLIBS = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lifcore -limf -lpthread -lm -ldl LDLIBS = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lifcore -limf -lpthread -lm -ldl -liomp5
## Aggressive optimization flags + PGO Phase 2 (profile-guided optimization) ## PGO build mode switch (ABE only; TwoPunctureABE always uses opt flags)
## -fprofile-instr-use: use collected profile data to guide optimization decisions ## opt : (default) maximum performance with PGO profile-guided optimization
## (branch prediction, basic block layout, inlining, loop unrolling) ## instrument : PGO Phase 1 instrumentation to collect fresh profile data
PROFDATA = /home/amss/AMSS-NCKU/pgo_profile/default.profdata PGO_MODE ?= opt
CXXAPPFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \
-fprofile-instr-use=$(PROFDATA) \ ## Interp_Points load balance profiling mode
-Dfortran3 -Dnewc -I${MKLROOT}/include ## off : (default) no load balance instrumentation
f90appflags = -O3 -xHost -fp-model fast=2 -fma -ipo \ ## profile : Pass 1 — instrument Interp_Points to collect timing profile
-fprofile-instr-use=$(PROFDATA) \ ## optimize : Pass 2 — read profile and apply block rebalancing
-align array64byte -fpp -I${MKLROOT}/include INTERP_LB_MODE ?= off
ifeq ($(INTERP_LB_MODE),profile)
INTERP_LB_FLAGS = -DINTERP_LB_PROFILE
else ifeq ($(INTERP_LB_MODE),optimize)
INTERP_LB_FLAGS = -DINTERP_LB_OPTIMIZE
else
INTERP_LB_FLAGS =
endif
## Kernel implementation switch
## 1 (default) : use C++ rewrite of bssn_rhs and helper kernels (faster)
## 0 : fall back to original Fortran kernels
USE_CXX_KERNELS ?= 1
f90 = ifx f90 = ifx
f77 = ifx f77 = ifx
CXX = icpx CXX = icpx

View File

@@ -0,0 +1,146 @@
#ifndef SHARE_FUNC_H
#define SHARE_FUNC_H
#include <stdlib.h>
#include <stddef.h>
#include <math.h>
#include <stdio.h>
/* 主网格0-based -> 1D */
static inline size_t idx_ex(int i0, int j0, int k0, const int ex[3]) {
const int ex1 = ex[0], ex2 = ex[1];
return (size_t)i0 + (size_t)j0 * (size_t)ex1 + (size_t)k0 * (size_t)ex1 * (size_t)ex2;
}
/*
* fh 对应 Fortran: fh(-1:ex1, -1:ex2, -1:ex3)
* ord=2 => shift=1
* iF/jF/kF 为 Fortran 索引(可为 -1,0,1..ex
*/
static inline size_t idx_fh_F_ord2(int iF, int jF, int kF, const int ex[3]) {
const int shift = 1;
const int nx = ex[0] + 2; // ex1 + ord
const int ny = ex[1] + 2;
const int ii = iF + shift; // 0..ex1+1
const int jj = jF + shift; // 0..ex2+1
const int kk = kF + shift; // 0..ex3+1
return (size_t)ii + (size_t)jj * (size_t)nx + (size_t)kk * (size_t)nx * (size_t)ny;
}
/*
* fh 对应 Fortran: fh(-2:ex1, -2:ex2, -2:ex3)
* ord=3 => shift=2
* iF/jF/kF 是 Fortran 索引(可为负)
*/
static inline size_t idx_fh_F(int iF, int jF, int kF, const int ex[3]) {
const int shift = 2; // ord=3 -> -2..ex
const int nx = ex[0] + 3; // ex1 + ord
const int ny = ex[1] + 3;
const int ii = iF + shift; // 0..ex1+2
const int jj = jF + shift; // 0..ex2+2
const int kk = kF + shift; // 0..ex3+2
return (size_t)ii + (size_t)jj * (size_t)nx + (size_t)kk * (size_t)nx * (size_t)ny;
}
/*
* func: (1..extc1, 1..extc2, 1..extc3) 1-based in Fortran
* funcc: (-ord+1..extc1, -ord+1..extc2, -ord+1..extc3) in Fortran
*
* C 里我们把:
* func 视为 0-based: i0=0..extc1-1, j0=0..extc2-1, k0=0..extc3-1
* funcc 用“平移下标”存为一维数组:
* iF in [-ord+1..extc1] -> ii = iF + (ord-1) in [0..extc1+ord-1]
* 总长度 nx = extc1 + ord
* 同理 ny = extc2 + ord, nz = extc3 + ord
*/
static inline size_t idx_func0(int i0, int j0, int k0, const int extc[3]) {
const int nx = extc[0], ny = extc[1];
return (size_t)i0 + (size_t)j0 * (size_t)nx + (size_t)k0 * (size_t)nx * (size_t)ny;
}
static inline size_t idx_funcc_F(int iF, int jF, int kF, int ord, const int extc[3]) {
const int shift = ord - 1; // iF = -shift .. extc1
const int nx = extc[0] + ord; // [-shift..extc1] 共 extc1+ord 个
const int ny = extc[1] + ord;
const int ii = iF + shift; // 0..extc1+shift
const int jj = jF + shift; // 0..extc2+shift
const int kk = kF + shift; // 0..extc3+shift
return (size_t)ii + (size_t)jj * (size_t)nx + (size_t)kk * (size_t)nx * (size_t)ny;
}
/*
* 等价于 Fortran:
* funcc(1:extc1,1:extc2,1:extc3)=func
* do i=0,ord-1
* funcc(-i,1:extc2,1:extc3) = funcc(i+1,1:extc2,1:extc3)*SoA(1)
* enddo
* do i=0,ord-1
* funcc(:,-i,1:extc3) = funcc(:,i+1,1:extc3)*SoA(2)
* enddo
* do i=0,ord-1
* funcc(:,:,-i) = funcc(:,:,i+1)*SoA(3)
* enddo
*/
static inline void symmetry_bd(int ord,
const int extc[3],
const double *func,
double *funcc,
const double SoA[3])
{
const int extc1 = extc[0], extc2 = extc[1], extc3 = extc[2];
// 1) funcc(1:extc1,1:extc2,1:extc3) = func
// Fortran 的 (iF=1..extc1) 对应 C 的 func(i0=0..extc1-1)
for (int k0 = 0; k0 < extc3; ++k0) {
for (int j0 = 0; j0 < extc2; ++j0) {
for (int i0 = 0; i0 < extc1; ++i0) {
const int iF = i0 + 1, jF = j0 + 1, kF = k0 + 1;
funcc[idx_funcc_F(iF, jF, kF, ord, extc)] = func[idx_func0(i0, j0, k0, extc)];
}
}
}
// 2) do i=0..ord-1: funcc(-i, 1:extc2, 1:extc3) = funcc(i+1, ...)*SoA(1)
for (int ii = 0; ii <= ord - 1; ++ii) {
const int iF_dst = -ii; // 0, -1, -2, ...
const int iF_src = ii + 1; // 1, 2, 3, ...
for (int kF = 1; kF <= extc3; ++kF) {
for (int jF = 1; jF <= extc2; ++jF) {
funcc[idx_funcc_F(iF_dst, jF, kF, ord, extc)] =
funcc[idx_funcc_F(iF_src, jF, kF, ord, extc)] * SoA[0];
}
}
}
// 3) do i=0..ord-1: funcc(:,-i, 1:extc3) = funcc(:, i+1, 1:extc3)*SoA(2)
// 注意 Fortran 这里的 ":" 表示 iF 从 (-ord+1..extc1) 全覆盖
for (int jj = 0; jj <= ord - 1; ++jj) {
const int jF_dst = -jj;
const int jF_src = jj + 1;
for (int kF = 1; kF <= extc3; ++kF) {
for (int iF = -ord + 1; iF <= extc1; ++iF) {
funcc[idx_funcc_F(iF, jF_dst, kF, ord, extc)] =
funcc[idx_funcc_F(iF, jF_src, kF, ord, extc)] * SoA[1];
}
}
}
// 4) do i=0..ord-1: funcc(:,:,-i) = funcc(:,:, i+1)*SoA(3)
for (int kk = 0; kk <= ord - 1; ++kk) {
const int kF_dst = -kk;
const int kF_src = kk + 1;
for (int jF = -ord + 1; jF <= extc2; ++jF) {
for (int iF = -ord + 1; iF <= extc1; ++iF) {
funcc[idx_funcc_F(iF, jF, kF_dst, ord, extc)] =
funcc[idx_funcc_F(iF, jF, kF_src, ord, extc)] * SoA[2];
}
}
}
}
#endif

27
AMSS_NCKU_source/tool.h Normal file
View File

@@ -0,0 +1,27 @@
#include "share_func.h"
void fdderivs(const int ex[3],
const double *f,
double *fxx, double *fxy, double *fxz,
double *fyy, double *fyz, double *fzz,
const double *X, const double *Y, const double *Z,
double SYM1, double SYM2, double SYM3,
int Symmetry, int onoff);
void fderivs(const int ex[3],
const double *f,
double *fx, double *fy, double *fz,
const double *X, const double *Y, const double *Z,
double SYM1, double SYM2, double SYM3,
int Symmetry, int onoff);
void kodis(const int ex[3],
const double *X, const double *Y, const double *Z,
const double *f, double *f_rhs,
const double SoA[3],
int Symmetry, double eps);
void lopsided(const int ex[3],
const double *X, const double *Y, const double *Z,
const double *f, double *f_rhs,
const double *Sfx, const double *Sfy, const double *Sfz,
int Symmetry, const double SoA[3]);

View File

@@ -0,0 +1,72 @@
#!/usr/bin/env python3
"""Convert interp_lb_profile.bin to a C header for compile-time embedding."""
import struct, sys
if len(sys.argv) < 3:
print(f"Usage: {sys.argv[0]} <profile.bin> <output.h>")
sys.exit(1)
with open(sys.argv[1], 'rb') as f:
magic, version, nprocs, num_heavy = struct.unpack('IIii', f.read(16))
threshold = struct.unpack('d', f.read(8))[0]
times = list(struct.unpack(f'{nprocs}d', f.read(nprocs * 8)))
heavy = list(struct.unpack(f'{num_heavy}i', f.read(num_heavy * 4)))
# For each heavy rank, compute split: left half -> lighter neighbor, right half -> heavy rank
# (or vice versa depending on which neighbor is lighter)
splits = []
for hr in heavy:
prev_t = times[hr - 1] if hr > 0 else 1e30
next_t = times[hr + 1] if hr < nprocs - 1 else 1e30
if prev_t <= next_t:
splits.append((hr, hr - 1, hr)) # (block_id, r_left, r_right)
else:
splits.append((hr, hr, hr + 1))
# Also remap the displaced neighbor blocks
remaps = {}
for hr, r_l, r_r in splits:
if r_l != hr:
# We took r_l's slot, so remap block r_l to its other neighbor
displaced = r_l
if displaced > 0 and displaced - 1 not in [s[0] for s in splits]:
remaps[displaced] = displaced - 1
elif displaced < nprocs - 1:
remaps[displaced] = displaced + 1
else:
displaced = r_r
if displaced < nprocs - 1 and displaced + 1 not in [s[0] for s in splits]:
remaps[displaced] = displaced + 1
elif displaced > 0:
remaps[displaced] = displaced - 1
with open(sys.argv[2], 'w') as out:
out.write("/* Auto-generated from interp_lb_profile.bin — do not edit */\n")
out.write("#ifndef INTERP_LB_PROFILE_DATA_H\n")
out.write("#define INTERP_LB_PROFILE_DATA_H\n\n")
out.write(f"#define INTERP_LB_NPROCS {nprocs}\n")
out.write(f"#define INTERP_LB_NUM_HEAVY {num_heavy}\n\n")
out.write(f"static const int interp_lb_heavy_blocks[{num_heavy}] = {{")
out.write(", ".join(str(h) for h in heavy))
out.write("};\n\n")
out.write("/* Split table: {block_id, r_left, r_right} */\n")
out.write(f"static const int interp_lb_splits[{num_heavy}][3] = {{\n")
for bid, rl, rr in splits:
out.write(f" {{{bid}, {rl}, {rr}}},\n")
out.write("};\n\n")
out.write("/* Rank remap for displaced neighbor blocks */\n")
out.write(f"static const int interp_lb_num_remaps = {len(remaps)};\n")
out.write(f"static const int interp_lb_remaps[][2] = {{\n")
for src, dst in sorted(remaps.items()):
out.write(f" {{{src}, {dst}}},\n")
if not remaps:
out.write(" {-1, -1},\n")
out.write("};\n\n")
out.write("#endif /* INTERP_LB_PROFILE_DATA_H */\n")
print(f"Generated {sys.argv[2]}:")
print(f" {num_heavy} heavy blocks to split: {heavy}")
for bid, rl, rr in splits:
print(f" block {bid}: split -> rank {rl} (left), rank {rr} (right)")
for src, dst in sorted(remaps.items()):
print(f" block {src}: remap -> rank {dst}")

View File

@@ -11,17 +11,46 @@
import AMSS_NCKU_Input as input_data import AMSS_NCKU_Input as input_data
import subprocess import subprocess
import time import time
## CPU core binding configuration using taskset
## taskset ensures all child processes inherit the CPU affinity mask
## This forces make and all compiler processes to use only nohz_full cores (4-55, 60-111)
## Format: taskset -c 4-55,60-111 ensures processes only run on these cores
#NUMACTL_CPU_BIND = "taskset -c 0-111"
NUMACTL_CPU_BIND = "taskset -c 16-47,64-95"
## Build parallelism configuration
## Use nohz_full cores (4-55, 60-111) for compilation: 52 + 52 = 104 cores def get_last_n_cores_per_socket(n=32):
## Set make -j to utilize available cores for faster builds """
BUILD_JOBS = 96 Read CPU topology via lscpu and return a taskset -c string
selecting the last `n` cores of each NUMA node (socket).
Example: 2 sockets x 56 cores each, n=32 -> node0: 24-55, node1: 80-111
-> "taskset -c 24-55,80-111"
"""
result = subprocess.run(["lscpu", "--parse=NODE,CPU"], capture_output=True, text=True)
# Build a dict: node_id -> sorted list of CPU ids
node_cpus = {}
for line in result.stdout.splitlines():
if line.startswith("#") or not line.strip():
continue
parts = line.split(",")
if len(parts) < 2:
continue
node_id, cpu_id = int(parts[0]), int(parts[1])
node_cpus.setdefault(node_id, []).append(cpu_id)
segments = []
for node_id in sorted(node_cpus):
cpus = sorted(node_cpus[node_id])
selected = cpus[-n:] # last n cores of this socket
segments.append(f"{selected[0]}-{selected[-1]}")
cpu_str = ",".join(segments)
total = len(segments) * n
print(f" CPU binding: taskset -c {cpu_str} ({total} cores, last {n} per socket)")
return f"taskset -c {cpu_str}"
## CPU core binding: dynamically select the last 32 cores of each socket (64 cores total)
NUMACTL_CPU_BIND = get_last_n_cores_per_socket(n=32)
## Build parallelism: match the number of bound cores
BUILD_JOBS = 64
################################################################## ##################################################################
@@ -40,7 +69,7 @@ def makefile_ABE():
## Build command with CPU binding to nohz_full cores ## Build command with CPU binding to nohz_full cores
if (input_data.GPU_Calculation == "no"): if (input_data.GPU_Calculation == "no"):
makefile_command = f"{NUMACTL_CPU_BIND} make -j{BUILD_JOBS} ABE" makefile_command = f"{NUMACTL_CPU_BIND} make -j{BUILD_JOBS} INTERP_LB_MODE=optimize ABE"
elif (input_data.GPU_Calculation == "yes"): elif (input_data.GPU_Calculation == "yes"):
makefile_command = f"{NUMACTL_CPU_BIND} make -j{BUILD_JOBS} ABEGPU" makefile_command = f"{NUMACTL_CPU_BIND} make -j{BUILD_JOBS} ABEGPU"
else: else:

29
parallel_plot_helper.py Normal file
View File

@@ -0,0 +1,29 @@
import multiprocessing
def run_plot_task(task):
"""Execute a single plotting task.
Parameters
----------
task : tuple
A tuple of (function, args_tuple) where function is a callable
plotting function and args_tuple contains its arguments.
"""
func, args = task
return func(*args)
def run_plot_tasks_parallel(plot_tasks):
"""Execute a list of independent plotting tasks in parallel.
Uses the 'fork' context to create worker processes so that the main
script is NOT re-imported/re-executed in child processes.
Parameters
----------
plot_tasks : list of tuples
Each element is (function, args_tuple).
"""
ctx = multiprocessing.get_context('fork')
with ctx.Pool() as pool:
pool.map(run_plot_task, plot_tasks)

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -11,6 +11,8 @@
import numpy ## numpy for array operations import numpy ## numpy for array operations
import scipy ## scipy for interpolation and signal processing import scipy ## scipy for interpolation and signal processing
import math import math
import matplotlib
matplotlib.use('Agg') ## use non-interactive backend for multiprocessing safety
import matplotlib.pyplot as plt ## matplotlib for plotting import matplotlib.pyplot as plt ## matplotlib for plotting
import os ## os for system/file operations import os ## os for system/file operations

View File

@@ -8,16 +8,23 @@
## ##
################################################# #################################################
## Restrict OpenMP to one thread per process so that running
## many workers in parallel does not create an O(workers * BLAS_threads)
## thread explosion. The variable MUST be set before numpy/scipy
## are imported, because the BLAS library reads them only at load time.
import os
os.environ.setdefault("OMP_NUM_THREADS", "1")
import numpy import numpy
import scipy import scipy
import matplotlib
matplotlib.use('Agg') ## use non-interactive backend for multiprocessing safety
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm from matplotlib.colors import LogNorm
from mpl_toolkits.mplot3d import Axes3D from mpl_toolkits.mplot3d import Axes3D
## import torch ## import torch
import AMSS_NCKU_Input as input_data import AMSS_NCKU_Input as input_data
import os
######################################################################################### #########################################################################################
@@ -192,3 +199,19 @@ def get_data_xy( Rmin, Rmax, n, data0, time, figure_title, figure_outdir ):
#################################################################################### ####################################################################################
####################################################################################
## Allow this module to be run as a standalone script so that each
## binary-data plot can be executed in a fresh subprocess whose BLAS
## environment variables (set above) take effect before numpy loads.
##
## Usage: python3 plot_binary_data.py <filename> <binary_outdir> <figure_outdir>
####################################################################################
if __name__ == '__main__':
import sys
if len(sys.argv) != 4:
print(f"Usage: {sys.argv[0]} <filename> <binary_outdir> <figure_outdir>")
sys.exit(1)
plot_binary_data(sys.argv[1], sys.argv[2], sys.argv[3])

View File

@@ -8,6 +8,8 @@
################################################# #################################################
import numpy ## numpy for array operations import numpy ## numpy for array operations
import matplotlib
matplotlib.use('Agg') ## use non-interactive backend for multiprocessing safety
import matplotlib.pyplot as plt ## matplotlib for plotting import matplotlib.pyplot as plt ## matplotlib for plotting
from mpl_toolkits.mplot3d import Axes3D ## needed for 3D plots from mpl_toolkits.mplot3d import Axes3D ## needed for 3D plots
import glob import glob
@@ -15,6 +17,9 @@ import os ## operating system utilities
import plot_binary_data import plot_binary_data
import AMSS_NCKU_Input as input_data import AMSS_NCKU_Input as input_data
import subprocess
import sys
import multiprocessing
# plt.rcParams['text.usetex'] = True ## enable LaTeX fonts in plots # plt.rcParams['text.usetex'] = True ## enable LaTeX fonts in plots
@@ -50,10 +55,40 @@ def generate_binary_data_plot( binary_outdir, figure_outdir ):
file_list.append(x) file_list.append(x)
print(x) print(x)
## Plot each file in the list ## Plot each file in parallel using subprocesses.
## Each subprocess is a fresh Python process where the BLAS thread-count
## environment variables (set at the top of plot_binary_data.py) take
## effect before numpy is imported. This avoids the thread explosion
## that occurs when multiprocessing.Pool with 'fork' context inherits
## already-initialized multi-threaded BLAS from the parent.
script = os.path.join( os.path.dirname(__file__), "plot_binary_data.py" )
max_workers = min( multiprocessing.cpu_count(), len(file_list) ) if file_list else 0
running = []
failed = []
for filename in file_list: for filename in file_list:
print(filename) print(filename)
plot_binary_data.plot_binary_data(filename, binary_outdir, figure_outdir) proc = subprocess.Popen(
[sys.executable, script, filename, binary_outdir, figure_outdir],
)
running.append( (proc, filename) )
## Keep at most max_workers subprocesses active at a time
if len(running) >= max_workers:
p, fn = running.pop(0)
p.wait()
if p.returncode != 0:
failed.append(fn)
## Wait for all remaining subprocesses to finish
for p, fn in running:
p.wait()
if p.returncode != 0:
failed.append(fn)
if failed:
print( " WARNING: the following binary data plots failed:" )
for fn in failed:
print( " ", fn )
print( ) print( )
print( " Binary Data Plot Has been Finished " ) print( " Binary Data Plot Has been Finished " )