Compare commits

..

29 Commits

Author SHA1 Message Date
5d8dfaf679 Add plot-only restart script to skip recomputation when plotting is interrupted
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 15:01:25 +08:00
24f4a45097 Fix macrodef.h include and clean up stale z4c_gpu_rhs_ss.cu
Include macrodef.h (not macrodef.fh) in gpu_rhsSS_mem.h and
bssn_gpu.h so that ABEtype is visible to #if guards in CUDA files.
Remove the separate z4c_gpu_rhs_ss.cu (merged into bssn_gpu_rhs_ss.cu).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 20:02:35 +08:00
f16469ea77 Simplify Z4C Shell GPU: CPU-side trKd+TZ_rhs wrapper
Replace the duplicated z4c_gpu_rhs_ss.cu with a lightweight
gpu_rhs_z4c_ss wrapper inside bssn_gpu_rhs_ss.cu (guarded by
#if ABEtype==2). The wrapper:
1. Builds trKd = trK + 2*TZ on host and passes it to gpu_rhs_ss
2. After BSSN GPU returns, computes TZ_rhs = alpn1*Hcon/2 and
   applies kappa1/kappa2 constraint damping on CPU

This avoids duplicate kernel definitions (linker errors) and
keeps all shell GPU code in a single file. The CPU-side Z4C
corrections are O(100K) operations — negligible vs GPU RHS time.

Also remove the separate z4c_gpu_rhs_ss.cu and its build rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 16:05:56 +08:00
f754aa1ec2 Add Z4C Shell-Patch GPU acceleration (Phase 3 complete)
Create z4c_gpu_rhs_ss.cu (reusing BSSN shell FD/chain-rule kernels):
- Uploads trKd = trK + 2*TZ to GPU so existing BSSN algebraic kernels
  compute correct Z4C physical equations without modification
- New kern_z4c_post applies TZ_rhs = alpn1 * Hcon / 2, kappa1/kappa2
  constraint damping, TZ advection (lopsided), and dissipation (kodis)
- Adds TZ/TZ_rhs to Meta struct, alloc/upload/download/free lifecycle

Add cuda_compute_rhs_z4c_ss() wrapper in Z4c_class.C matching the
Fortran f_compute_rhs_Z4c_ss signature, with #define redirection for
Step/SHStep call sites and #undef before analysis functions.

Add z4c_gpu_rhs_ss.o to ABE_CUDA_CFILES and build rule in makefile.
Add kappa1_c/kappa2_c constants to gpu_rhsSS_mem.h.

Build verified with USE_CUDA_Z4C=1 + WithShell — compiles and links
cleanly. All three Shell GPU files now coexist: bssn_gpu_rhs_ss.o
(BSSN), z4c_gpu_rhs_ss.o (Z4C), both sharing FD/chain-rule kernels.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 13:52:48 +08:00
c4194214c6 Enable Z4C + Shell-Patch GPU coexistence (Phase 3)
Remove the compile-time #error that blocked USE_CUDA_Z4C + WithShell.
Add GPU-to-CPU state sync at the start of both Z4C Step functions
(non-CPBC and CPBC) so shell CPU consumers read valid field data
after Cartesian GPU RHS with resident state.

Move bssn_cuda_use_resident_sync and bssn_cuda_download_level_state
_if_present from anonymous namespace to file scope in bssn_class.C
so derived classes (Z4C) can call them. Declare both in
bssn_rhs_cuda.h. Include bssn_rhs_cuda.h in Z4c_class.C.

Z4C shell RHS remains on CPU (Fortran Z4c_rhs_ss.f90) pending
future GPU kernel implementation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 12:08:02 +08:00
0ca86afd41 Use static OpenMP schedule in ShellPatch::setupintintstuff
Static scheduling has lower overhead than guided for uniform workloads
(grid points all have equal computational cost).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 02:23:07 +08:00
f5bf3ab252 Add thread-safe ShellPatch::setupintintstuff with OpenMP
Split prolongpointstru into search-only (prolongpointstru_search) and
append-only (prolongpointstru_append) functions. The search is read-only
and thread-safe; each thread builds private linked lists via
prolongpointstru_append, merged after the parallel loop.

This eliminates critical-section contention and delivers ~2.2x speedup:
setupintintstuff: 511s -> 252s, total init: 592s -> 267s.

Also add -qopenmp to ShellPatch.o compilation via makefile override rule
and <omp.h> include with _OPENMP guards + fallback stubs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 02:10:20 +08:00
d0d3f965a6 Add diagnostic timing to Shell-Patch initialization
Print MPI_Wtime breakdown of Initialize() shell setup steps and
Read_Ansorg::Compute_Constraint duration. Reveals that
ShellPatch::setupintintstuff() takes ~511s of the ~590s startup.

The function builds interpolation tables by searching every shell
grid point against all Cartesian patches — thread-safe OpenMP
parallelization is blocked by shared linked-list mutations in
prolongpointstru(), which would need a search/append split first.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 21:51:07 +08:00
fbb2ed112d Fix Compile_Constraint/analysis use CPU Fortran for shell RHS
Limit GPU shell RHS redirection to Step and SHStep only via #define/#undef.
Compute_Constraint, Interp_Constraint, and Constraint_Out continue using
the CPU Fortran path to avoid GPU alloc-per-call overhead during
initialization and analysis phases.

Also: wrap compare_result_gpu in #ifdef RESULT_CHECK to avoid link error.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 19:25:45 +08:00
bd4ce3fbf3 GPU-accelerate Shell-Patch BSSN evolution
Phase 1: Enable GPU resident state for Cartesian patches in Shell mode.
- Remove WithShell guard from bssn_cuda_use_resident_sync().
- Add GPU-to-CPU state sync before shell CPU consumers (SHStep,
  CS_Inter, inline shell RHS blocks).

Phase 2: GPU-accelerate BSSN Shell Patch RHS.
- Create bssn_gpu.h with RHS_SS_PARA macro and gpu_rhs_ss declaration.
- Fix compilation bugs in legacy bssn_gpu_rhs_ss.cu (deprecated
  cudaThreadSynchronize, tmp_con2 redeclaration, ijkmin3_h typo,
  CUDA_SAFE_CALL, missing compare_result guard).
- Add bssn_gpu_rhs_ss.o to CFILES_CUDA_BSSN with build rule.
- Write cuda_compute_rhs_bssn_ss() wrapper bridging Fortran and GPU
  parameter conventions, redirect all shell RHS call sites via #define.

Verified: 30-step Shell-Patch GPU run completes without errors/NaN.
Step wall time ~4.4s (step_fn ~2.0s + RP ~0.68s + constraint ~0.70s).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 18:50:10 +08:00
5eb49949d9 Fix AHF crash under CUDA resident-sync mode
Download BSSN StateList from GPU to CPU before AHFinderDirect_find_horizons
so that AH_Interp_Points reads valid field data instead of stale CPU arrays.
The resident-sync path keeps canonical state on GPU; without this download the
Newton iteration diverges and probes outside the computational domain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 16:11:56 +08:00
39450228f5 Accelerate Shell-Patch interpolation fast paths 2026-05-08 13:26:16 +08:00
063f28b3b4 Add Shell-Patch GPU runtime fast paths 2026-05-08 09:26:36 +08:00
1064a68d16 Optimize BSSN-EM 8th-order AMR transfers 2026-05-07 21:38:16 +08:00
dcc83bafcb Support 2nd and 8th order CUDA AMR paths 2026-05-07 20:31:26 +08:00
c4d8d41b25 Cover Z4C CUDA AMR restrict prolong 2026-05-07 19:49:09 +08:00
0076b3ca18 Optimize 6th-order CUDA AMR stencils 2026-05-07 19:22:37 +08:00
9ff2f065be Apply BSSN AMR sync default to EScalar 2026-05-07 17:12:33 +08:00
2317e4abde Fix BSSN GPU resident AMR sync default 2026-05-07 17:11:09 +08:00
fea2dcc0d5 Fix BSSN-EM runtime crash 2026-05-07 16:47:55 +08:00
5525465cad Support CUDA finite-difference order selection 2026-05-07 16:28:02 +08:00
96829d0441 Optimize Z4C GPU runtime defaults 2026-05-07 15:37:09 +08:00
83afaf19ce Skip zero EM resident downloads 2026-05-07 13:04:46 +08:00
cb911dec06 Add EM GPU fast paths and defaults 2026-05-07 12:18:56 +08:00
dd0e20d8c7 Fix BSSN-EScalar CUDA boundary and scalar KO 2026-05-06 15:44:35 +08:00
ffa0d801ed Default Python GPU runner to EScalar fast path 2026-05-06 00:12:46 +08:00
ae64a22178 Complete BSSN-EScalar CUDA resident transfers 2026-05-05 23:57:42 +08:00
85fe29cc2e Optimize BSSN-EScalar CUDA path 2026-05-05 10:47:46 +08:00
06f62dee36 Switch back to Intel toolchain as the default option
Seems that Intel MPI also supports CUDA-aware by setting I_MPI_OFFLOAD to 1. Besides, I_MPI_OFFLOAD_IPC=0 is needed to avoid segfaults.
2026-05-01 21:59:13 +08:00
30 changed files with 12429 additions and 10257 deletions

View File

@@ -31,7 +31,7 @@ GPU_Part = 0.0
## Setting the physical system and numerical method
Symmetry = "equatorial-symmetry" ## Symmetry of System: choose equatorial-symmetry、no-symmetry、octant-symmetry
Equation_Class = "BSSN-EScalar" ## Evolution Equation: choose "BSSN", "BSSN-EScalar", "BSSN-EM", "Z4C"
Equation_Class = "BSSN" ## Evolution Equation: choose "BSSN", "BSSN-EScalar", "BSSN-EM", "Z4C"
## If "BSSN-EScalar" is chosen, it is necessary to set other parameters below
Initial_Data_Method = "Ansorg-TwoPuncture" ## initial data method: choose "Ansorg-TwoPuncture", "Lousto-Analytical", "Cao-Analytical", "KerrSchild-Analytical"
Time_Evolution_Method = "runge-kutta-45" ## time evolution method: choose "runge-kutta-45"
@@ -158,7 +158,7 @@ Detector_Rmax = 160.0 ## farest dector distance
## Setting the apprent horizon
AHF_Find = "no" ## whether to find the apparent horizon: choose "yes" or "no"
AHF_Find = "yes" ## whether to find the apparent horizon: choose "yes" or "no"
AHF_Find_Every = 24
AHF_Dump_Time = 20.0

100
AMSS_NCKU_Program_Plot.py Normal file
View File

@@ -0,0 +1,100 @@
##################################################################
##
## AMSS-NCKU Plot-Only Restart Script
## Author: Xiaoqu / Claude
## 2026/05/12
##
## This script checks for existing output data from AMSS_NCKU_Program.py.
## If data exists, it skips all computation and goes directly to plotting,
## saving time when plotting was interrupted.
## If no data is found, it exits with a message.
##
##################################################################
## Guard against re-execution by multiprocessing child processes.
if __name__ != '__main__':
import sys as _sys
_sys.exit(0)
import os
import sys
import AMSS_NCKU_Input as input_data
##################################################################
## Construct paths from input configuration
File_directory = os.path.join(input_data.File_directory)
output_directory = os.path.join(File_directory, "AMSS_NCKU_output")
binary_results_directory = os.path.join(output_directory, input_data.Output_directory)
figure_directory = os.path.join(File_directory, "figure")
##################################################################
## Check whether the required output data files exist
required_files = [
os.path.join(binary_results_directory, "bssn_BH.dat"),
os.path.join(binary_results_directory, "bssn_ADMQs.dat"),
os.path.join(binary_results_directory, "bssn_psi4.dat"),
os.path.join(binary_results_directory, "bssn_constraint.dat"),
]
missing_files = [f for f in required_files if not os.path.exists(f)]
if missing_files:
print(" No existing AMSS_NCKU_Program.py output data found. ")
print(" The following required files are missing: ")
for f in missing_files:
print(f" {f}")
print()
print(" Please run AMSS_NCKU_Program.py first to generate the simulation data. ")
print(" Exiting. ")
sys.exit(1)
print(" Found existing AMSS_NCKU_Program.py output data. " )
print(" Skipping all computation and going directly to plotting. " )
print()
## Ensure the figure directory exists (it should, but be safe)
os.makedirs(figure_directory, exist_ok=True)
##################################################################
## Plot the AMSS-NCKU program results
import plot_xiaoqu
import plot_GW_strain_amplitude_xiaoqu
from parallel_plot_helper import run_plot_tasks_parallel
plot_tasks = []
## Plot black hole trajectory
plot_tasks.append((plot_xiaoqu.generate_puncture_orbit_plot, (binary_results_directory, figure_directory)))
plot_tasks.append((plot_xiaoqu.generate_puncture_orbit_plot3D, (binary_results_directory, figure_directory)))
## Plot black hole separation vs. time
plot_tasks.append((plot_xiaoqu.generate_puncture_distence_plot, (binary_results_directory, figure_directory)))
## Plot gravitational waveforms (psi4 and strain amplitude)
for i in range(input_data.Detector_Number):
plot_tasks.append((plot_xiaoqu.generate_gravitational_wave_psi4_plot, (binary_results_directory, figure_directory, i)))
plot_tasks.append((plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot, (binary_results_directory, figure_directory, i)))
## Plot ADM mass evolution
for i in range(input_data.Detector_Number):
plot_tasks.append((plot_xiaoqu.generate_ADMmass_plot, (binary_results_directory, figure_directory, i)))
## Plot Hamiltonian constraint violation over time
for i in range(input_data.grid_level):
plot_tasks.append((plot_xiaoqu.generate_constraint_check_plot, (binary_results_directory, figure_directory, i)))
run_plot_tasks_parallel(plot_tasks)
## Plot stored binary data (runs serially, not in the parallel pool)
plot_xiaoqu.generate_binary_data_plot(binary_results_directory, figure_directory)
print()
print(" Plotting completed successfully. ")
print()

View File

@@ -198,16 +198,16 @@ int main(int argc, char *argv[])
if (myrank == 0)
{
string out_dir;
char filename[50];
string filename;
map<string, string>::iterator iter;
iter = parameters::str_par.find("output dir");
if (iter != parameters::str_par.end())
{
out_dir = iter->second;
}
sprintf(filename, "%s/setting.par", out_dir.c_str());
filename = out_dir + "/setting.par";
ofstream setfile;
setfile.open(filename, ios::trunc);
setfile.open(filename.c_str(), ios::trunc);
if (!setfile.good())
{
@@ -484,6 +484,10 @@ int main(int argc, char *argv[])
cout << endl;
}
// Let the process teardown reclaim the simulation object. Some derived
// equation classes keep MPI/CUDA-backed state whose destructor ordering
// is fragile at program shutdown.
if (getenv("AMSS_DELETE_ADM_ON_EXIT"))
delete ADM;
//=======================caculation done=============================================================

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -102,6 +102,16 @@ public:
//-1: means no dumy dimension at all; 0: means rho; 1: means sigma
};
// Thread-safe search result (no pointers to shared mutable state)
struct PointSearchResult
{
bool found;
Block *Bg;
double gx, gy, gz; // global Cartesian coordinates
double lx, ly, lz; // local coordinates within the found block
int ssst; // source shell-patch type (-1 = Cartesian)
};
int myrank;
int shape[dim]; // for (rho, sigma, R), for rho and sigma means number of points for every pi/2
double Rrange[2]; // for Rmin and Rmax
@@ -175,6 +185,12 @@ public:
MyList<Patch> *Pp, double CDH[dim], MyList<pointstru> *pss);
bool prolongpointstru(MyList<pointstru> *&psul, bool ssyn, int tsst, MyList<ss_patch> *sPp, double DH[dim],
MyList<Patch> *Pp, double CDH[dim], double x, double y, double z, int Symmetry, int rank_in);
// Read-only point search — thread-safe (no shared mutable state modified)
PointSearchResult prolongpointstru_search(bool ssyn, int tsst, MyList<ss_patch> *sPp, double DH[dim],
MyList<Patch> *Pp, double CDH[dim], double x, double y, double z,
int Symmetry, int rank_in);
// Append a search result to a linked list — use inside omp critical section
void prolongpointstru_append(MyList<pointstru> *&psul, const PointSearchResult &sr, int tsst);
void setupintintstuff(int cpusize, MyList<Patch> *CPatL, int Symmetry);
void intertransfer(MyList<pointstru> **src, MyList<pointstru> **dst,
MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /*target */,

View File

@@ -2,8 +2,8 @@
#ifdef newc
#include <sstream>
#include <cstdio>
#include <cstdlib>
#include <map>
#include <string>
using namespace std;
#else
#include <stdio.h>
@@ -36,6 +36,12 @@ using namespace std;
#if USE_CUDA_Z4C && (ABEtype == 2)
#include "z4c_rhs_cuda.h"
#endif
#if USE_CUDA_BSSN
#include "bssn_rhs_cuda.h"
#ifdef WithShell
#include "bssn_gpu.h"
#endif
#endif
#ifdef With_AHF
#include "derivatives.h"
@@ -46,6 +52,81 @@ using namespace std;
// Define Z4c_class
#if USE_CUDA_Z4C && (ABEtype == 2) && defined(WithShell)
// GPU-accelerated Z4C shell RHS: same parameter signature as f_compute_rhs_Z4c_ss.
// Internally calls gpu_rhs_z4c_ss which modifies trK→trKd before upload,
// runs BSSN algebraic kernels, then applies Z4C post-processing (TZ_rhs, damping).
extern "C" {
static int cuda_compute_rhs_z4c_ss(
int *ex, double &T, double *crho, double *sigma, double *R,
double *X, double *Y, double *Z,
double *drhodx, double *drhody, double *drhodz,
double *dsigmadx, double *dsigmady, double *dsigmadz,
double *dRdx, double *dRdy, double *dRdz,
double *drhodxx, double *drhodxy, double *drhodxz, double *drhodyy, double *drhodyz, double *drhodzz,
double *dsigmadxx, double *dsigmadxy, double *dsigmadxz, double *dsigmadyy, double *dsigmadyz, double *dsigmadzz,
double *dRdxx, double *dRdxy, double *dRdxz, double *dRdyy, double *dRdyz, double *dRdzz,
double *chi, double *trK,
double *gxx, double *gxy, double *gxz, double *gyy, double *gyz, double *gzz,
double *Axx, double *Axy, double *Axz, double *Ayy, double *Ayz, double *Azz,
double *Gamx, double *Gamy, double *Gamz,
double *Lap, double *betax, double *betay, double *betaz,
double *dtSfx, double *dtSfy, double *dtSfz,
double *TZ,
double *chi_rhs, double *trK_rhs,
double *gxx_rhs, double *gxy_rhs, double *gxz_rhs, double *gyy_rhs, double *gyz_rhs, double *gzz_rhs,
double *Axx_rhs, double *Axy_rhs, double *Axz_rhs, double *Ayy_rhs, double *Ayz_rhs, double *Azz_rhs,
double *Gamx_rhs, double *Gamy_rhs, double *Gamz_rhs,
double *Lap_rhs, double *betax_rhs, double *betay_rhs, double *betaz_rhs,
double *dtSfx_rhs, double *dtSfy_rhs, double *dtSfz_rhs,
double *TZ_rhs,
double *rho_mat, double *Sx, double *Sy, double *Sz,
double *Sxx, double *Sxy, double *Sxz, double *Syy, double *Syz, double *Szz,
double *Gamxxx, double *Gamxxy, double *Gamxxz, double *Gamxyy, double *Gamxyz, double *Gamxzz,
double *Gamyxx, double *Gamyxy, double *Gamyxz, double *Gamyyy, double *Gamyyz, double *Gamyzz,
double *Gamzxx, double *Gamzxy, double *Gamzxz, double *Gamzyy, double *Gamzyz, double *Gamzzz,
double *Rxx, double *Rxy, double *Rxz, double *Ryy, double *Ryz, double *Rzz,
double *ham_Res, double *movx_Res, double *movy_Res, double *movz_Res,
double *Gmx_Res, double *Gmy_Res, double *Gmz_Res,
int &Symmetry, int &Lev, double &eps, int &sst, int &co)
{
return gpu_rhs_z4c_ss(0, 0, // calledby=ABE_main, mpi_rank=device_0
ex, T, crho, sigma, R, X, Y, Z,
drhodx, drhody, drhodz,
dsigmadx, dsigmady, dsigmadz,
dRdx, dRdy, dRdz,
drhodxx, drhodxy, drhodxz, drhodyy, drhodyz, drhodzz,
dsigmadxx, dsigmadxy, dsigmadxz, dsigmadyy, dsigmadyz, dsigmadzz,
dRdxx, dRdxy, dRdxz, dRdyy, dRdyz, dRdzz,
chi, trK,
gxx, gxy, gxz, gyy, gyz, gzz,
Axx, Axy, Axz, Ayy, Ayz, Azz,
Gamx, Gamy, Gamz,
Lap, betax, betay, betaz,
dtSfx, dtSfy, dtSfz,
TZ,
chi_rhs, trK_rhs,
gxx_rhs, gxy_rhs, gxz_rhs, gyy_rhs, gyz_rhs, gzz_rhs,
Axx_rhs, Axy_rhs, Axz_rhs, Ayy_rhs, Ayz_rhs, Azz_rhs,
Gamx_rhs, Gamy_rhs, Gamz_rhs,
Lap_rhs, betax_rhs, betay_rhs, betaz_rhs,
dtSfx_rhs, dtSfy_rhs, dtSfz_rhs,
TZ_rhs,
rho_mat, Sx, Sy, Sz,
Sxx, Sxy, Sxz, Syy, Syz, Szz,
Gamxxx, Gamxxy, Gamxxz, Gamxyy, Gamxyz, Gamxzz,
Gamyxx, Gamyxy, Gamyxz, Gamyyy, Gamyyz, Gamyzz,
Gamzxx, Gamzxy, Gamzxz, Gamzyy, Gamzyz, Gamzzz,
Rxx, Rxy, Rxz, Ryy, Ryz, Rzz,
ham_Res, movx_Res, movy_Res, movz_Res,
Gmx_Res, Gmy_Res, Gmz_Res,
Symmetry, Lev, eps, sst, co);
}
}
// Redirect all Z4C shell RHS calls in Step/SHStep to GPU
#define f_compute_rhs_Z4c_ss cuda_compute_rhs_z4c_ss
#endif
// This class inherits some members and methods from the parent `bssn_class` and modifies others.
// The modified members and methods are defined below (and in the header Z4c_class.h).
// The remaining members/methods are inherited from `bssn_class` (declared in bssn_class.h).
@@ -127,8 +208,6 @@ void Z4c_class::Initialize()
CheckPoint->readcheck_sh(SH, myrank);
#endif
Initialize_Level_Runtime();
double h = GH->PatL[0]->data->blb->data->getdX(0);
for (int i = 1; i < dim; i++)
h = Mymin(h, GH->PatL[0]->data->blb->data->getdX(i));
@@ -143,6 +222,13 @@ void Z4c_class::Initialize()
PhysTime = StartTime;
Setup_Black_Hole_position();
}
sync_cache_pre = new Parallel::SyncCache[GH->levels];
sync_cache_cor = new Parallel::SyncCache[GH->levels];
sync_cache_rp_coarse = new Parallel::SyncCache[GH->levels];
sync_cache_rp_fine = new Parallel::SyncCache[GH->levels];
sync_cache_restrict = new Parallel::SyncCache[GH->levels];
sync_cache_outbd = new Parallel::SyncCache[GH->levels];
}
//================================================================================================
@@ -182,9 +268,6 @@ Z4c_class::~Z4c_class()
// for sommerfeld boundary
#if USE_CUDA_Z4C && (ABEtype == 2)
#ifdef WithShell
#error "USE_CUDA_Z4C resident path currently supports Cartesian non-shell Z4C only"
#endif
#if (MRBD == 2)
#error "USE_CUDA_Z4C resident path does not support MRBD == 2"
#endif
@@ -216,35 +299,6 @@ bool fill_z4c_cuda_views(Block *cg, MyList<var> *vars,
return idx == Z4C_CUDA_STATE_COUNT && vars == 0;
}
bool z4c_cuda_keep_resident_after_step(int lev, int trfls_in, int analysis_lev)
{
static int keep_all_levels = -1;
if (keep_all_levels < 0)
{
const char *env = getenv("AMSS_CUDA_KEEP_ALL_LEVELS");
keep_all_levels = (env && atoi(env) != 0) ? 1 : 0;
}
static int enabled = -1;
if (enabled < 0)
{
const char *env = getenv("AMSS_CUDA_Z4C_KEEP_RESIDENT_AFTER_STEP");
if (env)
enabled = (atoi(env) != 0) ? 1 : 0;
else
{
env = getenv("AMSS_CUDA_KEEP_RESIDENT_AFTER_STEP");
enabled = (env && atoi(env) != 0) ? 1 : 0;
}
}
if (!enabled)
return false;
if (lev == analysis_lev)
return false;
if (keep_all_levels)
return true;
return lev < trfls_in;
}
void z4c_cuda_download_level_state(MyList<Patch> *PatL, MyList<var> *vars, int myrank, bool release_ctx)
{
MyList<Patch> *Pp = PatL;
@@ -388,57 +442,41 @@ bool z4c_cuda_interp_bh_point_resident(MyList<Patch> *PatL,
if (z4c_cuda_has_resident_state(block) &&
block->shape[0] >= ordn && block->shape[1] >= ordn && block->shape[2] >= ordn)
{
const int sx = ordn;
const int sy = ordn;
const int sz = ordn;
const int region_all = sx * sy * sz;
const int i0 = z4c_cuda_interp_tile_start(block->X[0], block->shape[0], x, DH[0], ordn);
const int j0 = z4c_cuda_interp_tile_start(block->X[1], block->shape[1], y, DH[1], ordn);
const int k0 = z4c_cuda_interp_tile_start(block->X[2], block->shape[2], z, DH[2], ordn);
double *packed_fields = new double[3 * region_all];
var *vars[3] = {forx, fory, forz};
static int use_device_bh_interp = -1;
if (use_device_bh_interp < 0)
{
const char *env = getenv("AMSS_CUDA_Z4C_BH_INTERP_DEVICE");
use_device_bh_interp = (env && atoi(env) != 0) ? 1 : 0;
}
bool used_device_interp = false;
if (use_device_bh_interp)
{
double soa3[9];
for (int f = 0; f < 3; f++)
{
soa3[3 * f + 0] = vars[f]->SoA[0];
soa3[3 * f + 1] = vars[f]->SoA[1];
soa3[3 * f + 2] = vars[f]->SoA[2];
}
used_device_interp =
(z4c_cuda_interp_state_point3(block, block->shape,
k_z4c_cuda_bh_state_indices[0],
k_z4c_cuda_bh_state_indices[1],
k_z4c_cuda_bh_state_indices[2],
block->X[0][0], block->X[1][0], block->X[2][0],
DH[0], DH[1], DH[2],
x, y, z,
interp_ordn, interp_sym,
soa3, shellf) == 0);
}
if (!used_device_interp)
if (z4c_cuda_pack_state_region_to_host_buffer(block,
k_z4c_cuda_bh_state_indices[f],
packed_fields + f * region_all,
block->shape,
i0, j0, k0,
sx, sy, sz) != 0)
{
double *shift_views[3] = {
block->fgfs[forx->sgfn],
block->fgfs[fory->sgfn],
block->fgfs[forz->sgfn]};
if (z4c_cuda_download_state_subset(block, block->shape, 3,
k_z4c_cuda_bh_state_indices,
shift_views) != 0)
{
cout << "CUDA Z4C BH shift download failed" << endl;
delete[] packed_fields;
cout << "CUDA Z4C BH tile download failed" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
f_global_interp(block->shape, block->X[0], block->X[1], block->X[2],
block->fgfs[forx->sgfn], shellf[0],
x, y, z, interp_ordn, forx->SoA, interp_sym);
f_global_interp(block->shape, block->X[0], block->X[1], block->X[2],
block->fgfs[fory->sgfn], shellf[1],
x, y, z, interp_ordn, fory->SoA, interp_sym);
f_global_interp(block->shape, block->X[0], block->X[1], block->X[2],
block->fgfs[forz->sgfn], shellf[2],
x, y, z, interp_ordn, forz->SoA, interp_sym);
int tile_shape[3] = {sx, sy, sz};
f_global_interp(tile_shape,
block->X[0] + i0,
block->X[1] + j0,
block->X[2] + k0,
packed_fields + f * region_all,
shellf[f],
x, y, z,
interp_ordn,
vars[f]->SoA,
interp_sym);
}
delete[] packed_fields;
}
else
{
@@ -500,115 +538,15 @@ bool z4c_cuda_compute_porg_rhs_resident(cgh *GH,
return true;
}
bool z4c_cuda_download_bh_shift_level(MyList<Patch> *PatL,
int myrank,
var *forx, var *fory, var *forz)
bool z4c_cuda_resident_step_enabled()
{
MyList<Patch> *Pp = PatL;
while (Pp)
static int enabled = -1;
if (enabled < 0)
{
MyList<Block> *BP = Pp->data->blb;
while (BP)
{
Block *cg = BP->data;
if (myrank == cg->rank && z4c_cuda_has_resident_state(cg))
{
double *fields[3] = {
cg->fgfs[forx->sgfn],
cg->fgfs[fory->sgfn],
cg->fgfs[forz->sgfn]};
if (z4c_cuda_download_state_subset(cg, cg->shape, 3,
k_z4c_cuda_bh_state_indices,
fields))
return false;
const char *env = getenv("AMSS_Z4C_CUDA_RESIDENT");
enabled = (env && atoi(env) != 0) ? 1 : 0;
}
if (BP == Pp->data->ble)
break;
BP = BP->next;
}
Pp = Pp->next;
}
return true;
}
bool z4c_cuda_refresh_constraint_level(MyList<Patch> *PatL,
int myrank,
var *Cons_Ham, var *Cons_Px,
var *Cons_Py, var *Cons_Pz,
var *Cons_Gx, var *Cons_Gy,
var *Cons_Gz, var *TZ0,
int Symmetry, int lev, double eps)
{
bool all_resident = true;
const int tz_index = 24;
MyList<Patch> *Pp = PatL;
while (Pp)
{
MyList<Block> *BP = Pp->data->blb;
while (BP)
{
Block *cg = BP->data;
if (myrank == cg->rank)
{
if (!z4c_cuda_has_resident_state(cg))
{
all_resident = false;
}
else
{
double *constraints[7] = {
cg->fgfs[Cons_Ham->sgfn], cg->fgfs[Cons_Px->sgfn],
cg->fgfs[Cons_Py->sgfn], cg->fgfs[Cons_Pz->sgfn],
cg->fgfs[Cons_Gx->sgfn], cg->fgfs[Cons_Gy->sgfn],
cg->fgfs[Cons_Gz->sgfn]};
double *tz_out[1] = {cg->fgfs[TZ0->sgfn]};
int co = 0;
if (z4c_cuda_compute_constraints_resident(cg, cg->shape,
cg->X[0], cg->X[1], cg->X[2],
Symmetry, eps, co,
constraints) ||
z4c_cuda_download_state_subset(cg, cg->shape, 1, &tz_index, tz_out))
{
cout << "CUDA Z4C resident constraint refresh failed" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
}
}
if (BP == Pp->data->ble)
break;
BP = BP->next;
}
Pp = Pp->next;
}
return all_resident;
}
long long &z4c_constraint_output_counter()
{
static long long counter = 0;
return counter;
}
int z4c_constraint_output_every()
{
static int every = -1;
if (every < 0)
{
const char *env = getenv("AMSS_CUDA_Z4C_CONSTRAINT_EVERY");
every = (env && atoi(env) > 0) ? atoi(env) : 1;
}
return every;
}
bool z4c_constraint_output_due_now()
{
const int every = z4c_constraint_output_every();
return every <= 1 || (z4c_constraint_output_counter() % every) == 0;
}
void z4c_constraint_output_advance()
{
z4c_constraint_output_counter()++;
return enabled != 0;
}
} // namespace
@@ -629,34 +567,14 @@ void Z4c_class::Step(int lev, int YN)
int iter_count = 0;
int pre = 0, cor = 1;
int ERROR = 0;
const double dT_mon = dT * pow(0.5, Mymax(0, trfls));
const bool need_constraint_after_step =
(LastConsOut + dT_mon >= AnasTime) && z4c_constraint_output_due_now();
if (BH_num > 0 && lev == GH->levels - 1)
#ifdef WithShell
if (bssn_cuda_use_resident_sync(lev))
{
if (!z4c_cuda_download_bh_shift_level(GH->PatL[lev], myrank, Sfx0, Sfy0, Sfz0))
{
if (myrank == 0 && ErrorMonitor->outfile)
ErrorMonitor->outfile << "CUDA Z4C failed to download predictor black-hole shift at t = "
<< PhysTime << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
compute_Porg_rhs(Porg0, Porg_rhs, Sfx0, Sfy0, Sfz0, lev);
for (int ithBH = 0; ithBH < BH_num; ithBH++)
{
f_rungekutta4_scalar(dT_lev, Porg0[ithBH][0], Porg[ithBH][0], Porg_rhs[ithBH][0], iter_count);
f_rungekutta4_scalar(dT_lev, Porg0[ithBH][1], Porg[ithBH][1], Porg_rhs[ithBH][1], iter_count);
f_rungekutta4_scalar(dT_lev, Porg0[ithBH][2], Porg[ithBH][2], Porg_rhs[ithBH][2], iter_count);
if (Symmetry > 0)
Porg[ithBH][2] = fabs(Porg[ithBH][2]);
if (Symmetry == 2)
{
Porg[ithBH][0] = fabs(Porg[ithBH][0]);
Porg[ithBH][1] = fabs(Porg[ithBH][1]);
}
}
for (int dl = 0; dl < GH->levels; dl++)
bssn_cuda_download_level_state_if_present(GH->PatL[dl], StateList, myrank);
}
#endif
MyList<Patch> *Pp = GH->PatL[lev];
while (Pp)
@@ -685,7 +603,7 @@ void Z4c_class::Step(int lev, int YN)
#elif (MRBD == 1)
apply_bam_bc = 1;
#endif
int keep_resident_state = 1;
int keep_resident_state = z4c_cuda_resident_step_enabled() ? 1 : 0;
int apply_enforce_ga = 0;
#if (AGM == 0)
apply_enforce_ga = 1;
@@ -724,10 +642,24 @@ void Z4c_class::Step(int lev, int YN)
MPI_Abort(MPI_COMM_WORLD, 1);
}
Parallel::Sync_cached(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev]);
if (BH_num > 0 && lev == GH->levels - 1)
{
Parallel::AsyncSyncState async_pre;
Parallel::Sync_start(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev], async_pre);
Parallel::Sync_finish(sync_cache_pre[lev], async_pre, SynchList_pre, Symmetry);
compute_Porg_rhs(Porg0, Porg_rhs, Sfx0, Sfy0, Sfz0, lev);
for (int ithBH = 0; ithBH < BH_num; ithBH++)
{
f_rungekutta4_scalar(dT_lev, Porg0[ithBH][0], Porg[ithBH][0], Porg_rhs[ithBH][0], iter_count);
f_rungekutta4_scalar(dT_lev, Porg0[ithBH][1], Porg[ithBH][1], Porg_rhs[ithBH][1], iter_count);
f_rungekutta4_scalar(dT_lev, Porg0[ithBH][2], Porg[ithBH][2], Porg_rhs[ithBH][2], iter_count);
if (Symmetry > 0)
Porg[ithBH][2] = fabs(Porg[ithBH][2]);
if (Symmetry == 2)
{
Porg[ithBH][0] = fabs(Porg[ithBH][0]);
Porg[ithBH][1] = fabs(Porg[ithBH][1]);
}
}
}
if ((lev == a_lev) && (LastAnas + dT_lev >= AnasTime))
@@ -766,7 +698,7 @@ void Z4c_class::Step(int lev, int YN)
#elif (MRBD == 1)
apply_bam_bc = 1;
#endif
int keep_resident_state = 1;
int keep_resident_state = z4c_cuda_resident_step_enabled() ? 1 : 0;
int apply_enforce_ga = 0;
#if (AGM == 0)
apply_enforce_ga = 1;
@@ -787,25 +719,6 @@ void Z4c_class::Step(int lev, int YN)
<< cg->bbox[2] << ":" << cg->bbox[5] << ")" << endl;
ERROR = 1;
}
if (!ERROR && iter_count == 3 && need_constraint_after_step)
{
double *constraints[7] = {
cg->fgfs[Cons_Ham->sgfn], cg->fgfs[Cons_Px->sgfn],
cg->fgfs[Cons_Py->sgfn], cg->fgfs[Cons_Pz->sgfn],
cg->fgfs[Cons_Gx->sgfn], cg->fgfs[Cons_Gy->sgfn],
cg->fgfs[Cons_Gz->sgfn]};
double *tz_out[1] = {cg->fgfs[TZ0->sgfn]};
const int tz_index = 24;
if (z4c_cuda_download_constraint_outputs(cg->shape, constraints) ||
z4c_cuda_download_state_subset(cg, cg->shape, 1, &tz_index, tz_out))
{
cout << "CUDA Z4C constraint download failed in domain: ("
<< cg->bbox[0] << ":" << cg->bbox[3] << ","
<< cg->bbox[1] << ":" << cg->bbox[4] << ","
<< cg->bbox[2] << ":" << cg->bbox[5] << ")" << endl;
ERROR = 1;
}
}
}
if (BP == Pp->data->ble)
break;
@@ -827,13 +740,11 @@ void Z4c_class::Step(int lev, int YN)
MPI_Abort(MPI_COMM_WORLD, 1);
}
{
Parallel::AsyncSyncState async_cor;
Parallel::Sync_start(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev], async_cor);
Parallel::Sync_finish(sync_cache_cor[lev], async_cor, SynchList_cor, Symmetry);
}
Parallel::Sync_cached(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev]);
if (BH_num > 0 && lev == GH->levels - 1)
{
if (z4c_cuda_resident_step_enabled())
{
if (!z4c_cuda_compute_porg_rhs_resident(GH, lev, myrank, BH_num,
Porg, Porg1,
@@ -844,6 +755,11 @@ void Z4c_class::Step(int lev, int YN)
<< PhysTime << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
}
else
{
compute_Porg_rhs(Porg, Porg1, Sfx, Sfy, Sfz, lev);
}
for (int ithBH = 0; ithBH < BH_num; ithBH++)
{
f_rungekutta4_scalar(dT_lev, Porg0[ithBH][0], Porg1[ithBH][0], Porg_rhs[ithBH][0], iter_count);
@@ -887,13 +803,7 @@ void Z4c_class::Step(int lev, int YN)
}
}
{
const bool keep_resident = z4c_cuda_keep_resident_after_step(lev, trfls, a_lev);
const bool need_host_after_step =
((lev == a_lev) && (LastAnas + dT_lev >= AnasTime));
if (!keep_resident || need_host_after_step)
z4c_cuda_download_level_state(GH->PatL[lev], SynchList_cor, myrank, !keep_resident);
}
z4c_cuda_download_level_state(GH->PatL[lev], SynchList_cor, myrank, false);
#if (RPS == 0)
RestrictProlong(lev, YN, BB);
@@ -1091,6 +1001,13 @@ void Z4c_class::Step(int lev, int YN)
}
#ifdef WithShell
#if USE_CUDA_Z4C
if (bssn_cuda_use_resident_sync(lev))
{
for (int dl = 0; dl < GH->levels; dl++)
bssn_cuda_download_level_state_if_present(GH->PatL[dl], StateList, myrank);
}
#endif
// evolve Shell Patches
if (lev == 0)
{
@@ -1798,9 +1715,7 @@ void Z4c_class::Step(int lev, int YN)
}
#else
// for constraint preserving boundary (CPBC)
#if USE_CUDA_Z4C && (ABEtype == 2)
#error "USE_CUDA_Z4C resident path does not support CPBC"
#endif
// Note: CPBC path uses CPU Fortran RHS; GPU resident sync is a no-op here.
#ifndef WithShell
#error "CPBC only supports Shell"
#endif
@@ -1830,6 +1745,14 @@ void Z4c_class::Step(int lev, int YN)
int pre = 0, cor = 1;
int ERROR = 0;
#if USE_CUDA_Z4C && defined(WithShell)
if (bssn_cuda_use_resident_sync(lev))
{
for (int dl = 0; dl < GH->levels; dl++)
bssn_cuda_download_level_state_if_present(GH->PatL[dl], StateList, myrank);
}
#endif
MyList<ss_patch> *sPp;
// Predictor
MyList<Patch> *Pp = GH->PatL[lev];
@@ -3160,6 +3083,11 @@ void Z4c_class::Check_extrop()
//================================================================================================
#if USE_CUDA_Z4C && (ABEtype == 2) && defined(WithShell)
#undef f_compute_rhs_Z4c_ss
#define f_compute_rhs_Z4c_ss compute_rhs_z4c_ss_
#endif
// this member function is used to compute and output constraint violation
//================================================================================================
@@ -3172,13 +3100,7 @@ void Z4c_class::Constraint_Out()
if (LastConsOut >= AnasTime)
// Constraint violation
{
#if USE_CUDA_Z4C && (ABEtype == 2)
bool cuda_constraints_ready = true;
#else
const bool cuda_constraints_ready = false;
#endif
// recompute least the constraint data lost for moved new grid
if (!cuda_constraints_ready)
for (int lev = 0; lev < GH->levels; lev++)
{
// make sure the data consistent for higher levels
@@ -3441,11 +3363,12 @@ void Z4c_class::Interp_Constraint()
}
ofstream outfile;
char filename[50];
sprintf(filename, "%s/interp_constraint_%05d.dat", ErrorMonitor->out_dir.c_str(), int(PhysTime / dT + 0.5));
char suffix[64];
sprintf(suffix, "/interp_constraint_%05d.dat", int(PhysTime / dT + 0.5));
string filename = ErrorMonitor->out_dir + suffix;
// 0.5 for round off
outfile.open(filename);
outfile.open(filename.c_str());
outfile << "# corrdinate, H_Res, Px_Res, Py_Res, Pz_Res, Gx_Res, Gy_Res, Gz_Res, ...." << endl;
for (int i = 0; i < n; i++)
{

View File

@@ -2,7 +2,9 @@
#ifdef newc
#include <sstream>
#include <cstdio>
#include <cstdlib>
#include <map>
#include <string>
using namespace std;
#else
#include <stdio.h>
@@ -10,6 +12,7 @@ using namespace std;
#endif
#include <time.h>
#include <cstring>
#include "macrodef.h"
#include "misc.h"
@@ -18,10 +21,10 @@ using namespace std;
#include "Parallel.h"
#include "bssnEM_class.h"
#include "bssn_rhs.h"
#include "empart.h"
#if USE_CUDA_BSSN
#include "bssn_rhs_cuda.h"
#endif
#include "empart.h"
#include "initial_puncture.h"
#include "initial_maxwell.h"
#include "enforce_algebra.h"
@@ -39,13 +42,212 @@ using namespace std;
//================================================================================================
#if USE_CUDA_BSSN
namespace {
namespace
{
MyList<var> *advance_var_list(MyList<var> *vars, int count)
{
while (vars && count > 0)
{
vars = vars->next;
--count;
}
return vars;
}
bool fill_bssn_cuda_views_prefix(Block *cg, MyList<var> *vars,
bool bssn_em_step_timing_enabled()
{
static int enabled = -1;
if (enabled < 0)
{
const char *env = getenv("AMSS_EM_STEP_TIMING");
enabled = (env && atoi(env) != 0) ? 1 : 0;
}
return enabled != 0;
}
bool bssn_em_step_timing_all_levels_enabled()
{
static int enabled = -1;
if (enabled < 0)
{
const char *env = getenv("AMSS_EM_STEP_TIMING_ALL_LEVELS");
enabled = (env && atoi(env) != 0) ? 1 : 0;
}
return enabled != 0;
}
#if USE_CUDA_BSSN
bool bssn_em_zero_analysis_fastpath_enabled()
{
static int enabled = -1;
if (enabled < 0)
{
const char *env = getenv("AMSS_EM_ZERO_ANALYSIS_FASTPATH");
enabled = (!env || atoi(env) != 0) ? 1 : 0;
}
return enabled != 0;
}
bool bssn_em_zero_resident_download_fastpath_enabled()
{
static int enabled = -1;
if (enabled < 0)
{
const char *env = getenv("AMSS_EM_ZERO_RESIDENT_DOWNLOAD_FASTPATH");
enabled = (!env || atoi(env) != 0) ? 1 : 0;
}
return enabled != 0;
}
bool bssn_em_resident_zero_fastpath_ready(MyList<Patch> *PatL,
#ifdef WithShell
ShellPatch *shell,
#else
ShellPatch * /*shell*/,
#endif
int rank)
{
int local_ok = 1;
int local_seen = 0;
MyList<Patch> *Pp = PatL;
while (Pp)
{
MyList<Block> *BP = Pp->data->blb;
while (BP)
{
Block *cg = BP->data;
if (rank == cg->rank)
{
local_seen = 1;
if (!bssn_em_cuda_resident_zero_fast_state(cg))
local_ok = 0;
}
if (BP == Pp->data->ble)
break;
BP = BP->next;
}
Pp = Pp->next;
}
#ifdef WithShell
if (shell && shell->PatL)
{
MyList<ss_patch> *SP = shell->PatL;
while (SP)
{
MyList<Block> *BP = SP->data->blb;
while (BP)
{
Block *cg = BP->data;
if (rank == cg->rank)
{
local_seen = 1;
if (!bssn_em_cuda_resident_zero_fast_state(cg))
local_ok = 0;
}
if (BP == SP->data->ble)
break;
BP = BP->next;
}
SP = SP->next;
}
}
#endif
int global_ok = 0;
int global_seen = 0;
MPI_Allreduce(&local_ok, &global_ok, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD);
MPI_Allreduce(&local_seen, &global_seen, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
return global_seen && global_ok;
}
bool bssn_em_analysis_zero_fastpath_ready(MyList<Patch> *PatL,
#ifdef WithShell
ShellPatch *shell,
#else
ShellPatch *shell,
#endif
int rank)
{
if (!bssn_em_zero_analysis_fastpath_enabled())
return false;
return bssn_em_resident_zero_fastpath_ready(PatL, shell, rank);
}
void zero_em_analysis_outputs(MyList<Patch> *PatL,
#ifdef WithShell
ShellPatch *shell,
#else
ShellPatch * /*shell*/,
#endif
int rank,
var *Rphi2_var, var *Iphi2_var,
var *Rphi1_var, var *Iphi1_var)
{
MyList<Patch> *Pp = PatL;
while (Pp)
{
MyList<Block> *BP = Pp->data->blb;
while (BP)
{
Block *cg = BP->data;
if (rank == cg->rank)
{
const size_t all = (size_t)cg->shape[0] * cg->shape[1] * cg->shape[2];
std::memset(cg->fgfs[Rphi2_var->sgfn], 0, all * sizeof(double));
std::memset(cg->fgfs[Iphi2_var->sgfn], 0, all * sizeof(double));
std::memset(cg->fgfs[Rphi1_var->sgfn], 0, all * sizeof(double));
std::memset(cg->fgfs[Iphi1_var->sgfn], 0, all * sizeof(double));
}
if (BP == Pp->data->ble)
break;
BP = BP->next;
}
Pp = Pp->next;
}
#ifdef WithShell
if (shell && shell->PatL)
{
MyList<ss_patch> *SP = shell->PatL;
while (SP)
{
MyList<Block> *BP = SP->data->blb;
while (BP)
{
Block *cg = BP->data;
if (rank == cg->rank)
{
const size_t all = (size_t)cg->shape[0] * cg->shape[1] * cg->shape[2];
std::memset(cg->fgfs[Rphi2_var->sgfn], 0, all * sizeof(double));
std::memset(cg->fgfs[Iphi2_var->sgfn], 0, all * sizeof(double));
std::memset(cg->fgfs[Rphi1_var->sgfn], 0, all * sizeof(double));
std::memset(cg->fgfs[Iphi1_var->sgfn], 0, all * sizeof(double));
}
if (BP == SP->data->ble)
break;
BP = BP->next;
}
SP = SP->next;
}
}
#endif
}
#endif
int bssn_em_step_timing_every()
{
static int every = -1;
if (every < 0)
{
const char *env = getenv("AMSS_EM_STEP_TIMING_EVERY");
every = (env && atoi(env) > 0) ? atoi(env) : 1;
}
return every;
}
#if USE_CUDA_BSSN
bool fill_bssn_em_bssn_cuda_views(Block *cg, MyList<var> *vars,
double **host_views,
double *propspeeds = nullptr,
double *soa_flat = nullptr)
double *propspeeds = 0,
double *soa_flat = 0)
{
int idx = 0;
while (vars && idx < BSSN_CUDA_STATE_COUNT)
@@ -65,79 +267,161 @@ bool fill_bssn_cuda_views_prefix(Block *cg, MyList<var> *vars,
return idx == BSSN_CUDA_STATE_COUNT;
}
void skip_bssn_cuda_prefix(MyList<var> *&a, MyList<var> *&b, MyList<var> *&c)
bool fill_bssn_em_cuda_views(Block *cg, MyList<var> *vars,
double **host_views,
double *propspeeds = 0,
double *soa_flat = 0)
{
for (int i = 0; i < BSSN_CUDA_STATE_COUNT && a && b && c; ++i)
int idx = 0;
while (vars && idx < BSSN_EM_CUDA_STATE_COUNT)
{
a = a->next;
b = b->next;
c = c->next;
host_views[idx] = cg->fgfs[vars->data->sgfn];
if (propspeeds)
propspeeds[idx] = vars->data->propspeed;
if (soa_flat)
{
soa_flat[3 * idx + 0] = vars->data->SoA[0];
soa_flat[3 * idx + 1] = vars->data->SoA[1];
soa_flat[3 * idx + 2] = vars->data->SoA[2];
}
vars = vars->next;
++idx;
}
return idx == BSSN_EM_CUDA_STATE_COUNT && vars == 0;
}
void skip_bssn_cuda_prefix(MyList<var> *&a, MyList<var> *&b,
MyList<var> *&c, MyList<var> *&d)
void fill_bssn_em_fixed_source_cuda_views(Block *cg, double **sources,
var *Jx, var *Jy, var *Jz, var *qchar)
{
for (int i = 0; i < BSSN_CUDA_STATE_COUNT && a && b && c && d; ++i)
{
a = a->next;
b = b->next;
c = c->next;
d = d->next;
}
sources[0] = cg->fgfs[Jx->sgfn];
sources[1] = cg->fgfs[Jy->sgfn];
sources[2] = cg->fgfs[Jz->sgfn];
sources[3] = cg->fgfs[qchar->sgfn];
}
int run_bssn_em_cuda_substep(Block *cg,
MyList<var> *state_in_list,
MyList<var> *state_out_list,
Patch *patch,
double &dT_lev,
double &TRK4,
int &iter_count,
int &Symmetry,
int lev,
double &ndeps,
int &co,
double &chitiny,
void fill_bssn_em_matter_cuda_views(Block *cg, double **matter,
var *rho, var *Sx, var *Sy, var *Sz,
var *Sxx, var *Sxy, var *Sxz,
var *Syy, var *Syz, var *Szz)
{
double *state_in[BSSN_CUDA_STATE_COUNT];
double *state_out[BSSN_CUDA_STATE_COUNT];
double *matter[BSSN_CUDA_MATTER_COUNT] = {
cg->fgfs[rho->sgfn], cg->fgfs[Sx->sgfn], cg->fgfs[Sy->sgfn], cg->fgfs[Sz->sgfn],
cg->fgfs[Sxx->sgfn], cg->fgfs[Sxy->sgfn], cg->fgfs[Sxz->sgfn],
cg->fgfs[Syy->sgfn], cg->fgfs[Syz->sgfn], cg->fgfs[Szz->sgfn]};
double propspeed[BSSN_CUDA_STATE_COUNT];
double soa_flat[3 * BSSN_CUDA_STATE_COUNT];
if (!fill_bssn_cuda_views_prefix(cg, state_in_list, state_in, propspeed, soa_flat) ||
!fill_bssn_cuda_views_prefix(cg, state_out_list, state_out))
return 1;
int apply_bam_bc = 0;
#if (SommerType == 0)
#ifndef WithShell
apply_bam_bc = (lev == 0) ? 1 : 0;
#endif
#endif
int use_zero_matter = 0;
int keep_resident_state = 0;
int apply_enforce_ga = 0;
return bssn_cuda_rk4_substep(cg,
cg->shape, cg->X[0], cg->X[1], cg->X[2],
state_in, state_out, matter,
propspeed, soa_flat, patch->bbox,
dT_lev, TRK4, iter_count, apply_bam_bc,
Symmetry, lev, ndeps, co,
use_zero_matter,
keep_resident_state, apply_enforce_ga, chitiny);
matter[0] = cg->fgfs[rho->sgfn];
matter[1] = cg->fgfs[Sx->sgfn];
matter[2] = cg->fgfs[Sy->sgfn];
matter[3] = cg->fgfs[Sz->sgfn];
matter[4] = cg->fgfs[Sxx->sgfn];
matter[5] = cg->fgfs[Sxy->sgfn];
matter[6] = cg->fgfs[Sxz->sgfn];
matter[7] = cg->fgfs[Syy->sgfn];
matter[8] = cg->fgfs[Syz->sgfn];
matter[9] = cg->fgfs[Szz->sgfn];
}
bool bssn_em_cuda_use_resident_sync(int lev)
{
#ifdef WithShell
(void)lev;
return false;
#else
return true;
#endif
}
bool bssn_em_cuda_keep_resident_after_step(int lev, int trfls_in, int analysis_lev)
{
static int keep_all_levels = -1;
if (keep_all_levels < 0)
{
const char *env = getenv("AMSS_CUDA_EM_KEEP_ALL_LEVELS");
keep_all_levels = (env && atoi(env) != 0) ? 1 : 0;
}
static int enabled = -1;
if (enabled < 0)
{
const char *env = getenv("AMSS_CUDA_EM_KEEP_RESIDENT_AFTER_STEP");
if (!env)
env = getenv("AMSS_CUDA_KEEP_RESIDENT_AFTER_STEP");
enabled = (env && atoi(env) != 0) ? 1 : 0;
}
if (!enabled)
return false;
if (lev == analysis_lev)
return false;
if (keep_all_levels)
return true;
return lev < trfls_in;
}
void bssn_em_cuda_download_level_state(MyList<Patch> *PatL, MyList<var> *vars,
int myrank, bool release_ctx)
{
MyList<Patch> *Pp = PatL;
while (Pp)
{
MyList<Block> *BP = Pp->data->blb;
while (BP)
{
Block *cg = BP->data;
if (myrank == cg->rank && bssn_cuda_has_resident_state(cg))
{
double *state_out[BSSN_EM_CUDA_STATE_COUNT];
if (!fill_bssn_em_cuda_views(cg, vars, state_out))
{
cout << "CUDA BSSN-EM resident state list mismatch during download" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
if (bssn_cuda_download_resident_state_count_if_present(cg, cg->shape,
state_out,
BSSN_EM_CUDA_STATE_COUNT))
{
cout << "CUDA BSSN-EM resident state download failed" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
if (release_ctx)
bssn_cuda_release_step_ctx(cg);
}
if (BP == Pp->data->ble)
break;
BP = BP->next;
}
Pp = Pp->next;
}
}
void bssn_em_cuda_keep_only_level_state(MyList<Patch> *PatL, MyList<var> *vars,
int myrank)
{
MyList<Patch> *Pp = PatL;
while (Pp)
{
MyList<Block> *BP = Pp->data->blb;
while (BP)
{
Block *cg = BP->data;
if (myrank == cg->rank && bssn_cuda_has_resident_state(cg))
{
double *state_key[BSSN_EM_CUDA_STATE_COUNT];
if (!fill_bssn_em_cuda_views(cg, vars, state_key))
{
cout << "CUDA BSSN-EM resident state list mismatch during prune" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
if (bssn_cuda_keep_only_resident_state_count(cg, cg->shape,
state_key,
BSSN_EM_CUDA_STATE_COUNT))
{
cout << "CUDA BSSN-EM keep-only resident state failed" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
}
if (BP == Pp->data->ble)
break;
BP = BP->next;
}
Pp = Pp->next;
}
}
#endif
//================================================================================================
}
// Define bssnEM_class
@@ -347,8 +631,6 @@ void bssnEM_class::Initialize()
CheckPoint->readcheck_sh(SH, myrank);
#endif
Initialize_Level_Runtime();
double h = GH->PatL[0]->data->blb->data->getdX(0);
for (int i = 1; i < dim; i++)
h = Mymin(h, GH->PatL[0]->data->blb->data->getdX(i));
@@ -363,6 +645,13 @@ void bssnEM_class::Initialize()
PhysTime = StartTime;
Setup_Black_Hole_position();
}
sync_cache_pre = new Parallel::SyncCache[GH->levels];
sync_cache_cor = new Parallel::SyncCache[GH->levels];
sync_cache_rp_coarse = new Parallel::SyncCache[GH->levels];
sync_cache_rp_fine = new Parallel::SyncCache[GH->levels];
sync_cache_restrict = new Parallel::SyncCache[GH->levels];
sync_cache_outbd = new Parallel::SyncCache[GH->levels];
}
//================================================================================================
@@ -938,9 +1227,25 @@ void bssnEM_class::Step(int lev, int YN)
int iter_count = 0; // count RK4 substeps
int pre = 0, cor = 1;
int ERROR = 0;
#if USE_CUDA_BSSN
const bool use_cuda_resident_sync = bssn_em_cuda_use_resident_sync(lev);
#endif
const bool em_step_timing = bssn_em_step_timing_enabled();
const double em_step_t0 = em_step_timing ? MPI_Wtime() : 0.0;
double em_t0 = 0.0;
double em_t_predictor = 0.0;
double em_t_predictor_sync = 0.0;
double em_t_corrector = 0.0;
double em_t_corrector_sync = 0.0;
double em_t_analysis = 0.0;
double em_t_bh = 0.0;
double em_t_swap = 0.0;
double em_t_resident = 0.0;
double em_t_rp = 0.0;
MyList<ss_patch> *sPp;
// Predictor
em_t0 = em_step_timing ? MPI_Wtime() : 0.0;
MyList<Patch> *Pp = GH->PatL[lev];
while (Pp)
{
@@ -950,16 +1255,20 @@ void bssnEM_class::Step(int lev, int YN)
Block *cg = BP->data;
if (myrank == cg->rank)
{
#if !USE_CUDA_BSSN
#if (AGM == 0)
f_enforce_ga(cg->shape,
cg->fgfs[gxx0->sgfn], cg->fgfs[gxy0->sgfn], cg->fgfs[gxz0->sgfn],
cg->fgfs[gyy0->sgfn], cg->fgfs[gyz0->sgfn], cg->fgfs[gzz0->sgfn],
cg->fgfs[Axx0->sgfn], cg->fgfs[Axy0->sgfn], cg->fgfs[Axz0->sgfn],
cg->fgfs[Ayy0->sgfn], cg->fgfs[Ayz0->sgfn], cg->fgfs[Azz0->sgfn]);
#endif
#endif
int em_rhs_error = 0;
bool used_gpu_substep = false;
if (
#if !USE_CUDA_BSSN
em_rhs_error =
f_compute_rhs_empart(cg->shape, cg->X[0], cg->X[1], cg->X[2],
cg->fgfs[phi0->sgfn],
cg->fgfs[gxx0->sgfn], cg->fgfs[gxy0->sgfn], cg->fgfs[gxz0->sgfn],
@@ -979,17 +1288,52 @@ void bssnEM_class::Step(int lev, int YN)
cg->fgfs[Sx->sgfn], cg->fgfs[Sy->sgfn], cg->fgfs[Sz->sgfn],
cg->fgfs[Sxx->sgfn], cg->fgfs[Sxy->sgfn], cg->fgfs[Sxz->sgfn],
cg->fgfs[Syy->sgfn], cg->fgfs[Syz->sgfn], cg->fgfs[Szz->sgfn],
Symmetry, lev, ndeps) ||
#if USE_CUDA_BSSN
((used_gpu_substep =
(run_bssn_em_cuda_substep(cg, StateList, SynchList_pre, Pp->data,
dT_lev, TRK4, iter_count, Symmetry, lev,
ndeps, pre, chitiny,
rho, Sx, Sy, Sz, Sxx, Sxy, Sxz, Syy, Syz, Szz) == 0))
? 0
: 1) ||
Symmetry, lev, ndeps);
#endif
(!used_gpu_substep && f_compute_rhs_bssn(cg->shape, TRK4, cg->X[0], cg->X[1], cg->X[2],
#if USE_CUDA_BSSN
if (!em_rhs_error)
{
double *state_in[BSSN_EM_CUDA_STATE_COUNT];
double *state_out[BSSN_EM_CUDA_STATE_COUNT];
double *sources[BSSN_EM_CUDA_SOURCE_COUNT];
double propspeed[BSSN_EM_CUDA_STATE_COUNT];
double soa_flat[3 * BSSN_EM_CUDA_STATE_COUNT];
if (!fill_bssn_em_cuda_views(cg, StateList, state_in, propspeed, soa_flat) ||
!fill_bssn_em_cuda_views(cg, SynchList_pre, state_out))
{
cout << "CUDA BSSN-EM state list mismatch on predictor step" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
fill_bssn_em_fixed_source_cuda_views(cg, sources, Jx, Jy, Jz, qchar);
int apply_bam_bc = 0;
#if (SommerType == 0)
#ifndef WithShell
apply_bam_bc = (lev == 0) ? 1 : 0;
#endif
#endif
int apply_enforce_ga = 0;
#if (AGM == 0)
apply_enforce_ga = 1;
#endif
int keep_resident_state = use_cuda_resident_sync ? 1 : 0;
if (bssn_em_cuda_rk4_substep(cg,
cg->shape, cg->X[0], cg->X[1], cg->X[2],
state_in, state_out, sources,
propspeed, soa_flat, Pp->data->bbox,
dT_lev, TRK4, iter_count, apply_bam_bc,
Symmetry, lev, ndeps, pre,
keep_resident_state, apply_enforce_ga, chitiny))
{
ERROR = 1;
}
used_gpu_substep = true;
}
#endif
if (em_rhs_error ||
(!used_gpu_substep &&
f_compute_rhs_bssn(cg->shape, TRK4, cg->X[0], cg->X[1], cg->X[2],
cg->fgfs[phi0->sgfn], cg->fgfs[trK0->sgfn],
cg->fgfs[gxx0->sgfn], cg->fgfs[gxy0->sgfn], cg->fgfs[gxz0->sgfn],
cg->fgfs[gyy0->sgfn], cg->fgfs[gyz0->sgfn], cg->fgfs[gzz0->sgfn],
@@ -1031,14 +1375,12 @@ void bssnEM_class::Step(int lev, int YN)
ERROR = 1;
}
if (!used_gpu_substep)
{
// rk4 substep and boundary
{
MyList<var> *varl0 = StateList, *varl = SynchList_pre, *varlrhs = RHSList;
// we do not check the correspondence here
#if USE_CUDA_BSSN
if (used_gpu_substep)
skip_bssn_cuda_prefix(varl0, varl, varlrhs);
#endif
while (varl0)
{
@@ -1077,12 +1419,15 @@ void bssnEM_class::Step(int lev, int YN)
}
f_lowerboundset(cg->shape, cg->fgfs[phi->sgfn], chitiny);
}
}
if (BP == Pp->data->ble)
break;
BP = BP->next;
}
Pp = Pp->next;
}
if (em_step_timing)
em_t_predictor += MPI_Wtime() - em_t0;
// check error information
{
int erh = ERROR;
@@ -1340,7 +1685,11 @@ void bssnEM_class::Step(int lev, int YN)
}
#endif
if (em_step_timing)
em_t0 = MPI_Wtime();
Parallel::Sync_cached(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev]);
if (em_step_timing)
em_t_predictor_sync += MPI_Wtime() - em_t0;
#ifdef WithShell
if (lev == 0)
@@ -1363,6 +1712,8 @@ void bssnEM_class::Step(int lev, int YN)
// for black hole position
if (BH_num > 0 && lev == GH->levels - 1)
{
if (em_step_timing)
em_t0 = MPI_Wtime();
compute_Porg_rhs(Porg0, Porg_rhs, Sfx0, Sfy0, Sfz0, lev);
for (int ithBH = 0; ithBH < BH_num; ithBH++)
{
@@ -1391,16 +1742,24 @@ void bssnEM_class::Step(int lev, int YN)
DG_List->clearList();
}
}
if (em_step_timing)
em_t_bh += MPI_Wtime() - em_t0;
}
// data analysis part
// Warning NOTE: the variables1 are used as temp storege room
if (lev == a_lev)
{
if (em_step_timing)
em_t0 = MPI_Wtime();
AnalysisStuff_EM(lev, dT_lev);
if (em_step_timing)
em_t_analysis += MPI_Wtime() - em_t0;
}
// corrector
for (iter_count = 1; iter_count < 4; iter_count++)
{
if (em_step_timing)
em_t0 = MPI_Wtime();
// for RK4: t0, t0+dt/2, t0+dt/2, t0+dt;
if (iter_count == 1 || iter_count == 3)
TRK4 += dT_lev / 2;
@@ -1413,6 +1772,7 @@ void bssnEM_class::Step(int lev, int YN)
Block *cg = BP->data;
if (myrank == cg->rank)
{
#if !USE_CUDA_BSSN
#if (AGM == 0)
f_enforce_ga(cg->shape,
cg->fgfs[gxx->sgfn], cg->fgfs[gxy->sgfn], cg->fgfs[gxz->sgfn],
@@ -1426,10 +1786,13 @@ void bssnEM_class::Step(int lev, int YN)
cg->fgfs[gyy->sgfn], cg->fgfs[gyz->sgfn], cg->fgfs[gzz->sgfn],
cg->fgfs[Axx->sgfn], cg->fgfs[Axy->sgfn], cg->fgfs[Axz->sgfn],
cg->fgfs[Ayy->sgfn], cg->fgfs[Ayz->sgfn], cg->fgfs[Azz->sgfn]);
#endif
#endif
int em_rhs_error = 0;
bool used_gpu_substep = false;
if (
#if !USE_CUDA_BSSN
em_rhs_error =
f_compute_rhs_empart(cg->shape, cg->X[0], cg->X[1], cg->X[2],
cg->fgfs[phi->sgfn],
cg->fgfs[gxx->sgfn], cg->fgfs[gxy->sgfn], cg->fgfs[gxz->sgfn],
@@ -1449,17 +1812,55 @@ void bssnEM_class::Step(int lev, int YN)
cg->fgfs[Sx->sgfn], cg->fgfs[Sy->sgfn], cg->fgfs[Sz->sgfn],
cg->fgfs[Sxx->sgfn], cg->fgfs[Sxy->sgfn], cg->fgfs[Sxz->sgfn],
cg->fgfs[Syy->sgfn], cg->fgfs[Syz->sgfn], cg->fgfs[Szz->sgfn],
Symmetry, lev, ndeps) ||
#if USE_CUDA_BSSN
((used_gpu_substep =
(run_bssn_em_cuda_substep(cg, SynchList_pre, SynchList_cor, Pp->data,
dT_lev, TRK4, iter_count, Symmetry, lev,
ndeps, cor, chitiny,
rho, Sx, Sy, Sz, Sxx, Sxy, Sxz, Syy, Syz, Szz) == 0))
? 0
: 1) ||
Symmetry, lev, ndeps);
#endif
(!used_gpu_substep && f_compute_rhs_bssn(cg->shape, TRK4, cg->X[0], cg->X[1], cg->X[2],
#if USE_CUDA_BSSN
if (!em_rhs_error)
{
double *state_in[BSSN_EM_CUDA_STATE_COUNT];
double *state_out[BSSN_EM_CUDA_STATE_COUNT];
double *sources[BSSN_EM_CUDA_SOURCE_COUNT];
double propspeed[BSSN_EM_CUDA_STATE_COUNT];
double soa_flat[3 * BSSN_EM_CUDA_STATE_COUNT];
if (!fill_bssn_em_cuda_views(cg, SynchList_pre, state_in, propspeed, soa_flat) ||
!fill_bssn_em_cuda_views(cg, SynchList_cor, state_out))
{
cout << "CUDA BSSN-EM state list mismatch on corrector step" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
fill_bssn_em_fixed_source_cuda_views(cg, sources, Jx, Jy, Jz, qchar);
int apply_bam_bc = 0;
#if (SommerType == 0)
#ifndef WithShell
apply_bam_bc = (lev == 0) ? 1 : 0;
#endif
#endif
int apply_enforce_ga = 0;
#if (AGM == 0)
apply_enforce_ga = 1;
#elif (AGM == 1)
if (iter_count == 3)
apply_enforce_ga = 1;
#endif
int keep_resident_state = use_cuda_resident_sync ? 1 : 0;
if (bssn_em_cuda_rk4_substep(cg,
cg->shape, cg->X[0], cg->X[1], cg->X[2],
state_in, state_out, sources,
propspeed, soa_flat, Pp->data->bbox,
dT_lev, TRK4, iter_count, apply_bam_bc,
Symmetry, lev, ndeps, cor,
keep_resident_state, apply_enforce_ga, chitiny))
{
ERROR = 1;
}
used_gpu_substep = true;
}
#endif
if (em_rhs_error ||
(!used_gpu_substep &&
f_compute_rhs_bssn(cg->shape, TRK4, cg->X[0], cg->X[1], cg->X[2],
cg->fgfs[phi->sgfn], cg->fgfs[trK->sgfn],
cg->fgfs[gxx->sgfn], cg->fgfs[gxy->sgfn], cg->fgfs[gxz->sgfn],
cg->fgfs[gyy->sgfn], cg->fgfs[gyz->sgfn], cg->fgfs[gzz->sgfn],
@@ -1499,14 +1900,12 @@ void bssnEM_class::Step(int lev, int YN)
<< cg->bbox[2] << ":" << cg->bbox[5] << ")" << endl;
ERROR = 1;
}
if (!used_gpu_substep)
{
// rk4 substep and boundary
{
MyList<var> *varl0 = StateList, *varl = SynchList_pre, *varl1 = SynchList_cor, *varlrhs = RHSList;
// we do not check the correspondence here
#if USE_CUDA_BSSN
if (used_gpu_substep)
skip_bssn_cuda_prefix(varl0, varl, varl1, varlrhs);
#endif
while (varl0)
{
@@ -1546,6 +1945,7 @@ void bssnEM_class::Step(int lev, int YN)
}
f_lowerboundset(cg->shape, cg->fgfs[phi1->sgfn], chitiny);
}
}
if (BP == Pp->data->ble)
break;
BP = BP->next;
@@ -1816,7 +2216,13 @@ void bssnEM_class::Step(int lev, int YN)
}
#endif
if (em_step_timing)
em_t_corrector += MPI_Wtime() - em_t0;
if (em_step_timing)
em_t0 = MPI_Wtime();
Parallel::Sync_cached(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev]);
if (em_step_timing)
em_t_corrector_sync += MPI_Wtime() - em_t0;
#ifdef WithShell
if (lev == 0)
@@ -1838,6 +2244,8 @@ void bssnEM_class::Step(int lev, int YN)
// for black hole position
if (BH_num > 0 && lev == GH->levels - 1)
{
if (em_step_timing)
em_t0 = MPI_Wtime();
compute_Porg_rhs(Porg, Porg1, Sfx, Sfy, Sfz, lev);
for (int ithBH = 0; ithBH < BH_num; ithBH++)
{
@@ -1866,10 +2274,14 @@ void bssnEM_class::Step(int lev, int YN)
DG_List->clearList();
}
}
if (em_step_timing)
em_t_bh += MPI_Wtime() - em_t0;
}
// swap time level
if (iter_count < 3)
{
if (em_step_timing)
em_t0 = MPI_Wtime();
Pp = GH->PatL[lev];
while (Pp)
{
@@ -1913,12 +2325,32 @@ void bssnEM_class::Step(int lev, int YN)
Porg[ithBH][2] = Porg1[ithBH][2];
}
}
if (em_step_timing)
em_t_swap += MPI_Wtime() - em_t0;
}
}
#if USE_CUDA_BSSN
if (use_cuda_resident_sync)
{
if (em_step_timing)
em_t0 = MPI_Wtime();
const bool needs_resident_download =
!bssn_em_cuda_keep_resident_after_step(lev, trfls, a_lev);
if (needs_resident_download)
bssn_em_cuda_download_level_state(GH->PatL[lev], SynchList_cor, myrank, true);
if (em_step_timing)
em_t_resident += MPI_Wtime() - em_t0;
}
#endif
#if (RPS == 0)
// mesh refinement boundary part
if (em_step_timing)
em_t0 = MPI_Wtime();
RestrictProlong(lev, YN, BB);
if (em_step_timing)
em_t_rp += MPI_Wtime() - em_t0;
#ifdef WithShell
if (lev == 0)
@@ -1946,6 +2378,8 @@ void bssnEM_class::Step(int lev, int YN)
//
// OldStateList old -----------
// update
if (em_step_timing)
em_t0 = MPI_Wtime();
Pp = GH->PatL[lev];
while (Pp)
{
@@ -1991,6 +2425,26 @@ void bssnEM_class::Step(int lev, int YN)
Porg0[ithBH][2] = Porg1[ithBH][2];
}
}
if (em_step_timing)
{
em_t_swap += MPI_Wtime() - em_t0;
static int em_step_report_count = 0;
const int em_timing_every = bssn_em_step_timing_every();
const bool report_all_levels = bssn_em_step_timing_all_levels_enabled();
if (lev == GH->levels - 1)
++em_step_report_count;
if ((report_all_levels || lev == GH->levels - 1) &&
(em_timing_every <= 1 || em_step_report_count % em_timing_every == 0))
{
fprintf(stderr,
"[AMSS-EM-STEP-TIMING] lev=%d wall=%.6f predictor=%.6f pre_sync=%.6f "
"analysis=%.6f corrector=%.6f cor_sync=%.6f bh=%.6f swap=%.6f resident=%.6f rp=%.6f\n",
lev, MPI_Wtime() - em_step_t0,
em_t_predictor, em_t_predictor_sync,
em_t_analysis, em_t_corrector, em_t_corrector_sync,
em_t_bh, em_t_swap, em_t_resident, em_t_rp);
}
}
}
//================================================================================================
@@ -2169,6 +2623,59 @@ void bssnEM_class::AnalysisStuff_EM(int lev, double dT_lev)
if (LastAnas >= AnasTime)
{
#if USE_CUDA_BSSN
const bool zero_em_analysis =
bssn_em_analysis_zero_fastpath_ready(GH->PatL[lev],
#ifdef WithShell
SH
#else
0
#endif
, myrank
);
#else
const bool zero_em_analysis = false;
#endif
if (zero_em_analysis)
{
#if USE_CUDA_BSSN
zero_em_analysis_outputs(GH->PatL[lev],
#ifdef WithShell
SH,
#else
0,
#endif
myrank,
Rphi2, Iphi2, Rphi1, Iphi1);
#endif
int NN = 0;
for (int pl = 1; pl < maxl + 1; pl++)
for (int pm = -pl; pm < pl + 1; pm++)
NN++;
double *RP = new double[NN];
double *IP = new double[NN];
std::memset(RP, 0, NN * sizeof(double));
std::memset(IP, 0, NN * sizeof(double));
for (int i = 0; i < decn; i++)
Phi2Monitor->writefile(PhysTime, NN, RP, IP);
delete[] RP;
delete[] IP;
NN = 0;
for (int pl = 0; pl < maxl + 1; pl++)
for (int pm = -pl; pm < pl + 1; pm++)
NN++;
RP = new double[NN];
IP = new double[NN];
std::memset(RP, 0, NN * sizeof(double));
std::memset(IP, 0, NN * sizeof(double));
for (int i = 0; i < decn; i++)
Phi1Monitor->writefile(PhysTime, NN, RP, IP);
delete[] RP;
delete[] IP;
}
else
{
Compute_Phi2(lev);
double *RP, *IP;
int NN = 0;
@@ -2258,6 +2765,7 @@ void bssnEM_class::AnalysisStuff_EM(int lev, double dT_lev)
delete[] RP;
delete[] IP;
}
}
AnalysisStuff(lev, dT_lev); // LastAnas need and only need control here
@@ -2437,11 +2945,12 @@ void bssnEM_class::Interp_Constraint()
}
ofstream outfile;
char filename[50];
sprintf(filename, "%s/interp_constraint_%05d.dat", ErrorMonitor->out_dir.c_str(), int(PhysTime / dT + 0.5));
char suffix[64];
sprintf(suffix, "/interp_constraint_%05d.dat", int(PhysTime / dT + 0.5));
string filename = ErrorMonitor->out_dir + suffix;
// 0.5 for round off
outfile.open(filename);
outfile.open(filename.c_str());
outfile << "# corrdinate, H_Res, Px_Res, Py_Res, Pz_Res, Gx_Res, Gy_Res, Gz_Res, ...." << endl;
for (int i = 0; i < n; i++)
{

File diff suppressed because it is too large Load Diff

View File

@@ -63,10 +63,6 @@ protected:
var *Cons_fR;
MyList<var> *BSSNStateList, *BSSNSynchList_pre, *BSSNSynchList_cor;
MyList<var> *ScalarSynchList_pre, *ScalarSynchList_cor;
Parallel::SyncCache *sync_cache_scalar_pre, *sync_cache_scalar_cor;
monitor *MaxScalar_Monitor;
};

View File

@@ -5,138 +5,6 @@
#include "macrodef.fh"
! scalar RHS and stress-energy only; BSSN RHS can be supplied by CUDA.
function compute_rhs_bssn_escalar_matter(ex, T, X, Y, Z, &
chi , trK , &
dxx , gxy , gxz , dyy , gyz , dzz, &
Axx , Axy , Axz , Ayy , Ayz , Azz, &
Gamx , Gamy , Gamz , &
Lap , betax , betay , betaz , &
dtSfx , dtSfy , dtSfz , &
Sphi , Spi , &
Sphi_rhs , Spi_rhs , &
rho,Sx,Sy,Sz,Sxx,Sxy,Sxz,Syy,Syz,Szz, &
Symmetry,Lev,eps) result(gont)
implicit none
integer,intent(in ):: ex(1:3), Symmetry,Lev
real*8, intent(in ):: T
real*8, intent(in ):: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
real*8, dimension(ex(1),ex(2),ex(3)),intent(inout) :: chi,dxx,dyy,dzz
real*8, dimension(ex(1),ex(2),ex(3)),intent(in ) :: trK
real*8, dimension(ex(1),ex(2),ex(3)),intent(in ) :: gxy,gxz,gyz
real*8, dimension(ex(1),ex(2),ex(3)),intent(in ) :: Axx,Axy,Axz,Ayy,Ayz,Azz
real*8, dimension(ex(1),ex(2),ex(3)),intent(in ) :: Gamx,Gamy,Gamz
real*8, dimension(ex(1),ex(2),ex(3)),intent(inout) :: Lap, betax, betay, betaz
real*8, dimension(ex(1),ex(2),ex(3)),intent(in ) :: dtSfx, dtSfy, dtSfz
real*8, dimension(ex(1),ex(2),ex(3)),intent(in ) :: Sphi,Spi
real*8, dimension(ex(1),ex(2),ex(3)),intent(out) :: Sphi_rhs,Spi_rhs
real*8, dimension(ex(1),ex(2),ex(3)),intent(inout) :: rho,Sx,Sy,Sz
real*8, dimension(ex(1),ex(2),ex(3)),intent(inout) :: Sxx,Sxy,Sxz,Syy,Syz,Szz
real*8,intent(in) :: eps
integer::gont
real*8, dimension(ex(1),ex(2),ex(3)) :: gxx,gyy,gzz
real*8, dimension(ex(1),ex(2),ex(3)) :: chix,chiy,chiz
real*8, dimension(ex(1),ex(2),ex(3)) :: Lapx,Lapy,Lapz
real*8, dimension(ex(1),ex(2),ex(3)) :: Kx,Ky,Kz,S
real*8, dimension(ex(1),ex(2),ex(3)) :: f,fxx,fxy,fxz,fyy,fyz,fzz
real*8, dimension(ex(1),ex(2),ex(3)) :: alpn1,chin1
real*8, dimension(ex(1),ex(2),ex(3)) :: gupxx,gupxy,gupxz
real*8, dimension(ex(1),ex(2),ex(3)) :: gupyy,gupyz,gupzz
real*8 :: dX
real*8, parameter :: ZEO=0.d0, ONE = 1.D0, TWO = 2.D0, HALF = 0.5D0
real*8, parameter :: SYM = 1.D0
dX = sum(chi)+sum(trK)+sum(dxx)+sum(gxy)+sum(gxz)+sum(dyy)+sum(gyz)+sum(dzz) &
+sum(Gamx)+sum(Gamy)+sum(Gamz) &
+sum(Lap)+sum(Sphi)+sum(Spi)
if(dX.ne.dX) then
if(sum(chi).ne.sum(chi))write(*,*)"bssn_escalar_matter: find NaN in chi"
if(sum(trK).ne.sum(trK))write(*,*)"bssn_escalar_matter: find NaN in trk"
if(sum(dxx).ne.sum(dxx))write(*,*)"bssn_escalar_matter: find NaN in dxx"
if(sum(gxy).ne.sum(gxy))write(*,*)"bssn_escalar_matter: find NaN in gxy"
if(sum(gxz).ne.sum(gxz))write(*,*)"bssn_escalar_matter: find NaN in gxz"
if(sum(dyy).ne.sum(dyy))write(*,*)"bssn_escalar_matter: find NaN in dyy"
if(sum(gyz).ne.sum(gyz))write(*,*)"bssn_escalar_matter: find NaN in gyz"
if(sum(dzz).ne.sum(dzz))write(*,*)"bssn_escalar_matter: find NaN in dzz"
if(sum(Gamx).ne.sum(Gamx))write(*,*)"bssn_escalar_matter: find NaN in Gamx"
if(sum(Gamy).ne.sum(Gamy))write(*,*)"bssn_escalar_matter: find NaN in Gamy"
if(sum(Gamz).ne.sum(Gamz))write(*,*)"bssn_escalar_matter: find NaN in Gamz"
if(sum(Lap).ne.sum(Lap))write(*,*)"bssn_escalar_matter: find NaN in Lap"
if(sum(Sphi).ne.sum(Sphi))write(*,*)"bssn_escalar_matter: find NaN in Sphi"
if(sum(Spi).ne.sum(Spi))write(*,*)"bssn_escalar_matter: find NaN in Spi"
gont = 1
return
endif
alpn1 = Lap + ONE
chin1 = chi + ONE
gxx = dxx + ONE
gyy = dyy + ONE
gzz = dzz + ONE
call fderivs(ex,chi,chix,chiy,chiz,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
call fderivs(ex,Lap,Lapx,Lapy,Lapz,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
gupzz = gxx * gyy * gzz + gxy * gyz * gxz + gxz * gxy * gyz - &
gxz * gyy * gxz - gxy * gxy * gzz - gxx * gyz * gyz
gupxx = ( gyy * gzz - gyz * gyz ) / gupzz
gupxy = - ( gxy * gzz - gyz * gxz ) / gupzz
gupxz = ( gxy * gyz - gyy * gxz ) / gupzz
gupyy = ( gxx * gzz - gxz * gxz ) / gupzz
gupyz = - ( gxx * gyz - gxy * gxz ) / gupzz
gupzz = ( gxx * gyy - gxy * gxy ) / gupzz
#if 1
Sphi_rhs = alpn1 * Spi
call fderivs(ex,Sphi,Kx,Ky,Kz,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
call fdderivs(ex,Sphi,fxx,fxy,fxz,fyy,fyz,fzz,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
Spi_rhs = gupxx * fxx + gupyy * fyy + gupzz * fzz + &
( gupxy * fxy + gupxz * fxz + gupyz * fyz ) * TWO - &
((Gamx+(gupxx*chix+gupxy*chiy+gupxz*chiz)/TWO/chin1)*Kx &
+ (Gamy+(gupxy*chix+gupyy*chiy+gupyz*chiz)/TWO/chin1)*Ky &
+ (Gamz+(gupxz*chix+gupyz*chiy+gupzz*chiz)/TWO/chin1)*Kz)
Spi_rhs = Spi_rhs*alpn1 + &
(gupxx*Lapx*Kx + gupxy*Lapx*Ky + gupxz*Lapx*Kz &
+gupxy*Lapy*Kx + gupyy*Lapy*Ky + gupyz*Lapy*Kz &
+gupxz*Lapz*Kx + gupyz*Lapz*Ky + gupzz*Lapz*Kz)
call frpotential(ex,Sphi,f,S)
Spi_rhs = Spi_rhs*chin1 + alpn1*(trK*Spi - S)
rho = chin1*((gupxx * Kx * Kx + gupyy * Ky * Ky + gupzz * Kz * Kz)/TWO + &
gupxy * Kx * Ky + gupxz * Kx * Kz + gupyz * Ky * Kz ) &
+ Spi*Spi/TWO+f
Sx = -Spi*Kx
Sy = -Spi*Ky
Sz = -Spi*Kz
f = (rho - Spi*Spi)/chin1
Sxx = Kx*Kx-f*gxx
Sxy = Kx*Ky-f*gxy
Sxz = Kx*Kz-f*gxz
Syy = Ky*Ky-f*gyy
Syz = Ky*Kz-f*gyz
Szz = Kz*Kz-f*gzz
#else
Sphi_rhs = ZEO
Spi_rhs = ZEO
rho = ZEO
Sx = ZEO
Sy = ZEO
Sz = ZEO
Sxx = ZEO
Sxy = ZEO
Sxz = ZEO
Syy = ZEO
Syz = ZEO
Szz = ZEO
#endif
gont = 0
return
end function compute_rhs_bssn_escalar_matter
! rhs for scalar and GR variables
! here we consider vacuum spacetime only
function compute_rhs_bssn_escalar(ex, T,X, Y, Z, &

File diff suppressed because it is too large Load Diff

View File

@@ -144,7 +144,7 @@ public:
bssn_class(double Couranti, double StartTimei, double TotalTimei, double DumpTimei, double d2DumpTimei, double CheckTimei, double AnasTimei,
int Symmetryi, int checkruni, char *checkfilenamei, double numepssi, double numepsbi, double numepshi,
int a_levi, int maxli, int decni, double maxrexi, double drexi);
~bssn_class();
virtual ~bssn_class();
void Evolve(int Steps);
void RecursiveStep(int lev);
@@ -184,9 +184,6 @@ public:
virtual void Constraint_Out();
virtual void Compute_Constraint();
protected:
void Initialize_Level_Runtime();
#ifdef With_AHF
protected:
MyList<var> *AHList, *AHDList, *GaugeList;

View File

@@ -0,0 +1,56 @@
#ifndef BSSN_GPU_H_
#define BSSN_GPU_H_
#include "bssn_macro.h"
#include "macrodef.h"
#define DEVICE_ID 0
// #define DEVICE_ID_BY_MPI_RANK
#define GRID_DIM 256
#define BLOCK_DIM 128
#define _FH2_(i, j, k) fh[(i) + (j) * _1D_SIZE[2] + (k) * _2D_SIZE[2]]
#define _FH3_(i, j, k) fh[(i) + (j) * _1D_SIZE[3] + (k) * _2D_SIZE[3]]
#define pow2(x) ((x) * (x))
#define TimeBetween(a, b) ((b.tv_sec - a.tv_sec) + (b.tv_usec - a.tv_usec) / 1000000.0f)
#define M_ metac.
#define Mh_ meta->
#define Ms_ metassc.
#define Msh_ metass->
// #define TIMING
#define RHS_SS_PARA int calledby, int mpi_rank, int *ex, double &T, double *crho, double *sigma, double *R, double *X, double *Y, double *Z, double *drhodx, double *drhody, double *drhodz, double *dsigmadx, double *dsigmady, double *dsigmadz, double *dRdx, double *dRdy, double *dRdz, double *drhodxx, double *drhodxy, double *drhodxz, double *drhodyy, double *drhodyz, double *drhodzz, double *dsigmadxx, double *dsigmadxy, double *dsigmadxz, double *dsigmadyy, double *dsigmadyz, double *dsigmadzz, double *dRdxx, double *dRdxy, double *dRdxz, double *dRdyy, double *dRdyz, double *dRdzz, double *chi, double *trK, double *dxx, double *gxy, double *gxz, double *dyy, double *gyz, double *dzz, double *Axx, double *Axy, double *Axz, double *Ayy, double *Ayz, double *Azz, double *Gamx, double *Gamy, double *Gamz, double *Lap, double *betax, double *betay, double *betaz, double *dtSfx, double *dtSfy, double *dtSfz, double *chi_rhs, double *trK_rhs, double *gxx_rhs, double *gxy_rhs, double *gxz_rhs, double *gyy_rhs, double *gyz_rhs, double *gzz_rhs, double *Axx_rhs, double *Axy_rhs, double *Axz_rhs, double *Ayy_rhs, double *Ayz_rhs, double *Azz_rhs, double *Gamx_rhs, double *Gamy_rhs, double *Gamz_rhs, double *Lap_rhs, double *betax_rhs, double *betay_rhs, double *betaz_rhs, double *dtSfx_rhs, double *dtSfy_rhs, double *dtSfz_rhs, double *rho, double *Sx, double *Sy, double *Sz, double *Sxx, double *Sxy, double *Sxz, double *Syy, double *Syz, double *Szz, double *Gamxxx, double *Gamxxy, double *Gamxxz, double *Gamxyy, double *Gamxyz, double *Gamxzz, double *Gamyxx, double *Gamyxy, double *Gamyxz, double *Gamyyy, double *Gamyyz, double *Gamyzz, double *Gamzxx, double *Gamzxy, double *Gamzxz, double *Gamzyy, double *Gamzyz, double *Gamzzz, double *Rxx, double *Rxy, double *Rxz, double *Ryy, double *Ryz, double *Rzz, double *ham_Res, double *movx_Res, double *movy_Res, double *movz_Res, double *Gmx_Res, double *Gmy_Res, double *Gmz_Res, int &Symmetry, int &Lev, double &eps, int &sst, int &co
/** main function */
int gpu_rhs(int calledby, int mpi_rank, int *ex, double &T,
double *X, double *Y, double *Z,
double *chi, double *trK,
double *dxx, double *gxy, double *gxz, double *dyy, double *gyz, double *dzz,
double *Axx, double *Axy, double *Axz, double *Ayy, double *Ayz, double *Azz,
double *Gamx, double *Gamy, double *Gamz,
double *Lap, double *betax, double *betay, double *betaz,
double *dtSfx, double *dtSfy, double *dtSfz,
double *chi_rhs, double *trK_rhs,
double *gxx_rhs, double *gxy_rhs, double *gxz_rhs, double *gyy_rhs, double *gyz_rhs, double *gzz_rhs,
double *Axx_rhs, double *Axy_rhs, double *Axz_rhs, double *Ayy_rhs, double *Ayz_rhs, double *Azz_rhs,
double *Gamx_rhs, double *Gamy_rhs, double *Gamz_rhs,
double *Lap_rhs, double *betax_rhs, double *betay_rhs, double *betaz_rhs,
double *dtSfx_rhs, double *dtSfy_rhs, double *dtSfz_rhs,
double *rho, double *Sx, double *Sy, double *Sz, double *Sxx,
double *Sxy, double *Sxz, double *Syy, double *Syz, double *Szz,
double *Gamxxx, double *Gamxxy, double *Gamxxz, double *Gamxyy, double *Gamxyz, double *Gamxzz,
double *Gamyxx, double *Gamyxy, double *Gamyxz, double *Gamyyy, double *Gamyyz, double *Gamyzz,
double *Gamzxx, double *Gamzxy, double *Gamzxz, double *Gamzyy, double *Gamzyz, double *Gamzzz,
double *Rxx, double *Rxy, double *Rxz, double *Ryy, double *Ryz, double *Rzz,
double *ham_Res, double *movx_Res, double *movy_Res, double *movz_Res,
double *Gmx_Res, double *Gmy_Res, double *Gmz_Res,
int &Symmetry, int &Lev, double &eps, int &co);
int gpu_rhs_ss(RHS_SS_PARA);
#define Z4C_SS_PARA int calledby, int mpi_rank, int *ex, double &T, double *crho, double *sigma, double *R, double *X, double *Y, double *Z, double *drhodx, double *drhody, double *drhodz, double *dsigmadx, double *dsigmady, double *dsigmadz, double *dRdx, double *dRdy, double *dRdz, double *drhodxx, double *drhodxy, double *drhodxz, double *drhodyy, double *drhodyz, double *drhodzz, double *dsigmadxx, double *dsigmadxy, double *dsigmadxz, double *dsigmadyy, double *dsigmadyz, double *dsigmadzz, double *dRdxx, double *dRdxy, double *dRdxz, double *dRdyy, double *dRdyz, double *dRdzz, double *chi, double *trK, double *dxx, double *gxy, double *gxz, double *dyy, double *gyz, double *dzz, double *Axx, double *Axy, double *Axz, double *Ayy, double *Ayz, double *Azz, double *Gamx, double *Gamy, double *Gamz, double *Lap, double *betax, double *betay, double *betaz, double *dtSfx, double *dtSfy, double *dtSfz, double *TZ, double *chi_rhs, double *trK_rhs, double *gxx_rhs, double *gxy_rhs, double *gxz_rhs, double *gyy_rhs, double *gyz_rhs, double *gzz_rhs, double *Axx_rhs, double *Axy_rhs, double *Axz_rhs, double *Ayy_rhs, double *Ayz_rhs, double *Azz_rhs, double *Gamx_rhs, double *Gamy_rhs, double *Gamz_rhs, double *Lap_rhs, double *betax_rhs, double *betay_rhs, double *betaz_rhs, double *dtSfx_rhs, double *dtSfy_rhs, double *dtSfz_rhs, double *TZ_rhs, double *rho, double *Sx, double *Sy, double *Sz, double *Sxx, double *Sxy, double *Sxz, double *Syy, double *Syz, double *Szz, double *Gamxxx, double *Gamxxy, double *Gamxxz, double *Gamxyy, double *Gamxyz, double *Gamxzz, double *Gamyxx, double *Gamyxy, double *Gamyxz, double *Gamyyy, double *Gamyyz, double *Gamyzz, double *Gamzxx, double *Gamzxy, double *Gamzxz, double *Gamzyy, double *Gamzyz, double *Gamzzz, double *Rxx, double *Rxy, double *Rxz, double *Ryy, double *Ryz, double *Rzz, double *ham_Res, double *movx_Res, double *movy_Res, double *movz_Res, double *Gmx_Res, double *Gmy_Res, double *Gmz_Res, int &Symmetry, int &Lev, double &eps, int &sst, int &co
int gpu_rhs_z4c_ss(Z4C_SS_PARA);
#endif

View File

@@ -20,12 +20,14 @@ using namespace std;
__device__ volatile unsigned int global_count = 0;
#ifdef RESULT_CHECK
void compare_result_gpu(int ftag1,double * datac,int data_num){
double * data = (double*)malloc(sizeof(double)*data_num);
cudaMemcpy(data, datac, data_num * sizeof(double), cudaMemcpyDeviceToHost);
compare_result(ftag1,data,data_num);
free(data);
}
#endif
__global__ void sub_symmetry_bd_ss_partF(int ord, double * func, double *funcc)
{
@@ -153,11 +155,11 @@ __global__ void sub_symmetry_bd_ss_partJ(int ord,double * func, double * funcc,d
inline void sub_symmetry_bd_ss(int ord,double * func, double * funcc,double * SoA){
sub_symmetry_bd_ss_partF<<<GRID_DIM,BLOCK_DIM>>>(ord,func,funcc);
cudaThreadSynchronize();
cudaDeviceSynchronize();
sub_symmetry_bd_ss_partI<<<GRID_DIM,BLOCK_DIM>>>(ord,func,funcc,SoA[0]);
cudaThreadSynchronize();
cudaDeviceSynchronize();
sub_symmetry_bd_ss_partJ<<<GRID_DIM,BLOCK_DIM>>>(ord,func,funcc,SoA[1]);
cudaThreadSynchronize();
cudaDeviceSynchronize();
}
__global__ void sub_fderivs_shc_part1(double *fx,double *fy,double *fz){
@@ -247,13 +249,13 @@ inline void sub_fderivs_shc(int& sst,double * f,double * fh,double *fx,double *f
//cudaMemset(Msh_ gy,0,h_3D_SIZE[0] * sizeof(double));
//cudaMemset(Msh_ gz,0,h_3D_SIZE[0] * sizeof(double));
sub_symmetry_bd_ss(2,f,fh,SoA1);
cudaThreadSynchronize();
cudaDeviceSynchronize();
//compare_result_gpu(0,fh,h_3D_SIZE[2]);
sub_fderivs_sh<<<GRID_DIM,BLOCK_DIM>>>(fh,Msh_ gx,Msh_ gy,Msh_ gz);
cudaThreadSynchronize();
cudaDeviceSynchronize();
sub_fderivs_shc_part1<<<GRID_DIM,BLOCK_DIM>>>(fx,fy,fz);
cudaThreadSynchronize();
cudaDeviceSynchronize();
//compare_result_gpu(1,fx,h_3D_SIZE[0]);
//compare_result_gpu(2,fy,h_3D_SIZE[0]);
//compare_result_gpu(3,fz,h_3D_SIZE[0]);
@@ -451,17 +453,17 @@ inline void sub_fdderivs_shc(int& sst,double * f,double * fh,
//fderivs_sh
sub_symmetry_bd_ss(2,f,fh,SoA1);
cudaThreadSynchronize();
cudaDeviceSynchronize();
//compare_result_gpu(1,fh,h_3D_SIZE[2]);
sub_fderivs_sh<<<GRID_DIM,BLOCK_DIM>>>(fh,Msh_ gx,Msh_ gy,Msh_ gz);
cudaThreadSynchronize();
cudaDeviceSynchronize();
//fdderivs_sh
sub_symmetry_bd_ss(2,f,fh,SoA1);
cudaThreadSynchronize();
cudaDeviceSynchronize();
//compare_result_gpu(21,fh,h_3D_SIZE[2]);
sub_fdderivs_sh<<<GRID_DIM,BLOCK_DIM>>>(fh,Msh_ gxx,Msh_ gxy,Msh_ gxz,Msh_ gyy,Msh_ gyz,Msh_ gzz);
cudaThreadSynchronize();
cudaDeviceSynchronize();
/*compare_result_gpu(11,Msh_ gx,h_3D_SIZE[0]);
compare_result_gpu(12,Msh_ gy,h_3D_SIZE[0]);
compare_result_gpu(13,Msh_ gz,h_3D_SIZE[0]);
@@ -472,7 +474,7 @@ inline void sub_fdderivs_shc(int& sst,double * f,double * fh,
compare_result_gpu(5,Msh_ gyz,h_3D_SIZE[0]);
compare_result_gpu(6,Msh_ gzz,h_3D_SIZE[0]);*/
sub_fdderivs_shc_part1<<<GRID_DIM,BLOCK_DIM>>>(fxx,fxy,fxz,fyy,fyz,fzz);
cudaThreadSynchronize();
cudaDeviceSynchronize();
/*compare_result_gpu(1,fxx,h_3D_SIZE[0]);
compare_result_gpu(2,fxy,h_3D_SIZE[0]);
compare_result_gpu(3,fxz,h_3D_SIZE[0]);
@@ -496,9 +498,9 @@ __global__ void computeRicci_ss_part1(double * dst)
inline void computeRicci_ss(int &sst,double * src,double* dst,double * SoA, Meta* meta)
{
sub_fdderivs_shc(sst,src,Mh_ fh,Mh_ fxx,Mh_ fxy,Mh_ fxz,Mh_ fyy,Mh_ fyz,Mh_ fzz,SoA);
cudaThreadSynchronize();
cudaDeviceSynchronize();
computeRicci_ss_part1<<<GRID_DIM,BLOCK_DIM>>>(dst);
cudaThreadSynchronize();
cudaDeviceSynchronize();
}
__global__ void sub_lopsided_ss_part1(double * dst)
@@ -516,9 +518,9 @@ __global__ void sub_lopsided_ss_part1(double * dst)
inline void sub_lopsided_ss(int& sst,double *src,double* dst,double *SoA)
{
sub_fderivs_shc(sst,src,Mh_ fh,Mh_ fxx,Mh_ fxy,Mh_ fxz,SoA);
cudaThreadSynchronize();
cudaDeviceSynchronize();
sub_lopsided_ss_part1<<<GRID_DIM,BLOCK_DIM>>>(dst);
cudaThreadSynchronize();
cudaDeviceSynchronize();
}
__global__ void sub_kodis_sh_part1(double *f,double *fh,double *f_rhs)
@@ -590,11 +592,11 @@ inline void sub_kodis_ss(int &sst,double *f,double *fh,double *f_rhs,double *SoA
}
//compare_result_gpu(10,f,h_3D_SIZE[0]);
sub_symmetry_bd_ss(3,f,fh,SoA1);
cudaThreadSynchronize();
cudaDeviceSynchronize();
//compare_result_gpu(0,fh,h_3D_SIZE[3]);
sub_kodis_sh_part1<<<GRID_DIM,BLOCK_DIM>>>(f,fh,f_rhs);
cudaThreadSynchronize();
cudaDeviceSynchronize();
//compare_result_gpu(1,f_rhs,h_3D_SIZE[0]);
}
@@ -1699,7 +1701,7 @@ void destroy_meta(Meta *meta,Metass *metass)
if(Msh_ gzz) cudaFree(Msh_ gzz);
#if (GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7)
if(Mh_ reta) CUDA_SAFE_CALL(cudaFree(Mh_ reta));
if(Mh_ reta) cudaFree(Mh_ reta);
#endif
@@ -1895,7 +1897,7 @@ int gpu_rhs_ss(RHS_SS_PARA)
//1.2 local Data
cudaMalloc((void**)&(Mh_ gxx), matrix_size * sizeof(double));
CUDA_SAFE_CALL( cudaMalloc((void**)&(Mh_ gyy), matrix_size * sizeof(double)));
cudaMalloc((void**)&(Mh_ gyy), matrix_size * sizeof(double));
cudaMalloc((void**)&(Mh_ gzz), matrix_size * sizeof(double));
cudaMalloc((void**)&(Mh_ chix), matrix_size * sizeof(double));
cudaMalloc((void**)&(Mh_ chiy), matrix_size * sizeof(double));
@@ -2160,7 +2162,7 @@ int gpu_rhs_ss(RHS_SS_PARA)
double tmp_con2 = 1/Mass[0] - tmp_con;
cudaMemcpyToSymbol(C1, &tmp_con2, sizeof(double));
double tmp_con2 = 1/Mass[1] - tmp_con;
tmp_con2 = 1/Mass[1] - tmp_con;
cudaMemcpyToSymbol(C2, &tmp_con2, sizeof(double));
@@ -2233,7 +2235,7 @@ int gpu_rhs_ss(RHS_SS_PARA)
if((sst == 2 || sst == 4) && abs[1] < dYh)
{
ijkmin_h[1] = -2;
ijkmin_h[1] = -3;
ijkmin3_h[1] = -3;
}
if((sst == 3 || sst == 5) && abs_Y_ex2 < dYh)
{
@@ -2287,13 +2289,13 @@ int gpu_rhs_ss(RHS_SS_PARA)
#ifdef TIMING1
cudaThreadSynchronize();
cudaDeviceSynchronize();
gettimeofday(&tv2, NULL);
cout<<"TIME USED"<<TimeBetween(tv1, tv2)<<endl;
#endif
//cout<<"GPU meta data ready.\n";
cudaThreadSynchronize();
cudaDeviceSynchronize();
//-------------get device info-------------------------------------
@@ -2306,7 +2308,7 @@ int gpu_rhs_ss(RHS_SS_PARA)
//sub_enforce_ga(matrix_size);
//4.1-----compute rhs---------
compute_rhs_ss_part1<<<GRID_DIM,BLOCK_DIM>>>();
cudaThreadSynchronize();
cudaDeviceSynchronize();
sub_fderivs_shc(sst,Mh_ betax,Mh_ fh,Mh_ betaxx,Mh_ betaxy,Mh_ betaxz,ass);
sub_fderivs_shc(sst,Mh_ betay,Mh_ fh,Mh_ betayx,Mh_ betayy,Mh_ betayz,sas);
@@ -2322,7 +2324,7 @@ int gpu_rhs_ss(RHS_SS_PARA)
sub_fderivs_shc(sst,Mh_ gyz,Mh_ fh,Mh_ gyzx,Mh_ gyzy,Mh_ gyzz, saa);
compute_rhs_ss_part2<<<GRID_DIM,BLOCK_DIM>>>();
cudaThreadSynchronize();
cudaDeviceSynchronize();
sub_fdderivs_shc(sst,Mh_ betax,Mh_ fh,Mh_ gxxx,Mh_ gxyx,Mh_ gxzx,Mh_ gyyx,Mh_ gyzx,Mh_ gzzx,ass);
sub_fdderivs_shc(sst,Mh_ betay,Mh_ fh,Mh_ gxxy,Mh_ gxyy,Mh_ gxzy,Mh_ gyyy,Mh_ gyzy,Mh_ gzzy,sas);
@@ -2332,7 +2334,7 @@ int gpu_rhs_ss(RHS_SS_PARA)
sub_fderivs_shc( sst,Mh_ Gamz, Mh_ fh,Mh_ Gamzx, Mh_ Gamzy, Mh_ Gamzz,ssa);
compute_rhs_ss_part3<<<GRID_DIM,BLOCK_DIM>>>();
cudaThreadSynchronize();
cudaDeviceSynchronize();
computeRicci_ss(sst,Mh_ dxx,Mh_ Rxx,sss, meta);
computeRicci_ss(sst,Mh_ dyy,Mh_ Ryy,sss, meta);
@@ -2340,25 +2342,25 @@ int gpu_rhs_ss(RHS_SS_PARA)
computeRicci_ss(sst,Mh_ gxy,Mh_ Rxy,aas, meta);
computeRicci_ss(sst,Mh_ gxz,Mh_ Rxz,asa, meta);
computeRicci_ss(sst,Mh_ gyz,Mh_ Ryz,saa, meta);
cudaThreadSynchronize();
cudaDeviceSynchronize();
compute_rhs_ss_part4<<<GRID_DIM,BLOCK_DIM>>>();
cudaThreadSynchronize();
cudaDeviceSynchronize();
sub_fdderivs_shc(sst,Mh_ chi,Mh_ fh,Mh_ fxx,Mh_ fxy,Mh_ fxz,Mh_ fyy,Mh_ fyz,Mh_ fzz,sss);
//cudaThreadSynchronize();
//cudaDeviceSynchronize();
//compare_result_gpu(0,Mh_ chi,h_3D_SIZE[0]);
//compare_result_gpu(1,Mh_ chi,h_3D_SIZE[0]);
//compare_result_gpu(2,Mh_ fyz,h_3D_SIZE[0]);
compute_rhs_ss_part5<<<GRID_DIM,BLOCK_DIM>>>();
cudaThreadSynchronize();
cudaDeviceSynchronize();
sub_fdderivs_shc(sst,Mh_ Lap,Mh_ fh,Mh_ fxx,Mh_ fxy,Mh_ fxz,Mh_ fyy,Mh_ fyz,Mh_ fzz,sss);
compute_rhs_ss_part6<<<GRID_DIM,BLOCK_DIM>>>();
cudaThreadSynchronize();
cudaDeviceSynchronize();
#if (GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5)
sub_fderivs_shc(sst,Mh_ chi,Mh_ fh, Mh_ dtSfx_rhs, Mh_ dtSfy_rhs, Mh_ dtSfz_rhs,sss);
@@ -2423,7 +2425,7 @@ int gpu_rhs_ss(RHS_SS_PARA)
}
if(co == 0){
compute_rhs_ss_part7<<<GRID_DIM,BLOCK_DIM>>>();
cudaThreadSynchronize();
cudaDeviceSynchronize();
sub_fderivs_shc(sst,Mh_ Axx,Mh_ fh,Mh_ gxxx,Mh_ gxxy,Mh_ gxxz,sss);
sub_fderivs_shc(sst,Mh_ Axy,Mh_ fh,Mh_ gxyx,Mh_ gxyy,Mh_ gxyz,aas);
@@ -2432,7 +2434,7 @@ int gpu_rhs_ss(RHS_SS_PARA)
sub_fderivs_shc(sst,Mh_ Ayz,Mh_ fh,Mh_ gyzx,Mh_ gyzy,Mh_ gyzz,saa);
sub_fderivs_shc(sst,Mh_ Azz,Mh_ fh,Mh_ gzzx,Mh_ gzzy,Mh_ gzzz,sss);
compute_rhs_ss_part8<<<GRID_DIM,BLOCK_DIM>>>();
cudaThreadSynchronize();
cudaDeviceSynchronize();
}
#if (ABV == 1)
@@ -2512,7 +2514,7 @@ int gpu_rhs_ss(RHS_SS_PARA)
//test kodis
//sub_kodis_sh(sst,Msh_ drhodx,Mh_ fh2,Msh_ drhody,sss);
#ifdef TIMING
cudaThreadSynchronize();
cudaDeviceSynchronize();
gettimeofday(&tv2, NULL);
cout<<"MPI rank is: "<<mpi_rank<<" GPU TIME is"<<TimeBetween(tv1, tv2)<<" (s)."<<endl;
#endif
@@ -2522,4 +2524,55 @@ int gpu_rhs_ss(RHS_SS_PARA)
return 0;//TODO return
}
#if (ABEtype == 2)
// Z4C Shell GPU: calls BSSN gpu_rhs_ss with trKd=trK+2*TZ, then applies
// TZ_rhs = alpn1*Hcon/2 and constraint damping on CPU.
int gpu_rhs_z4c_ss(Z4C_SS_PARA)
{
int matrix_size = ex[0] * ex[1] * ex[2];
double k1 = 0.02, k2 = 0.0;
double *trKd_host = new double[matrix_size];
for (int _i = 0; _i < matrix_size; _i++)
trKd_host[_i] = trK[_i] + 2.0 * TZ[_i];
int result = gpu_rhs_ss(calledby, mpi_rank,
ex, T, crho, sigma, R, X, Y, Z,
drhodx, drhody, drhodz, dsigmadx, dsigmady, dsigmadz,
dRdx, dRdy, dRdz,
drhodxx, drhodxy, drhodxz, drhodyy, drhodyz, drhodzz,
dsigmadxx, dsigmadxy, dsigmadxz, dsigmadyy, dsigmadyz, dsigmadzz,
dRdxx, dRdxy, dRdxz, dRdyy, dRdyz, dRdzz,
chi, trKd_host, dxx, gxy, gxz, dyy, gyz, dzz,
Axx, Axy, Axz, Ayy, Ayz, Azz,
Gamx, Gamy, Gamz,
Lap, betax, betay, betaz,
dtSfx, dtSfy, dtSfz,
chi_rhs, trK_rhs,
gxx_rhs, gxy_rhs, gxz_rhs, gyy_rhs, gyz_rhs, gzz_rhs,
Axx_rhs, Axy_rhs, Axz_rhs, Ayy_rhs, Ayz_rhs, Azz_rhs,
Gamx_rhs, Gamy_rhs, Gamz_rhs,
Lap_rhs, betax_rhs, betay_rhs, betaz_rhs,
dtSfx_rhs, dtSfy_rhs, dtSfz_rhs,
rho, Sx, Sy, Sz, Sxx, Sxy, Sxz, Syy, Syz, Szz,
Gamxxx, Gamxxy, Gamxxz, Gamxyy, Gamxyz, Gamxzz,
Gamyxx, Gamyxy, Gamyxz, Gamyyy, Gamyyz, Gamyzz,
Gamzxx, Gamzxy, Gamzxz, Gamzyy, Gamzyz, Gamzzz,
Rxx, Rxy, Rxz, Ryy, Ryz, Rzz,
ham_Res, movx_Res, movy_Res, movz_Res,
Gmx_Res, Gmy_Res, Gmz_Res,
Symmetry, Lev, eps, sst, co);
delete[] trKd_host;
if (result != 0) return result;
for (int _i = 0; _i < matrix_size; _i++) {
double alp = Lap[_i] + 1.0;
TZ_rhs[_i] = alp * ham_Res[_i] * 0.5;
TZ_rhs[_i] -= alp * (2.0 + k2) * k1 * TZ[_i];
trK_rhs[_i] += alp * k1 * (1.0 - k2) * TZ[_i];
}
return 0;
}
#endif // ABEtype == 2
#endif //WithShell

View File

@@ -6,7 +6,6 @@
#define f_compute_rhs_bssn compute_rhs_bssn
#define f_compute_rhs_bssn_ss compute_rhs_bssn_ss
#define f_compute_rhs_bssn_escalar compute_rhs_bssn_escalar
#define f_compute_rhs_bssn_escalar_matter compute_rhs_bssn_escalar_matter
#define f_compute_rhs_bssn_escalar_ss compute_rhs_bssn_escalar_ss
#define f_compute_rhs_Z4c compute_rhs_z4c
#define f_compute_rhs_Z4cnot compute_rhs_z4cnot
@@ -17,7 +16,6 @@
#define f_compute_rhs_bssn COMPUTE_RHS_BSSN
#define f_compute_rhs_bssn_ss COMPUTE_RHS_BSSN_SS
#define f_compute_rhs_bssn_escalar COMPUTE_RHS_BSSN_ESCALAR
#define f_compute_rhs_bssn_escalar_matter COMPUTE_RHS_BSSN_ESCALAR_MATTER
#define f_compute_rhs_bssn_escalar_ss COMPUTE_RHS_BSSN_ESCALAR_SS
#define f_compute_rhs_Z4c COMPUTE_RHS_Z4C
#define f_compute_rhs_Z4cnot COMPUTE_RHS_Z4CNOT
@@ -28,7 +26,6 @@
#define f_compute_rhs_bssn compute_rhs_bssn_
#define f_compute_rhs_bssn_ss compute_rhs_bssn_ss_
#define f_compute_rhs_bssn_escalar compute_rhs_bssn_escalar_
#define f_compute_rhs_bssn_escalar_matter compute_rhs_bssn_escalar_matter_
#define f_compute_rhs_bssn_escalar_ss compute_rhs_bssn_escalar_ss_
#define f_compute_rhs_Z4c compute_rhs_z4c_
#define f_compute_rhs_Z4cnot compute_rhs_z4cnot_
@@ -99,20 +96,6 @@ extern "C"
int &, int &, double &, int &, int &);
}
extern "C"
{
int f_compute_rhs_bssn_escalar_matter(int *, double &, double *, double *, double *, // ex,T,X,Y,Z
double *, double *, // chi, trK
double *, double *, double *, double *, double *, double *, // gij
double *, double *, double *, double *, double *, double *, // Aij
double *, double *, double *, // Gam
double *, double *, double *, double *, double *, double *, double *, // Gauge
double *, double *, // Sphi, Spi
double *, double *, // Sphi, Spi rhs
double *, double *, double *, double *, double *, double *, double *, double *, double *, double *, // stress-energy
int &, int &, double &);
}
extern "C"
{
int f_compute_rhs_bssn_escalar(int *, double &, double *, double *, double *, // ex,T,X,Y,Z

File diff suppressed because it is too large Load Diff

View File

@@ -7,6 +7,9 @@ extern "C" {
enum {
BSSN_CUDA_STATE_COUNT = 24,
BSSN_ESCALAR_CUDA_STATE_COUNT = 26,
BSSN_EM_CUDA_STATE_COUNT = 32,
BSSN_EM_CUDA_SOURCE_COUNT = 4,
BSSN_CUDA_MATTER_COUNT = 10
};
@@ -55,116 +58,53 @@ int bssn_cuda_rk4_substep(void *block_tag,
int &apply_enforce_ga,
double &chitiny);
int bssn_cuda_compute_escalar_matter(void *block_tag,
int bssn_escalar_cuda_rk4_substep(void *block_tag,
int *ex, double *X, double *Y, double *Z,
double **state_host_in,
double *Sphi_host,
double *Spi_host,
double *Sphi_rhs_host,
double *Spi_rhs_host,
double a2,
int &Symmetry,
int &Lev,
double &eps,
int &co,
int &apply_enforce_ga);
int bssn_cuda_escalar_finalize_scalar_fields(void *block_tag,
int *ex, double *X, double *Y, double *Z,
double *Sphi_out_host,
double *Spi_out_host,
double **state_host_out,
const double *propspeed,
const double *soa_flat,
const double *bbox,
double &dT,
double &T,
int &RK4,
int &apply_bam_bc,
int &Symmetry,
int &Lev,
double &eps,
int &precor);
int &co,
int &keep_resident_state,
int &apply_enforce_ga,
double &chitiny);
int bssn_cuda_escalar_has_resident_fields(void *block_tag,
double *Sphi_host,
double *Spi_host);
int bssn_escalar_cuda_compute_constraints(int *ex, double *X, double *Y, double *Z,
double **state_host_in,
double **constraint_host_out,
int &Symmetry,
int &Lev,
double &eps);
int bssn_cuda_escalar_has_any_resident_fields(void *block_tag);
int bssn_em_cuda_rk4_substep(void *block_tag,
int *ex, double *X, double *Y, double *Z,
double **state_host_in,
double **state_host_out,
double **source_host,
const double *propspeed,
const double *soa_flat,
const double *bbox,
double &dT,
double &T,
int &RK4,
int &apply_bam_bc,
int &Symmetry,
int &Lev,
double &eps,
int &co,
int &keep_resident_state,
int &apply_enforce_ga,
double &chitiny);
int bssn_cuda_escalar_download_fields_if_present(void *block_tag,
int *ex,
double *Sphi_host,
double *Spi_host);
int bssn_cuda_pack_escalar_batch_to_host_buffer(void *block_tag,
double **scalar_host_key,
double *host_buffer,
int *ex,
int i0, int j0, int k0,
int sx, int sy, int sz);
int bssn_cuda_unpack_escalar_batch_from_host_buffer(void *block_tag,
double **scalar_host_key,
double *host_buffer,
int *ex,
int i0, int j0, int k0,
int sx, int sy, int sz);
int bssn_cuda_pack_escalar_batch_to_device_buffer(void *block_tag,
double **scalar_host_key,
double *device_buffer,
int *ex,
int i0, int j0, int k0,
int sx, int sy, int sz);
int bssn_cuda_unpack_escalar_batch_from_device_buffer(void *block_tag,
double **scalar_host_key,
double *device_buffer,
int *ex,
int i0, int j0, int k0,
int sx, int sy, int sz);
int bssn_cuda_restrict_escalar_batch_to_host_buffer(void *block_tag,
double **scalar_host_key,
double *host_buffer,
int *ex,
int sx, int sy, int sz,
int fi0, int fj0, int fk0,
const double *scalar_soa);
int bssn_cuda_prolong_escalar_batch_to_host_buffer(void *block_tag,
double **scalar_host_key,
double *host_buffer,
int *ex,
int sx, int sy, int sz,
int ii0, int jj0, int kk0,
int lbc_i, int lbc_j, int lbc_k,
const double *scalar_soa);
int bssn_cuda_restrict_escalar_batch_to_device_buffer(void *block_tag,
double **scalar_host_key,
double *device_buffer,
int *ex,
int sx, int sy, int sz,
int fi0, int fj0, int fk0,
const double *scalar_soa);
int bssn_cuda_prolong_escalar_batch_to_device_buffer(void *block_tag,
double **scalar_host_key,
double *device_buffer,
int *ex,
int sx, int sy, int sz,
int ii0, int jj0, int kk0,
int lbc_i, int lbc_j, int lbc_k,
const double *scalar_soa);
int bssn_cuda_prepare_escalar_inter_time_level(void *block_tag,
int *ex,
double **src1_host_key,
double **src2_host_key,
double **src3_host_key,
double **dst_host_key,
int source_count,
int tindex);
int bssn_em_cuda_resident_zero_fast_state(void *block_tag);
int bssn_cuda_copy_state_region_to_host(void *block_tag,
int state_index,
@@ -184,13 +124,37 @@ int bssn_cuda_download_resident_state(void *block_tag,
int *ex,
double **state_host_out);
int bssn_cuda_download_resident_state_if_present(void *block_tag,
int bssn_escalar_cuda_download_resident_state(void *block_tag,
int *ex,
double **state_host_out);
int bssn_cuda_resident_state_matches(void *block_tag,
int bssn_cuda_upload_resident_state_count(void *block_tag,
int *ex,
double **state_host_in,
int state_count);
int bssn_escalar_cuda_upload_resident_state(void *block_tag,
int *ex,
double **state_host_in);
int bssn_cuda_keep_only_resident_state_count(void *block_tag,
int *ex,
double **state_host_key,
int state_count);
int bssn_escalar_cuda_keep_only_resident_state(void *block_tag,
int *ex,
double **state_host_key);
int bssn_cuda_download_resident_state_count_if_present(void *block_tag,
int *ex,
double **state_host_out,
int state_count);
int bssn_cuda_download_resident_state_if_present(void *block_tag,
int *ex,
double **state_host_out);
int bssn_cuda_download_constraint_outputs(int *ex,
double **constraint_host_out);
@@ -217,6 +181,7 @@ int bssn_cuda_interp_state_point3(void *block_tag,
double pz,
int ordn,
int symmetry,
double **state_host_key,
const double *soa3,
double *out3);
@@ -246,6 +211,15 @@ int bssn_cuda_unpack_state_region_from_host_buffer(void *block_tag,
int i0, int j0, int k0,
int sx, int sy, int sz);
int bssn_cuda_unpack_state_region_from_host_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
int state_index,
double *host_buffer,
int *ex,
int i0, int j0, int k0,
int sx, int sy, int sz);
int bssn_cuda_pack_state_batch_to_host_buffer(void *block_tag,
int state_count,
double *host_buffer,
@@ -276,42 +250,6 @@ int bssn_cuda_unpack_state_batch_from_host_buffer_for_host_views(void *block_tag
int i0, int j0, int k0,
int sx, int sy, int sz);
int bssn_cuda_restrict_state_batch_to_host_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *host_buffer,
int *ex,
int sx, int sy, int sz,
int fi0, int fj0, int fk0,
const double *state_soa);
int bssn_cuda_restrict_state_batch_to_host_buffer(void *block_tag,
int state_count,
double *host_buffer,
int *ex,
int sx, int sy, int sz,
int fi0, int fj0, int fk0,
const double *state_soa);
int bssn_cuda_prolong_state_batch_to_host_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *host_buffer,
int *ex,
int sx, int sy, int sz,
int ii0, int jj0, int kk0,
int lbc_i, int lbc_j, int lbc_k,
const double *state_soa);
int bssn_cuda_prolong_state_batch_to_host_buffer(void *block_tag,
int state_count,
double *host_buffer,
int *ex,
int sx, int sy, int sz,
int ii0, int jj0, int kk0,
int lbc_i, int lbc_j, int lbc_k,
const double *state_soa);
int bssn_cuda_pack_state_batch_to_device_buffer(void *block_tag,
int state_count,
double *device_buffer,
@@ -452,6 +390,7 @@ int bssn_cuda_upload_state_subset(void *block_tag,
int bssn_cuda_prepare_inter_time_level(void *block_tag,
int *ex,
int state_count,
double **src1_host_key,
double **src2_host_key,
double **src3_host_key,
@@ -465,6 +404,10 @@ void bssn_cuda_release_step_ctx(void *block_tag);
#ifdef __cplusplus
}
// C++-only helpers declared for derived equation classes (Z4C, etc.)
// Defined in bssn_class.C. Requires MyList, Patch, var from including TU.
bool bssn_cuda_use_resident_sync(int lev);
void bssn_cuda_download_level_state_if_present(MyList<Patch> *PatL, MyList<var> *vars, int myrank);
#endif
#endif

View File

@@ -76,8 +76,11 @@ checkpoint::checkpoint(bool checked, const char fname[], int myrank) : filename(
I_Print = (myrank == 0);
int i = strlen(fname);
filename = new char[i+30];
size_t filename_len = out_dir.size() + strlen(fname) + 32;
#ifdef CHECKDETAIL
filename_len += 32;
#endif
filename = new char[filename_len];
// cout << filename << endl;
// cout << i << endl;
@@ -103,7 +106,7 @@ checkpoint::checkpoint(bool checked, const char fname[], int myrank) : filename(
checkpoint::~checkpoint()
{
CheckList->clearList();
if (I_Print)
if (filename)
delete[] filename;
}
@@ -136,7 +139,7 @@ void checkpoint::writecheck_cgh(double time, cgh *GH)
if (I_Print)
{
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_cgh.CHK", filename);
outfile.open(fname, ios::out | ios::trunc);
@@ -195,7 +198,7 @@ void checkpoint::readcheck_cgh(double &time, cgh *GH, int myrank, int nprocs, in
int DIM = dim;
ifstream infile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_cgh.CHK", filename);
infile.open(fname);
@@ -297,7 +300,7 @@ void checkpoint::writecheck_sh(double time, ShellPatch *SH)
if (I_Print)
{
char fname[50];
char fname[4096];
sprintf(fname, "%s_sh.CHK", filename);
outfile.open(fname, ios::out | ios::trunc);
@@ -335,7 +338,7 @@ void checkpoint::readcheck_sh(ShellPatch *SH, int myrank)
int DIM = dim;
ifstream infile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_sh.CHK", filename);
infile.open(fname);
@@ -390,7 +393,7 @@ void checkpoint::write_Black_Hole_position(int BH_num_input, int BH_num, double
if (I_Print)
{
char fname[50];
char fname[4096];
sprintf(fname, "%s_BHp.CHK", filename);
outfile.open(fname, ios::out | ios::trunc);
@@ -417,7 +420,7 @@ void checkpoint::read_Black_Hole_position(int &BH_num_input, int &BH_num, double
{
ifstream infile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_BHp.CHK", filename);
infile.open(fname);
@@ -461,7 +464,7 @@ void checkpoint::write_bssn(double LastDump, double Last2dDump, double LastAnas)
if (I_Print)
{
char fname[50];
char fname[4096];
sprintf(fname, "%s_bssn.CHK", filename);
outfile.open(fname, ios::out | ios::trunc);
@@ -481,7 +484,7 @@ void checkpoint::read_bssn(double &LastDump, double &Last2dDump, double &LastAna
{
ifstream infile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_bssn.CHK", filename);
infile.open(fname);
@@ -506,7 +509,7 @@ void checkpoint::write_bssn(double LastDump, double Last2dDump, double LastAnas)
ofstream outfile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_bssn.CHK", filename);
outfile.open(fname, ios::out | ios::trunc);
@@ -527,7 +530,7 @@ void checkpoint::read_bssn(double &LastDump, double &Last2dDump, double &LastAna
{
ifstream infile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_bssn.CHK", filename);
infile.open(fname);
@@ -551,7 +554,7 @@ void checkpoint::write_Black_Hole_position(int BH_num_input, int BH_num, double
ofstream outfile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_BHp.CHK", filename);
outfile.open(fname, ios::out | ios::trunc);
@@ -581,7 +584,7 @@ void checkpoint::read_Black_Hole_position(int &BH_num_input, int &BH_num, double
{
ifstream infile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_BHp.CHK", filename);
infile.open(fname);
@@ -628,7 +631,7 @@ void checkpoint::writecheck_cgh(double time, cgh *GH)
ofstream outfile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_cgh.CHK", filename);
outfile.open(fname, ios::out | ios::trunc);
@@ -738,7 +741,7 @@ void checkpoint::readcheck_cgh(double &time, cgh *GH, int myrank, int nprocs, in
int DIM = dim;
ifstream infile;
// char fname[50];
char fname[50+50];
char fname[4096];
sprintf(fname, "%s_cgh.CHK", filename);
infile.open(fname);

View File

@@ -0,0 +1,412 @@
#ifndef AMSS_NCKU_FD_CUDA_HELPERS_CUH
#define AMSS_NCKU_FD_CUDA_HELPERS_CUH
#ifndef ghost_width
#error "ghost_width must be defined before including fd_cuda_helpers.cuh"
#endif
#if ghost_width < 2 || ghost_width > 5
#error "CUDA finite-difference helpers support ghost_width 2..5"
#endif
#define AMSS_FD_CENTER_RADIUS (ghost_width - 1)
#define AMSS_FD_LK_RADIUS (ghost_width)
__device__ __forceinline__ int fd_axis_radius(int qF, int qminF, int qmaxF)
{
#if AMSS_FD_CENTER_RADIUS >= 4
if (qF - 4 >= qminF && qF + 4 <= qmaxF) return 4;
#endif
#if AMSS_FD_CENTER_RADIUS >= 3
if (qF - 3 >= qminF && qF + 3 <= qmaxF) return 3;
#endif
#if AMSS_FD_CENTER_RADIUS >= 2
if (qF - 2 >= qminF && qF + 2 <= qmaxF) return 2;
#endif
if (qF - 1 >= qminF && qF + 1 <= qmaxF) return 1;
return 0;
}
__device__ __forceinline__ int fd_common_radius(int iF, int jF, int kF,
int iminF, int jminF, int kminF,
int imaxF, int jmaxF, int kmaxF)
{
int r = fd_axis_radius(iF, iminF, imaxF);
const int ry = fd_axis_radius(jF, jminF, jmaxF);
const int rz = fd_axis_radius(kF, kminF, kmaxF);
if (ry < r) r = ry;
if (rz < r) r = rz;
return r;
}
__device__ __forceinline__ double fd_first_coef(int r, int off)
{
switch (r) {
case 1:
if (off == -1) return -1.0;
if (off == 1) return 1.0;
return 0.0;
case 2:
if (off == -2) return 1.0;
if (off == -1) return -8.0;
if (off == 1) return 8.0;
if (off == 2) return -1.0;
return 0.0;
case 3:
if (off == -3) return -1.0;
if (off == -2) return 9.0;
if (off == -1) return -45.0;
if (off == 1) return 45.0;
if (off == 2) return -9.0;
if (off == 3) return 1.0;
return 0.0;
case 4:
if (off == -4) return 3.0;
if (off == -3) return -32.0;
if (off == -2) return 168.0;
if (off == -1) return -672.0;
if (off == 1) return 672.0;
if (off == 2) return -168.0;
if (off == 3) return 32.0;
if (off == 4) return -3.0;
return 0.0;
default:
return 0.0;
}
}
__device__ __forceinline__ double fd_second_coef(int r, int off)
{
switch (r) {
case 1:
if (off == -1) return 1.0;
if (off == 0) return -2.0;
if (off == 1) return 1.0;
return 0.0;
case 2:
if (off == -2) return -1.0;
if (off == -1) return 16.0;
if (off == 0) return -30.0;
if (off == 1) return 16.0;
if (off == 2) return -1.0;
return 0.0;
case 3:
if (off == -3) return 2.0;
if (off == -2) return -27.0;
if (off == -1) return 270.0;
if (off == 0) return -490.0;
if (off == 1) return 270.0;
if (off == 2) return -27.0;
if (off == 3) return 2.0;
return 0.0;
case 4:
if (off == -4) return -9.0;
if (off == -3) return 128.0;
if (off == -2) return -1008.0;
if (off == -1) return 8064.0;
if (off == 0) return -14350.0;
if (off == 1) return 8064.0;
if (off == 2) return -1008.0;
if (off == 3) return 128.0;
if (off == 4) return -9.0;
return 0.0;
default:
return 0.0;
}
}
__device__ __forceinline__ double fd_first_denom(int r)
{
return (r == 4) ? 840.0 : ((r == 3) ? 60.0 : ((r == 2) ? 12.0 : 2.0));
}
__device__ __forceinline__ double fd_second_denom(int r)
{
return (r == 4) ? 5040.0 : ((r == 3) ? 180.0 : ((r == 2) ? 12.0 : 1.0));
}
__device__ __forceinline__ double fd_fetch_axis(const double *src,
int iF, int jF, int kF,
int axis, int off,
int SoA0, int SoA1, int SoA2)
{
if (axis == 0) iF += off;
else if (axis == 1) jF += off;
else kF += off;
return fetch_sym_ord2_direct(src, iF, jF, kF, SoA0, SoA1, SoA2);
}
__device__ __forceinline__ double fd_fetch_axis2(const double *src,
int iF, int jF, int kF,
int axis_a, int off_a,
int axis_b, int off_b,
int SoA0, int SoA1, int SoA2)
{
if (axis_a == 0) iF += off_a;
else if (axis_a == 1) jF += off_a;
else kF += off_a;
if (axis_b == 0) iF += off_b;
else if (axis_b == 1) jF += off_b;
else kF += off_b;
return fetch_sym_ord2_direct(src, iF, jF, kF, SoA0, SoA1, SoA2);
}
__device__ __forceinline__ double fd_first_axis_radius(const double *src,
int iF, int jF, int kF,
int axis, int r, double h,
int SoA0, int SoA1, int SoA2)
{
if (r <= 0) return 0.0;
double s = 0.0;
#pragma unroll
for (int off = -4; off <= 4; ++off) {
const double c = fd_first_coef(r, off);
if (c != 0.0) {
s += c * fd_fetch_axis(src, iF, jF, kF, axis, off, SoA0, SoA1, SoA2);
}
}
return s / (fd_first_denom(r) * h);
}
__device__ __forceinline__ double fd_second_axis_radius(const double *src,
int iF, int jF, int kF,
int axis, int r, double h,
int SoA0, int SoA1, int SoA2)
{
if (r <= 0) return 0.0;
double s = 0.0;
#pragma unroll
for (int off = -4; off <= 4; ++off) {
const double c = fd_second_coef(r, off);
if (c != 0.0) {
s += c * fd_fetch_axis(src, iF, jF, kF, axis, off, SoA0, SoA1, SoA2);
}
}
return s / (fd_second_denom(r) * h * h);
}
__device__ __forceinline__ double fd_mixed_axis_radius(const double *src,
int iF, int jF, int kF,
int axis_a, int r_a, double h_a,
int axis_b, int r_b, double h_b,
int SoA0, int SoA1, int SoA2)
{
if (r_a <= 0 || r_b <= 0) return 0.0;
double s = 0.0;
#pragma unroll
for (int off_a = -4; off_a <= 4; ++off_a) {
const double ca = fd_first_coef(r_a, off_a);
if (ca == 0.0) continue;
#pragma unroll
for (int off_b = -4; off_b <= 4; ++off_b) {
const double cb = fd_first_coef(r_b, off_b);
if (cb != 0.0) {
s += ca * cb * fd_fetch_axis2(src, iF, jF, kF, axis_a, off_a,
axis_b, off_b, SoA0, SoA1, SoA2);
}
}
}
return s / (fd_first_denom(r_a) * fd_first_denom(r_b) * h_a * h_b);
}
__device__ __forceinline__ void fd_compute_first3(const double *src,
int iF, int jF, int kF,
int iminF, int jminF, int kminF,
int imaxF, int jmaxF, int kmaxF,
int SoA0, int SoA1, int SoA2,
double &fx, double &fy, double &fz)
{
#if ghost_width == 3
const int r = fd_common_radius(iF, jF, kF, iminF, jminF, kminF, imaxF, jmaxF, kmaxF);
fx = fd_first_axis_radius(src, iF, jF, kF, 0, r, d_gp.dX, SoA0, SoA1, SoA2);
fy = fd_first_axis_radius(src, iF, jF, kF, 1, r, d_gp.dY, SoA0, SoA1, SoA2);
fz = fd_first_axis_radius(src, iF, jF, kF, 2, r, d_gp.dZ, SoA0, SoA1, SoA2);
#else
fx = fd_first_axis_radius(src, iF, jF, kF, 0, fd_axis_radius(iF, iminF, imaxF),
d_gp.dX, SoA0, SoA1, SoA2);
fy = fd_first_axis_radius(src, iF, jF, kF, 1, fd_axis_radius(jF, jminF, jmaxF),
d_gp.dY, SoA0, SoA1, SoA2);
fz = fd_first_axis_radius(src, iF, jF, kF, 2, fd_axis_radius(kF, kminF, kmaxF),
d_gp.dZ, SoA0, SoA1, SoA2);
#endif
}
__device__ __forceinline__ void fd_compute_second6(const double *src,
int iF, int jF, int kF,
int iminF, int jminF, int kminF,
int imaxF, int jmaxF, int kmaxF,
int SoA0, int SoA1, int SoA2,
double &fxx, double &fxy, double &fxz,
double &fyy, double &fyz, double &fzz)
{
#if ghost_width == 3
const int r = fd_common_radius(iF, jF, kF, iminF, jminF, kminF, imaxF, jmaxF, kmaxF);
const int rx = r, ry = r, rz = r;
#else
const int rx = fd_axis_radius(iF, iminF, imaxF);
const int ry = fd_axis_radius(jF, jminF, jmaxF);
const int rz = fd_axis_radius(kF, kminF, kmaxF);
#endif
fxx = fd_second_axis_radius(src, iF, jF, kF, 0, rx, d_gp.dX, SoA0, SoA1, SoA2);
fyy = fd_second_axis_radius(src, iF, jF, kF, 1, ry, d_gp.dY, SoA0, SoA1, SoA2);
fzz = fd_second_axis_radius(src, iF, jF, kF, 2, rz, d_gp.dZ, SoA0, SoA1, SoA2);
fxy = fd_mixed_axis_radius(src, iF, jF, kF, 0, rx, d_gp.dX, 1, ry, d_gp.dY, SoA0, SoA1, SoA2);
fxz = fd_mixed_axis_radius(src, iF, jF, kF, 0, rx, d_gp.dX, 2, rz, d_gp.dZ, SoA0, SoA1, SoA2);
fyz = fd_mixed_axis_radius(src, iF, jF, kF, 1, ry, d_gp.dY, 2, rz, d_gp.dZ, SoA0, SoA1, SoA2);
}
__device__ __forceinline__ bool fd_lop_fits(int qF, int qminF, int qmaxF,
int dir, int lo, int hi)
{
for (int off = lo; off <= hi; ++off) {
const int q = qF + dir * off;
if (q < qminF || q > qmaxF) return false;
}
return true;
}
__device__ __forceinline__ double fd_lop_fetch_sum(const double *src,
int iF, int jF, int kF,
int axis, int dir,
const double *coef,
int lo, int hi,
int SoA0, int SoA1, int SoA2)
{
double s = 0.0;
for (int off = lo; off <= hi; ++off) {
const double c = coef[off - lo];
if (c != 0.0) {
s += c * fd_fetch_axis(src, iF, jF, kF, axis, dir * off, SoA0, SoA1, SoA2);
}
}
return s;
}
__device__ __forceinline__ double fd_lopsided_axis(const double *src,
int iF, int jF, int kF,
int axis, double speed,
int qF, int qminF, int qmaxF,
double h,
int SoA0, int SoA1, int SoA2)
{
if (speed == 0.0) return 0.0;
const int dir = (speed > 0.0) ? 1 : -1;
const double mag = (speed > 0.0) ? speed : -speed;
#if ghost_width == 2
if (fd_lop_fits(qF, qminF, qmaxF, dir, 0, 2)) {
const double c[] = {-3.0, 4.0, -1.0};
return mag * fd_lop_fetch_sum(src, iF, jF, kF, axis, dir, c, 0, 2, SoA0, SoA1, SoA2) / (2.0 * h);
}
if (fd_lop_fits(qF, qminF, qmaxF, dir, 0, 1)) {
const double c[] = {-1.0, 1.0};
return mag * fd_lop_fetch_sum(src, iF, jF, kF, axis, dir, c, 0, 1, SoA0, SoA1, SoA2) / (2.0 * h);
}
return 0.0;
#elif ghost_width == 3
if (fd_lop_fits(qF, qminF, qmaxF, dir, -1, 3)) {
const double c[] = {-3.0, -10.0, 18.0, -6.0, 1.0};
return mag * fd_lop_fetch_sum(src, iF, jF, kF, axis, dir, c, -1, 3, SoA0, SoA1, SoA2) / (12.0 * h);
}
const int r = fd_axis_radius(qF, qminF, qmaxF);
return speed * fd_first_axis_radius(src, iF, jF, kF, axis, r, h, SoA0, SoA1, SoA2);
#elif ghost_width == 4
if (fd_lop_fits(qF, qminF, qmaxF, dir, -2, 4)) {
const double c[] = {2.0, -24.0, -35.0, 80.0, -30.0, 8.0, -1.0};
return mag * fd_lop_fetch_sum(src, iF, jF, kF, axis, dir, c, -2, 4, SoA0, SoA1, SoA2) / (60.0 * h);
}
if (fd_lop_fits(qF, qminF, qmaxF, dir, -1, 5)) {
const double c[] = {-10.0, -77.0, 150.0, -100.0, 50.0, -15.0, 2.0};
return mag * fd_lop_fetch_sum(src, iF, jF, kF, axis, dir, c, -1, 5, SoA0, SoA1, SoA2) / (60.0 * h);
}
const int r = fd_axis_radius(qF, qminF, qmaxF);
return speed * fd_first_axis_radius(src, iF, jF, kF, axis, r, h, SoA0, SoA1, SoA2);
#else
if (fd_lop_fits(qF, qminF, qmaxF, dir, -3, 5)) {
const double c[] = {-5.0, 60.0, -420.0, -378.0, 1050.0, -420.0, 140.0, -30.0, 3.0};
return mag * fd_lop_fetch_sum(src, iF, jF, kF, axis, dir, c, -3, 5, SoA0, SoA1, SoA2) / (840.0 * h);
}
const int r = fd_axis_radius(qF, qminF, qmaxF);
return speed * fd_first_axis_radius(src, iF, jF, kF, axis, r, h, SoA0, SoA1, SoA2);
#endif
}
__device__ __forceinline__ double fd_ko_coef(int r, int off)
{
const int a = off < 0 ? -off : off;
if (r == 2) {
if (a == 0) return 6.0;
if (a == 1) return -4.0;
if (a == 2) return 1.0;
} else if (r == 3) {
if (a == 0) return -20.0;
if (a == 1) return 15.0;
if (a == 2) return -6.0;
if (a == 3) return 1.0;
} else if (r == 4) {
if (a == 0) return 70.0;
if (a == 1) return -56.0;
if (a == 2) return 28.0;
if (a == 3) return -8.0;
if (a == 4) return 1.0;
} else if (r == 5) {
if (a == 0) return -252.0;
if (a == 1) return 210.0;
if (a == 2) return -120.0;
if (a == 3) return 45.0;
if (a == 4) return -10.0;
if (a == 5) return 1.0;
}
return 0.0;
}
__device__ __forceinline__ double fd_ko_axis(const double *src,
int iF, int jF, int kF,
int axis, int r,
int SoA0, int SoA1, int SoA2)
{
double s = 0.0;
#pragma unroll
for (int off = -5; off <= 5; ++off) {
if (off < -r || off > r) continue;
const double c = fd_ko_coef(r, off);
if (c != 0.0) {
s += c * fd_fetch_axis(src, iF, jF, kF, axis, off, SoA0, SoA1, SoA2);
}
}
return s;
}
__device__ __forceinline__ double fd_ko_term(const double *src,
int iF, int jF, int kF,
int iminF, int jminF, int kminF,
int imaxF, int jmaxF, int kmaxF,
double eps_val,
int SoA0, int SoA1, int SoA2)
{
const int r = AMSS_FD_LK_RADIUS;
if (eps_val <= 0.0) return 0.0;
#if ghost_width >= 4
if (iF - r <= iminF || iF + r >= imaxF ||
jF - r <= jminF || jF + r >= jmaxF ||
kF - r <= kminF || kF + r >= kmaxF) {
return 0.0;
}
#else
if (iF - r < iminF || iF + r > imaxF ||
jF - r < jminF || jF + r > jmaxF ||
kF - r < kminF || kF + r > kmaxF) {
return 0.0;
}
#endif
double cof = 1.0;
#pragma unroll
for (int n = 0; n < 2 * r; ++n) cof *= 2.0;
const double sign = (r & 1) ? 1.0 : -1.0;
const double dx = fd_ko_axis(src, iF, jF, kF, 0, r, SoA0, SoA1, SoA2);
const double dy = fd_ko_axis(src, iF, jF, kF, 1, r, SoA0, SoA1, SoA2);
const double dz = fd_ko_axis(src, iF, jF, kF, 2, r, SoA0, SoA1, SoA2);
return sign * eps_val * (dx / d_gp.dX + dy / d_gp.dY + dz / d_gp.dZ) / cof;
}
#endif

View File

@@ -1,6 +1,6 @@
#ifndef GPU_MEM_H_
#define GPU_MEM_H_
#include "macrodef.fh"
#include "macrodef.h"
#ifdef WithShell
struct Metass
@@ -48,6 +48,8 @@ struct Meta
double * Gamx_rhs,*Gamy_rhs,*Gamz_rhs;//out
double * Lap_rhs, *betax_rhs, *betay_rhs, *betaz_rhs;//out
double * dtSfx_rhs,*dtSfy_rhs,*dtSfz_rhs;//out
double * TZ; //in (Z4C)
double * TZ_rhs; //out (Z4C)
double * rho,*Sx,*Sy,*Sz ; //in
double * Sxx,*Sxy,*Sxz,*Syy,*Syz,*Szz; //in
@@ -132,6 +134,8 @@ __constant__ double SYM = 1.0;
__constant__ double ANTI = -1.0;
__constant__ double FF = 0.75;
__constant__ double eta = 2.0;
__constant__ double kappa1_c = 0.02;
__constant__ double kappa2_c = 0.0;
__constant__ double F1o3;
__constant__ double F2o3;
__constant__ double F3o2 = 1.5;

View File

@@ -10,7 +10,7 @@
#define GaussInt
#define ABEtype 1
#define ABEtype 0
//#define With_AHF
#define Psi4type 0
@@ -167,4 +167,3 @@
#define TINY 1e-10
#endif /* MICRODEF_H */

View File

@@ -35,7 +35,6 @@ f90appflags = -O3 -xHost -fp-model fast=2 -fma -ipo \
endif
TP_OPTFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \
-fprofile-instr-use=$(TP_PROFDATA) \
-Dfortran3 -Dnewc $(MKL_INC)
else
## NVHPC defaults: mpicc/mpicxx/mpifort wrappers
@@ -57,6 +56,10 @@ endif
.C.o:
${CXX} $(CXXAPPFLAGS) -c $< $(filein) -o $@
# ShellPatch.C uses OpenMP for setupintintstuff search loops
ShellPatch.o: ShellPatch.C
${CXX} $(CXXAPPFLAGS) $(OMP_FLAG) -c $< $(filein) -o $@
.for.o:
$(f77) -c $< -o $@
@@ -64,11 +67,15 @@ endif
$(Cu) $(CUDA_APP_FLAGS) -c $< -o $@ $(CUDA_LIB_PATH)
# CUDA rewrite of BSSN RHS (drop-in replacement for bssn_rhs_c + stencil helpers)
bssn_rhs_cuda.o: bssn_rhs_cuda.cu bssn_rhs.h macrodef.h
bssn_rhs_cuda.o: bssn_rhs_cuda.cu bssn_rhs.h macrodef.h fd_cuda_helpers.cuh
$(Cu) $(CUDA_APP_FLAGS) -c $< -o $@ $(CUDA_LIB_PATH)
# CUDA rewrite of BSSN Shell-Patch RHS (drop-in replacement for bssn_rhs_ss)
bssn_gpu_rhs_ss.o: bssn_gpu_rhs_ss.cu bssn_gpu.h gpu_rhsSS_mem.h bssn_macro.h macrodef.fh
$(Cu) $(CUDA_APP_FLAGS) -c $< -o $@ $(CUDA_LIB_PATH)
# CUDA rewrite of Z4C Cartesian RHS
z4c_rhs_cuda.o: z4c_rhs_cuda.cu z4c_rhs_cuda.h bssn_rhs.h macrodef.h ricci_gamma.h
z4c_rhs_cuda.o: z4c_rhs_cuda.cu z4c_rhs_cuda.h bssn_rhs.h macrodef.h ricci_gamma.h fd_cuda_helpers.cuh
$(Cu) $(CUDA_APP_FLAGS) -c $< -o $@ $(CUDA_LIB_PATH)
# C rewrite of BSSN RHS kernel and helpers
@@ -124,7 +131,7 @@ else
CFILES_CPU = bssn_rhs_c.o fderivs_c.o fdderivs_c.o kodiss_c.o lopsided_c.o lopsided_kodis_c.o
endif
CFILES_CUDA_BSSN = bssn_rhs_cuda.o
CFILES_CUDA_BSSN = bssn_rhs_cuda.o bssn_gpu_rhs_ss.o
ifeq ($(USE_CUDA_BSSN),1)
CFILES = $(CFILES_CUDA_BSSN)

View File

@@ -1,7 +1,7 @@
## Toolchain selection
## nvhpc : NVIDIA HPC SDK + CUDA-aware MPI (default)
## intel : Intel oneAPI toolchain (legacy path)
TOOLCHAIN ?= nvhpc
TOOLCHAIN ?= intel
## PGO build mode switch (ABE only; TwoPunctureABE always uses opt flags)
## opt : (default) maximum performance with PGO profile-guided optimization

View File

@@ -1,6 +1,7 @@
#ifdef newc
#include <cstdio>
#include <sstream>
using namespace std;
#else
#include <stdio.h>
@@ -77,16 +78,17 @@ monitor::monitor(const char fname[], int myrank, string head)
parameters::str_par.insert(map<string, string>::value_type("output dir", out_dir));
}
// considering checkpoint run
char filename[50];
sprintf(filename, "%s/%s", out_dir.c_str(), fname);
string filename = out_dir + "/" + fname;
int i = 1;
while ((access(filename, F_OK)) != -1)
while ((access(filename.c_str(), F_OK)) != -1)
{
sprintf(filename, "%s/%d_%s", out_dir.c_str(), i, fname);
stringstream ss;
ss << out_dir << "/" << i << "_" << fname;
filename = ss.str();
i++;
}
outfile.open(filename, ios::trunc);
outfile.open(filename.c_str(), ios::trunc);
time_t tnow;
time(&tnow);
@@ -107,16 +109,17 @@ monitor::monitor(const char fname[], int myrank, const int out_rank, string head
if (I_Print)
{
// considering checkpoint run
char filename[50];
sprintf(filename, "%s/%s", out_dir.c_str(), fname);
string filename = out_dir + "/" + fname;
int i = 1;
while ((access(filename, F_OK)) != -1)
while ((access(filename.c_str(), F_OK)) != -1)
{
sprintf(filename, "%s/%d_%s", out_dir.c_str(), i, fname);
stringstream ss;
ss << out_dir << "/" << i << "_" << fname;
filename = ss.str();
i++;
}
outfile.open(filename, ios::trunc);
outfile.open(filename.c_str(), ios::trunc);
time_t tnow;
time(&tnow);

File diff suppressed because it is too large Load Diff

View File

@@ -53,14 +53,6 @@ int z4c_cuda_pack_state_batch_to_host_buffer(void *block_tag,
int i0, int j0, int k0,
int sx, int sy, int sz);
int z4c_cuda_pack_state_batch_to_host_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *host_buffer,
int *ex,
int i0, int j0, int k0,
int sx, int sy, int sz);
int z4c_cuda_unpack_state_batch_from_host_buffer(void *block_tag,
int state_count,
double *host_buffer,
@@ -68,14 +60,6 @@ int z4c_cuda_unpack_state_batch_from_host_buffer(void *block_tag,
int i0, int j0, int k0,
int sx, int sy, int sz);
int z4c_cuda_unpack_state_batch_from_host_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *host_buffer,
int *ex,
int i0, int j0, int k0,
int sx, int sy, int sz);
int z4c_cuda_pack_state_batch_to_device_buffer(void *block_tag,
int state_count,
double *device_buffer,
@@ -83,14 +67,6 @@ int z4c_cuda_pack_state_batch_to_device_buffer(void *block_tag,
int i0, int j0, int k0,
int sx, int sy, int sz);
int z4c_cuda_pack_state_batch_to_device_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *device_buffer,
int *ex,
int i0, int j0, int k0,
int sx, int sy, int sz);
int z4c_cuda_unpack_state_batch_from_device_buffer(void *block_tag,
int state_count,
double *device_buffer,
@@ -98,78 +74,6 @@ int z4c_cuda_unpack_state_batch_from_device_buffer(void *block_tag,
int i0, int j0, int k0,
int sx, int sy, int sz);
int z4c_cuda_unpack_state_batch_from_device_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *device_buffer,
int *ex,
int i0, int j0, int k0,
int sx, int sy, int sz);
int z4c_cuda_pack_state_segments_to_device_buffer(void *block_tag,
int state_count,
double *device_buffer,
int *ex,
int segment_count,
const int *segment_meta);
int z4c_cuda_pack_state_segments_to_device_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *device_buffer,
int *ex,
int segment_count,
const int *segment_meta);
int z4c_cuda_unpack_state_segments_from_device_buffer(void *block_tag,
int state_count,
double *device_buffer,
int *ex,
int segment_count,
const int *segment_meta);
int z4c_cuda_unpack_state_segments_from_device_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *device_buffer,
int *ex,
int segment_count,
const int *segment_meta);
int z4c_cuda_restrict_state_segments_to_device_buffer(void *block_tag,
int state_count,
double *device_buffer,
int *ex,
int segment_count,
const int *segment_meta,
const double *state_soa);
int z4c_cuda_restrict_state_segments_to_device_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *device_buffer,
int *ex,
int segment_count,
const int *segment_meta,
const double *state_soa);
int z4c_cuda_prolong_state_segments_to_device_buffer(void *block_tag,
int state_count,
double *device_buffer,
int *ex,
int segment_count,
const int *segment_meta,
const double *state_soa);
int z4c_cuda_prolong_state_segments_to_device_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *device_buffer,
int *ex,
int segment_count,
const int *segment_meta,
const double *state_soa);
int z4c_cuda_restrict_state_batch_to_device_buffer(void *block_tag,
int state_count,
double *device_buffer,
@@ -178,15 +82,6 @@ int z4c_cuda_restrict_state_batch_to_device_buffer(void *block_tag,
int fi0, int fj0, int fk0,
const double *state_soa);
int z4c_cuda_restrict_state_batch_to_device_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *device_buffer,
int *ex,
int sx, int sy, int sz,
int fi0, int fj0, int fk0,
const double *state_soa);
int z4c_cuda_prolong_state_batch_to_device_buffer(void *block_tag,
int state_count,
double *device_buffer,
@@ -196,16 +91,6 @@ int z4c_cuda_prolong_state_batch_to_device_buffer(void *block_tag,
int lbc_i, int lbc_j, int lbc_k,
const double *state_soa);
int z4c_cuda_prolong_state_batch_to_device_buffer_for_host_views(void *block_tag,
double **state_host_key,
int state_count,
double *device_buffer,
int *ex,
int sx, int sy, int sz,
int ii0, int jj0, int kk0,
int lbc_i, int lbc_j, int lbc_k,
const double *state_soa);
int z4c_cuda_download_state_subset(void *block_tag,
int *ex,
int subset_count,
@@ -218,36 +103,7 @@ int z4c_cuda_upload_state_subset(void *block_tag,
const int *state_indices,
double **state_host_in);
int z4c_cuda_compute_constraints_resident(void *block_tag,
int *ex, double *X, double *Y, double *Z,
int Symmetry, double eps, int co,
double **constraint_host_out);
int z4c_cuda_interp_state_point3(void *block_tag,
int *ex,
int state0,
int state1,
int state2,
double x0,
double y0,
double z0,
double dx,
double dy,
double dz,
double px,
double py,
double pz,
int ordn,
int symmetry,
const double *soa3,
double *out3);
int z4c_cuda_download_constraint_outputs(int *ex,
double **constraint_host_out);
int z4c_cuda_has_resident_state(void *block_tag);
int z4c_cuda_resident_state_matches(void *block_tag,
double **state_host_key);
void z4c_cuda_release_step_ctx(void *block_tag);

View File

@@ -138,25 +138,58 @@ def _stop_cuda_mps(runtime_env):
def _gpu_runtime_env():
runtime_env = os.environ.copy()
finite_difference = str(getattr(input_data, "Finite_Diffenence_Method", "4th-order")).strip()
defaults = {
"AMSS_EVOLVE_TIMING": "1",
"AMSS_ESCALAR_STEP_TIMING": "0",
"AMSS_INTERP_FAST": "1",
"AMSS_INTERP_GPU": "1",
"AMSS_ANALYSIS_MAP_EVERY": "1000000",
"AMSS_CUDA_AWARE_MPI": "1",
"AMSS_CUDA_KEEP_RESIDENT_AFTER_STEP": "1",
"AMSS_CUDA_Z4C_KEEP_RESIDENT_AFTER_STEP": "1",
"AMSS_CUDA_KEEP_ALL_LEVELS": "1",
"AMSS_CUDA_Z4C_AMR_DEVICE": "0",
"AMSS_CUDA_AMR_RESTRICT_DEVICE": "1",
"AMSS_CUDA_ESCALAR_KEEP_RESIDENT_AFTER_STEP": "1",
"AMSS_CUDA_ESCALAR_KEEP_ALL_LEVELS": "1",
"AMSS_CUDA_EM_CACHE_SOURCES": "1",
"AMSS_CUDA_EM_ZERO_FASTPATH": "1",
"AMSS_EM_ZERO_ANALYSIS_FASTPATH": "1",
"AMSS_EM_ZERO_RESIDENT_DOWNLOAD_FASTPATH": "1",
"AMSS_CUDA_AMR_HOST_STAGED": "1",
"AMSS_CUDA_AMR_RESTRICT_DEVICE": "0",
"AMSS_CUDA_AMR_RESTRICT_BATCH": "0",
"AMSS_CUDA_DEVICE_SEGMENT_BATCH": "0",
"AMSS_CUDA_PIN_ESCALAR_TRANSFERS": "0",
"AMSS_ESCALAR_GPU_RK": "0",
"AMSS_CUDA_UNCACHED_DEVICE_BUFFERS": "1",
"AMSS_SHELL_FAST_INTERP": "0",
"AMSS_SHELL_PARALLEL_INTERP": "0",
"AMSS_SHELL_CUDA_INTERP": "0",
}
if finite_difference in ("2nd-order", "8th-order"):
defaults.update({
"AMSS_INTERP_FAST": "0",
"AMSS_INTERP_GPU": "0",
"AMSS_CUDA_AWARE_MPI": "0",
})
if finite_difference == "8th-order" and getattr(input_data, "Equation_Class", "") == "BSSN-EM":
defaults.update({
"AMSS_CUDA_AMR_RESTRICT_DEVICE": "1",
"AMSS_CUDA_AMR_RESTRICT_BATCH": "1",
"AMSS_CUDA_DEVICE_SEGMENT_BATCH": "1",
})
if getattr(input_data, "basic_grid_set", "") == "Shell-Patch":
defaults.update({
"AMSS_CUDA_AWARE_MPI": "0",
"AMSS_SHELL_FAST_INTERP": "1",
"AMSS_SHELL_PARALLEL_INTERP": "1",
"AMSS_SHELL_INTERP_THREADS": "16",
})
if getattr(input_data, "Equation_Class", "") in ("BSSN", "BSSN-EScalar", "Z4C"):
defaults["AMSS_CUDA_AMR_RESTRICT_DEVICE"] = "1"
if getattr(input_data, "Equation_Class", "") == "Z4C":
defaults["AMSS_CUDA_Z4C_KEEP_RESIDENT_AFTER_STEP"] = "0"
defaults["AMSS_CUDA_KEEP_ALL_LEVELS"] = "0"
defaults.update({
"AMSS_Z4C_CUDA_RESIDENT": "1",
"AMSS_CONSTRAINT_OUT_EVERY": "1000000",
})
for key, value in defaults.items():
runtime_env.setdefault(key, value)
@@ -268,29 +301,58 @@ def run_ABE():
mpi_env = None
started_mps = False
mpi_processes = int(input_data.MPI_processes)
if (input_data.GPU_Calculation == "yes" and
getattr(input_data, "Equation_Class", "") == "Z4C"):
z4c_env_np = os.environ.get("AMSS_Z4C_GPU_MPI_PROCESSES")
if z4c_env_np and int(z4c_env_np) > 0:
mpi_processes = int(z4c_env_np)
elif mpi_processes < 4:
mpi_processes = 4
if (input_data.GPU_Calculation == "yes" and
getattr(input_data, "basic_grid_set", "") == "Shell-Patch"):
shell_env_np = os.environ.get("AMSS_SHELL_GPU_MPI_PROCESSES")
if shell_env_np and int(shell_env_np) > 0:
mpi_processes = int(shell_env_np)
elif mpi_processes < 4:
mpi_processes = 4
if (input_data.GPU_Calculation == "no"):
mpi_command = NUMACTL_CPU_BIND + " mpirun -np " + str(input_data.MPI_processes) + " ./ABE"
mpi_command = NUMACTL_CPU_BIND + " mpirun -np " + str(mpi_processes) + " ./ABE"
#mpi_command = " mpirun -np " + str(input_data.MPI_processes) + " ./ABE"
mpi_command_outfile = "ABE_out.log"
elif (input_data.GPU_Calculation == "yes"):
mpi_command = NUMACTL_CPU_BIND + " mpirun -np " + str(input_data.MPI_processes) + " ./ABE_CUDA"
mpi_command = NUMACTL_CPU_BIND + " I_MPI_OFFLOAD=1 I_MPI_OFFLOAD_IPC=0 mpirun -np " + str(mpi_processes) + " ./ABE_CUDA"
mpi_command_outfile = "ABEGPU_out.log"
mpi_env = _gpu_runtime_env()
started_mps = _start_cuda_mps_if_requested(mpi_env)
print(" GPU optimized runtime switches:")
print(f" MPI processes={mpi_processes}")
print(f" AMSS_INTERP_FAST={mpi_env.get('AMSS_INTERP_FAST', '')}")
print(f" AMSS_INTERP_GPU={mpi_env.get('AMSS_INTERP_GPU', '')}")
print(f" AMSS_ANALYSIS_MAP_EVERY={mpi_env.get('AMSS_ANALYSIS_MAP_EVERY', '')}")
print(f" AMSS_EVOLVE_TIMING={mpi_env.get('AMSS_EVOLVE_TIMING', '')}")
print(f" AMSS_ESCALAR_STEP_TIMING={mpi_env.get('AMSS_ESCALAR_STEP_TIMING', '')}")
print(f" AMSS_CUDA_AWARE_MPI={mpi_env.get('AMSS_CUDA_AWARE_MPI', '')}")
print(f" AMSS_CUDA_KEEP_RESIDENT_AFTER_STEP={mpi_env.get('AMSS_CUDA_KEEP_RESIDENT_AFTER_STEP', '')}")
print(f" AMSS_CUDA_Z4C_KEEP_RESIDENT_AFTER_STEP={mpi_env.get('AMSS_CUDA_Z4C_KEEP_RESIDENT_AFTER_STEP', '')}")
print(f" AMSS_CUDA_KEEP_ALL_LEVELS={mpi_env.get('AMSS_CUDA_KEEP_ALL_LEVELS', '')}")
print(f" AMSS_CUDA_Z4C_AMR_DEVICE={mpi_env.get('AMSS_CUDA_Z4C_AMR_DEVICE', '')}")
print(f" AMSS_CUDA_ESCALAR_KEEP_RESIDENT_AFTER_STEP={mpi_env.get('AMSS_CUDA_ESCALAR_KEEP_RESIDENT_AFTER_STEP', '')}")
print(f" AMSS_CUDA_ESCALAR_KEEP_ALL_LEVELS={mpi_env.get('AMSS_CUDA_ESCALAR_KEEP_ALL_LEVELS', '')}")
print(f" AMSS_CUDA_EM_CACHE_SOURCES={mpi_env.get('AMSS_CUDA_EM_CACHE_SOURCES', '')}")
print(f" AMSS_CUDA_EM_ZERO_FASTPATH={mpi_env.get('AMSS_CUDA_EM_ZERO_FASTPATH', '')}")
print(f" AMSS_EM_ZERO_ANALYSIS_FASTPATH={mpi_env.get('AMSS_EM_ZERO_ANALYSIS_FASTPATH', '')}")
print(f" AMSS_EM_ZERO_RESIDENT_DOWNLOAD_FASTPATH={mpi_env.get('AMSS_EM_ZERO_RESIDENT_DOWNLOAD_FASTPATH', '')}")
print(f" AMSS_CUDA_AMR_HOST_STAGED={mpi_env.get('AMSS_CUDA_AMR_HOST_STAGED', '')}")
print(f" AMSS_CUDA_AMR_RESTRICT_DEVICE={mpi_env.get('AMSS_CUDA_AMR_RESTRICT_DEVICE', '')}")
print(f" AMSS_CUDA_AMR_RESTRICT_BATCH={mpi_env.get('AMSS_CUDA_AMR_RESTRICT_BATCH', '')}")
print(f" AMSS_CUDA_DEVICE_SEGMENT_BATCH={mpi_env.get('AMSS_CUDA_DEVICE_SEGMENT_BATCH', '')}")
print(f" AMSS_CUDA_PIN_ESCALAR_TRANSFERS={mpi_env.get('AMSS_CUDA_PIN_ESCALAR_TRANSFERS', '')}")
print(f" AMSS_ESCALAR_GPU_RK={mpi_env.get('AMSS_ESCALAR_GPU_RK', '')}")
print(f" AMSS_CUDA_UNCACHED_DEVICE_BUFFERS={mpi_env.get('AMSS_CUDA_UNCACHED_DEVICE_BUFFERS', '')}")
print(f" AMSS_SHELL_FAST_INTERP={mpi_env.get('AMSS_SHELL_FAST_INTERP', '')}")
print(f" AMSS_SHELL_PARALLEL_INTERP={mpi_env.get('AMSS_SHELL_PARALLEL_INTERP', '')}")
print(f" AMSS_SHELL_CUDA_INTERP={mpi_env.get('AMSS_SHELL_CUDA_INTERP', '')}")
print(f" AMSS_SHELL_INTERP_THREADS={mpi_env.get('AMSS_SHELL_INTERP_THREADS', '')}")
print(f" AMSS_Z4C_CUDA_RESIDENT={mpi_env.get('AMSS_Z4C_CUDA_RESIDENT', '')}")
print(f" AMSS_CONSTRAINT_OUT_EVERY={mpi_env.get('AMSS_CONSTRAINT_OUT_EVERY', '')}")
if "CUDA_MPS_PIPE_DIRECTORY" in mpi_env:
print(f" CUDA_MPS_PIPE_DIRECTORY={mpi_env['CUDA_MPS_PIPE_DIRECTORY']}")

View File

@@ -822,8 +822,21 @@ def generate_constraint_check_plot( outdir, figure_outdir, input_level_number ):
print( " Begin the constraint violation plot for grid level number = ", input_level_number )
if (not os.path.exists(file0)) or os.path.getsize(file0) == 0:
if ( input_level_number == 0 ):
print( " Constraint data file is empty; skip constraint violation plots" )
print( )
return
# load the full data file (assumed whitespace-separated floats)
data = numpy.loadtxt(file0)
data = numpy.atleast_2d(data)
if data.shape[1] < 8:
if ( input_level_number == 0 ):
print( " Constraint data file has insufficient columns; skip constraint violation plots" )
print( )
return
# extract columns from the constraint data file
time = data[:,0]