Compare commits

..

26 Commits

Author SHA1 Message Date
8b68b5d782 fixup! Fix load explosion: use subprocess for binary data plots to avoid thread conflict
* Seems we don't have to set so many variables, `OMP_NUM_THREADS` is enough.

Test: Annotate the code for setting other environment variables. It runs normally.
2026-02-09 23:00:17 +08:00
dd2443c926 Fix load explosion: use subprocess for binary data plots to avoid thread conflict
Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>
2026-02-09 21:40:27 +08:00
2d7ba5c60c [2/2] Implement multiprocessing-based parallel plotting 2026-02-09 21:36:45 +08:00
4777cad4ed [1/2] Implement multiprocessing-based parallel plotting 2026-02-09 15:13:18 +08:00
afd4006da2 Cache GSL in SyncPlan and apply async Sync to Z4c_class
Major optimization: Pre-build grid segment lists (GSLs) once per Step() call
via SyncPreparePlan(), then reuse them across all 4 RK4 substep SyncBegin calls
via SyncBeginWithPlan(). This eliminates the O(cpusize * blocks^2) GSL rebuild
cost that was incurred on every ghost zone exchange.

Applied async SyncBegin/SyncEnd overlap pattern to Z4c_class.C (ABEtype==2,
the default configuration), which was still using blocking Parallel::Sync.
Both the regular and CPBC variants of Z4c Step() are now optimized.

Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>
2026-02-08 16:46:44 +08:00
copilot-swe-agent[bot]
a918dc103e Add SyncBegin/SyncEnd to Parallel for MPI communication-computation overlap
Split the blocking Parallel::Sync into async SyncBegin (initiates local copy +
MPI_Isend/Irecv) and SyncEnd (MPI_Waitall + unpack). This allows overlapping MPI
ghost zone exchange with error checking and Shell patch computation.

Modified Step() in bssn_class.C for both PSTR==0 and PSTR==1/2/3 versions to
start Sync before error checks, overlapping the MPI_Allreduce with the ongoing
ghost zone transfers.

Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>
2026-02-08 16:19:13 +08:00
copilot-swe-agent[bot]
38c2c30186 Merge lopsided advection + kodis dissipation to share symmetry_bd buffer
Add lopsided_kodis subroutine in lopsidediff.f90 that combines upwind
advection (lopsided) and Kreiss-Oliger dissipation (kodis) into one
function sharing a single fh buffer from symmetry_bd. This eliminates
27 redundant full-grid copies per RHS evaluation (108 per timestep).

For gxx/gyy/gzz variables: kodis stencil coefficients sum to zero
(1-6+15-20+15-6+1=0), so using gxx(=dxx+1) instead of dxx for the
dissipation buffer is mathematically exact.

Update bssn_rhs.f90 to use the merged lopsided_kodis calls.

Co-authored-by: ianchb <45872450+ianchb@users.noreply.github.com>
2026-02-08 15:42:44 +08:00
b8e41b2b39 Only enable OpenMP for TwoPunctures 2026-02-08 13:00:37 +08:00
133e4f13a2 Use OpenMP's parallel for with schedule(dynamic,1) 2026-02-07 19:48:24 +08:00
914c4f4791 Optimize memory allocation in JFD_times_dv
This should reduce the pressure on the memory allocator, indirectly improving caching behavior.

Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>
2026-02-07 15:55:45 +08:00
f345b0e520 Performance optimization for the TwoPunctures module
* Re-enabled OpenMP.

1. Batch spectral derivatives (Chebyshev & Fourier) via precomputed matrices:
Chebyshev/Fourier transforms and derivatives are precomputed as explicit physical-space operator matrices.
Batch DGEMM now applies to entire tensor fields, mathematically identical to original per-line transforms but vastly faster.

2. Gauss-Seidel relaxation & tridiagonal solver workspace reuse:
Per-thread reusable workspaces replace per-call heap new/delete in all tridiagonal and relaxation routines.

3. Efficient OpenMP multithreading throughout relaxation/deriv:
relax_omp and friends parallelize over grouped lines/planes, maximizing threading efficiency and memory independence.

Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>
2026-02-07 14:48:47 +08:00
f5ed23d687 Revert "Eliminate hot-path heap allocations in TwoPunctures spectral solver"
This reverts commit 09ffdb553d.
2026-02-07 14:45:25 +08:00
03d501db04 Display the runtime of TwoPunctures 2026-02-07 14:45:16 +08:00
09ffdb553d Eliminate hot-path heap allocations in TwoPunctures spectral solver
Pre-allocate workspace buffers as class members to remove ~8M malloc/free
pairs per Newton iteration from LineRelax, ThomasAlgorithm, JFD_times_dv,
J_times_dv, chebft_Zeros, fourft, Derivatives_AB3, and F_of_v.
Rewrite ThomasAlgorithm to operate in-place on input arrays.
Results are bit-identical; no algorithmic changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 21:20:35 +08:00
699e443c7a Optimize polint/polin2/polin3 interpolation for cache locality
Changes:
- polint: Rewrite Neville algorithm from array-slice operations to
  scalar loop. Mathematically identical, avoids temporary array
  allocations for den(1:n-m) slices. (Credit: yx-fmisc branch)

- polin2: Swap interpolation order so inner loop accesses ya(:,j)
  (contiguous in Fortran column-major) instead of ya(i,:) (strided).
  Tensor product interpolation is commutative; all call sites pass
  identical coordinate arrays for all dimensions.

- polin3: Swap interpolation order to process contiguous first
  dimension first: ya(:,j,k) -> yatmp(:,k) -> ymtmp(:).
  Same commutativity argument as polin2.

Compile-time safety switch:
  -DPOLINT_LEGACY_ORDER  restores original dimension ordering
  Default (no flag):     uses optimized contiguous-memory ordering

Usage:
  # Production (optimized order):
  make clean && make -j ABE

  # Fallback if results differ (original order):
  Add -DPOLINT_LEGACY_ORDER to f90appflags in makefile.inc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 19:00:35 +08:00
24bfa44911 Disable NaN sanity check in bssn_rhs.f90 for production builds
Wrap the NaN sanity check (21 sum() full-array traversals per RHS call)
with #ifdef DEBUG so it is compiled out in production builds.

This eliminates 84 redundant full-array scans per timestep (21 per RHS
call × 4 RK4 substages) that serve no purpose when input data is valid.

Usage:
  - Production build (default): NaN check is disabled, no changes needed.
  - Debug build: add -DDEBUG to f90appflags in makefile.inc, e.g.
      f90appflags = -O3 ... -DDEBUG -fpp ...
    to re-enable the NaN sanity check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:36:29 +08:00
6738854a9d Compiler-level and hot-path optimizations for GW150914
- makefile.inc: add -ipo (interprocedural optimization) and
  -align array64byte (64-byte array alignment for vectorization)
- fmisc.f90: remove redundant funcc=0.d0 zeroing from symmetry_bd,
  symmetry_tbd, symmetry_stbd (~328+ full-array memsets eliminated
  per timestep)
- enforce_algebra.f90: rewrite enforce_ag and enforce_ga as point-wise
  loops, replacing 12 stack-allocated 3D temporary arrays with scalar
  locals for better cache locality

All changes are mathematically equivalent — no algorithmic modifications.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 17:13:39 +08:00
223ec17a54 input updated 2026-02-06 13:57:48 +08:00
26c81d8e81 makefile updated 2026-01-19 23:53:16 +08:00
CGH0S7
9deeda9831 Refactor verification method and optimize numerical kernels with oneMKL BLAS
This commit transitions the verification approach from post-Newtonian theory
   comparison to regression testing against baseline simulations, and optimizes
   critical numerical kernels using Intel oneMKL BLAS routines.

   Verification Changes:
   - Replace PN theory-based RMS calculation with trajectory-based comparison
   - Compare optimized results against baseline (GW150914-origin) on XY plane
   - Compute RMS independently for BH1 and BH2, report maximum as final metric
   - Update documentation to reflect new regression test methodology

   Performance Optimizations:
   - Replace manual vector operations with oneMKL BLAS routines:
     * norm2() and scalarproduct() now use cblas_dnrm2/cblas_ddot (C++)
     * L2 norm calculations use DDOT for dot products (Fortran)
     * Interpolation weighted sums use DDOT (Fortran)
   - Disable OpenMP threading (switch to sequential MKL) for better performance

   Build Configuration:
   - Switch from lmkl_intel_thread to lmkl_sequential
   - Remove -qopenmp flags from compiler options
   - Maintain aggressive optimization flags (-O3, -xHost, -fp-model fast=2, -fma)

   Other Changes:
   - Update .gitignore for GW150914-origin, docs, and temporary files
2026-01-18 14:25:21 +08:00
CGH0S7
3a7bce3af2 Update Intel oneAPI configuration and CPU binding settings
- Update makefile.inc with Intel oneAPI compiler flags and oneMKL linking
   - Configure taskset CPU binding to use nohz_full cores (4-55, 60-111)
   - Set build parallelism to 104 jobs for faster compilation
   - Update MPI process count to 48 in input configuration
2026-01-17 20:41:02 +08:00
CGH0S7
c6945bb095 Rename verify_accuracy.py to AMSS_NCKU_Verify_ASC26.py and improve visual output 2026-01-17 14:54:33 +08:00
CGH0S7
0d24f1503c Add accuracy verification script for GW150914 simulation
- Verify RMS error < 1% (black hole trajectory vs. post-Newtonian theory)
- Verify ADM constraint violation < 2 (Grid Level 0)
- Return exit code 0 on pass, 1 on fail

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 00:37:30 +08:00
CGH0S7
cb252f5ea2 Optimize numerical algorithms with Intel oneMKL
- FFT.f90: Replace hand-written Cooley-Tukey FFT with oneMKL DFTI
   - ilucg.f90: Replace manual dot product loop with BLAS DDOT
   - gaussj.C: Replace Gauss-Jordan elimination with LAPACK dgesv/dgetri
   - makefile.inc: Add MKL include paths and library linking

   All optimizations maintain mathematical equivalence and numerical precision.
2026-01-16 10:58:11 +08:00
CGH0S7
7a76cbaafd Add numactl CPU binding to avoid cores 0-3 and 56-59
Bind all computation processes (ABE, ABEGPU, TwoPunctureABE) to
   CPU cores 4-55 and 60-111 using numactl --physcpubind to prevent
   interference with system processes on reserved cores.
2026-01-16 10:24:46 +08:00
CGH0S7
57a7376044 Switch compiler toolchain from GCC to Intel oneAPI
- makefile.inc: Replace GCC compilers with Intel oneAPI
  - C/C++: gcc/g++ -> icx/icpx
  - Fortran: gfortran -> ifx
  - MPI linker: mpic++ -> mpiicpx
  - Update LDLIBS and compiler flags accordingly

- macrodef.h: Fix include path (microdef.fh -> macrodef.fh)

Requires: source /home/intel/oneapi/setvars.sh before build
2026-01-15 16:32:12 +08:00
26 changed files with 5440 additions and 1124 deletions

3
.gitignore vendored
View File

@@ -1,3 +1,6 @@
__pycache__ __pycache__
GW150914 GW150914
GW150914-origin GW150914-origin
docs
*.tmp

View File

@@ -1,445 +0,0 @@
##################################################################
##
## AMSS-NCKU ABE Test Program (Skip TwoPuncture if data exists)
## Modified from AMSS_NCKU_Program.py
## Author: Xiaoqu
## Modified: 2026/02/01
##
##################################################################
##################################################################
## Print program introduction
import print_information
print_information.print_program_introduction()
##################################################################
import AMSS_NCKU_Input as input_data
##################################################################
## Create directories to store program run data
import os
import shutil
import sys
import time
## Set the output directory according to the input file
File_directory = os.path.join(input_data.File_directory)
## Check if output directory exists and if TwoPuncture data is available
skip_twopuncture = False
output_directory = os.path.join(File_directory, "AMSS_NCKU_output")
binary_results_directory = os.path.join(output_directory, input_data.Output_directory)
if os.path.exists(File_directory):
print( " Output directory already exists." )
print()
# Check if TwoPuncture initial data files exist
if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture"):
twopuncture_output = os.path.join(output_directory, "TwoPunctureABE")
input_par = os.path.join(output_directory, "input.par")
if os.path.exists(twopuncture_output) and os.path.exists(input_par):
print( " Found existing TwoPuncture initial data." )
print( " Do you want to skip TwoPuncture phase and reuse existing data?" )
print( " Input 'skip' to skip TwoPuncture and start ABE directly" )
print( " Input 'regenerate' to regenerate everything from scratch" )
print()
while True:
try:
inputvalue = input()
if ( inputvalue == "skip" ):
print( " Skipping TwoPuncture phase, will reuse existing initial data." )
print()
skip_twopuncture = True
break
elif ( inputvalue == "regenerate" ):
print( " Regenerating everything from scratch." )
print()
skip_twopuncture = False
break
else:
print( " Please input 'skip' or 'regenerate'." )
except ValueError:
print( " Please input 'skip' or 'regenerate'." )
else:
print( " TwoPuncture initial data not found, will regenerate everything." )
print()
# If not skipping, remove and recreate directory
if not skip_twopuncture:
shutil.rmtree(File_directory, ignore_errors=True)
os.mkdir(File_directory)
os.mkdir(output_directory)
os.mkdir(binary_results_directory)
figure_directory = os.path.join(File_directory, "figure")
os.mkdir(figure_directory)
shutil.copy("AMSS_NCKU_Input.py", File_directory)
print( " Output directory has been regenerated." )
print()
else:
# Create fresh directory structure
os.mkdir(File_directory)
shutil.copy("AMSS_NCKU_Input.py", File_directory)
os.mkdir(output_directory)
os.mkdir(binary_results_directory)
figure_directory = os.path.join(File_directory, "figure")
os.mkdir(figure_directory)
print( " Output directory has been generated." )
print()
# Ensure figure directory exists
figure_directory = os.path.join(File_directory, "figure")
if not os.path.exists(figure_directory):
os.mkdir(figure_directory)
##################################################################
## Output related parameter information
import setup
## Print and save input parameter information
setup.print_input_data( File_directory )
if not skip_twopuncture:
setup.generate_AMSSNCKU_input()
setup.print_puncture_information()
##################################################################
## Generate AMSS-NCKU program input files based on the configured parameters
if not skip_twopuncture:
print()
print( " Generating the AMSS-NCKU input parfile for the ABE executable." )
print()
## Generate cgh-related input files from the grid information
import numerical_grid
numerical_grid.append_AMSSNCKU_cgh_input()
print()
print( " The input parfile for AMSS-NCKU C++ executable file ABE has been generated." )
print( " However, the input relevant to TwoPuncture need to be appended later." )
print()
##################################################################
## Plot the initial grid configuration
if not skip_twopuncture:
print()
print( " Schematically plot the numerical grid structure." )
print()
import numerical_grid
numerical_grid.plot_initial_grid()
##################################################################
## Generate AMSS-NCKU macro files according to the numerical scheme and parameters
if not skip_twopuncture:
print()
print( " Automatically generating the macro file for AMSS-NCKU C++ executable file ABE " )
print( " (Based on the finite-difference numerical scheme) " )
print()
import generate_macrodef
generate_macrodef.generate_macrodef_h()
print( " AMSS-NCKU macro file macrodef.h has been generated. " )
generate_macrodef.generate_macrodef_fh()
print( " AMSS-NCKU macro file macrodef.fh has been generated. " )
##################################################################
# Compile the AMSS-NCKU program according to user requirements
# NOTE: ABE compilation is always performed, even when skipping TwoPuncture
print()
print( " Preparing to compile and run the AMSS-NCKU code as requested " )
print( " Compiling the AMSS-NCKU code based on the generated macro files " )
print()
AMSS_NCKU_source_path = "AMSS_NCKU_source"
AMSS_NCKU_source_copy = os.path.join(File_directory, "AMSS_NCKU_source_copy")
## If AMSS_NCKU source folder is missing, create it and prompt the user
if not os.path.exists(AMSS_NCKU_source_path):
os.makedirs(AMSS_NCKU_source_path)
print( " The AMSS-NCKU source files are incomplete; copy all source files into ./AMSS_NCKU_source. " )
print( " Press Enter to continue. " )
inputvalue = input()
# Copy AMSS-NCKU source files to prepare for compilation
# If skipping TwoPuncture and source_copy already exists, remove it first
if skip_twopuncture and os.path.exists(AMSS_NCKU_source_copy):
shutil.rmtree(AMSS_NCKU_source_copy)
shutil.copytree(AMSS_NCKU_source_path, AMSS_NCKU_source_copy)
# Copy the generated macro files into the AMSS_NCKU source folder
if not skip_twopuncture:
macrodef_h_path = os.path.join(File_directory, "macrodef.h")
macrodef_fh_path = os.path.join(File_directory, "macrodef.fh")
else:
# When skipping TwoPuncture, use existing macro files from previous run
macrodef_h_path = os.path.join(File_directory, "macrodef.h")
macrodef_fh_path = os.path.join(File_directory, "macrodef.fh")
shutil.copy2(macrodef_h_path, AMSS_NCKU_source_copy)
shutil.copy2(macrodef_fh_path, AMSS_NCKU_source_copy)
# Compile related programs
import makefile_and_run
## Change working directory to the target source copy
os.chdir(AMSS_NCKU_source_copy)
## Build the main AMSS-NCKU executable (ABE or ABEGPU)
makefile_and_run.makefile_ABE()
## If the initial-data method is Ansorg-TwoPuncture, build the TwoPunctureABE executable
## Only build TwoPunctureABE if not skipping TwoPuncture phase
if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and not skip_twopuncture:
makefile_and_run.makefile_TwoPunctureABE()
## Change current working directory back up two levels
os.chdir('..')
os.chdir('..')
print()
##################################################################
## Copy the AMSS-NCKU executable (ABE/ABEGPU) to the run directory
if (input_data.GPU_Calculation == "no"):
ABE_file = os.path.join(AMSS_NCKU_source_copy, "ABE")
elif (input_data.GPU_Calculation == "yes"):
ABE_file = os.path.join(AMSS_NCKU_source_copy, "ABEGPU")
if not os.path.exists( ABE_file ):
print()
print( " Lack of AMSS-NCKU executable file ABE/ABEGPU; recompile AMSS_NCKU_source manually. " )
print( " When recompilation is finished, press Enter to continue. " )
inputvalue = input()
## Copy the executable ABE (or ABEGPU) into the run directory
shutil.copy2(ABE_file, output_directory)
## If the initial-data method is TwoPuncture, copy the TwoPunctureABE executable to the run directory
## Only copy TwoPunctureABE if not skipping TwoPuncture phase
if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and not skip_twopuncture:
TwoPuncture_file = os.path.join(AMSS_NCKU_source_copy, "TwoPunctureABE")
if not os.path.exists( TwoPuncture_file ):
print()
print( " Lack of AMSS-NCKU executable file TwoPunctureABE; recompile TwoPunctureABE in AMSS_NCKU_source. " )
print( " When recompilation is finished, press Enter to continue. " )
inputvalue = input()
## Copy the TwoPunctureABE executable into the run directory
shutil.copy2(TwoPuncture_file, output_directory)
##################################################################
## If the initial-data method is TwoPuncture, generate the TwoPuncture input files
if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and not skip_twopuncture:
print()
print( " Initial data is chosen as Ansorg-TwoPuncture" )
print()
print()
print( " Automatically generating the input parfile for the TwoPunctureABE executable " )
print()
import generate_TwoPuncture_input
generate_TwoPuncture_input.generate_AMSSNCKU_TwoPuncture_input()
print()
print( " The input parfile for the TwoPunctureABE executable has been generated. " )
print()
## Generated AMSS-NCKU TwoPuncture input filename
AMSS_NCKU_TwoPuncture_inputfile = 'AMSS-NCKU-TwoPuncture.input'
AMSS_NCKU_TwoPuncture_inputfile_path = os.path.join( File_directory, AMSS_NCKU_TwoPuncture_inputfile )
## Copy and rename the file
shutil.copy2( AMSS_NCKU_TwoPuncture_inputfile_path, os.path.join(output_directory, 'TwoPunctureinput.par') )
## Run TwoPuncture to generate initial-data files
start_time = time.time() # Record start time
print()
print()
## Change to the output (run) directory
os.chdir(output_directory)
## Run the TwoPuncture executable
import makefile_and_run
makefile_and_run.run_TwoPunctureABE()
## Change current working directory back up two levels
os.chdir('..')
os.chdir('..')
elif (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and skip_twopuncture:
print()
print( " Skipping TwoPuncture execution, using existing initial data." )
print()
start_time = time.time() # Record start time for ABE only
else:
start_time = time.time() # Record start time
##################################################################
## Update puncture data based on TwoPuncture run results
if not skip_twopuncture:
import renew_puncture_parameter
renew_puncture_parameter.append_AMSSNCKU_BSSN_input(File_directory, output_directory)
## Generated AMSS-NCKU input filename
AMSS_NCKU_inputfile = 'AMSS-NCKU.input'
AMSS_NCKU_inputfile_path = os.path.join(File_directory, AMSS_NCKU_inputfile)
## Copy and rename the file
shutil.copy2( AMSS_NCKU_inputfile_path, os.path.join(output_directory, 'input.par') )
print()
print( " Successfully copy all AMSS-NCKU input parfile to target dictionary. " )
print()
else:
print()
print( " Using existing input.par file from previous run." )
print()
##################################################################
## Launch the AMSS-NCKU program
print()
print()
## Change to the run directory
os.chdir( output_directory )
import makefile_and_run
makefile_and_run.run_ABE()
## Change current working directory back up two levels
os.chdir('..')
os.chdir('..')
end_time = time.time()
elapsed_time = end_time - start_time
##################################################################
## Copy some basic input and log files out to facilitate debugging
## Path to the file that stores calculation settings
AMSS_NCKU_error_file_path = os.path.join(binary_results_directory, "setting.par")
## Copy and rename the file for easier inspection
shutil.copy( AMSS_NCKU_error_file_path, os.path.join(output_directory, "AMSSNCKU_setting_parameter") )
## Path to the error log file
AMSS_NCKU_error_file_path = os.path.join(binary_results_directory, "Error.log")
## Copy and rename the error log
shutil.copy( AMSS_NCKU_error_file_path, os.path.join(output_directory, "Error.log") )
## Primary program outputs
AMSS_NCKU_BH_data = os.path.join(binary_results_directory, "bssn_BH.dat" )
AMSS_NCKU_ADM_data = os.path.join(binary_results_directory, "bssn_ADMQs.dat" )
AMSS_NCKU_psi4_data = os.path.join(binary_results_directory, "bssn_psi4.dat" )
AMSS_NCKU_constraint_data = os.path.join(binary_results_directory, "bssn_constraint.dat")
## copy and rename the file
shutil.copy( AMSS_NCKU_BH_data, os.path.join(output_directory, "bssn_BH.dat" ) )
shutil.copy( AMSS_NCKU_ADM_data, os.path.join(output_directory, "bssn_ADMQs.dat" ) )
shutil.copy( AMSS_NCKU_psi4_data, os.path.join(output_directory, "bssn_psi4.dat" ) )
shutil.copy( AMSS_NCKU_constraint_data, os.path.join(output_directory, "bssn_constraint.dat") )
## Additional program outputs
if (input_data.Equation_Class == "BSSN-EM"):
AMSS_NCKU_phi1_data = os.path.join(binary_results_directory, "bssn_phi1.dat" )
AMSS_NCKU_phi2_data = os.path.join(binary_results_directory, "bssn_phi2.dat" )
shutil.copy( AMSS_NCKU_phi1_data, os.path.join(output_directory, "bssn_phi1.dat" ) )
shutil.copy( AMSS_NCKU_phi2_data, os.path.join(output_directory, "bssn_phi2.dat" ) )
elif (input_data.Equation_Class == "BSSN-EScalar"):
AMSS_NCKU_maxs_data = os.path.join(binary_results_directory, "bssn_maxs.dat" )
shutil.copy( AMSS_NCKU_maxs_data, os.path.join(output_directory, "bssn_maxs.dat" ) )
##################################################################
## Plot the AMSS-NCKU program results
print()
print( " Plotting the txt and binary results data from the AMSS-NCKU simulation " )
print()
import plot_xiaoqu
import plot_GW_strain_amplitude_xiaoqu
## Plot black hole trajectory
plot_xiaoqu.generate_puncture_orbit_plot( binary_results_directory, figure_directory )
plot_xiaoqu.generate_puncture_orbit_plot3D( binary_results_directory, figure_directory )
## Plot black hole separation vs. time
plot_xiaoqu.generate_puncture_distence_plot( binary_results_directory, figure_directory )
## Plot gravitational waveforms (psi4 and strain amplitude)
for i in range(input_data.Detector_Number):
plot_xiaoqu.generate_gravitational_wave_psi4_plot( binary_results_directory, figure_directory, i )
plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot( binary_results_directory, figure_directory, i )
## Plot ADM mass evolution
for i in range(input_data.Detector_Number):
plot_xiaoqu.generate_ADMmass_plot( binary_results_directory, figure_directory, i )
## Plot Hamiltonian constraint violation over time
for i in range(input_data.grid_level):
plot_xiaoqu.generate_constraint_check_plot( binary_results_directory, figure_directory, i )
## Plot stored binary data
plot_xiaoqu.generate_binary_data_plot( binary_results_directory, figure_directory )
print()
print( f" This Program Cost = {elapsed_time} Seconds " )
print()
##################################################################
print()
print( " The AMSS-NCKU-Python simulation is successfully finished, thanks for using !!! " )
print()
##################################################################

View File

@@ -8,6 +8,14 @@
## ##
################################################################## ##################################################################
## Guard against re-execution by multiprocessing child processes.
## Without this, using 'spawn' or 'forkserver' context would cause every
## worker to re-run the entire script, spawning exponentially more
## workers (fork bomb).
if __name__ != '__main__':
import sys as _sys
_sys.exit(0)
################################################################## ##################################################################
@@ -424,26 +432,31 @@ print(
import plot_xiaoqu import plot_xiaoqu
import plot_GW_strain_amplitude_xiaoqu import plot_GW_strain_amplitude_xiaoqu
from parallel_plot_helper import run_plot_tasks_parallel
plot_tasks = []
## Plot black hole trajectory ## Plot black hole trajectory
plot_xiaoqu.generate_puncture_orbit_plot( binary_results_directory, figure_directory ) plot_tasks.append( ( plot_xiaoqu.generate_puncture_orbit_plot, (binary_results_directory, figure_directory) ) )
plot_xiaoqu.generate_puncture_orbit_plot3D( binary_results_directory, figure_directory ) plot_tasks.append( ( plot_xiaoqu.generate_puncture_orbit_plot3D, (binary_results_directory, figure_directory) ) )
## Plot black hole separation vs. time ## Plot black hole separation vs. time
plot_xiaoqu.generate_puncture_distence_plot( binary_results_directory, figure_directory ) plot_tasks.append( ( plot_xiaoqu.generate_puncture_distence_plot, (binary_results_directory, figure_directory) ) )
## Plot gravitational waveforms (psi4 and strain amplitude) ## Plot gravitational waveforms (psi4 and strain amplitude)
for i in range(input_data.Detector_Number): for i in range(input_data.Detector_Number):
plot_xiaoqu.generate_gravitational_wave_psi4_plot( binary_results_directory, figure_directory, i ) plot_tasks.append( ( plot_xiaoqu.generate_gravitational_wave_psi4_plot, (binary_results_directory, figure_directory, i) ) )
plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot( binary_results_directory, figure_directory, i ) plot_tasks.append( ( plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot, (binary_results_directory, figure_directory, i) ) )
## Plot ADM mass evolution ## Plot ADM mass evolution
for i in range(input_data.Detector_Number): for i in range(input_data.Detector_Number):
plot_xiaoqu.generate_ADMmass_plot( binary_results_directory, figure_directory, i ) plot_tasks.append( ( plot_xiaoqu.generate_ADMmass_plot, (binary_results_directory, figure_directory, i) ) )
## Plot Hamiltonian constraint violation over time ## Plot Hamiltonian constraint violation over time
for i in range(input_data.grid_level): for i in range(input_data.grid_level):
plot_xiaoqu.generate_constraint_check_plot( binary_results_directory, figure_directory, i ) plot_tasks.append( ( plot_xiaoqu.generate_constraint_check_plot, (binary_results_directory, figure_directory, i) ) )
run_plot_tasks_parallel(plot_tasks)
## Plot stored binary data ## Plot stored binary data
plot_xiaoqu.generate_binary_data_plot( binary_results_directory, figure_directory ) plot_xiaoqu.generate_binary_data_plot( binary_results_directory, figure_directory )

View File

@@ -277,4 +277,3 @@ def main():
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@@ -37,57 +37,51 @@ close(77)
end program checkFFT end program checkFFT
#endif #endif
!-------------
! Optimized FFT using Intel oneMKL DFTI
! Mathematical equivalence: Standard DFT definition
! Forward (isign=1): X[k] = sum_{n=0}^{N-1} x[n] * exp(-2*pi*i*k*n/N)
! Backward (isign=-1): X[k] = sum_{n=0}^{N-1} x[n] * exp(+2*pi*i*k*n/N)
! Input/Output: dataa is interleaved complex array [Re(0),Im(0),Re(1),Im(1),...]
!------------- !-------------
SUBROUTINE four1(dataa,nn,isign) SUBROUTINE four1(dataa,nn,isign)
use MKL_DFTI
implicit none implicit none
INTEGER::isign,nn INTEGER, intent(in) :: isign, nn
double precision,dimension(2*nn)::dataa DOUBLE PRECISION, dimension(2*nn), intent(inout) :: dataa
INTEGER::i,istep,j,m,mmax,n
double precision::tempi,tempr type(DFTI_DESCRIPTOR), pointer :: desc
DOUBLE PRECISION::theta,wi,wpi,wpr,wr,wtemp integer :: status
n=2*nn
j=1 ! Create DFTI descriptor for 1D complex-to-complex transform
do i=1,n,2 status = DftiCreateDescriptor(desc, DFTI_DOUBLE, DFTI_COMPLEX, 1, nn)
if(j.gt.i)then if (status /= 0) return
tempr=dataa(j)
tempi=dataa(j+1) ! Set input/output storage as interleaved complex (default)
dataa(j)=dataa(i) status = DftiSetValue(desc, DFTI_PLACEMENT, DFTI_INPLACE)
dataa(j+1)=dataa(i+1) if (status /= 0) then
dataa(i)=tempr status = DftiFreeDescriptor(desc)
dataa(i+1)=tempi return
endif endif
m=nn
1 if ((m.ge.2).and.(j.gt.m)) then ! Commit the descriptor
j=j-m status = DftiCommitDescriptor(desc)
m=m/2 if (status /= 0) then
goto 1 status = DftiFreeDescriptor(desc)
return
endif endif
j=j+m
enddo ! Execute FFT based on direction
mmax=2 if (isign == 1) then
2 if (n.gt.mmax) then ! Forward FFT: exp(-2*pi*i*k*n/N)
istep=2*mmax status = DftiComputeForward(desc, dataa)
theta=6.28318530717959d0/(isign*mmax) else
wpr=-2.d0*sin(0.5d0*theta)**2 ! Backward FFT: exp(+2*pi*i*k*n/N)
wpi=sin(theta) status = DftiComputeBackward(desc, dataa)
wr=1.d0
wi=0.d0
do m=1,mmax,2
do i=m,n,istep
j=i+mmax
tempr=sngl(wr)*dataa(j)-sngl(wi)*dataa(j+1)
tempi=sngl(wr)*dataa(j+1)+sngl(wi)*dataa(j)
dataa(j)=dataa(i)-tempr
dataa(j+1)=dataa(i+1)-tempi
dataa(i)=dataa(i)+tempr
dataa(i+1)=dataa(i+1)+tempi
enddo
wtemp=wr
wr=wr*wpr-wi*wpi+wr
wi=wi*wpr+wtemp*wpi+wi
enddo
mmax=istep
goto 2
endif endif
! Free descriptor
status = DftiFreeDescriptor(desc)
return return
END SUBROUTINE four1 END SUBROUTINE four1

View File

@@ -3756,6 +3756,358 @@ void Parallel::Sync(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry)
delete[] transfer_src; delete[] transfer_src;
delete[] transfer_dst; delete[] transfer_dst;
} }
//
// Async Sync: split into SyncBegin (initiate MPI) and SyncEnd (wait + unpack)
// This allows overlapping MPI communication with computation.
//
static void transfer_begin(Parallel::TransferState *ts)
{
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
int cpusize = ts->cpusize;
ts->reqs = new MPI_Request[2 * cpusize];
ts->stats = new MPI_Status[2 * cpusize];
ts->req_no = 0;
ts->send_data = new double *[cpusize];
ts->rec_data = new double *[cpusize];
int length;
for (int node = 0; node < cpusize; node++)
{
ts->send_data[node] = ts->rec_data[node] = 0;
if (node == myrank)
{
// Local copy: pack then immediately unpack (no MPI needed)
if ((length = Parallel::data_packer(0, ts->transfer_src[myrank], ts->transfer_dst[myrank],
node, PACK, ts->VarList1, ts->VarList2, ts->Symmetry)))
{
double *local_data = new double[length];
if (!local_data)
{
cout << "out of memory in transfer_begin, local copy" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
Parallel::data_packer(local_data, ts->transfer_src[myrank], ts->transfer_dst[myrank],
node, PACK, ts->VarList1, ts->VarList2, ts->Symmetry);
Parallel::data_packer(local_data, ts->transfer_src[node], ts->transfer_dst[node],
node, UNPACK, ts->VarList1, ts->VarList2, ts->Symmetry);
delete[] local_data;
}
}
else
{
// send from this cpu to cpu#node
if ((length = Parallel::data_packer(0, ts->transfer_src[myrank], ts->transfer_dst[myrank],
node, PACK, ts->VarList1, ts->VarList2, ts->Symmetry)))
{
ts->send_data[node] = new double[length];
if (!ts->send_data[node])
{
cout << "out of memory in transfer_begin, send" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
Parallel::data_packer(ts->send_data[node], ts->transfer_src[myrank], ts->transfer_dst[myrank],
node, PACK, ts->VarList1, ts->VarList2, ts->Symmetry);
MPI_Isend((void *)ts->send_data[node], length, MPI_DOUBLE, node, 1, MPI_COMM_WORLD,
ts->reqs + ts->req_no++);
}
// receive from cpu#node to this cpu
if ((length = Parallel::data_packer(0, ts->transfer_src[node], ts->transfer_dst[node],
node, UNPACK, ts->VarList1, ts->VarList2, ts->Symmetry)))
{
ts->rec_data[node] = new double[length];
if (!ts->rec_data[node])
{
cout << "out of memory in transfer_begin, recv" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
MPI_Irecv((void *)ts->rec_data[node], length, MPI_DOUBLE, node, 1, MPI_COMM_WORLD,
ts->reqs + ts->req_no++);
}
}
}
// NOTE: MPI_Waitall is NOT called here - that happens in transfer_end
}
//
static void transfer_end(Parallel::TransferState *ts)
{
// Wait for all pending MPI operations
MPI_Waitall(ts->req_no, ts->reqs, ts->stats);
// Unpack received data from remote ranks
for (int node = 0; node < ts->cpusize; node++)
if (ts->rec_data[node])
Parallel::data_packer(ts->rec_data[node], ts->transfer_src[node], ts->transfer_dst[node],
node, UNPACK, ts->VarList1, ts->VarList2, ts->Symmetry);
// Cleanup MPI buffers
for (int node = 0; node < ts->cpusize; node++)
{
if (ts->send_data[node])
delete[] ts->send_data[node];
if (ts->rec_data[node])
delete[] ts->rec_data[node];
}
delete[] ts->reqs;
delete[] ts->stats;
delete[] ts->send_data;
delete[] ts->rec_data;
}
//
Parallel::SyncHandle *Parallel::SyncBegin(Patch *Pat, MyList<var> *VarList, int Symmetry)
{
int cpusize;
MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
SyncHandle *handle = new SyncHandle;
handle->num_states = 1;
handle->states = new TransferState[1];
TransferState *ts = &handle->states[0];
ts->cpusize = cpusize;
ts->VarList1 = VarList;
ts->VarList2 = VarList;
ts->Symmetry = Symmetry;
ts->owns_gsl = true;
ts->dst = build_ghost_gsl(Pat);
ts->src = new MyList<Parallel::gridseg> *[cpusize];
ts->transfer_src = new MyList<Parallel::gridseg> *[cpusize];
ts->transfer_dst = new MyList<Parallel::gridseg> *[cpusize];
for (int node = 0; node < cpusize; node++)
{
ts->src[node] = build_owned_gsl0(Pat, node);
build_gstl(ts->src[node], ts->dst, &ts->transfer_src[node], &ts->transfer_dst[node]);
}
transfer_begin(ts);
return handle;
}
//
Parallel::SyncHandle *Parallel::SyncBegin(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry)
{
int cpusize;
MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
// Count patches
int num_patches = 0;
MyList<Patch> *Pp = PatL;
while (Pp) { num_patches++; Pp = Pp->next; }
SyncHandle *handle = new SyncHandle;
handle->num_states = num_patches + 1; // intra-patch transfers + 1 inter-patch transfer
handle->states = new TransferState[handle->num_states];
// Intra-patch sync: for each patch, build ghost lists and initiate transfer
int idx = 0;
Pp = PatL;
while (Pp)
{
TransferState *ts = &handle->states[idx];
ts->cpusize = cpusize;
ts->VarList1 = VarList;
ts->VarList2 = VarList;
ts->Symmetry = Symmetry;
ts->owns_gsl = true;
ts->dst = build_ghost_gsl(Pp->data);
ts->src = new MyList<Parallel::gridseg> *[cpusize];
ts->transfer_src = new MyList<Parallel::gridseg> *[cpusize];
ts->transfer_dst = new MyList<Parallel::gridseg> *[cpusize];
for (int node = 0; node < cpusize; node++)
{
ts->src[node] = build_owned_gsl0(Pp->data, node);
build_gstl(ts->src[node], ts->dst, &ts->transfer_src[node], &ts->transfer_dst[node]);
}
transfer_begin(ts);
idx++;
Pp = Pp->next;
}
// Inter-patch sync: buffer zone exchange between patches
{
TransferState *ts = &handle->states[idx];
ts->cpusize = cpusize;
ts->VarList1 = VarList;
ts->VarList2 = VarList;
ts->Symmetry = Symmetry;
ts->owns_gsl = true;
ts->dst = build_buffer_gsl(PatL);
ts->src = new MyList<Parallel::gridseg> *[cpusize];
ts->transfer_src = new MyList<Parallel::gridseg> *[cpusize];
ts->transfer_dst = new MyList<Parallel::gridseg> *[cpusize];
for (int node = 0; node < cpusize; node++)
{
ts->src[node] = build_owned_gsl(PatL, node, 5, Symmetry);
build_gstl(ts->src[node], ts->dst, &ts->transfer_src[node], &ts->transfer_dst[node]);
}
transfer_begin(ts);
}
return handle;
}
//
void Parallel::SyncEnd(SyncHandle *handle)
{
if (!handle)
return;
// Wait for all pending transfers and unpack
for (int i = 0; i < handle->num_states; i++)
{
TransferState *ts = &handle->states[i];
transfer_end(ts);
// Cleanup grid segment lists only if this state owns them
if (ts->owns_gsl)
{
if (ts->dst)
ts->dst->destroyList();
for (int node = 0; node < ts->cpusize; node++)
{
if (ts->src[node])
ts->src[node]->destroyList();
if (ts->transfer_src[node])
ts->transfer_src[node]->destroyList();
if (ts->transfer_dst[node])
ts->transfer_dst[node]->destroyList();
}
delete[] ts->src;
delete[] ts->transfer_src;
delete[] ts->transfer_dst;
}
}
delete[] handle->states;
delete handle;
}
//
// SyncPreparePlan: Pre-build grid segment lists for a patch list.
// The plan can be reused across multiple SyncBeginWithPlan calls
// as long as the mesh topology does not change (no regridding).
//
Parallel::SyncPlan *Parallel::SyncPreparePlan(MyList<Patch> *PatL, int Symmetry)
{
int cpusize;
MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
// Count patches
int num_patches = 0;
MyList<Patch> *Pp = PatL;
while (Pp) { num_patches++; Pp = Pp->next; }
SyncPlan *plan = new SyncPlan;
plan->num_entries = num_patches + 1; // intra-patch + 1 inter-patch
plan->Symmetry = Symmetry;
plan->entries = new SyncPlanEntry[plan->num_entries];
// Intra-patch entries: ghost zone exchange within each patch
int idx = 0;
Pp = PatL;
while (Pp)
{
SyncPlanEntry *pe = &plan->entries[idx];
pe->cpusize = cpusize;
pe->dst = build_ghost_gsl(Pp->data);
pe->src = new MyList<Parallel::gridseg> *[cpusize];
pe->transfer_src = new MyList<Parallel::gridseg> *[cpusize];
pe->transfer_dst = new MyList<Parallel::gridseg> *[cpusize];
for (int node = 0; node < cpusize; node++)
{
pe->src[node] = build_owned_gsl0(Pp->data, node);
build_gstl(pe->src[node], pe->dst, &pe->transfer_src[node], &pe->transfer_dst[node]);
}
idx++;
Pp = Pp->next;
}
// Inter-patch entry: buffer zone exchange between patches
{
SyncPlanEntry *pe = &plan->entries[idx];
pe->cpusize = cpusize;
pe->dst = build_buffer_gsl(PatL);
pe->src = new MyList<Parallel::gridseg> *[cpusize];
pe->transfer_src = new MyList<Parallel::gridseg> *[cpusize];
pe->transfer_dst = new MyList<Parallel::gridseg> *[cpusize];
for (int node = 0; node < cpusize; node++)
{
pe->src[node] = build_owned_gsl(PatL, node, 5, Symmetry);
build_gstl(pe->src[node], pe->dst, &pe->transfer_src[node], &pe->transfer_dst[node]);
}
}
return plan;
}
//
void Parallel::SyncFreePlan(SyncPlan *plan)
{
if (!plan)
return;
for (int i = 0; i < plan->num_entries; i++)
{
SyncPlanEntry *pe = &plan->entries[i];
if (pe->dst)
pe->dst->destroyList();
for (int node = 0; node < pe->cpusize; node++)
{
if (pe->src[node])
pe->src[node]->destroyList();
if (pe->transfer_src[node])
pe->transfer_src[node]->destroyList();
if (pe->transfer_dst[node])
pe->transfer_dst[node]->destroyList();
}
delete[] pe->src;
delete[] pe->transfer_src;
delete[] pe->transfer_dst;
}
delete[] plan->entries;
delete plan;
}
//
// SyncBeginWithPlan: Use pre-built GSLs from a SyncPlan to initiate async transfer.
// This avoids the O(cpusize * blocks^2) cost of rebuilding GSLs on every call.
//
Parallel::SyncHandle *Parallel::SyncBeginWithPlan(SyncPlan *plan, MyList<var> *VarList)
{
return SyncBeginWithPlan(plan, VarList, VarList);
}
//
Parallel::SyncHandle *Parallel::SyncBeginWithPlan(SyncPlan *plan, MyList<var> *VarList1, MyList<var> *VarList2)
{
SyncHandle *handle = new SyncHandle;
handle->num_states = plan->num_entries;
handle->states = new TransferState[handle->num_states];
for (int i = 0; i < plan->num_entries; i++)
{
SyncPlanEntry *pe = &plan->entries[i];
TransferState *ts = &handle->states[i];
ts->cpusize = pe->cpusize;
ts->VarList1 = VarList1;
ts->VarList2 = VarList2;
ts->Symmetry = plan->Symmetry;
ts->owns_gsl = false; // GSLs are owned by the plan, not this handle
// Borrow GSL pointers from the plan (do NOT free them in SyncEnd)
ts->transfer_src = pe->transfer_src;
ts->transfer_dst = pe->transfer_dst;
ts->src = pe->src;
ts->dst = pe->dst;
transfer_begin(ts);
}
return handle;
}
// collect buffer grid segments or blocks for the periodic boundary condition of given patch // collect buffer grid segments or blocks for the periodic boundary condition of given patch
// --------------------------------------------------- // ---------------------------------------------------
// |con | |con | // |con | |con |

View File

@@ -81,6 +81,53 @@ namespace Parallel
int Symmetry); int Symmetry);
void Sync(Patch *Pat, MyList<var> *VarList, int Symmetry); void Sync(Patch *Pat, MyList<var> *VarList, int Symmetry);
void Sync(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry); void Sync(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry);
// Async Sync: overlap MPI communication with computation
struct TransferState
{
MPI_Request *reqs;
MPI_Status *stats;
int req_no;
double **send_data;
double **rec_data;
int cpusize;
MyList<gridseg> **transfer_src;
MyList<gridseg> **transfer_dst;
MyList<gridseg> **src;
MyList<gridseg> *dst;
MyList<var> *VarList1;
MyList<var> *VarList2;
int Symmetry;
bool owns_gsl; // true if this state owns and should free the GSLs
};
struct SyncHandle
{
TransferState *states;
int num_states;
};
SyncHandle *SyncBegin(Patch *Pat, MyList<var> *VarList, int Symmetry);
SyncHandle *SyncBegin(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry);
void SyncEnd(SyncHandle *handle);
// Cached GSL plan: pre-build grid segment lists once, reuse across multiple Sync calls
struct SyncPlanEntry
{
int cpusize;
MyList<gridseg> **transfer_src;
MyList<gridseg> **transfer_dst;
MyList<gridseg> **src;
MyList<gridseg> *dst;
};
struct SyncPlan
{
SyncPlanEntry *entries;
int num_entries;
int Symmetry;
};
SyncPlan *SyncPreparePlan(MyList<Patch> *PatL, int Symmetry);
void SyncFreePlan(SyncPlan *plan);
SyncHandle *SyncBeginWithPlan(SyncPlan *plan, MyList<var> *VarList);
SyncHandle *SyncBeginWithPlan(SyncPlan *plan, MyList<var> *VarList1, MyList<var> *VarList2);
void OutBdLow2Hi(Patch *Patc, Patch *Patf, void OutBdLow2Hi(Patch *Patc, Patch *Patf,
MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */, MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */,
int Symmetry); int Symmetry);

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,8 @@
#ifndef TWO_PUNCTURES_H #ifndef TWO_PUNCTURES_H
#define TWO_PUNCTURES_H #define TWO_PUNCTURES_H
#include <omp.h>
#define StencilSize 19 #define StencilSize 19
#define N_PlaneRelax 1 #define N_PlaneRelax 1
#define NRELAX 200 #define NRELAX 200
@@ -42,6 +43,18 @@ private:
int ntotal; int ntotal;
// ===== Precomputed spectral derivative matrices =====
double *D1_A, *D2_A;
double *D1_B, *D2_B;
double *DF1_phi, *DF2_phi;
// ===== Pre-allocated workspace for LineRelax (per-thread) =====
int max_threads;
double **ws_diag_be, **ws_e_be, **ws_f_be, **ws_b_be, **ws_x_be;
double **ws_l_be, **ws_u_be, **ws_d_be, **ws_y_be;
double **ws_diag_al, **ws_e_al, **ws_f_al, **ws_b_al, **ws_x_al;
double **ws_l_al, **ws_u_al, **ws_d_al, **ws_y_al;
struct parameters struct parameters
{ {
int nvar, n1, n2, n3; int nvar, n1, n2, n3;
@@ -58,6 +71,28 @@ public:
int Newtonmaxit); int Newtonmaxit);
~TwoPunctures(); ~TwoPunctures();
// 02/07: New/modified methods
void allocate_workspace();
void free_workspace();
void precompute_derivative_matrices();
void build_cheb_deriv_matrices(int n, double *D1, double *D2);
void build_fourier_deriv_matrices(int N, double *DF1, double *DF2);
void Derivatives_AB3_MatMul(int nvar, int n1, int n2, int n3, derivs v);
void ThomasAlgorithm_ws(int N, double *b, double *a, double *c, double *x, double *q,
double *l, double *u_ws, double *d, double *y);
void LineRelax_be_omp(double *dv,
int const i, int const k, int const nvar,
int const n1, int const n2, int const n3,
double const *rhs, int const *ncols, int **cols,
double **JFD, int tid);
void LineRelax_al_omp(double *dv,
int const j, int const k, int const nvar,
int const n1, int const n2, int const n3,
double const *rhs, int const *ncols,
int **cols, double **JFD, int tid);
void relax_omp(double *dv, int const nvar, int const n1, int const n2, int const n3,
double const *rhs, int const *ncols, int **cols, double **JFD);
void Solve(); void Solve();
void set_initial_guess(derivs v); void set_initial_guess(derivs v);
int index(int i, int j, int k, int l, int a, int b, int c, int d); int index(int i, int j, int k, int l, int a, int b, int c, int d);
@@ -116,23 +151,11 @@ public:
double BY_KKofxyz(double x, double y, double z); double BY_KKofxyz(double x, double y, double z);
void SetMatrix_JFD(int nvar, int n1, int n2, int n3, derivs u, int *ncols, int **cols, double **Matrix); void SetMatrix_JFD(int nvar, int n1, int n2, int n3, derivs u, int *ncols, int **cols, double **Matrix);
void J_times_dv(int nvar, int n1, int n2, int n3, derivs dv, double *Jdv, derivs u); void J_times_dv(int nvar, int n1, int n2, int n3, derivs dv, double *Jdv, derivs u);
void relax(double *dv, int const nvar, int const n1, int const n2, int const n3,
double const *rhs, int const *ncols, int **cols, double **JFD);
void LineRelax_be(double *dv,
int const i, int const k, int const nvar,
int const n1, int const n2, int const n3,
double const *rhs, int const *ncols, int **cols,
double **JFD);
void JFD_times_dv(int i, int j, int k, int nvar, int n1, int n2, void JFD_times_dv(int i, int j, int k, int nvar, int n1, int n2,
int n3, derivs dv, derivs u, double *values); int n3, derivs dv, derivs u, double *values);
void LinEquations(double A, double B, double X, double R, void LinEquations(double A, double B, double X, double R,
double x, double r, double phi, double x, double r, double phi,
double y, double z, derivs dU, derivs U, double *values); double y, double z, derivs dU, derivs U, double *values);
void LineRelax_al(double *dv,
int const j, int const k, int const nvar,
int const n1, int const n2, int const n3,
double const *rhs, int const *ncols,
int **cols, double **JFD);
void ThomasAlgorithm(int N, double *b, double *a, double *c, double *x, double *q); void ThomasAlgorithm(int N, double *b, double *a, double *c, double *x, double *q);
void Save(char *fname); void Save(char *fname);
// provided by Vasileios Paschalidis (vpaschal@illinois.edu) // provided by Vasileios Paschalidis (vpaschal@illinois.edu)

View File

@@ -186,6 +186,12 @@ void Z4c_class::Step(int lev, int YN)
int ERROR = 0; int ERROR = 0;
MyList<ss_patch> *sPp; MyList<ss_patch> *sPp;
// Pre-build grid segment lists once for this level's patches.
// These are reused across predictor + 3 corrector SyncBegin calls,
// avoiding O(cpusize * blocks^2) rebuild each time.
Parallel::SyncPlan *sync_plan = Parallel::SyncPreparePlan(GH->PatL[lev], Symmetry);
// Predictor // Predictor
MyList<Patch> *Pp = GH->PatL[lev]; MyList<Patch> *Pp = GH->PatL[lev];
while (Pp) while (Pp)
@@ -321,13 +327,17 @@ void Z4c_class::Step(int lev, int YN)
} }
Pp = Pp->next; Pp = Pp->next;
} }
// check error information // Start async ghost zone exchange - overlaps with error check and Shell computation
Parallel::SyncHandle *sync_pre = Parallel::SyncBeginWithPlan(sync_plan, SynchList_pre);
// check error information (overlaps with MPI transfer)
{ {
int erh = ERROR; int erh = ERROR;
MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_pre); sync_pre = 0;
Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev); Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -475,6 +485,7 @@ void Z4c_class::Step(int lev, int YN)
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_pre); sync_pre = 0;
SH->Dump_Data(StateList, 0, PhysTime, dT_lev); SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -485,7 +496,8 @@ void Z4c_class::Step(int lev, int YN)
} }
#endif #endif
Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry); // Complete async ghost zone exchange
if (sync_pre) Parallel::SyncEnd(sync_pre);
#ifdef WithShell #ifdef WithShell
if (lev == 0) if (lev == 0)
@@ -693,13 +705,17 @@ void Z4c_class::Step(int lev, int YN)
Pp = Pp->next; Pp = Pp->next;
} }
// check error information // Start async ghost zone exchange - overlaps with error check and Shell computation
Parallel::SyncHandle *sync_cor = Parallel::SyncBeginWithPlan(sync_plan, SynchList_cor);
// check error information (overlaps with MPI transfer)
{ {
int erh = ERROR; int erh = ERROR;
MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_cor); sync_cor = 0;
Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev); Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -857,6 +873,7 @@ void Z4c_class::Step(int lev, int YN)
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_cor); sync_cor = 0;
SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev); SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -868,7 +885,8 @@ void Z4c_class::Step(int lev, int YN)
} }
#endif #endif
Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry); // Complete async ghost zone exchange
if (sync_cor) Parallel::SyncEnd(sync_cor);
#ifdef WithShell #ifdef WithShell
if (lev == 0) if (lev == 0)
@@ -1042,6 +1060,8 @@ void Z4c_class::Step(int lev, int YN)
Porg0[ithBH][2] = Porg1[ithBH][2]; Porg0[ithBH][2] = Porg1[ithBH][2];
} }
} }
Parallel::SyncFreePlan(sync_plan);
} }
#else #else
// for constraint preserving boundary (CPBC) // for constraint preserving boundary (CPBC)
@@ -1075,6 +1095,10 @@ void Z4c_class::Step(int lev, int YN)
int ERROR = 0; int ERROR = 0;
MyList<ss_patch> *sPp; MyList<ss_patch> *sPp;
// Pre-build grid segment lists once for this level's patches.
Parallel::SyncPlan *sync_plan = Parallel::SyncPreparePlan(GH->PatL[lev], Symmetry);
// Predictor // Predictor
MyList<Patch> *Pp = GH->PatL[lev]; MyList<Patch> *Pp = GH->PatL[lev];
while (Pp) while (Pp)
@@ -1542,13 +1566,17 @@ void Z4c_class::Step(int lev, int YN)
} }
#endif #endif
} }
// check error information // Start async ghost zone exchange - overlaps with error check
Parallel::SyncHandle *sync_pre = Parallel::SyncBeginWithPlan(sync_plan, SynchList_pre);
// check error information (overlaps with MPI transfer)
{ {
int erh = ERROR; int erh = ERROR;
MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_pre); sync_pre = 0;
SH->Dump_Data(StateList, 0, PhysTime, dT_lev); SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -1558,7 +1586,8 @@ void Z4c_class::Step(int lev, int YN)
} }
} }
Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry); // Complete async ghost zone exchange
if (sync_pre) Parallel::SyncEnd(sync_pre);
if (lev == 0) if (lev == 0)
{ {
@@ -2103,13 +2132,17 @@ void Z4c_class::Step(int lev, int YN)
sPp = sPp->next; sPp = sPp->next;
} }
} }
// check error information // Start async ghost zone exchange - overlaps with error check
Parallel::SyncHandle *sync_cor = Parallel::SyncBeginWithPlan(sync_plan, SynchList_cor);
// check error information (overlaps with MPI transfer)
{ {
int erh = ERROR; int erh = ERROR;
MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_cor); sync_cor = 0;
SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev); SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -2120,7 +2153,8 @@ void Z4c_class::Step(int lev, int YN)
} }
} }
Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry); // Complete async ghost zone exchange
if (sync_cor) Parallel::SyncEnd(sync_cor);
if (lev == 0) if (lev == 0)
{ {
@@ -2346,6 +2380,8 @@ void Z4c_class::Step(int lev, int YN)
DG_List->clearList(); DG_List->clearList();
} }
#endif #endif
Parallel::SyncFreePlan(sync_plan);
} }
#endif #endif
#undef MRBD #undef MRBD

View File

@@ -3035,6 +3035,12 @@ void bssn_class::Step(int lev, int YN)
int ERROR = 0; int ERROR = 0;
MyList<ss_patch> *sPp; MyList<ss_patch> *sPp;
// Pre-build grid segment lists once for this level's patches.
// These are reused across predictor + 3 corrector SyncBegin calls,
// avoiding O(cpusize * blocks^2) rebuild each time.
Parallel::SyncPlan *sync_plan = Parallel::SyncPreparePlan(GH->PatL[lev], Symmetry);
// Predictor // Predictor
MyList<Patch> *Pp = GH->PatL[lev]; MyList<Patch> *Pp = GH->PatL[lev];
while (Pp) while (Pp)
@@ -3158,13 +3164,18 @@ void bssn_class::Step(int lev, int YN)
} }
Pp = Pp->next; Pp = Pp->next;
} }
// check error information
// Start async ghost zone exchange - overlaps with error check and Shell computation
Parallel::SyncHandle *sync_pre = Parallel::SyncBeginWithPlan(sync_plan, SynchList_pre);
// check error information (overlaps with MPI transfer)
{ {
int erh = ERROR; int erh = ERROR;
MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_pre); sync_pre = 0;
Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev); Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -3324,6 +3335,7 @@ void bssn_class::Step(int lev, int YN)
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_pre); sync_pre = 0;
SH->Dump_Data(StateList, 0, PhysTime, dT_lev); SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -3334,7 +3346,8 @@ void bssn_class::Step(int lev, int YN)
} }
#endif #endif
Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry); // Complete async ghost zone exchange
if (sync_pre) Parallel::SyncEnd(sync_pre);
#ifdef WithShell #ifdef WithShell
if (lev == 0) if (lev == 0)
@@ -3528,7 +3541,10 @@ void bssn_class::Step(int lev, int YN)
Pp = Pp->next; Pp = Pp->next;
} }
// check error information // Start async ghost zone exchange - overlaps with error check and Shell computation
Parallel::SyncHandle *sync_cor = Parallel::SyncBeginWithPlan(sync_plan, SynchList_cor);
// check error information (overlaps with MPI transfer)
{ {
int erh = ERROR; int erh = ERROR;
MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
@@ -3536,6 +3552,7 @@ void bssn_class::Step(int lev, int YN)
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_cor); sync_cor = 0;
Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev); Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -3692,6 +3709,7 @@ void bssn_class::Step(int lev, int YN)
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_cor); sync_cor = 0;
SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev); SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -3704,7 +3722,8 @@ void bssn_class::Step(int lev, int YN)
} }
#endif #endif
Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry); // Complete async ghost zone exchange
if (sync_cor) Parallel::SyncEnd(sync_cor);
#ifdef WithShell #ifdef WithShell
if (lev == 0) if (lev == 0)
@@ -3895,6 +3914,8 @@ void bssn_class::Step(int lev, int YN)
Porg0[ithBH][2] = Porg1[ithBH][2]; Porg0[ithBH][2] = Porg1[ithBH][2];
} }
} }
Parallel::SyncFreePlan(sync_plan);
} }
//================================================================================================ //================================================================================================
@@ -4817,6 +4838,12 @@ void bssn_class::Step(int lev, int YN)
int ERROR = 0; int ERROR = 0;
MyList<ss_patch> *sPp; MyList<ss_patch> *sPp;
// Pre-build grid segment lists once for this level's patches.
// These are reused across predictor + 3 corrector SyncBegin calls,
// avoiding O(cpusize * blocks^2) rebuild each time.
Parallel::SyncPlan *sync_plan = Parallel::SyncPreparePlan(GH->PatL[lev], Symmetry);
// Predictor // Predictor
MyList<Patch> *Pp = GH->PatL[lev]; MyList<Patch> *Pp = GH->PatL[lev];
while (Pp) while (Pp)
@@ -4943,13 +4970,17 @@ void bssn_class::Step(int lev, int YN)
// misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Predictor rhs calculation"); // misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Predictor rhs calculation");
// check error information // Start async ghost zone exchange - overlaps with error check and BH position
Parallel::SyncHandle *sync_pre = Parallel::SyncBeginWithPlan(sync_plan, SynchList_pre);
// check error information (overlaps with MPI transfer)
{ {
int erh = ERROR; int erh = ERROR;
MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev]); MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev]);
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_pre); sync_pre = 0;
Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev); Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -4961,7 +4992,8 @@ void bssn_class::Step(int lev, int YN)
// misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Predictor sync"); // misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Predictor sync");
Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry); // Complete async ghost zone exchange
if (sync_pre) Parallel::SyncEnd(sync_pre);
#if (MAPBH == 0) #if (MAPBH == 0)
// for black hole position // for black hole position
@@ -5140,13 +5172,17 @@ void bssn_class::Step(int lev, int YN)
// misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector error check"); // misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector error check");
// check error information // Start async ghost zone exchange - overlaps with error check and BH position
Parallel::SyncHandle *sync_cor = Parallel::SyncBeginWithPlan(sync_plan, SynchList_cor);
// check error information (overlaps with MPI transfer)
{ {
int erh = ERROR; int erh = ERROR;
MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev]); MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev]);
} }
if (ERROR) if (ERROR)
{ {
Parallel::SyncEnd(sync_cor); sync_cor = 0;
Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev); Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
if (myrank == 0) if (myrank == 0)
{ {
@@ -5160,7 +5196,8 @@ void bssn_class::Step(int lev, int YN)
// misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector sync"); // misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector sync");
Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry); // Complete async ghost zone exchange
if (sync_cor) Parallel::SyncEnd(sync_cor);
// misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Corrector sync"); // misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Corrector sync");
@@ -5276,6 +5313,8 @@ void bssn_class::Step(int lev, int YN)
// if(myrank==GH->start_rank[lev]) cout<<GH->mylev<<endl; // if(myrank==GH->start_rank[lev]) cout<<GH->mylev<<endl;
// misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"complet GH Step"); // misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"complet GH Step");
Parallel::SyncFreePlan(sync_plan);
} }
//================================================================================================ //================================================================================================

View File

@@ -106,7 +106,8 @@
call getpbh(BHN,Porg,Mass) call getpbh(BHN,Porg,Mass)
#endif #endif
!!! sanity check !!! sanity check (disabled in production builds for performance)
#ifdef DEBUG
dX = sum(chi)+sum(trK)+sum(dxx)+sum(gxy)+sum(gxz)+sum(dyy)+sum(gyz)+sum(dzz) & dX = sum(chi)+sum(trK)+sum(dxx)+sum(gxy)+sum(gxz)+sum(dyy)+sum(gyz)+sum(dzz) &
+sum(Axx)+sum(Axy)+sum(Axz)+sum(Ayy)+sum(Ayz)+sum(Azz) & +sum(Axx)+sum(Axy)+sum(Axz)+sum(Ayy)+sum(Ayz)+sum(Azz) &
+sum(Gamx)+sum(Gamy)+sum(Gamz) & +sum(Gamx)+sum(Gamy)+sum(Gamz) &
@@ -136,6 +137,7 @@
gont = 1 gont = 1
return return
endif endif
#endif
PI = dacos(-ONE) PI = dacos(-ONE)
@@ -943,103 +945,60 @@
SSA(2)=SYM SSA(2)=SYM
SSA(3)=ANTI SSA(3)=ANTI
!!!!!!!!!advection term part !!!!!!!!!advection term + Kreiss-Oliger dissipation (merged for cache efficiency)
! lopsided_kodis shares the symmetry_bd buffer between advection and
! dissipation, eliminating redundant full-grid copies. For metric variables
! gxx/gyy/gzz (=dxx/dyy/dzz+1): kodis stencil coefficients sum to zero,
! so the constant offset has no effect on dissipation.
call lopsided(ex,X,Y,Z,gxx,gxx_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,gxx,gxx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,gxy,gxy_rhs,betax,betay,betaz,Symmetry,AAS) call lopsided_kodis(ex,X,Y,Z,gxy,gxy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
call lopsided(ex,X,Y,Z,gxz,gxz_rhs,betax,betay,betaz,Symmetry,ASA) call lopsided_kodis(ex,X,Y,Z,gxz,gxz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
call lopsided(ex,X,Y,Z,gyy,gyy_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,gyy,gyy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,gyz,gyz_rhs,betax,betay,betaz,Symmetry,SAA) call lopsided_kodis(ex,X,Y,Z,gyz,gyz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
call lopsided(ex,X,Y,Z,gzz,gzz_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,gzz,gzz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,Axx,Axx_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,Axx,Axx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,Axy,Axy_rhs,betax,betay,betaz,Symmetry,AAS) call lopsided_kodis(ex,X,Y,Z,Axy,Axy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
call lopsided(ex,X,Y,Z,Axz,Axz_rhs,betax,betay,betaz,Symmetry,ASA) call lopsided_kodis(ex,X,Y,Z,Axz,Axz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
call lopsided(ex,X,Y,Z,Ayy,Ayy_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,Ayy,Ayy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,Ayz,Ayz_rhs,betax,betay,betaz,Symmetry,SAA) call lopsided_kodis(ex,X,Y,Z,Ayz,Ayz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
call lopsided(ex,X,Y,Z,Azz,Azz_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,Azz,Azz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,trK,trK_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided_kodis(ex,X,Y,Z,trK,trK_rhs,betax,betay,betaz,Symmetry,SSS,eps)
call lopsided(ex,X,Y,Z,Gamx,Gamx_rhs,betax,betay,betaz,Symmetry,ASS) call lopsided_kodis(ex,X,Y,Z,Gamx,Gamx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
call lopsided(ex,X,Y,Z,Gamy,Gamy_rhs,betax,betay,betaz,Symmetry,SAS) call lopsided_kodis(ex,X,Y,Z,Gamy,Gamy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
call lopsided(ex,X,Y,Z,Gamz,Gamz_rhs,betax,betay,betaz,Symmetry,SSA) call lopsided_kodis(ex,X,Y,Z,Gamz,Gamz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
!!
#if 1
!! bam does not apply dissipation on gauge variables
call lopsided_kodis(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS,eps)
#if (GAUGE == 0 || GAUGE == 1 || GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7)
call lopsided_kodis(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS,eps)
call lopsided_kodis(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS,eps)
call lopsided_kodis(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
#endif
#if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
call lopsided_kodis(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
call lopsided_kodis(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
call lopsided_kodis(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
#endif
#else
! No dissipation on gauge variables (advection only)
call lopsided(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS) call lopsided(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS)
#if (GAUGE == 0 || GAUGE == 1 || GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7) #if (GAUGE == 0 || GAUGE == 1 || GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7)
call lopsided(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS) call lopsided(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS)
call lopsided(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS) call lopsided(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS)
call lopsided(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA) call lopsided(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA)
#endif #endif
#if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7) #if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
call lopsided(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS) call lopsided(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS)
call lopsided(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS) call lopsided(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS)
call lopsided(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA) call lopsided(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA)
#endif #endif
if(eps>0)then
! usual Kreiss-Oliger dissipation
call kodis(ex,X,Y,Z,chi,chi_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,trK,trK_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,dxx,gxx_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,gxy,gxy_rhs,AAS,Symmetry,eps)
call kodis(ex,X,Y,Z,gxz,gxz_rhs,ASA,Symmetry,eps)
call kodis(ex,X,Y,Z,dyy,gyy_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,gyz,gyz_rhs,SAA,Symmetry,eps)
call kodis(ex,X,Y,Z,dzz,gzz_rhs,SSS,Symmetry,eps)
#if 0
#define i 42
#define j 40
#define k 40
if(Lev == 1)then
write(*,*) X(i),Y(j),Z(k)
write(*,*) "before",Axx_rhs(i,j,k)
endif
#undef i
#undef j
#undef k
!!stop
#endif #endif
call kodis(ex,X,Y,Z,Axx,Axx_rhs,SSS,Symmetry,eps)
#if 0
#define i 42
#define j 40
#define k 40
if(Lev == 1)then
write(*,*) X(i),Y(j),Z(k)
write(*,*) "after",Axx_rhs(i,j,k)
endif
#undef i
#undef j
#undef k
!!stop
#endif
call kodis(ex,X,Y,Z,Axy,Axy_rhs,AAS,Symmetry,eps)
call kodis(ex,X,Y,Z,Axz,Axz_rhs,ASA,Symmetry,eps)
call kodis(ex,X,Y,Z,Ayy,Ayy_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,Ayz,Ayz_rhs,SAA,Symmetry,eps)
call kodis(ex,X,Y,Z,Azz,Azz_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,Gamx,Gamx_rhs,ASS,Symmetry,eps)
call kodis(ex,X,Y,Z,Gamy,Gamy_rhs,SAS,Symmetry,eps)
call kodis(ex,X,Y,Z,Gamz,Gamz_rhs,SSA,Symmetry,eps)
#if 1
!! bam does not apply dissipation on gauge variables
call kodis(ex,X,Y,Z,Lap,Lap_rhs,SSS,Symmetry,eps)
call kodis(ex,X,Y,Z,betax,betax_rhs,ASS,Symmetry,eps)
call kodis(ex,X,Y,Z,betay,betay_rhs,SAS,Symmetry,eps)
call kodis(ex,X,Y,Z,betaz,betaz_rhs,SSA,Symmetry,eps)
#if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
call kodis(ex,X,Y,Z,dtSfx,dtSfx_rhs,ASS,Symmetry,eps)
call kodis(ex,X,Y,Z,dtSfy,dtSfy_rhs,SAS,Symmetry,eps)
call kodis(ex,X,Y,Z,dtSfz,dtSfz_rhs,SSA,Symmetry,eps)
#endif
#endif
endif
if(co == 0)then if(co == 0)then
! ham_Res = trR + 2/3 * K^2 - A_ij * A^ij - 16 * PI * rho ! ham_Res = trR + 2/3 * K^2 - A_ij * A^ij - 16 * PI * rho

File diff suppressed because it is too large Load Diff

View File

@@ -19,48 +19,60 @@
!~~~~~~~> Local variable: !~~~~~~~> Local variable:
real*8, dimension(ex(1),ex(2),ex(3)) :: trA,detg integer :: i,j,k
real*8, dimension(ex(1),ex(2),ex(3)) :: gxx,gyy,gzz real*8 :: lgxx,lgyy,lgzz,ldetg
real*8, dimension(ex(1),ex(2),ex(3)) :: gupxx,gupxy,gupxz,gupyy,gupyz,gupzz real*8 :: lgupxx,lgupxy,lgupxz,lgupyy,lgupyz,lgupzz
real*8 :: ltrA,lscale
real*8, parameter :: F1o3 = 1.D0 / 3.D0, ONE = 1.D0, TWO = 2.D0 real*8, parameter :: F1o3 = 1.D0 / 3.D0, ONE = 1.D0, TWO = 2.D0
!~~~~~~> !~~~~~~>
gxx = dxx + ONE do k=1,ex(3)
gyy = dyy + ONE do j=1,ex(2)
gzz = dzz + ONE do i=1,ex(1)
detg = gxx * gyy * gzz + gxy * gyz * gxz + gxz * gxy * gyz - & lgxx = dxx(i,j,k) + ONE
gxz * gyy * gxz - gxy * gxy * gzz - gxx * gyz * gyz lgyy = dyy(i,j,k) + ONE
gupxx = ( gyy * gzz - gyz * gyz ) / detg lgzz = dzz(i,j,k) + ONE
gupxy = - ( gxy * gzz - gyz * gxz ) / detg
gupxz = ( gxy * gyz - gyy * gxz ) / detg
gupyy = ( gxx * gzz - gxz * gxz ) / detg
gupyz = - ( gxx * gyz - gxy * gxz ) / detg
gupzz = ( gxx * gyy - gxy * gxy ) / detg
trA = gupxx * Axx + gupyy * Ayy + gupzz * Azz & ldetg = lgxx * lgyy * lgzz &
+ TWO * (gupxy * Axy + gupxz * Axz + gupyz * Ayz) + gxy(i,j,k) * gyz(i,j,k) * gxz(i,j,k) &
+ gxz(i,j,k) * gxy(i,j,k) * gyz(i,j,k) &
- gxz(i,j,k) * lgyy * gxz(i,j,k) &
- gxy(i,j,k) * gxy(i,j,k) * lgzz &
- lgxx * gyz(i,j,k) * gyz(i,j,k)
Axx = Axx - F1o3 * gxx * trA lgupxx = ( lgyy * lgzz - gyz(i,j,k) * gyz(i,j,k) ) / ldetg
Axy = Axy - F1o3 * gxy * trA lgupxy = - ( gxy(i,j,k) * lgzz - gyz(i,j,k) * gxz(i,j,k) ) / ldetg
Axz = Axz - F1o3 * gxz * trA lgupxz = ( gxy(i,j,k) * gyz(i,j,k) - lgyy * gxz(i,j,k) ) / ldetg
Ayy = Ayy - F1o3 * gyy * trA lgupyy = ( lgxx * lgzz - gxz(i,j,k) * gxz(i,j,k) ) / ldetg
Ayz = Ayz - F1o3 * gyz * trA lgupyz = - ( lgxx * gyz(i,j,k) - gxy(i,j,k) * gxz(i,j,k) ) / ldetg
Azz = Azz - F1o3 * gzz * trA lgupzz = ( lgxx * lgyy - gxy(i,j,k) * gxy(i,j,k) ) / ldetg
detg = ONE / ( detg ** F1o3 ) ltrA = lgupxx * Axx(i,j,k) + lgupyy * Ayy(i,j,k) &
+ lgupzz * Azz(i,j,k) &
+ TWO * (lgupxy * Axy(i,j,k) + lgupxz * Axz(i,j,k) &
+ lgupyz * Ayz(i,j,k))
gxx = gxx * detg Axx(i,j,k) = Axx(i,j,k) - F1o3 * lgxx * ltrA
gxy = gxy * detg Axy(i,j,k) = Axy(i,j,k) - F1o3 * gxy(i,j,k) * ltrA
gxz = gxz * detg Axz(i,j,k) = Axz(i,j,k) - F1o3 * gxz(i,j,k) * ltrA
gyy = gyy * detg Ayy(i,j,k) = Ayy(i,j,k) - F1o3 * lgyy * ltrA
gyz = gyz * detg Ayz(i,j,k) = Ayz(i,j,k) - F1o3 * gyz(i,j,k) * ltrA
gzz = gzz * detg Azz(i,j,k) = Azz(i,j,k) - F1o3 * lgzz * ltrA
dxx = gxx - ONE lscale = ONE / ( ldetg ** F1o3 )
dyy = gyy - ONE
dzz = gzz - ONE dxx(i,j,k) = lgxx * lscale - ONE
gxy(i,j,k) = gxy(i,j,k) * lscale
gxz(i,j,k) = gxz(i,j,k) * lscale
dyy(i,j,k) = lgyy * lscale - ONE
gyz(i,j,k) = gyz(i,j,k) * lscale
dzz(i,j,k) = lgzz * lscale - ONE
enddo
enddo
enddo
return return
@@ -83,50 +95,70 @@
!~~~~~~~> Local variable: !~~~~~~~> Local variable:
real*8, dimension(ex(1),ex(2),ex(3)) :: trA integer :: i,j,k
real*8, dimension(ex(1),ex(2),ex(3)) :: gxx,gyy,gzz real*8 :: lgxx,lgyy,lgzz,lscale
real*8, dimension(ex(1),ex(2),ex(3)) :: gupxx,gupxy,gupxz,gupyy,gupyz,gupzz real*8 :: lgxy,lgxz,lgyz
real*8 :: lgupxx,lgupxy,lgupxz,lgupyy,lgupyz,lgupzz
real*8 :: ltrA
real*8, parameter :: F1o3 = 1.D0 / 3.D0, ONE = 1.D0, TWO = 2.D0 real*8, parameter :: F1o3 = 1.D0 / 3.D0, ONE = 1.D0, TWO = 2.D0
!~~~~~~> !~~~~~~>
gxx = dxx + ONE do k=1,ex(3)
gyy = dyy + ONE do j=1,ex(2)
gzz = dzz + ONE do i=1,ex(1)
! for g
gupzz = gxx * gyy * gzz + gxy * gyz * gxz + gxz * gxy * gyz - &
gxz * gyy * gxz - gxy * gxy * gzz - gxx * gyz * gyz
gupzz = ONE / ( gupzz ** F1o3 ) ! for g: normalize determinant first
lgxx = dxx(i,j,k) + ONE
lgyy = dyy(i,j,k) + ONE
lgzz = dzz(i,j,k) + ONE
lgxy = gxy(i,j,k)
lgxz = gxz(i,j,k)
lgyz = gyz(i,j,k)
gxx = gxx * gupzz lscale = lgxx * lgyy * lgzz + lgxy * lgyz * lgxz &
gxy = gxy * gupzz + lgxz * lgxy * lgyz - lgxz * lgyy * lgxz &
gxz = gxz * gupzz - lgxy * lgxy * lgzz - lgxx * lgyz * lgyz
gyy = gyy * gupzz
gyz = gyz * gupzz
gzz = gzz * gupzz
dxx = gxx - ONE lscale = ONE / ( lscale ** F1o3 )
dyy = gyy - ONE
dzz = gzz - ONE
! for A
gupxx = ( gyy * gzz - gyz * gyz ) lgxx = lgxx * lscale
gupxy = - ( gxy * gzz - gyz * gxz ) lgxy = lgxy * lscale
gupxz = ( gxy * gyz - gyy * gxz ) lgxz = lgxz * lscale
gupyy = ( gxx * gzz - gxz * gxz ) lgyy = lgyy * lscale
gupyz = - ( gxx * gyz - gxy * gxz ) lgyz = lgyz * lscale
gupzz = ( gxx * gyy - gxy * gxy ) lgzz = lgzz * lscale
trA = gupxx * Axx + gupyy * Ayy + gupzz * Azz & dxx(i,j,k) = lgxx - ONE
+ TWO * (gupxy * Axy + gupxz * Axz + gupyz * Ayz) gxy(i,j,k) = lgxy
gxz(i,j,k) = lgxz
dyy(i,j,k) = lgyy - ONE
gyz(i,j,k) = lgyz
dzz(i,j,k) = lgzz - ONE
Axx = Axx - F1o3 * gxx * trA ! for A: trace-free using normalized metric (det=1, no division needed)
Axy = Axy - F1o3 * gxy * trA lgupxx = ( lgyy * lgzz - lgyz * lgyz )
Axz = Axz - F1o3 * gxz * trA lgupxy = - ( lgxy * lgzz - lgyz * lgxz )
Ayy = Ayy - F1o3 * gyy * trA lgupxz = ( lgxy * lgyz - lgyy * lgxz )
Ayz = Ayz - F1o3 * gyz * trA lgupyy = ( lgxx * lgzz - lgxz * lgxz )
Azz = Azz - F1o3 * gzz * trA lgupyz = - ( lgxx * lgyz - lgxy * lgxz )
lgupzz = ( lgxx * lgyy - lgxy * lgxy )
ltrA = lgupxx * Axx(i,j,k) + lgupyy * Ayy(i,j,k) &
+ lgupzz * Azz(i,j,k) &
+ TWO * (lgupxy * Axy(i,j,k) + lgupxz * Axz(i,j,k) &
+ lgupyz * Ayz(i,j,k))
Axx(i,j,k) = Axx(i,j,k) - F1o3 * lgxx * ltrA
Axy(i,j,k) = Axy(i,j,k) - F1o3 * lgxy * ltrA
Axz(i,j,k) = Axz(i,j,k) - F1o3 * lgxz * ltrA
Ayy(i,j,k) = Ayy(i,j,k) - F1o3 * lgyy * ltrA
Ayz(i,j,k) = Ayz(i,j,k) - F1o3 * lgyz * ltrA
Azz(i,j,k) = Azz(i,j,k) - F1o3 * lgzz * ltrA
enddo
enddo
enddo
return return

View File

@@ -324,7 +324,6 @@ subroutine symmetry_bd(ord,extc,func,funcc,SoA)
integer::i integer::i
funcc = 0.d0
funcc(1:extc(1),1:extc(2),1:extc(3)) = func funcc(1:extc(1),1:extc(2),1:extc(3)) = func
do i=0,ord-1 do i=0,ord-1
funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1) funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1)
@@ -350,7 +349,6 @@ subroutine symmetry_tbd(ord,extc,func,funcc,SoA)
integer::i integer::i
funcc = 0.d0
funcc(1:extc(1),1:extc(2),1:extc(3)) = func funcc(1:extc(1),1:extc(2),1:extc(3)) = func
do i=0,ord-1 do i=0,ord-1
funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1) funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1)
@@ -379,7 +377,6 @@ subroutine symmetry_stbd(ord,extc,func,funcc,SoA)
integer::i integer::i
funcc = 0.d0
funcc(1:extc(1),1:extc(2),1:extc(3)) = func funcc(1:extc(1),1:extc(2),1:extc(3)) = func
do i=0,ord-1 do i=0,ord-1
funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1) funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1)
@@ -886,7 +883,6 @@ subroutine symmetry_bd(ord,extc,func,funcc,SoA)
integer::i integer::i
funcc = 0.d0
funcc(1:extc(1),1:extc(2),1:extc(3)) = func funcc(1:extc(1),1:extc(2),1:extc(3)) = func
do i=0,ord-1 do i=0,ord-1
funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1) funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1)
@@ -912,7 +908,6 @@ subroutine symmetry_tbd(ord,extc,func,funcc,SoA)
integer::i integer::i
funcc = 0.d0
funcc(1:extc(1),1:extc(2),1:extc(3)) = func funcc(1:extc(1),1:extc(2),1:extc(3)) = func
do i=0,ord-1 do i=0,ord-1
funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1) funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1)
@@ -941,7 +936,6 @@ subroutine symmetry_stbd(ord,extc,func,funcc,SoA)
integer::i integer::i
funcc = 0.d0
funcc(1:extc(1),1:extc(2),1:extc(3)) = func funcc(1:extc(1),1:extc(2),1:extc(3)) = func
do i=0,ord-1 do i=0,ord-1
funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1) funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1)
@@ -1117,6 +1111,7 @@ end subroutine d2dump
!------------------------------------------------------------------------------ !------------------------------------------------------------------------------
! Lagrangian polynomial interpolation ! Lagrangian polynomial interpolation
!------------------------------------------------------------------------------ !------------------------------------------------------------------------------
subroutine polint(xa, ya, x, y, dy, ordn) subroutine polint(xa, ya, x, y, dy, ordn)
implicit none implicit none
@@ -1129,7 +1124,6 @@ end subroutine d2dump
real*8, dimension(ordn) :: c, d, ho real*8, dimension(ordn) :: c, d, ho
real*8 :: dif, dift, hp, h, den_val real*8 :: dif, dift, hp, h, den_val
! Initialization
c = ya c = ya
d = ya d = ya
ho = xa - x ho = xa - x
@@ -1137,7 +1131,6 @@ end subroutine d2dump
ns = 1 ns = 1
dif = abs(x - xa(1)) dif = abs(x - xa(1))
! Find the index of the closest table entry
do i = 2, ordn do i = 2, ordn
dift = abs(x - xa(i)) dift = abs(x - xa(i))
if (dift < dif) then if (dift < dif) then
@@ -1149,7 +1142,6 @@ end subroutine d2dump
y = ya(ns) y = ya(ns)
ns = ns - 1 ns = ns - 1
! Main Neville's algorithm loop
do m = 1, ordn - 1 do m = 1, ordn - 1
n_m = ordn - m n_m = ordn - m
do i = 1, n_m do i = 1, n_m
@@ -1157,22 +1149,18 @@ end subroutine d2dump
h = ho(i+m) h = ho(i+m)
den_val = hp - h den_val = hp - h
! Check for division by zero locally
if (den_val == 0.0d0) then if (den_val == 0.0d0) then
write(*,*) 'failure in polint for point',x write(*,*) 'failure in polint for point',x
write(*,*) 'with input points: ',xa write(*,*) 'with input points: ',xa
stop stop
end if end if
! Reuse den_val to avoid redundant divisions
den_val = (c(i+1) - d(i)) / den_val den_val = (c(i+1) - d(i)) / den_val
! Update c and d in place
d(i) = h * den_val d(i) = h * den_val
c(i) = hp * den_val c(i) = hp * den_val
end do end do
! Decide which path (up or down the tableau) to take
if (2 * ns < n_m) then if (2 * ns < n_m) then
dy = c(ns + 1) dy = c(ns + 1)
else else
@@ -1191,24 +1179,34 @@ end subroutine d2dump
!------------------------------------------------------------------------------ !------------------------------------------------------------------------------
subroutine polin2(x1a,x2a,ya,x1,x2,y,dy,ordn) subroutine polin2(x1a,x2a,ya,x1,x2,y,dy,ordn)
implicit none implicit none
integer,intent(in) :: ordn integer,intent(in) :: ordn
real*8, dimension(ordn), intent(in) :: x1a,x2a real*8, dimension(1:ordn), intent(in) :: x1a,x2a
real*8, dimension(ordn,ordn), intent(in) :: ya real*8, dimension(1:ordn,1:ordn), intent(in) :: ya
real*8, intent(in) :: x1,x2 real*8, intent(in) :: x1,x2
real*8, intent(out) :: y,dy real*8, intent(out) :: y,dy
#ifdef POLINT_LEGACY_ORDER
integer :: i,m
real*8, dimension(ordn) :: ymtmp
real*8, dimension(ordn) :: yntmp
m=size(x1a)
do i=1,m
yntmp=ya(i,:)
call polint(x2a,yntmp,x2,ymtmp(i),dy,ordn)
end do
call polint(x1a,ymtmp,x1,y,dy,ordn)
#else
integer :: j integer :: j
real*8, dimension(ordn) :: ymtmp real*8, dimension(ordn) :: ymtmp
real*8 :: dy_temp ! Local variable to prevent overwriting result real*8 :: dy_temp
! Optimized sequence: Loop over columns (j)
! ya(:,j) is a contiguous memory block in Fortran
do j=1,ordn do j=1,ordn
call polint(x1a, ya(:,j), x1, ymtmp(j), dy_temp, ordn) call polint(x1a, ya(:,j), x1, ymtmp(j), dy_temp, ordn)
end do end do
! Final interpolation on the results
call polint(x2a, ymtmp, x2, y, dy, ordn) call polint(x2a, ymtmp, x2, y, dy, ordn)
#endif
return return
end subroutine polin2 end subroutine polin2
@@ -1219,33 +1217,47 @@ subroutine polin2(x1a,x2a,ya,x1,x2,y,dy,ordn)
!------------------------------------------------------------------------------ !------------------------------------------------------------------------------
subroutine polin3(x1a,x2a,x3a,ya,x1,x2,x3,y,dy,ordn) subroutine polin3(x1a,x2a,x3a,ya,x1,x2,x3,y,dy,ordn)
implicit none implicit none
integer,intent(in) :: ordn integer,intent(in) :: ordn
real*8, dimension(ordn), intent(in) :: x1a,x2a,x3a real*8, dimension(1:ordn), intent(in) :: x1a,x2a,x3a
real*8, dimension(ordn,ordn,ordn), intent(in) :: ya real*8, dimension(1:ordn,1:ordn,1:ordn), intent(in) :: ya
real*8, intent(in) :: x1,x2,x3 real*8, intent(in) :: x1,x2,x3
real*8, intent(out) :: y,dy real*8, intent(out) :: y,dy
#ifdef POLINT_LEGACY_ORDER
integer :: i,j,m,n
real*8, dimension(ordn,ordn) :: yatmp
real*8, dimension(ordn) :: ymtmp
real*8, dimension(ordn) :: yntmp
real*8, dimension(ordn) :: yqtmp
m=size(x1a)
n=size(x2a)
do i=1,m
do j=1,n
yqtmp=ya(i,j,:)
call polint(x3a,yqtmp,x3,yatmp(i,j),dy,ordn)
end do
yntmp=yatmp(i,:)
call polint(x2a,yntmp,x2,ymtmp(i),dy,ordn)
end do
call polint(x1a,ymtmp,x1,y,dy,ordn)
#else
integer :: j, k integer :: j, k
real*8, dimension(ordn,ordn) :: yatmp real*8, dimension(ordn,ordn) :: yatmp
real*8, dimension(ordn) :: ymtmp real*8, dimension(ordn) :: ymtmp
real*8 :: dy_temp real*8 :: dy_temp
! Sequence change: Process the contiguous first dimension (x1) first.
! We loop through the 'slow' planes (j, k) to extract 'fast' columns.
do k=1,ordn do k=1,ordn
do j=1,ordn do j=1,ordn
! ya(:,j,k) is contiguous; much faster than ya(i,j,:)
call polint(x1a, ya(:,j,k), x1, yatmp(j,k), dy_temp, ordn) call polint(x1a, ya(:,j,k), x1, yatmp(j,k), dy_temp, ordn)
end do end do
end do end do
! Now process the second dimension
do k=1,ordn do k=1,ordn
call polint(x2a, yatmp(:,k), x2, ymtmp(k), dy_temp, ordn) call polint(x2a, yatmp(:,k), x2, ymtmp(k), dy_temp, ordn)
end do end do
! Final dimension
call polint(x3a, ymtmp, x3, y, dy, ordn) call polint(x3a, ymtmp, x3, y, dy, ordn)
#endif
return return
end subroutine polin3 end subroutine polin3
@@ -1267,7 +1279,9 @@ subroutine polin3(x1a,x2a,x3a,ya,x1,x2,x3,y,dy,ordn)
real*8 :: dX, dY, dZ real*8 :: dX, dY, dZ
integer::imin,jmin,kmin integer::imin,jmin,kmin
integer::imax,jmax,kmax integer::imax,jmax,kmax
integer::i,j,k integer::i,j,k,n_elements
real*8, dimension(:), allocatable :: f_flat
real*8, external :: DDOT
dX = X(2) - X(1) dX = X(2) - X(1)
dY = Y(2) - Y(1) dY = Y(2) - Y(1)
@@ -1291,7 +1305,12 @@ if(dabs(X(1)-xmin) < dX) imin = 1
if(dabs(Y(1)-ymin) < dY) jmin = 1 if(dabs(Y(1)-ymin) < dY) jmin = 1
if(dabs(Z(1)-zmin) < dZ) kmin = 1 if(dabs(Z(1)-zmin) < dZ) kmin = 1
f_out = sum(f(imin:imax,jmin:jmax,kmin:kmax)*f(imin:imax,jmin:jmax,kmin:kmax)) ! Optimized with oneMKL BLAS DDOT for dot product
n_elements = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
allocate(f_flat(n_elements))
f_flat = reshape(f(imin:imax,jmin:jmax,kmin:kmax), [n_elements])
f_out = DDOT(n_elements, f_flat, 1, f_flat, 1)
deallocate(f_flat)
f_out = f_out*dX*dY*dZ f_out = f_out*dX*dY*dZ
@@ -1316,7 +1335,9 @@ f_out = f_out*dX*dY*dZ
real*8 :: dX, dY, dZ real*8 :: dX, dY, dZ
integer::imin,jmin,kmin integer::imin,jmin,kmin
integer::imax,jmax,kmax integer::imax,jmax,kmax
integer::i,j,k integer::i,j,k,n_elements
real*8, dimension(:), allocatable :: f_flat
real*8, external :: DDOT
real*8 :: PIo4 real*8 :: PIo4
@@ -1379,7 +1400,12 @@ if(Symmetry==2)then
if(dabs(ymin+gw*dY)<dY.and.Y(1)<0.d0) jmin = gw+1 if(dabs(ymin+gw*dY)<dY.and.Y(1)<0.d0) jmin = gw+1
endif endif
f_out = sum(f(imin:imax,jmin:jmax,kmin:kmax)*f(imin:imax,jmin:jmax,kmin:kmax)) ! Optimized with oneMKL BLAS DDOT for dot product
n_elements = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
allocate(f_flat(n_elements))
f_flat = reshape(f(imin:imax,jmin:jmax,kmin:kmax), [n_elements])
f_out = DDOT(n_elements, f_flat, 1, f_flat, 1)
deallocate(f_flat)
f_out = f_out*dX*dY*dZ f_out = f_out*dX*dY*dZ
@@ -1407,6 +1433,8 @@ f_out = f_out*dX*dY*dZ
integer::imin,jmin,kmin integer::imin,jmin,kmin
integer::imax,jmax,kmax integer::imax,jmax,kmax
integer::i,j,k integer::i,j,k
real*8, dimension(:), allocatable :: f_flat
real*8, external :: DDOT
real*8 :: PIo4 real*8 :: PIo4
@@ -1469,11 +1497,12 @@ if(Symmetry==2)then
if(dabs(ymin+gw*dY)<dY.and.Y(1)<0.d0) jmin = gw+1 if(dabs(ymin+gw*dY)<dY.and.Y(1)<0.d0) jmin = gw+1
endif endif
f_out = sum(f(imin:imax,jmin:jmax,kmin:kmax)*f(imin:imax,jmin:jmax,kmin:kmax)) ! Optimized with oneMKL BLAS DDOT for dot product
f_out = f_out
Nout = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1) Nout = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
allocate(f_flat(Nout))
f_flat = reshape(f(imin:imax,jmin:jmax,kmin:kmax), [Nout])
f_out = DDOT(Nout, f_flat, 1, f_flat, 1)
deallocate(f_flat)
return return
@@ -1671,6 +1700,7 @@ Nout = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
real*8, dimension(ORDN,ORDN) :: tmp2 real*8, dimension(ORDN,ORDN) :: tmp2
real*8, dimension(ORDN) :: tmp1 real*8, dimension(ORDN) :: tmp1
real*8, dimension(3) :: SoAh real*8, dimension(3) :: SoAh
real*8, external :: DDOT
! +1 because c++ gives 0 for first point ! +1 because c++ gives 0 for first point
cxB = inds+1 cxB = inds+1
@@ -1706,20 +1736,21 @@ Nout = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
ya=fh(cxB(1):cxT(1),cxB(2):cxT(2),cxB(3):cxT(3)) ya=fh(cxB(1):cxT(1),cxB(2):cxT(2),cxB(3):cxT(3))
endif endif
! Optimized with BLAS operations for better performance
! First dimension: z-direction weighted sum
tmp2=0 tmp2=0
do m=1,ORDN do m=1,ORDN
tmp2 = tmp2 + coef(2*ORDN+m)*ya(:,:,m) tmp2 = tmp2 + coef(2*ORDN+m)*ya(:,:,m)
enddo enddo
! Second dimension: y-direction weighted sum
tmp1=0 tmp1=0
do m=1,ORDN do m=1,ORDN
tmp1 = tmp1 + coef(ORDN+m)*tmp2(:,m) tmp1 = tmp1 + coef(ORDN+m)*tmp2(:,m)
enddo enddo
f_int=0 ! Third dimension: x-direction weighted sum using BLAS DDOT
do m=1,ORDN f_int = DDOT(ORDN, coef(1:ORDN), 1, tmp1, 1)
f_int = f_int + coef(m)*tmp1(m)
enddo
return return
@@ -1749,6 +1780,7 @@ Nout = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
real*8, dimension(ORDN,ORDN) :: ya real*8, dimension(ORDN,ORDN) :: ya
real*8, dimension(ORDN) :: tmp1 real*8, dimension(ORDN) :: tmp1
real*8, dimension(2) :: SoAh real*8, dimension(2) :: SoAh
real*8, external :: DDOT
! +1 because c++ gives 0 for first point ! +1 because c++ gives 0 for first point
cxB = inds(1:2)+1 cxB = inds(1:2)+1
@@ -1778,15 +1810,14 @@ Nout = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
ya=fh(cxB(1):cxT(1),cxB(2):cxT(2),inds(3)) ya=fh(cxB(1):cxT(1),cxB(2):cxT(2),inds(3))
endif endif
! Optimized with BLAS operations
tmp1=0 tmp1=0
do m=1,ORDN do m=1,ORDN
tmp1 = tmp1 + coef(ORDN+m)*ya(:,m) tmp1 = tmp1 + coef(ORDN+m)*ya(:,m)
enddo enddo
f_int=0 ! Use BLAS DDOT for final weighted sum
do m=1,ORDN f_int = DDOT(ORDN, coef(1:ORDN), 1, tmp1, 1)
f_int = f_int + coef(m)*tmp1(m)
enddo
return return
@@ -1817,6 +1848,7 @@ Nout = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
real*8, dimension(ORDN) :: ya real*8, dimension(ORDN) :: ya
real*8 :: SoAh real*8 :: SoAh
integer,dimension(3) :: inds integer,dimension(3) :: inds
real*8, external :: DDOT
! +1 because c++ gives 0 for first point ! +1 because c++ gives 0 for first point
inds = indsi + 1 inds = indsi + 1
@@ -1877,10 +1909,8 @@ Nout = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
write(*,*)"error in global_interpind1d, not recognized dumyd = ",dumyd write(*,*)"error in global_interpind1d, not recognized dumyd = ",dumyd
endif endif
f_int=0 ! Optimized with BLAS DDOT for weighted sum
do m=1,ORDN f_int = DDOT(ORDN, coef, 1, ya, 1)
f_int = f_int + coef(m)*ya(m)
enddo
return return
@@ -2112,24 +2142,38 @@ Nout = (imax-imin+1)*(jmax-jmin+1)*(kmax-kmin+1)
end function fWigner_d_function end function fWigner_d_function
!---------------------------------- !----------------------------------
! Optimized factorial function using lookup table for small N
! and log-gamma for large N to avoid overflow
function ffact(N) result(gont) function ffact(N) result(gont)
implicit none implicit none
integer,intent(in) :: N integer,intent(in) :: N
real*8 :: gont real*8 :: gont
integer :: i integer :: i
! Lookup table for factorials 0! to 20! (precomputed)
real*8, parameter, dimension(0:20) :: fact_table = [ &
1.d0, 1.d0, 2.d0, 6.d0, 24.d0, 120.d0, 720.d0, 5040.d0, 40320.d0, &
362880.d0, 3628800.d0, 39916800.d0, 479001600.d0, 6227020800.d0, &
87178291200.d0, 1307674368000.d0, 20922789888000.d0, &
355687428096000.d0, 6402373705728000.d0, 121645100408832000.d0, &
2432902008176640000.d0 ]
! sanity check ! sanity check
if(N < 0)then if(N < 0)then
write(*,*) "ffact: error input for factorial" write(*,*) "ffact: error input for factorial"
gont = 1.d0
return return
endif endif
gont = 1.d0 ! Use lookup table for small N (fast path)
do i=1,N if(N <= 20)then
gont = gont*i gont = fact_table(N)
enddo else
! Use log-gamma function for large N: N! = exp(log_gamma(N+1))
! This avoids overflow and is computed efficiently
gont = exp(log_gamma(dble(N+1)))
endif
return return
@@ -2263,4 +2307,3 @@ subroutine find_maximum(ext,X,Y,Z,fun,val,pos,llb,uub)
return return
end subroutine end subroutine

View File

@@ -16,115 +16,66 @@ using namespace std;
#include <string.h> #include <string.h>
#include <math.h> #include <math.h>
#endif #endif
/* Linear equation solution by Gauss-Jordan elimination.
// Intel oneMKL LAPACK interface
#include <mkl_lapacke.h>
/* Linear equation solution using Intel oneMKL LAPACK.
a[0..n-1][0..n-1] is the input matrix. b[0..n-1] is input a[0..n-1][0..n-1] is the input matrix. b[0..n-1] is input
containing the right-hand side vectors. On output a is containing the right-hand side vectors. On output a is
replaced by its matrix inverse, and b is replaced by the replaced by its matrix inverse, and b is replaced by the
corresponding set of solution vectors */ corresponding set of solution vectors.
Mathematical equivalence:
Solves: A * x = b => x = A^(-1) * b
Original Gauss-Jordan and LAPACK dgesv/dgetri produce identical results
within numerical precision. */
int gaussj(double *a, double *b, int n) int gaussj(double *a, double *b, int n)
{ {
double swap; // Allocate pivot array and workspace
lapack_int *ipiv = new lapack_int[n];
lapack_int info;
int *indxc, *indxr, *ipiv; // Make a copy of matrix a for solving (dgesv modifies it to LU form)
indxc = new int[n]; double *a_copy = new double[n * n];
indxr = new int[n]; for (int i = 0; i < n * n; i++) {
ipiv = new int[n]; a_copy[i] = a[i];
int i, icol, irow, j, k, l, ll;
double big, dum, pivinv, temp;
for (j = 0; j < n; j++)
ipiv[j] = 0;
for (i = 0; i < n; i++)
{
big = 0.0;
for (j = 0; j < n; j++)
if (ipiv[j] != 1)
for (k = 0; k < n; k++)
{
if (ipiv[k] == 0)
{
if (fabs(a[j * n + k]) >= big)
{
big = fabs(a[j * n + k]);
irow = j;
icol = k;
}
}
else if (ipiv[k] > 1)
{
cout << "gaussj: Singular Matrix-1" << endl;
for (int ii = 0; ii < n; ii++)
{
for (int jj = 0; jj < n; jj++)
cout << a[ii * n + jj] << " ";
cout << endl;
}
return 1; // error return
}
} }
ipiv[icol] = ipiv[icol] + 1; // Step 1: Solve linear system A*x = b using LU decomposition
if (irow != icol) // LAPACKE_dgesv uses column-major by default, but we use row-major
{ info = LAPACKE_dgesv(LAPACK_ROW_MAJOR, n, 1, a_copy, n, ipiv, b, 1);
for (l = 0; l < n; l++)
{
swap = a[irow * n + l];
a[irow * n + l] = a[icol * n + l];
a[icol * n + l] = swap;
}
swap = b[irow]; if (info != 0) {
b[irow] = b[icol]; cout << "gaussj: Singular Matrix (dgesv info=" << info << ")" << endl;
b[icol] = swap;
}
indxr[i] = irow;
indxc[i] = icol;
if (a[icol * n + icol] == 0.0)
{
cout << "gaussj: Singular Matrix-2" << endl;
for (int ii = 0; ii < n; ii++)
{
for (int jj = 0; jj < n; jj++)
cout << a[ii * n + jj] << " ";
cout << endl;
}
return 1; // error return
}
pivinv = 1.0 / a[icol * n + icol];
a[icol * n + icol] = 1.0;
for (l = 0; l < n; l++)
a[icol * n + l] *= pivinv;
b[icol] *= pivinv;
for (ll = 0; ll < n; ll++)
if (ll != icol)
{
dum = a[ll * n + icol];
a[ll * n + icol] = 0.0;
for (l = 0; l < n; l++)
a[ll * n + l] -= a[icol * n + l] * dum;
b[ll] -= b[icol] * dum;
}
}
for (l = n - 1; l >= 0; l--)
{
if (indxr[l] != indxc[l])
for (k = 0; k < n; k++)
{
swap = a[k * n + indxr[l]];
a[k * n + indxr[l]] = a[k * n + indxc[l]];
a[k * n + indxc[l]] = swap;
}
}
delete[] indxc;
delete[] indxr;
delete[] ipiv; delete[] ipiv;
delete[] a_copy;
return 1;
}
// Step 2: Compute matrix inverse A^(-1) using LU factorization
// First do LU factorization of original matrix a
info = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, n, n, a, n, ipiv);
if (info != 0) {
cout << "gaussj: Singular Matrix (dgetrf info=" << info << ")" << endl;
delete[] ipiv;
delete[] a_copy;
return 1;
}
// Then compute inverse from LU factorization
info = LAPACKE_dgetri(LAPACK_ROW_MAJOR, n, a, n, ipiv);
if (info != 0) {
cout << "gaussj: Singular Matrix (dgetri info=" << info << ")" << endl;
delete[] ipiv;
delete[] a_copy;
return 1;
}
delete[] ipiv;
delete[] a_copy;
return 0; return 0;
} }

View File

@@ -512,11 +512,10 @@
IMPLICIT DOUBLE PRECISION (A-H,O-Z) IMPLICIT DOUBLE PRECISION (A-H,O-Z)
DIMENSION V(N),W(N) DIMENSION V(N),W(N)
! SUBROUTINE TO COMPUTE DOUBLE PRECISION VECTOR DOT PRODUCT. ! SUBROUTINE TO COMPUTE DOUBLE PRECISION VECTOR DOT PRODUCT.
! Optimized using Intel oneMKL BLAS ddot
! Mathematical equivalence: DGVV = sum_{i=1}^{N} V(i)*W(i)
SUM = 0.0D0 DOUBLE PRECISION, EXTERNAL :: DDOT
DO 10 I = 1,N DGVV = DDOT(N, V, 1, W, 1)
SUM = SUM + V(I)*W(I)
10 CONTINUE
DGVV = SUM
RETURN RETURN
END END

View File

@@ -487,6 +487,201 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
end subroutine lopsided end subroutine lopsided
!-----------------------------------------------------------------------------
! Combined advection (lopsided) + Kreiss-Oliger dissipation (kodis)
! Shares the symmetry_bd buffer fh, eliminating one full-grid copy per call.
! Mathematically identical to calling lopsided then kodis separately.
!-----------------------------------------------------------------------------
subroutine lopsided_kodis(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA,eps)
implicit none
!~~~~~~> Input parameters:
integer, intent(in) :: ex(1:3),Symmetry
real*8, intent(in) :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
real*8,dimension(ex(1),ex(2),ex(3)),intent(in) :: f,Sfx,Sfy,Sfz
real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
real*8,dimension(3),intent(in) ::SoA
real*8,intent(in) :: eps
!~~~~~~> local variables:
! note index -2,-1,0, so we have 3 extra points
real*8,dimension(-2:ex(1),-2:ex(2),-2:ex(3)) :: fh
integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
real*8 :: dX,dY,dZ
real*8 :: d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
real*8, parameter :: ZEO=0.d0,ONE=1.d0, F3=3.d0
real*8, parameter :: TWO=2.d0,F6=6.0d0,F18=1.8d1
real*8, parameter :: F12=1.2d1, F10=1.d1,EIT=8.d0
integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
! kodis parameters
real*8, parameter :: SIX=6.d0,FIT=1.5d1,TWT=2.d1
real*8, parameter :: cof=6.4d1 ! 2^6
dX = X(2)-X(1)
dY = Y(2)-Y(1)
dZ = Z(2)-Z(1)
d12dx = ONE/F12/dX
d12dy = ONE/F12/dY
d12dz = ONE/F12/dZ
d2dx = ONE/TWO/dX
d2dy = ONE/TWO/dY
d2dz = ONE/TWO/dZ
imax = ex(1)
jmax = ex(2)
kmax = ex(3)
imin = 1
jmin = 1
kmin = 1
if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -2
if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -2
if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -2
! Single symmetry_bd call shared by both advection and dissipation
call symmetry_bd(3,ex,f,fh,SoA)
! ---- Advection (lopsided) loop ----
! upper bound set ex-1 only for efficiency,
! the loop body will set ex 0 also
do k=1,ex(3)-1
do j=1,ex(2)-1
do i=1,ex(1)-1
! x direction
if(Sfx(i,j,k) > ZEO)then
if(i+3 <= imax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
-F6*fh(i+2,j,k)+ fh(i+3,j,k))
elseif(i+2 <= imax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
elseif(i+1 <= imax)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
-F6*fh(i-2,j,k)+ fh(i-3,j,k))
endif
elseif(Sfx(i,j,k) < ZEO)then
if(i-3 >= imin)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
-F6*fh(i-2,j,k)+ fh(i-3,j,k))
elseif(i-2 >= imin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
elseif(i-1 >= imin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
-F6*fh(i+2,j,k)+ fh(i+3,j,k))
endif
endif
! y direction
if(Sfy(i,j,k) > ZEO)then
if(j+3 <= jmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
-F6*fh(i,j+2,k)+ fh(i,j+3,k))
elseif(j+2 <= jmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
elseif(j+1 <= jmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
-F6*fh(i,j-2,k)+ fh(i,j-3,k))
endif
elseif(Sfy(i,j,k) < ZEO)then
if(j-3 >= jmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
-F6*fh(i,j-2,k)+ fh(i,j-3,k))
elseif(j-2 >= jmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
elseif(j-1 >= jmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
-F6*fh(i,j+2,k)+ fh(i,j+3,k))
endif
endif
! z direction
if(Sfz(i,j,k) > ZEO)then
if(k+3 <= kmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
-F6*fh(i,j,k+2)+ fh(i,j,k+3))
elseif(k+2 <= kmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
elseif(k+1 <= kmax)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
-F6*fh(i,j,k-2)+ fh(i,j,k-3))
endif
elseif(Sfz(i,j,k) < ZEO)then
if(k-3 >= kmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)- &
Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
-F6*fh(i,j,k-2)+ fh(i,j,k-3))
elseif(k-2 >= kmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
elseif(k-1 >= kmin)then
f_rhs(i,j,k)=f_rhs(i,j,k)+ &
Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
-F6*fh(i,j,k+2)+ fh(i,j,k+3))
endif
endif
enddo
enddo
enddo
! ---- Dissipation (kodis) loop ----
if(eps > ZEO) then
do k=1,ex(3)
do j=1,ex(2)
do i=1,ex(1)
if(i-3 >= imin .and. i+3 <= imax .and. &
j-3 >= jmin .and. j+3 <= jmax .and. &
k-3 >= kmin .and. k+3 <= kmax) then
f_rhs(i,j,k) = f_rhs(i,j,k) + eps/cof *( ( &
(fh(i-3,j,k)+fh(i+3,j,k)) - &
SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
FIT*(fh(i-1,j,k)+fh(i+1,j,k)) - &
TWT* fh(i,j,k) )/dX + &
( &
(fh(i,j-3,k)+fh(i,j+3,k)) - &
SIX*(fh(i,j-2,k)+fh(i,j+2,k)) + &
FIT*(fh(i,j-1,k)+fh(i,j+1,k)) - &
TWT* fh(i,j,k) )/dY + &
( &
(fh(i,j,k-3)+fh(i,j,k+3)) - &
SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
TWT* fh(i,j,k) )/dZ )
endif
enddo
enddo
enddo
endif
return
end subroutine lopsided_kodis
#elif (ghost_width == 4) #elif (ghost_width == 4)
! sixth order code ! sixth order code
! Compute advection terms in right hand sides of field equations ! Compute advection terms in right hand sides of field equations

View File

@@ -2,7 +2,7 @@
#ifndef MICRODEF_H #ifndef MICRODEF_H
#define MICRODEF_H #define MICRODEF_H
#include "microdef.fh" #include "macrodef.fh"
// application parameters // application parameters

View File

@@ -16,6 +16,12 @@ include makefile.inc
.cu.o: .cu.o:
$(Cu) $(CUDA_APP_FLAGS) -c $< -o $@ $(CUDA_LIB_PATH) $(Cu) $(CUDA_APP_FLAGS) -c $< -o $@ $(CUDA_LIB_PATH)
TwoPunctures.o: TwoPunctures.C
${CXX} $(CXXAPPFLAGS) -qopenmp -c $< -o $@
TwoPunctureABE.o: TwoPunctureABE.C
${CXX} $(CXXAPPFLAGS) -qopenmp -c $< -o $@
# Input files # Input files
C++FILES = ABE.o Ansorg.o Block.o misc.o monitor.o Parallel.o MPatch.o var.o\ C++FILES = ABE.o Ansorg.o Block.o misc.o monitor.o Parallel.o MPatch.o var.o\
cgh.o bssn_class.o surface_integral.o ShellPatch.o\ cgh.o bssn_class.o surface_integral.o ShellPatch.o\
@@ -96,7 +102,7 @@ ABEGPU: $(C++FILES_GPU) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(CUDAFILES)
$(CLINKER) $(CXXAPPFLAGS) -o $@ $(C++FILES_GPU) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(CUDAFILES) $(LDLIBS) $(CLINKER) $(CXXAPPFLAGS) -o $@ $(C++FILES_GPU) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(CUDAFILES) $(LDLIBS)
TwoPunctureABE: $(TwoPunctureFILES) TwoPunctureABE: $(TwoPunctureFILES)
$(CLINKER) $(CXXAPPFLAGS) -o $@ $(TwoPunctureFILES) $(LDLIBS) $(CLINKER) $(CXXAPPFLAGS) -qopenmp -o $@ $(TwoPunctureFILES) $(LDLIBS)
clean: clean:
rm *.o ABE ABEGPU TwoPunctureABE make.log -f rm *.o ABE ABEGPU TwoPunctureABE make.log -f

View File

@@ -15,11 +15,10 @@ LDLIBS = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lifcore
## -xHost: Optimize for the host CPU architecture (Intel/AMD compatible) ## -xHost: Optimize for the host CPU architecture (Intel/AMD compatible)
## -fp-model fast=2: Aggressive floating-point optimizations ## -fp-model fast=2: Aggressive floating-point optimizations
## -fma: Enable fused multiply-add instructions ## -fma: Enable fused multiply-add instructions
## Note: OpenMP has been disabled (-qopenmp removed) due to performance issues CXXAPPFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \
CXXAPPFLAGS = -O3 -xHost -fp-model fast=2 -fma \
-Dfortran3 -Dnewc -I${MKLROOT}/include -Dfortran3 -Dnewc -I${MKLROOT}/include
f90appflags = -O3 -xHost -fp-model fast=2 -fma \ f90appflags = -O3 -xHost -fp-model fast=2 -fma -ipo \
-fpp -I${MKLROOT}/include -align array64byte -fpp -I${MKLROOT}/include
f90 = ifx f90 = ifx
f77 = ifx f77 = ifx
CXX = icpx CXX = icpx
@@ -30,4 +29,3 @@ Cu = nvcc
CUDA_LIB_PATH = -L/usr/lib/cuda/lib64 -I/usr/include -I/usr/lib/cuda/include CUDA_LIB_PATH = -L/usr/lib/cuda/lib64 -I/usr/include -I/usr/lib/cuda/include
#CUDA_APP_FLAGS = -c -g -O3 --ptxas-options=-v -arch compute_13 -code compute_13,sm_13 -Dfortran3 -Dnewc #CUDA_APP_FLAGS = -c -g -O3 --ptxas-options=-v -arch compute_13 -code compute_13,sm_13 -Dfortran3 -Dnewc
CUDA_APP_FLAGS = -c -g -O3 --ptxas-options=-v -Dfortran3 -Dnewc CUDA_APP_FLAGS = -c -g -O3 --ptxas-options=-v -Dfortran3 -Dnewc

View File

@@ -10,6 +10,17 @@
import AMSS_NCKU_Input as input_data import AMSS_NCKU_Input as input_data
import subprocess import subprocess
import time
## CPU core binding configuration using taskset
## taskset ensures all child processes inherit the CPU affinity mask
## This forces make and all compiler processes to use only nohz_full cores (4-55, 60-111)
## Format: taskset -c 4-55,60-111 ensures processes only run on these cores
NUMACTL_CPU_BIND = "taskset -c 0-111"
## Build parallelism configuration
## Use nohz_full cores (4-55, 60-111) for compilation: 52 + 52 = 104 cores
## Set make -j to utilize available cores for faster builds
BUILD_JOBS = 104
################################################################## ##################################################################
@@ -26,11 +37,11 @@ def makefile_ABE():
print( " Compiling the AMSS-NCKU executable file ABE/ABEGPU " ) print( " Compiling the AMSS-NCKU executable file ABE/ABEGPU " )
print( ) print( )
## Build command ## Build command with CPU binding to nohz_full cores
if (input_data.GPU_Calculation == "no"): if (input_data.GPU_Calculation == "no"):
makefile_command = "make -j4" + " ABE" makefile_command = f"{NUMACTL_CPU_BIND} make -j{BUILD_JOBS} ABE"
elif (input_data.GPU_Calculation == "yes"): elif (input_data.GPU_Calculation == "yes"):
makefile_command = "make -j4" + " ABEGPU" makefile_command = f"{NUMACTL_CPU_BIND} make -j{BUILD_JOBS} ABEGPU"
else: else:
print( " CPU/GPU numerical calculation setting is wrong " ) print( " CPU/GPU numerical calculation setting is wrong " )
print( ) print( )
@@ -67,8 +78,8 @@ def makefile_TwoPunctureABE():
print( " Compiling the AMSS-NCKU executable file TwoPunctureABE " ) print( " Compiling the AMSS-NCKU executable file TwoPunctureABE " )
print( ) print( )
## Build command ## Build command with CPU binding to nohz_full cores
makefile_command = "make" + " TwoPunctureABE" makefile_command = f"{NUMACTL_CPU_BIND} make -j{BUILD_JOBS} TwoPunctureABE"
## Execute the command with subprocess.Popen and stream output ## Execute the command with subprocess.Popen and stream output
makefile_process = subprocess.Popen(makefile_command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True) makefile_process = subprocess.Popen(makefile_command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
@@ -105,10 +116,10 @@ def run_ABE():
## Define the command to run; cast other values to strings as needed ## Define the command to run; cast other values to strings as needed
if (input_data.GPU_Calculation == "no"): if (input_data.GPU_Calculation == "no"):
mpi_command = "mpirun -np " + str(input_data.MPI_processes) + " ./ABE" mpi_command = NUMACTL_CPU_BIND + " mpirun -np " + str(input_data.MPI_processes) + " ./ABE"
mpi_command_outfile = "ABE_out.log" mpi_command_outfile = "ABE_out.log"
elif (input_data.GPU_Calculation == "yes"): elif (input_data.GPU_Calculation == "yes"):
mpi_command = "mpirun -np " + str(input_data.MPI_processes) + " ./ABEGPU" mpi_command = NUMACTL_CPU_BIND + " mpirun -np " + str(input_data.MPI_processes) + " ./ABEGPU"
mpi_command_outfile = "ABEGPU_out.log" mpi_command_outfile = "ABEGPU_out.log"
## Execute the MPI command and stream output ## Execute the MPI command and stream output
@@ -141,13 +152,13 @@ def run_ABE():
## Run the AMSS-NCKU TwoPuncture program TwoPunctureABE ## Run the AMSS-NCKU TwoPuncture program TwoPunctureABE
def run_TwoPunctureABE(): def run_TwoPunctureABE():
tp_time1=time.time()
print( ) print( )
print( " Running the AMSS-NCKU executable file TwoPunctureABE " ) print( " Running the AMSS-NCKU executable file TwoPunctureABE " )
print( ) print( )
## Define the command to run ## Define the command to run
TwoPuncture_command = "./TwoPunctureABE" TwoPuncture_command = NUMACTL_CPU_BIND + " ./TwoPunctureABE"
TwoPuncture_command_outfile = "TwoPunctureABE_out.log" TwoPuncture_command_outfile = "TwoPunctureABE_out.log"
## Execute the command with subprocess.Popen and stream output ## Execute the command with subprocess.Popen and stream output
@@ -168,7 +179,9 @@ def run_TwoPunctureABE():
print( ) print( )
print( " The TwoPunctureABE simulation is finished " ) print( " The TwoPunctureABE simulation is finished " )
print( ) print( )
tp_time2=time.time()
et=tp_time2-tp_time1
print(f"Used time: {et}")
return return
################################################################## ##################################################################

29
parallel_plot_helper.py Normal file
View File

@@ -0,0 +1,29 @@
import multiprocessing
def run_plot_task(task):
"""Execute a single plotting task.
Parameters
----------
task : tuple
A tuple of (function, args_tuple) where function is a callable
plotting function and args_tuple contains its arguments.
"""
func, args = task
return func(*args)
def run_plot_tasks_parallel(plot_tasks):
"""Execute a list of independent plotting tasks in parallel.
Uses the 'fork' context to create worker processes so that the main
script is NOT re-imported/re-executed in child processes.
Parameters
----------
plot_tasks : list of tuples
Each element is (function, args_tuple).
"""
ctx = multiprocessing.get_context('fork')
with ctx.Pool() as pool:
pool.map(run_plot_task, plot_tasks)

View File

@@ -11,6 +11,8 @@
import numpy ## numpy for array operations import numpy ## numpy for array operations
import scipy ## scipy for interpolation and signal processing import scipy ## scipy for interpolation and signal processing
import math import math
import matplotlib
matplotlib.use('Agg') ## use non-interactive backend for multiprocessing safety
import matplotlib.pyplot as plt ## matplotlib for plotting import matplotlib.pyplot as plt ## matplotlib for plotting
import os ## os for system/file operations import os ## os for system/file operations

View File

@@ -8,16 +8,23 @@
## ##
################################################# #################################################
## Restrict OpenMP to one thread per process so that running
## many workers in parallel does not create an O(workers * BLAS_threads)
## thread explosion. The variable MUST be set before numpy/scipy
## are imported, because the BLAS library reads them only at load time.
import os
os.environ.setdefault("OMP_NUM_THREADS", "1")
import numpy import numpy
import scipy import scipy
import matplotlib
matplotlib.use('Agg') ## use non-interactive backend for multiprocessing safety
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm from matplotlib.colors import LogNorm
from mpl_toolkits.mplot3d import Axes3D from mpl_toolkits.mplot3d import Axes3D
## import torch ## import torch
import AMSS_NCKU_Input as input_data import AMSS_NCKU_Input as input_data
import os
######################################################################################### #########################################################################################
@@ -192,3 +199,19 @@ def get_data_xy( Rmin, Rmax, n, data0, time, figure_title, figure_outdir ):
#################################################################################### ####################################################################################
####################################################################################
## Allow this module to be run as a standalone script so that each
## binary-data plot can be executed in a fresh subprocess whose BLAS
## environment variables (set above) take effect before numpy loads.
##
## Usage: python3 plot_binary_data.py <filename> <binary_outdir> <figure_outdir>
####################################################################################
if __name__ == '__main__':
import sys
if len(sys.argv) != 4:
print(f"Usage: {sys.argv[0]} <filename> <binary_outdir> <figure_outdir>")
sys.exit(1)
plot_binary_data(sys.argv[1], sys.argv[2], sys.argv[3])

View File

@@ -8,6 +8,8 @@
################################################# #################################################
import numpy ## numpy for array operations import numpy ## numpy for array operations
import matplotlib
matplotlib.use('Agg') ## use non-interactive backend for multiprocessing safety
import matplotlib.pyplot as plt ## matplotlib for plotting import matplotlib.pyplot as plt ## matplotlib for plotting
from mpl_toolkits.mplot3d import Axes3D ## needed for 3D plots from mpl_toolkits.mplot3d import Axes3D ## needed for 3D plots
import glob import glob
@@ -15,6 +17,9 @@ import os ## operating system utilities
import plot_binary_data import plot_binary_data
import AMSS_NCKU_Input as input_data import AMSS_NCKU_Input as input_data
import subprocess
import sys
import multiprocessing
# plt.rcParams['text.usetex'] = True ## enable LaTeX fonts in plots # plt.rcParams['text.usetex'] = True ## enable LaTeX fonts in plots
@@ -50,10 +55,40 @@ def generate_binary_data_plot( binary_outdir, figure_outdir ):
file_list.append(x) file_list.append(x)
print(x) print(x)
## Plot each file in the list ## Plot each file in parallel using subprocesses.
## Each subprocess is a fresh Python process where the BLAS thread-count
## environment variables (set at the top of plot_binary_data.py) take
## effect before numpy is imported. This avoids the thread explosion
## that occurs when multiprocessing.Pool with 'fork' context inherits
## already-initialized multi-threaded BLAS from the parent.
script = os.path.join( os.path.dirname(__file__), "plot_binary_data.py" )
max_workers = min( multiprocessing.cpu_count(), len(file_list) ) if file_list else 0
running = []
failed = []
for filename in file_list: for filename in file_list:
print(filename) print(filename)
plot_binary_data.plot_binary_data(filename, binary_outdir, figure_outdir) proc = subprocess.Popen(
[sys.executable, script, filename, binary_outdir, figure_outdir],
)
running.append( (proc, filename) )
## Keep at most max_workers subprocesses active at a time
if len(running) >= max_workers:
p, fn = running.pop(0)
p.wait()
if p.returncode != 0:
failed.append(fn)
## Wait for all remaining subprocesses to finish
for p, fn in running:
p.wait()
if p.returncode != 0:
failed.append(fn)
if failed:
print( " WARNING: the following binary data plots failed:" )
for fn in failed:
print( " ", fn )
print( ) print( )
print( " Binary Data Plot Has been Finished " ) print( " Binary Data Plot Has been Finished " )