Add OpenMP parallelization to Fortran compute kernels

Add !$omp parallel do collapse(2) directives to all triple-loop stencil kernels (fderivs, fdderivs, fdx/fdy/fdz, kodis, lopsided, enforce_ag/enforce_ga) across all ghost_width variants. Add !$omp parallel workshare to RK4/ICN/Euler whole-array update routines. Build system: add -qopenmp to compile and link flags, switch MKL from sequential to threaded (-lmkl_intel_thread -liomp5). Runtime: set OMP_NUM_THREADS=96, OMP_STACKSIZE=16M, OMP_PROC_BIND=close, OMP_PLACES=cores for 96-core server target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove MPI dependency, replace with single-process stub for non-MPI builds
2026-02-10 23:40:17 +08:00 · 2026-02-10 22:51:11 +08:00 · 2026-02-10 19:18:56 +08:00 · 2026-02-09 22:39:18 +08:00 · 2026-02-09 22:07:12 +08:00 · 2026-02-09 21:03:37 +08:00
53 changed files with 3034 additions and 1307 deletions
--- a/AMSS_NCKU_ABEtest.py
+++ b/AMSS_NCKU_ABEtest.py
@@ -1,447 +0,0 @@
-
-##################################################################
-##
-## AMSS-NCKU ABE Test Program (Skip TwoPuncture if data exists)
-## Modified from AMSS_NCKU_Program.py
-## Author: Xiaoqu
-## Modified: 2026/02/01
-##
-##################################################################
-
-
-##################################################################
-
-## Print program introduction
-
-import print_information
-
-print_information.print_program_introduction()
-
-##################################################################
-
-import AMSS_NCKU_Input as input_data
-
-##################################################################
-
-## Create directories to store program run data
-
-import os
-import shutil
-import sys
-import time
-
-## Set the output directory according to the input file
-File_directory = os.path.join(input_data.File_directory)
-
-## Check if output directory exists and if TwoPuncture data is available
-#skip_twopuncture = False
-skip_twopuncture = True
-output_directory = os.path.join(File_directory, "AMSS_NCKU_output")
-binary_results_directory = os.path.join(output_directory, input_data.Output_directory)
-
-if os.path.exists(File_directory):
-    print( " Output directory already exists." )
-    print()
-    '''
-    # Check if TwoPuncture initial data files exist
-    if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture"):
-        twopuncture_output = os.path.join(output_directory, "TwoPunctureABE")
-        input_par = os.path.join(output_directory, "input.par")
-
-        if os.path.exists(twopuncture_output) and os.path.exists(input_par):
-            print( " Found existing TwoPuncture initial data." )
-            print( " Do you want to skip TwoPuncture phase and reuse existing data?" )
-            print( " Input 'skip' to skip TwoPuncture and start ABE directly" )
-            print( " Input 'regenerate' to regenerate everything from scratch" )
-            print()
-            
-            while True:
-                try:
-                    inputvalue = input()
-                    if ( inputvalue == "skip" ):
-                        print( " Skipping TwoPuncture phase, will reuse existing initial data." )
-                        print()
-                        skip_twopuncture = True
-                        break
-                    elif ( inputvalue == "regenerate" ):
-                        print( " Regenerating everything from scratch." )
-                        print()
-                        skip_twopuncture = False
-                        break
-                    else:
-                        print( " Please input 'skip' or 'regenerate'." )
-                except ValueError:
-                    print( " Please input 'skip' or 'regenerate'." )
-            
-        else:
-            print( " TwoPuncture initial data not found, will regenerate everything." )
-            print()
-'''
-    # If not skipping, remove and recreate directory
-    if not skip_twopuncture:
-        shutil.rmtree(File_directory, ignore_errors=True)
-        os.mkdir(File_directory)
-        os.mkdir(output_directory)
-        os.mkdir(binary_results_directory)
-        figure_directory = os.path.join(File_directory, "figure")
-        os.mkdir(figure_directory)
-        shutil.copy("AMSS_NCKU_Input.py", File_directory)
-        print( " Output directory has been regenerated." )
-        print()
-else:
-    # Create fresh directory structure
-    os.mkdir(File_directory)
-    shutil.copy("AMSS_NCKU_Input.py", File_directory)
-    os.mkdir(output_directory)
-    os.mkdir(binary_results_directory)
-    figure_directory = os.path.join(File_directory, "figure")
-    os.mkdir(figure_directory)
-    print( " Output directory has been generated." )
-    print()
-
-# Ensure figure directory exists
-figure_directory = os.path.join(File_directory, "figure")
-if not os.path.exists(figure_directory):
-    os.mkdir(figure_directory)
-
-##################################################################
-
-## Output related parameter information
-
-import setup
-
-## Print and save input parameter information
-setup.print_input_data( File_directory )
-
-if not skip_twopuncture:
-    setup.generate_AMSSNCKU_input()
-
-setup.print_puncture_information()
-
-
-##################################################################
-
-## Generate AMSS-NCKU program input files based on the configured parameters
-
-if not skip_twopuncture:
-    print()
-    print( " Generating the AMSS-NCKU input parfile for the ABE executable." )
-    print()
-
-    ## Generate cgh-related input files from the grid information
-
-    import numerical_grid
-
-    numerical_grid.append_AMSSNCKU_cgh_input()
-
-    print()
-    print( " The input parfile for AMSS-NCKU C++ executable file ABE has been generated." )
-    print( " However, the input relevant to TwoPuncture need to be appended later." )
-    print()
-
-
-##################################################################
-
-## Plot the initial grid configuration
-
-if not skip_twopuncture:
-    print()
-    print( " Schematically plot the numerical grid structure." )
-    print()
-
-    import numerical_grid
-    numerical_grid.plot_initial_grid()
-
-
-##################################################################
-
-## Generate AMSS-NCKU macro files according to the numerical scheme and parameters
-
-if not skip_twopuncture:
-    print()
-    print( " Automatically generating the macro file for AMSS-NCKU C++ executable file ABE " )
-    print( " (Based on the finite-difference numerical scheme) " )
-    print()
-
-    import generate_macrodef
-
-    generate_macrodef.generate_macrodef_h()
-    print( " AMSS-NCKU macro file macrodef.h has been generated. " )
-
-    generate_macrodef.generate_macrodef_fh()
-    print( " AMSS-NCKU macro file macrodef.fh has been generated. " )
-
-
-##################################################################
-
-# Compile the AMSS-NCKU program according to user requirements
-# NOTE: ABE compilation is always performed, even when skipping TwoPuncture
-
-print()
-print( " Preparing to compile and run the AMSS-NCKU code as requested " )
-print( " Compiling the AMSS-NCKU code based on the generated macro files " )
-print()
-
-AMSS_NCKU_source_path = "AMSS_NCKU_source"
-AMSS_NCKU_source_copy = os.path.join(File_directory, "AMSS_NCKU_source_copy")
-
-## If AMSS_NCKU source folder is missing, create it and prompt the user
-if not os.path.exists(AMSS_NCKU_source_path):
-    os.makedirs(AMSS_NCKU_source_path)
-    print( " The AMSS-NCKU source files are incomplete; copy all source files into ./AMSS_NCKU_source. " )
-    print( " Press Enter to continue. " )
-    inputvalue = input()
-
-# Copy AMSS-NCKU source files to prepare for compilation
-# If skipping TwoPuncture and source_copy already exists, remove it first
-if skip_twopuncture and os.path.exists(AMSS_NCKU_source_copy):
-    shutil.rmtree(AMSS_NCKU_source_copy)
-
-shutil.copytree(AMSS_NCKU_source_path, AMSS_NCKU_source_copy)
-
-# Copy the generated macro files into the AMSS_NCKU source folder
-if not skip_twopuncture:
-    macrodef_h_path  = os.path.join(File_directory, "macrodef.h")
-    macrodef_fh_path = os.path.join(File_directory, "macrodef.fh")
-else:
-    # When skipping TwoPuncture, use existing macro files from previous run
-    macrodef_h_path  = os.path.join(File_directory, "macrodef.h")
-    macrodef_fh_path = os.path.join(File_directory, "macrodef.fh")
-
-shutil.copy2(macrodef_h_path,  AMSS_NCKU_source_copy)
-shutil.copy2(macrodef_fh_path, AMSS_NCKU_source_copy)
-
-# Compile related programs
-import makefile_and_run
-
-## Change working directory to the target source copy
-os.chdir(AMSS_NCKU_source_copy)
-
-## Build the main AMSS-NCKU executable (ABE or ABEGPU)
-makefile_and_run.makefile_ABE()
-
-## If the initial-data method is Ansorg-TwoPuncture, build the TwoPunctureABE executable
-## Only build TwoPunctureABE if not skipping TwoPuncture phase
-if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and not skip_twopuncture:
-    makefile_and_run.makefile_TwoPunctureABE()
-
-## Change current working directory back up two levels
-os.chdir('..')
-os.chdir('..')
-
-print()
-
-##################################################################
-
-## Copy the AMSS-NCKU executable (ABE/ABEGPU) to the run directory
-
-if (input_data.GPU_Calculation == "no"):
-    ABE_file = os.path.join(AMSS_NCKU_source_copy, "ABE")
-elif (input_data.GPU_Calculation == "yes"):
-    ABE_file = os.path.join(AMSS_NCKU_source_copy, "ABEGPU")
-
-if not os.path.exists( ABE_file ):
-    print()
-    print( " Lack of AMSS-NCKU executable file ABE/ABEGPU; recompile AMSS_NCKU_source manually. " )
-    print( " When recompilation is finished, press Enter to continue. " )
-    inputvalue = input()
-
-## Copy the executable ABE (or ABEGPU) into the run directory
-shutil.copy2(ABE_file, output_directory)
-
-## If the initial-data method is TwoPuncture, copy the TwoPunctureABE executable to the run directory
-## Only copy TwoPunctureABE if not skipping TwoPuncture phase
-if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and not skip_twopuncture:
-    TwoPuncture_file = os.path.join(AMSS_NCKU_source_copy, "TwoPunctureABE")
-
-    if not os.path.exists( TwoPuncture_file ):
-        print()
-        print( " Lack of AMSS-NCKU executable file TwoPunctureABE; recompile TwoPunctureABE in AMSS_NCKU_source. " )
-        print( " When recompilation is finished, press Enter to continue. " )
-        inputvalue = input()
-
-    ## Copy the TwoPunctureABE executable into the run directory
-    shutil.copy2(TwoPuncture_file, output_directory)
-
-##################################################################
-
-## If the initial-data method is TwoPuncture, generate the TwoPuncture input files
-
-if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and not skip_twopuncture:
-
-    print()
-    print( " Initial data is chosen as Ansorg-TwoPuncture" )
-    print()
-    
-    print()
-    print( " Automatically generating the input parfile for the TwoPunctureABE executable " )
-    print()
-    
-    import generate_TwoPuncture_input
-    
-    generate_TwoPuncture_input.generate_AMSSNCKU_TwoPuncture_input()
-    
-    print()
-    print( " The input parfile for the TwoPunctureABE executable has been generated. " )
-    print()
-    
-    ## Generated AMSS-NCKU TwoPuncture input filename
-    AMSS_NCKU_TwoPuncture_inputfile      = 'AMSS-NCKU-TwoPuncture.input'
-    AMSS_NCKU_TwoPuncture_inputfile_path = os.path.join( File_directory, AMSS_NCKU_TwoPuncture_inputfile )
- 
-    ## Copy and rename the file
-    shutil.copy2( AMSS_NCKU_TwoPuncture_inputfile_path, os.path.join(output_directory, 'TwoPunctureinput.par') )
-    
-    ## Run TwoPuncture to generate initial-data files
-    
-    start_time = time.time()  # Record start time
-
-    print()
-    print()
-    
-    ## Change to the output (run) directory
-    os.chdir(output_directory)
-
-    ## Run the TwoPuncture executable
-    import makefile_and_run
-    makefile_and_run.run_TwoPunctureABE()
-    
-    ## Change current working directory back up two levels
-    os.chdir('..')
-    os.chdir('..')
-
-elif (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and skip_twopuncture:
-    print()
-    print( " Skipping TwoPuncture execution, using existing initial data." )
-    print()
-    start_time = time.time()  # Record start time for ABE only
-else:
-    start_time = time.time()  # Record start time
-    
-##################################################################
-    
-## Update puncture data based on TwoPuncture run results
-
-if not skip_twopuncture:
-    import renew_puncture_parameter
-    renew_puncture_parameter.append_AMSSNCKU_BSSN_input(File_directory, output_directory)
-
-    ## Generated AMSS-NCKU input filename
-    AMSS_NCKU_inputfile      = 'AMSS-NCKU.input'
-    AMSS_NCKU_inputfile_path = os.path.join(File_directory, AMSS_NCKU_inputfile)
- 
-    ## Copy and rename the file
-    shutil.copy2( AMSS_NCKU_inputfile_path, os.path.join(output_directory, 'input.par') )
-
-    print()
-    print( " Successfully copy all AMSS-NCKU input parfile to target dictionary. " )
-    print()
-else:
-    print()
-    print( " Using existing input.par file from previous run." )
-    print()
-
-##################################################################
-
-## Launch the AMSS-NCKU program
-
-print()
-print()
-
-## Change to the run directory
-os.chdir( output_directory )
-
-import makefile_and_run
-makefile_and_run.run_ABE()
-
-## Change current working directory back up two levels
-os.chdir('..')
-os.chdir('..')
-
-end_time = time.time()
-elapsed_time = end_time - start_time
-
-##################################################################
-
-## Copy some basic input and log files out to facilitate debugging
-
-## Path to the file that stores calculation settings
-AMSS_NCKU_error_file_path = os.path.join(binary_results_directory, "setting.par")
-## Copy and rename the file for easier inspection
-shutil.copy( AMSS_NCKU_error_file_path, os.path.join(output_directory, "AMSSNCKU_setting_parameter") )
-
-## Path to the error log file
-AMSS_NCKU_error_file_path = os.path.join(binary_results_directory, "Error.log")
-## Copy and rename the error log
-shutil.copy( AMSS_NCKU_error_file_path, os.path.join(output_directory, "Error.log") )
-
-## Primary program outputs
-AMSS_NCKU_BH_data         = os.path.join(binary_results_directory, "bssn_BH.dat"        )
-AMSS_NCKU_ADM_data        = os.path.join(binary_results_directory, "bssn_ADMQs.dat"     )
-AMSS_NCKU_psi4_data       = os.path.join(binary_results_directory, "bssn_psi4.dat"      )
-AMSS_NCKU_constraint_data = os.path.join(binary_results_directory, "bssn_constraint.dat")
-## copy and rename the file
-shutil.copy( AMSS_NCKU_BH_data,         os.path.join(output_directory, "bssn_BH.dat"        ) )
-shutil.copy( AMSS_NCKU_ADM_data,        os.path.join(output_directory, "bssn_ADMQs.dat"     ) )
-shutil.copy( AMSS_NCKU_psi4_data,       os.path.join(output_directory, "bssn_psi4.dat"      ) )
-shutil.copy( AMSS_NCKU_constraint_data, os.path.join(output_directory, "bssn_constraint.dat") )
-
-## Additional program outputs
-if (input_data.Equation_Class == "BSSN-EM"):
-    AMSS_NCKU_phi1_data = os.path.join(binary_results_directory, "bssn_phi1.dat" )
-    AMSS_NCKU_phi2_data = os.path.join(binary_results_directory, "bssn_phi2.dat" )
-    shutil.copy( AMSS_NCKU_phi1_data, os.path.join(output_directory, "bssn_phi1.dat" ) )
-    shutil.copy( AMSS_NCKU_phi2_data, os.path.join(output_directory, "bssn_phi2.dat" ) )
-elif (input_data.Equation_Class == "BSSN-EScalar"):
-    AMSS_NCKU_maxs_data = os.path.join(binary_results_directory, "bssn_maxs.dat" )
-    shutil.copy( AMSS_NCKU_maxs_data, os.path.join(output_directory, "bssn_maxs.dat" ) )
-
-##################################################################
-
-## Plot the AMSS-NCKU program results
-
-print()
-print( " Plotting the txt and binary results data from the AMSS-NCKU simulation " )
-print()
-
-
-import plot_xiaoqu
-import plot_GW_strain_amplitude_xiaoqu
-
-## Plot black hole trajectory
-plot_xiaoqu.generate_puncture_orbit_plot(   binary_results_directory, figure_directory )
-plot_xiaoqu.generate_puncture_orbit_plot3D( binary_results_directory, figure_directory )
-
-## Plot black hole separation vs. time
-plot_xiaoqu.generate_puncture_distence_plot( binary_results_directory, figure_directory )
-
-## Plot gravitational waveforms (psi4 and strain amplitude)
-for i in range(input_data.Detector_Number):
-    plot_xiaoqu.generate_gravitational_wave_psi4_plot( binary_results_directory, figure_directory, i )
-    plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot( binary_results_directory, figure_directory, i )
-
-## Plot ADM mass evolution
-for i in range(input_data.Detector_Number):
-    plot_xiaoqu.generate_ADMmass_plot( binary_results_directory, figure_directory, i )
-
-## Plot Hamiltonian constraint violation over time
-for i in range(input_data.grid_level):
-    plot_xiaoqu.generate_constraint_check_plot( binary_results_directory, figure_directory, i )
-
-## Plot stored binary data
-plot_xiaoqu.generate_binary_data_plot( binary_results_directory, figure_directory )
-
-print()
-print( f" This Program Cost = {elapsed_time} Seconds " )
-print()
-
-
-##################################################################
-
-print()
-print( " The AMSS-NCKU-Python simulation is successfully finished, thanks for using !!! " )
-print()
-
-##################################################################
-
-
--- a/AMSS_NCKU_Input.py
+++ b/AMSS_NCKU_Input.py
@@ -16,7 +16,7 @@ import numpy
 File_directory   = "GW150914"                    ## output file directory
 Output_directory = "binary_output"               ## binary data file directory
                                                 ## The file directory name should not be too long
-MPI_processes    = 64                             ## number of mpi processes used in the simulation
+MPI_processes    = 1                              ## number of processes (MPI removed, single-process mode)

 GPU_Calculation  = "no"                          ## Use GPU or not 
                                                 ## (prefer "no" in the current version, because the GPU part may have bugs when integrated in this Python interface)
--- a/AMSS_NCKU_source/ABE.C
+++ b/AMSS_NCKU_source/ABE.C
@@ -20,7 +20,11 @@ using namespace std;
 #include <map.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "misc.h"
 #include "macrodef.h"
--- a/AMSS_NCKU_source/Ansorg.h
+++ b/AMSS_NCKU_source/Ansorg.h
@@ -19,7 +19,11 @@ using namespace std;
 #include <math.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #define PI M_PI

--- a/AMSS_NCKU_source/Block.h
+++ b/AMSS_NCKU_source/Block.h
@@ -2,7 +2,11 @@
 #ifndef BLOCK_H
 #define BLOCK_H

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif
 #include "macrodef.h" //need dim here; Vertex or Cell
 #include "var.h"
 #include "MyList.h"
--- a/AMSS_NCKU_source/IntPnts0.C
+++ b/AMSS_NCKU_source/IntPnts0.C
@@ -4,7 +4,11 @@
 #include <stdarg.h>
 #include <string.h>

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "myglobal.h"

--- a/AMSS_NCKU_source/MPatch.C
+++ b/AMSS_NCKU_source/MPatch.C
@@ -341,8 +341,9 @@ void Patch::Interp_Points(MyList<var> *VarList,
                          double *Shellf, int Symmetry)
 {
  // NOTE: we do not Synchnize variables here, make sure of that before calling this routine
-  int myrank;
+  int myrank, nprocs;
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
+  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

  int ordn = 2 * ghost_width;
  MyList<var> *varl;
@@ -354,24 +355,18 @@ void Patch::Interp_Points(MyList<var> *VarList,
    varl = varl->next;
  }

-  double *shellf;
-  shellf = new double[NN * num_var];
-  memset(shellf, 0, sizeof(double) * NN * num_var);
+  memset(Shellf, 0, sizeof(double) * NN * num_var);

-  // we use weight to monitor code, later some day we can move it for optimization
-  int *weight;
-  weight = new int[NN];
-  memset(weight, 0, sizeof(int) * NN);
-
-  double *DH, *llb, *uub;
-  DH = new double[dim];
+  // owner_rank[j] records which MPI rank owns point j
+  // All ranks traverse the same block list so they all agree on ownership
+  int *owner_rank;
+  owner_rank = new int[NN];
+  for (int j = 0; j < NN; j++)
+    owner_rank[j] = -1;

+  double DH[dim], llb[dim], uub[dim];
  for (int i = 0; i < dim; i++)
-  {
    DH[i] = getdX(i);
-  }
-  llb = new double[dim];
-  uub = new double[dim];

  for (int j = 0; j < NN; j++) // run along points
  {
@@ -403,12 +398,6 @@ void Patch::Interp_Points(MyList<var> *VarList,
      bool flag = true;
      for (int i = 0; i < dim; i++)
      {
-// NOTE: our dividing structure is (exclude ghost)
-// -1 0
-//       1  2
-// so (0,1) does not belong to any part for vertex structure
-// here we put (0,0.5) to left part and (0.5,1) to right part
-// BUT for cell structure the bbox is (-1.5,0.5) and (0.5,2.5), there is no missing region at all
 #ifdef Vertex
 #ifdef Cell
 #error Both Cell and Vertex are defined
@@ -433,6 +422,7 @@ void Patch::Interp_Points(MyList<var> *VarList,
      if (flag)
      {
        notfind = false;
+        owner_rank[j] = BP->rank;
        if (myrank == BP->rank)
        {
          //---> interpolation
@@ -440,14 +430,11 @@ void Patch::Interp_Points(MyList<var> *VarList,
          int k = 0;
          while (varl) // run along variables
          {
-            //              shellf[j*num_var+k] = Parallel::global_interp(dim,BP->shape,BP->X,BP->fgfs[varl->data->sgfn],
-            //	  		                                    pox,ordn,varl->data->SoA,Symmetry);
-            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], shellf[j * num_var + k],
+            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], Shellf[j * num_var + k],
                            pox[0], pox[1], pox[2], ordn, varl->data->SoA, Symmetry);
            varl = varl->next;
            k++;
          }
-          weight[j] = 1;
        }
      }
      if (Bp == ble)
@@ -456,61 +443,125 @@ void Patch::Interp_Points(MyList<var> *VarList,
    }
  }

-  MPI_Allreduce(shellf, Shellf, NN * num_var, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  int *Weight;
-  Weight = new int[NN];
-  MPI_Allreduce(weight, Weight, NN, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-
-  //  misc::tillherecheck("print me");
-
-  for (int i = 0; i < NN; i++)
+  // Replace MPI_Allreduce with per-owner MPI_Bcast:
+  // Group consecutive points by owner rank and broadcast each group.
+  // Since each point's data is non-zero only on the owner rank,
+  // Bcast from owner is equivalent to Allreduce(MPI_SUM) but much cheaper.
  {
-    if (Weight[i] > 1)
+    int j = 0;
+    while (j < NN)
+    {
+      int cur_owner = owner_rank[j];
+      if (cur_owner < 0)
      {
        if (myrank == 0)
-        cout << "WARNING: Patch::Interp_Points meets multiple weight" << endl;
-      for (int j = 0; j < num_var; j++)
-        Shellf[j + i * num_var] = Shellf[j + i * num_var] / Weight[i];
-    }
-    else if (Weight[i] == 0 && myrank == 0)
        {
          cout << "ERROR: Patch::Interp_Points fails to find point (";
-      for (int j = 0; j < dim; j++)
+          for (int d = 0; d < dim; d++)
          {
-        cout << XX[j][i];
-        if (j < dim - 1)
+            cout << XX[d][j];
+            if (d < dim - 1)
              cout << ",";
            else
              cout << ")";
          }
          cout << " on Patch (";
-      for (int j = 0; j < dim; j++)
+          for (int d = 0; d < dim; d++)
          {
-        cout << bbox[j] << "+" << lli[j] * getdX(j);
-        if (j < dim - 1)
+            cout << bbox[d] << "+" << lli[d] * DH[d];
+            if (d < dim - 1)
              cout << ",";
            else
              cout << ")--";
          }
          cout << "(";
-      for (int j = 0; j < dim; j++)
+          for (int d = 0; d < dim; d++)
          {
-        cout << bbox[dim + j] << "-" << uui[j] * getdX(j);
-        if (j < dim - 1)
+            cout << bbox[dim + d] << "-" << uui[d] * DH[d];
+            if (d < dim - 1)
              cout << ",";
            else
              cout << ")" << endl;
          }
-#if 0
-       checkBlock();
-#else
-      cout << "splited domains:" << endl;
+          MPI_Abort(MPI_COMM_WORLD, 1);
+        }
+        j++;
+        continue;
+      }
+      // Find contiguous run of points with the same owner
+      int jstart = j;
+      while (j < NN && owner_rank[j] == cur_owner)
+        j++;
+      int count = (j - jstart) * num_var;
+      MPI_Bcast(Shellf + jstart * num_var, count, MPI_DOUBLE, cur_owner, MPI_COMM_WORLD);
+    }
+  }
+
+  delete[] owner_rank;
+}
+void Patch::Interp_Points(MyList<var> *VarList,
+                          int NN, double **XX,
+                          double *Shellf, int Symmetry,
+                          int Nmin_consumer, int Nmax_consumer)
+{
+  // Targeted point-to-point overload: each owner sends each point only to
+  // the one rank that needs it for integration (consumer), reducing
+  // communication volume by ~nprocs times compared to the Bcast version.
+  int myrank, nprocs;
+  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
+  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
+
+  int ordn = 2 * ghost_width;
+  MyList<var> *varl;
+  int num_var = 0;
+  varl = VarList;
+  while (varl)
  {
+    num_var++;
+    varl = varl->next;
+  }
+
+  memset(Shellf, 0, sizeof(double) * NN * num_var);
+
+  // owner_rank[j] records which MPI rank owns point j
+  int *owner_rank;
+  owner_rank = new int[NN];
+  for (int j = 0; j < NN; j++)
+    owner_rank[j] = -1;
+
+  double DH[dim], llb[dim], uub[dim];
+  for (int i = 0; i < dim; i++)
+    DH[i] = getdX(i);
+
+  // --- Interpolation phase (identical to original) ---
+  for (int j = 0; j < NN; j++)
+  {
+    double pox[dim];
+    for (int i = 0; i < dim; i++)
+    {
+      pox[i] = XX[i][j];
+      if (myrank == 0 && (XX[i][j] < bbox[i] + lli[i] * DH[i] || XX[i][j] > bbox[dim + i] - uui[i] * DH[i]))
+      {
+        cout << "Patch::Interp_Points: point (";
+        for (int k = 0; k < dim; k++)
+        {
+          cout << XX[k][j];
+          if (k < dim - 1)
+            cout << ",";
+          else
+            cout << ") is out of current Patch." << endl;
+        }
+        MPI_Abort(MPI_COMM_WORLD, 1);
+      }
+    }
+
    MyList<Block> *Bp = blb;
-        while (Bp)
+    bool notfind = true;
+    while (notfind && Bp)
    {
      Block *BP = Bp->data;

+      bool flag = true;
      for (int i = 0; i < dim; i++)
      {
 #ifdef Vertex
@@ -527,32 +578,192 @@ void Patch::Interp_Points(MyList<var> *VarList,
 #error Not define Vertex nor Cell
 #endif
 #endif
-          }
-          cout << "(";
-          for (int j = 0; j < dim; j++)
+        if (XX[i][j] - llb[i] < -DH[i] / 2 || XX[i][j] - uub[i] > DH[i] / 2)
        {
-            cout << llb[j] << ":" << uub[j];
-            if (j < dim - 1)
-              cout << ",";
-            else
-              cout << ")" << endl;
+          flag = false;
+          break;
+        }
+      }
+
+      if (flag)
+      {
+        notfind = false;
+        owner_rank[j] = BP->rank;
+        if (myrank == BP->rank)
+        {
+          varl = VarList;
+          int k = 0;
+          while (varl)
+          {
+            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], Shellf[j * num_var + k],
+                            pox[0], pox[1], pox[2], ordn, varl->data->SoA, Symmetry);
+            varl = varl->next;
+            k++;
+          }
+        }
      }
      if (Bp == ble)
        break;
      Bp = Bp->next;
    }
  }
-#endif
+
+  // --- Error check for unfound points ---
+  for (int j = 0; j < NN; j++)
+  {
+    if (owner_rank[j] < 0 && myrank == 0)
+    {
+      cout << "ERROR: Patch::Interp_Points fails to find point (";
+      for (int d = 0; d < dim; d++)
+      {
+        cout << XX[d][j];
+        if (d < dim - 1)
+          cout << ",";
+        else
+          cout << ")";
+      }
+      cout << " on Patch (";
+      for (int d = 0; d < dim; d++)
+      {
+        cout << bbox[d] << "+" << lli[d] * DH[d];
+        if (d < dim - 1)
+          cout << ",";
+        else
+          cout << ")--";
+      }
+      cout << "(";
+      for (int d = 0; d < dim; d++)
+      {
+        cout << bbox[dim + d] << "-" << uui[d] * DH[d];
+        if (d < dim - 1)
+          cout << ",";
+        else
+          cout << ")" << endl;
+      }
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }

-  delete[] shellf;
-  delete[] weight;
-  delete[] Weight;
-  delete[] DH;
-  delete[] llb;
-  delete[] uub;
+  // --- Targeted point-to-point communication phase ---
+  // Compute consumer_rank[j] using the same deterministic formula as surface_integral
+  int *consumer_rank = new int[NN];
+  {
+    int mp = NN / nprocs;
+    int Lp = NN - nprocs * mp;
+    for (int j = 0; j < NN; j++)
+    {
+      if (j < Lp * (mp + 1))
+        consumer_rank[j] = j / (mp + 1);
+      else
+        consumer_rank[j] = Lp + (j - Lp * (mp + 1)) / mp;
+    }
+  }
+
+  // Count sends and recvs per rank
+  int *send_count = new int[nprocs];
+  int *recv_count = new int[nprocs];
+  memset(send_count, 0, sizeof(int) * nprocs);
+  memset(recv_count, 0, sizeof(int) * nprocs);
+
+  for (int j = 0; j < NN; j++)
+  {
+    int own = owner_rank[j];
+    int con = consumer_rank[j];
+    if (own == con)
+      continue; // local — no communication needed
+    if (own == myrank)
+      send_count[con]++;
+    if (con == myrank)
+      recv_count[own]++;
+  }
+
+  // Build send buffers: for each destination rank, pack (index, data) pairs
+  // Each entry: 1 int (point index j) + num_var doubles
+  int total_send = 0, total_recv = 0;
+  int *send_offset = new int[nprocs];
+  int *recv_offset = new int[nprocs];
+  for (int r = 0; r < nprocs; r++)
+  {
+    send_offset[r] = total_send;
+    total_send += send_count[r];
+    recv_offset[r] = total_recv;
+    total_recv += recv_count[r];
+  }
+
+  // Pack send buffers: each message contains (j, data[0..num_var-1]) per point
+  int stride = 1 + num_var; // 1 double for index + num_var doubles for data
+  double *sendbuf = new double[total_send * stride];
+  double *recvbuf = new double[total_recv * stride];
+
+  // Temporary counters for packing
+  int *pack_pos = new int[nprocs];
+  memset(pack_pos, 0, sizeof(int) * nprocs);
+
+  for (int j = 0; j < NN; j++)
+  {
+    int own = owner_rank[j];
+    int con = consumer_rank[j];
+    if (own != myrank || con == myrank)
+      continue;
+    int pos = (send_offset[con] + pack_pos[con]) * stride;
+    sendbuf[pos] = (double)j; // point index
+    for (int v = 0; v < num_var; v++)
+      sendbuf[pos + 1 + v] = Shellf[j * num_var + v];
+    pack_pos[con]++;
+  }
+
+  // Post non-blocking recvs and sends
+  int n_req = 0;
+  for (int r = 0; r < nprocs; r++)
+  {
+    if (recv_count[r] > 0) n_req++;
+    if (send_count[r] > 0) n_req++;
+  }
+
+  MPI_Request *reqs = new MPI_Request[n_req];
+  int req_idx = 0;
+
+  for (int r = 0; r < nprocs; r++)
+  {
+    if (recv_count[r] > 0)
+    {
+      MPI_Irecv(recvbuf + recv_offset[r] * stride,
+                recv_count[r] * stride, MPI_DOUBLE,
+                r, 0, MPI_COMM_WORLD, &reqs[req_idx++]);
+    }
+  }
+  for (int r = 0; r < nprocs; r++)
+  {
+    if (send_count[r] > 0)
+    {
+      MPI_Isend(sendbuf + send_offset[r] * stride,
+                send_count[r] * stride, MPI_DOUBLE,
+                r, 0, MPI_COMM_WORLD, &reqs[req_idx++]);
+    }
+  }
+
+  if (n_req > 0)
+    MPI_Waitall(n_req, reqs, MPI_STATUSES_IGNORE);
+
+  // Unpack recv buffers into Shellf
+  for (int i = 0; i < total_recv; i++)
+  {
+    int pos = i * stride;
+    int j = (int)recvbuf[pos];
+    for (int v = 0; v < num_var; v++)
+      Shellf[j * num_var + v] = recvbuf[pos + 1 + v];
+  }
+
+  delete[] reqs;
+  delete[] sendbuf;
+  delete[] recvbuf;
+  delete[] pack_pos;
+  delete[] send_offset;
+  delete[] recv_offset;
+  delete[] send_count;
+  delete[] recv_count;
+  delete[] consumer_rank;
+  delete[] owner_rank;
 }
 void Patch::Interp_Points(MyList<var> *VarList,
                          int NN, double **XX,
@@ -573,24 +784,22 @@ void Patch::Interp_Points(MyList<var> *VarList,
    varl = varl->next;
  }

-  double *shellf;
-  shellf = new double[NN * num_var];
-  memset(shellf, 0, sizeof(double) * NN * num_var);
+  memset(Shellf, 0, sizeof(double) * NN * num_var);

-  // we use weight to monitor code, later some day we can move it for optimization
-  int *weight;
-  weight = new int[NN];
-  memset(weight, 0, sizeof(int) * NN);
+  // owner_rank[j] stores the global rank that owns point j
+  int *owner_rank;
+  owner_rank = new int[NN];
+  for (int j = 0; j < NN; j++)
+    owner_rank[j] = -1;

-  double *DH, *llb, *uub;
-  DH = new double[dim];
+  // Build global-to-local rank translation for Comm_here
+  MPI_Group world_group, local_group;
+  MPI_Comm_group(MPI_COMM_WORLD, &world_group);
+  MPI_Comm_group(Comm_here, &local_group);

+  double DH[dim], llb[dim], uub[dim];
  for (int i = 0; i < dim; i++)
-  {
    DH[i] = getdX(i);
-  }
-  llb = new double[dim];
-  uub = new double[dim];

  for (int j = 0; j < NN; j++) // run along points
  {
@@ -622,12 +831,6 @@ void Patch::Interp_Points(MyList<var> *VarList,
      bool flag = true;
      for (int i = 0; i < dim; i++)
      {
-// NOTE: our dividing structure is (exclude ghost)
-// -1 0
-//       1  2
-// so (0,1) does not belong to any part for vertex structure
-// here we put (0,0.5) to left part and (0.5,1) to right part
-// BUT for cell structure the bbox is (-1.5,0.5) and (0.5,2.5), there is no missing region at all
 #ifdef Vertex
 #ifdef Cell
 #error Both Cell and Vertex are defined
@@ -652,6 +855,7 @@ void Patch::Interp_Points(MyList<var> *VarList,
      if (flag)
      {
        notfind = false;
+        owner_rank[j] = BP->rank;
        if (myrank == BP->rank)
        {
          //---> interpolation
@@ -659,14 +863,11 @@ void Patch::Interp_Points(MyList<var> *VarList,
          int k = 0;
          while (varl) // run along variables
          {
-            //              shellf[j*num_var+k] = Parallel::global_interp(dim,BP->shape,BP->X,BP->fgfs[varl->data->sgfn],
-            //	  		                                    pox,ordn,varl->data->SoA,Symmetry);
-            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], shellf[j * num_var + k],
+            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], Shellf[j * num_var + k],
                            pox[0], pox[1], pox[2], ordn, varl->data->SoA, Symmetry);
            varl = varl->next;
            k++;
          }
-          weight[j] = 1;
        }
      }
      if (Bp == ble)
@@ -675,97 +876,35 @@ void Patch::Interp_Points(MyList<var> *VarList,
    }
  }

-  MPI_Allreduce(shellf, Shellf, NN * num_var, MPI_DOUBLE, MPI_SUM, Comm_here);
-  int *Weight;
-  Weight = new int[NN];
-  MPI_Allreduce(weight, Weight, NN, MPI_INT, MPI_SUM, Comm_here);
+  // Collect unique global owner ranks and translate to local ranks in Comm_here
+  // Then broadcast each owner's points via MPI_Bcast on Comm_here
+  {
+    int j = 0;
+    while (j < NN)
+    {
+      int cur_owner_global = owner_rank[j];
+      if (cur_owner_global < 0)
+      {
+        // Point not found — skip (error check disabled for sub-communicator levels)
+        j++;
+        continue;
+      }
+      // Translate global rank to local rank in Comm_here
+      int cur_owner_local;
+      MPI_Group_translate_ranks(world_group, 1, &cur_owner_global, local_group, &cur_owner_local);

-  //  misc::tillherecheck("print me");
-  //  if(lmyrank == 0) cout<<"myrank = "<<myrank<<"print me"<<endl;
-
-  for (int i = 0; i < NN; i++)
-  {
-    if (Weight[i] > 1)
-    {
-      if (lmyrank == 0)
-        cout << "WARNING: Patch::Interp_Points meets multiple weight" << endl;
-      for (int j = 0; j < num_var; j++)
-        Shellf[j + i * num_var] = Shellf[j + i * num_var] / Weight[i];
+      // Find contiguous run of points with the same owner
+      int jstart = j;
+      while (j < NN && owner_rank[j] == cur_owner_global)
+        j++;
+      int count = (j - jstart) * num_var;
+      MPI_Bcast(Shellf + jstart * num_var, count, MPI_DOUBLE, cur_owner_local, Comm_here);
    }
-#if 0 // for not involved levels, this may fail     
-     else if(Weight[i] == 0 && lmyrank == 0)
-     {
-       cout<<"ERROR: Patch::Interp_Points fails to find point (";
-       for(int j=0;j<dim;j++)
-       {
-	  cout<<XX[j][i];
-	  if(j<dim-1) cout<<",";
-	  else        cout<<")";
-       }
-       cout<<" on Patch (";
-       for(int j=0;j<dim;j++)
-       {
-	  cout<<bbox[j]<<"+"<<lli[j]*getdX(j);
-	  if(j<dim-1) cout<<",";
-	  else        cout<<")--";
-       }
-       cout<<"(";
-       for(int j=0;j<dim;j++)
-       {
-	  cout<<bbox[dim+j]<<"-"<<uui[j]*getdX(j);
-	  if(j<dim-1) cout<<",";
-	  else        cout<<")"<<endl;
-       }
-#if 0
-       checkBlock();
-#else
-  cout<<"splited domains:"<<endl;
-  {
-     MyList<Block> *Bp=blb;
-     while(Bp)
-     {
-	Block *BP=Bp->data;
-
-	for(int i=0;i<dim;i++)
-	{
-#ifdef Vertex
-#ifdef Cell
-#error Both Cell and Vertex are defined
-#endif    
-          llb[i] = (feq(BP->bbox[i]    ,bbox[i]    ,DH[i]/2)) ? BP->bbox[i]+lli[i]*DH[i]     : BP->bbox[i]    +(ghost_width-0.5)*DH[i];
-          uub[i] = (feq(BP->bbox[dim+i],bbox[dim+i],DH[i]/2)) ? BP->bbox[dim+i]-uui[i]*DH[i] : BP->bbox[dim+i]-(ghost_width-0.5)*DH[i];
-#else
-#ifdef Cell
-          llb[i] = (feq(BP->bbox[i]    ,bbox[i]    ,DH[i]/2)) ? BP->bbox[i]+lli[i]*DH[i]     : BP->bbox[i]    +ghost_width*DH[i];
-          uub[i] = (feq(BP->bbox[dim+i],bbox[dim+i],DH[i]/2)) ? BP->bbox[dim+i]-uui[i]*DH[i] : BP->bbox[dim+i]-ghost_width*DH[i];
-#else
-#error Not define Vertex nor Cell
-#endif
-#endif 
-	}       
-       cout<<"(";
-       for(int j=0;j<dim;j++)
-       {
-	  cout<<llb[j]<<":"<<uub[j];
-	  if(j<dim-1) cout<<",";
-	  else        cout<<")"<<endl;
-       }
-	if(Bp == ble) break;
-	Bp=Bp->next;
-     }
-  }
-#endif       
-       MPI_Abort(MPI_COMM_WORLD,1);
-     }
-#endif
  }

-  delete[] shellf;
-  delete[] weight;
-  delete[] Weight;
-  delete[] DH;
-  delete[] llb;
-  delete[] uub;
+  MPI_Group_free(&world_group);
+  MPI_Group_free(&local_group);
+  delete[] owner_rank;
 }
 void Patch::checkBlock()
 {
--- a/AMSS_NCKU_source/MPatch.h
+++ b/AMSS_NCKU_source/MPatch.h
@@ -2,7 +2,11 @@
 #ifndef PATCH_H
 #define PATCH_H

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif
 #include "MyList.h"
 #include "Block.h"
 #include "var.h"
@@ -39,6 +43,10 @@ public:

   bool Find_Point(double *XX);

+   void Interp_Points(MyList<var> *VarList,
+                      int NN, double **XX,
+                      double *Shellf, int Symmetry,
+                      int Nmin_consumer, int Nmax_consumer);
   void Interp_Points(MyList<var> *VarList,
                      int NN, double **XX,
                      double *Shellf, int Symmetry, MPI_Comm Comm_here);
--- a/AMSS_NCKU_source/Newton.C
+++ b/AMSS_NCKU_source/Newton.C
@@ -8,7 +8,11 @@
 #include <limits.h>
 #include <float.h>
 #include <math.h>
+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "util_Table.h"
 #include "cctk.h"
--- a/AMSS_NCKU_source/NullShellPatch.h
+++ b/AMSS_NCKU_source/NullShellPatch.h
@@ -23,7 +23,11 @@ using namespace std;
 #include <complex.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif
 #include "MyList.h"
 #include "Block.h"
 #include "Parallel.h"
--- a/AMSS_NCKU_source/NullShellPatch2.h
+++ b/AMSS_NCKU_source/NullShellPatch2.h
@@ -23,7 +23,11 @@ using namespace std;
 #include <complex.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif
 #include "MyList.h"
 #include "Block.h"
 #include "Parallel.h"
--- a/AMSS_NCKU_source/Parallel.C
+++ b/AMSS_NCKU_source/Parallel.C
@@ -3756,6 +3756,484 @@ void Parallel::Sync(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry)
  delete[] transfer_src;
  delete[] transfer_dst;
 }
+// Merged Sync: collect all intra-patch and inter-patch grid segment lists,
+// then issue a single transfer() call instead of N+1 separate ones.
+void Parallel::Sync_merged(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry)
+{
+  int cpusize;
+  MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
+
+  MyList<Parallel::gridseg> **combined_src = new MyList<Parallel::gridseg> *[cpusize];
+  MyList<Parallel::gridseg> **combined_dst = new MyList<Parallel::gridseg> *[cpusize];
+  for (int node = 0; node < cpusize; node++)
+    combined_src[node] = combined_dst[node] = 0;
+
+  // Phase A: Intra-patch ghost exchange segments
+  MyList<Patch> *Pp = PatL;
+  while (Pp)
+  {
+    Patch *Pat = Pp->data;
+    MyList<Parallel::gridseg> *dst_ghost = build_ghost_gsl(Pat);
+
+    for (int node = 0; node < cpusize; node++)
+    {
+      MyList<Parallel::gridseg> *src_owned = build_owned_gsl0(Pat, node);
+      MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
+      build_gstl(src_owned, dst_ghost, &tsrc, &tdst);
+
+      if (tsrc)
+      {
+        if (combined_src[node])
+          combined_src[node]->catList(tsrc);
+        else
+          combined_src[node] = tsrc;
+      }
+      if (tdst)
+      {
+        if (combined_dst[node])
+          combined_dst[node]->catList(tdst);
+        else
+          combined_dst[node] = tdst;
+      }
+
+      if (src_owned)
+        src_owned->destroyList();
+    }
+
+    if (dst_ghost)
+      dst_ghost->destroyList();
+
+    Pp = Pp->next;
+  }
+
+  // Phase B: Inter-patch buffer exchange segments
+  MyList<Parallel::gridseg> *dst_buffer = build_buffer_gsl(PatL);
+  for (int node = 0; node < cpusize; node++)
+  {
+    MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatL, node, 5, Symmetry);
+    MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
+    build_gstl(src_owned, dst_buffer, &tsrc, &tdst);
+
+    if (tsrc)
+    {
+      if (combined_src[node])
+        combined_src[node]->catList(tsrc);
+      else
+        combined_src[node] = tsrc;
+    }
+    if (tdst)
+    {
+      if (combined_dst[node])
+        combined_dst[node]->catList(tdst);
+      else
+        combined_dst[node] = tdst;
+    }
+
+    if (src_owned)
+      src_owned->destroyList();
+  }
+  if (dst_buffer)
+    dst_buffer->destroyList();
+
+  // Phase C: Single transfer
+  transfer(combined_src, combined_dst, VarList, VarList, Symmetry);
+
+  // Phase D: Cleanup
+  for (int node = 0; node < cpusize; node++)
+  {
+    if (combined_src[node])
+      combined_src[node]->destroyList();
+    if (combined_dst[node])
+      combined_dst[node]->destroyList();
+  }
+  delete[] combined_src;
+  delete[] combined_dst;
+}
+// SyncCache constructor
+Parallel::SyncCache::SyncCache()
+    : valid(false), cpusize(0), combined_src(0), combined_dst(0),
+      send_lengths(0), recv_lengths(0), send_bufs(0), recv_bufs(0),
+      send_buf_caps(0), recv_buf_caps(0), reqs(0), stats(0), max_reqs(0)
+{
+}
+// SyncCache invalidate: free grid segment lists but keep buffers
+void Parallel::SyncCache::invalidate()
+{
+  if (!valid)
+    return;
+  for (int i = 0; i < cpusize; i++)
+  {
+    if (combined_src[i])
+      combined_src[i]->destroyList();
+    if (combined_dst[i])
+      combined_dst[i]->destroyList();
+    combined_src[i] = combined_dst[i] = 0;
+    send_lengths[i] = recv_lengths[i] = 0;
+  }
+  valid = false;
+}
+// SyncCache destroy: free everything
+void Parallel::SyncCache::destroy()
+{
+  invalidate();
+  if (combined_src) delete[] combined_src;
+  if (combined_dst) delete[] combined_dst;
+  if (send_lengths) delete[] send_lengths;
+  if (recv_lengths) delete[] recv_lengths;
+  if (send_buf_caps) delete[] send_buf_caps;
+  if (recv_buf_caps) delete[] recv_buf_caps;
+  for (int i = 0; i < cpusize; i++)
+  {
+    if (send_bufs && send_bufs[i]) delete[] send_bufs[i];
+    if (recv_bufs && recv_bufs[i]) delete[] recv_bufs[i];
+  }
+  if (send_bufs) delete[] send_bufs;
+  if (recv_bufs) delete[] recv_bufs;
+  if (reqs) delete[] reqs;
+  if (stats) delete[] stats;
+  combined_src = combined_dst = 0;
+  send_lengths = recv_lengths = 0;
+  send_buf_caps = recv_buf_caps = 0;
+  send_bufs = recv_bufs = 0;
+  reqs = 0; stats = 0;
+  cpusize = 0; max_reqs = 0;
+}
+// transfer_cached: reuse pre-allocated buffers from SyncCache
+void Parallel::transfer_cached(MyList<Parallel::gridseg> **src, MyList<Parallel::gridseg> **dst,
+                               MyList<var> *VarList1, MyList<var> *VarList2,
+                               int Symmetry, SyncCache &cache)
+{
+  int myrank;
+  MPI_Comm_size(MPI_COMM_WORLD, &cache.cpusize);
+  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
+  int cpusize = cache.cpusize;
+
+  int req_no = 0;
+  int node;
+
+  for (node = 0; node < cpusize; node++)
+  {
+    if (node == myrank)
+    {
+      int length = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
+      cache.recv_lengths[node] = length;
+      if (length > 0)
+      {
+        if (length > cache.recv_buf_caps[node])
+        {
+          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
+          cache.recv_bufs[node] = new double[length];
+          cache.recv_buf_caps[node] = length;
+        }
+        data_packer(cache.recv_bufs[node], src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
+      }
+    }
+    else
+    {
+      // send
+      int slength = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
+      cache.send_lengths[node] = slength;
+      if (slength > 0)
+      {
+        if (slength > cache.send_buf_caps[node])
+        {
+          if (cache.send_bufs[node]) delete[] cache.send_bufs[node];
+          cache.send_bufs[node] = new double[slength];
+          cache.send_buf_caps[node] = slength;
+        }
+        data_packer(cache.send_bufs[node], src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
+        MPI_Isend((void *)cache.send_bufs[node], slength, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, cache.reqs + req_no++);
+      }
+      // recv
+      int rlength = data_packer(0, src[node], dst[node], node, UNPACK, VarList1, VarList2, Symmetry);
+      cache.recv_lengths[node] = rlength;
+      if (rlength > 0)
+      {
+        if (rlength > cache.recv_buf_caps[node])
+        {
+          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
+          cache.recv_bufs[node] = new double[rlength];
+          cache.recv_buf_caps[node] = rlength;
+        }
+        MPI_Irecv((void *)cache.recv_bufs[node], rlength, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, cache.reqs + req_no++);
+      }
+    }
+  }
+
+  MPI_Waitall(req_no, cache.reqs, cache.stats);
+
+  for (node = 0; node < cpusize; node++)
+    if (cache.recv_bufs[node] && cache.recv_lengths[node] > 0)
+      data_packer(cache.recv_bufs[node], src[node], dst[node], node, UNPACK, VarList1, VarList2, Symmetry);
+}
+// Sync_cached: build grid segment lists on first call, reuse on subsequent calls
+void Parallel::Sync_cached(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry, SyncCache &cache)
+{
+  if (!cache.valid)
+  {
+    int cpusize;
+    MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
+    cache.cpusize = cpusize;
+
+    // Allocate cache arrays if needed
+    if (!cache.combined_src)
+    {
+      cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
+      cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
+      cache.send_lengths = new int[cpusize];
+      cache.recv_lengths = new int[cpusize];
+      cache.send_bufs = new double *[cpusize];
+      cache.recv_bufs = new double *[cpusize];
+      cache.send_buf_caps = new int[cpusize];
+      cache.recv_buf_caps = new int[cpusize];
+      for (int i = 0; i < cpusize; i++)
+      {
+        cache.send_bufs[i] = cache.recv_bufs[i] = 0;
+        cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
+      }
+      cache.max_reqs = 2 * cpusize;
+      cache.reqs = new MPI_Request[cache.max_reqs];
+      cache.stats = new MPI_Status[cache.max_reqs];
+    }
+
+    for (int node = 0; node < cpusize; node++)
+    {
+      cache.combined_src[node] = cache.combined_dst[node] = 0;
+      cache.send_lengths[node] = cache.recv_lengths[node] = 0;
+    }
+
+    // Build intra-patch segments (same as Sync_merged Phase A)
+    MyList<Patch> *Pp = PatL;
+    while (Pp)
+    {
+      Patch *Pat = Pp->data;
+      MyList<Parallel::gridseg> *dst_ghost = build_ghost_gsl(Pat);
+      for (int node = 0; node < cpusize; node++)
+      {
+        MyList<Parallel::gridseg> *src_owned = build_owned_gsl0(Pat, node);
+        MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
+        build_gstl(src_owned, dst_ghost, &tsrc, &tdst);
+        if (tsrc)
+        {
+          if (cache.combined_src[node])
+            cache.combined_src[node]->catList(tsrc);
+          else
+            cache.combined_src[node] = tsrc;
+        }
+        if (tdst)
+        {
+          if (cache.combined_dst[node])
+            cache.combined_dst[node]->catList(tdst);
+          else
+            cache.combined_dst[node] = tdst;
+        }
+        if (src_owned) src_owned->destroyList();
+      }
+      if (dst_ghost) dst_ghost->destroyList();
+      Pp = Pp->next;
+    }
+
+    // Build inter-patch segments (same as Sync_merged Phase B)
+    MyList<Parallel::gridseg> *dst_buffer = build_buffer_gsl(PatL);
+    for (int node = 0; node < cpusize; node++)
+    {
+      MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatL, node, 5, Symmetry);
+      MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
+      build_gstl(src_owned, dst_buffer, &tsrc, &tdst);
+      if (tsrc)
+      {
+        if (cache.combined_src[node])
+          cache.combined_src[node]->catList(tsrc);
+        else
+          cache.combined_src[node] = tsrc;
+      }
+      if (tdst)
+      {
+        if (cache.combined_dst[node])
+          cache.combined_dst[node]->catList(tdst);
+        else
+          cache.combined_dst[node] = tdst;
+      }
+      if (src_owned) src_owned->destroyList();
+    }
+    if (dst_buffer) dst_buffer->destroyList();
+
+    cache.valid = true;
+  }
+
+  // Use cached lists with buffer-reusing transfer
+  transfer_cached(cache.combined_src, cache.combined_dst, VarList, VarList, Symmetry, cache);
+}
+// Sync_start: pack and post MPI_Isend/Irecv, return immediately
+void Parallel::Sync_start(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry,
+                          SyncCache &cache, AsyncSyncState &state)
+{
+  // Ensure cache is built
+  if (!cache.valid)
+  {
+    // Build cache (same logic as Sync_cached)
+    int cpusize;
+    MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
+    cache.cpusize = cpusize;
+
+    if (!cache.combined_src)
+    {
+      cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
+      cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
+      cache.send_lengths = new int[cpusize];
+      cache.recv_lengths = new int[cpusize];
+      cache.send_bufs = new double *[cpusize];
+      cache.recv_bufs = new double *[cpusize];
+      cache.send_buf_caps = new int[cpusize];
+      cache.recv_buf_caps = new int[cpusize];
+      for (int i = 0; i < cpusize; i++)
+      {
+        cache.send_bufs[i] = cache.recv_bufs[i] = 0;
+        cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
+      }
+      cache.max_reqs = 2 * cpusize;
+      cache.reqs = new MPI_Request[cache.max_reqs];
+      cache.stats = new MPI_Status[cache.max_reqs];
+    }
+
+    for (int node = 0; node < cpusize; node++)
+    {
+      cache.combined_src[node] = cache.combined_dst[node] = 0;
+      cache.send_lengths[node] = cache.recv_lengths[node] = 0;
+    }
+
+    MyList<Patch> *Pp = PatL;
+    while (Pp)
+    {
+      Patch *Pat = Pp->data;
+      MyList<Parallel::gridseg> *dst_ghost = build_ghost_gsl(Pat);
+      for (int node = 0; node < cpusize; node++)
+      {
+        MyList<Parallel::gridseg> *src_owned = build_owned_gsl0(Pat, node);
+        MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
+        build_gstl(src_owned, dst_ghost, &tsrc, &tdst);
+        if (tsrc)
+        {
+          if (cache.combined_src[node])
+            cache.combined_src[node]->catList(tsrc);
+          else
+            cache.combined_src[node] = tsrc;
+        }
+        if (tdst)
+        {
+          if (cache.combined_dst[node])
+            cache.combined_dst[node]->catList(tdst);
+          else
+            cache.combined_dst[node] = tdst;
+        }
+        if (src_owned) src_owned->destroyList();
+      }
+      if (dst_ghost) dst_ghost->destroyList();
+      Pp = Pp->next;
+    }
+
+    MyList<Parallel::gridseg> *dst_buffer = build_buffer_gsl(PatL);
+    for (int node = 0; node < cpusize; node++)
+    {
+      MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatL, node, 5, Symmetry);
+      MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
+      build_gstl(src_owned, dst_buffer, &tsrc, &tdst);
+      if (tsrc)
+      {
+        if (cache.combined_src[node])
+          cache.combined_src[node]->catList(tsrc);
+        else
+          cache.combined_src[node] = tsrc;
+      }
+      if (tdst)
+      {
+        if (cache.combined_dst[node])
+          cache.combined_dst[node]->catList(tdst);
+        else
+          cache.combined_dst[node] = tdst;
+      }
+      if (src_owned) src_owned->destroyList();
+    }
+    if (dst_buffer) dst_buffer->destroyList();
+    cache.valid = true;
+  }
+
+  // Now pack and post async MPI operations
+  int myrank;
+  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
+  int cpusize = cache.cpusize;
+  state.req_no = 0;
+  state.active = true;
+
+  MyList<Parallel::gridseg> **src = cache.combined_src;
+  MyList<Parallel::gridseg> **dst = cache.combined_dst;
+
+  for (int node = 0; node < cpusize; node++)
+  {
+    if (node == myrank)
+    {
+      int length = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
+      cache.recv_lengths[node] = length;
+      if (length > 0)
+      {
+        if (length > cache.recv_buf_caps[node])
+        {
+          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
+          cache.recv_bufs[node] = new double[length];
+          cache.recv_buf_caps[node] = length;
+        }
+        data_packer(cache.recv_bufs[node], src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
+      }
+    }
+    else
+    {
+      int slength = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
+      cache.send_lengths[node] = slength;
+      if (slength > 0)
+      {
+        if (slength > cache.send_buf_caps[node])
+        {
+          if (cache.send_bufs[node]) delete[] cache.send_bufs[node];
+          cache.send_bufs[node] = new double[slength];
+          cache.send_buf_caps[node] = slength;
+        }
+        data_packer(cache.send_bufs[node], src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
+        MPI_Isend((void *)cache.send_bufs[node], slength, MPI_DOUBLE, node, 2, MPI_COMM_WORLD, cache.reqs + state.req_no++);
+      }
+      int rlength = data_packer(0, src[node], dst[node], node, UNPACK, VarList, VarList, Symmetry);
+      cache.recv_lengths[node] = rlength;
+      if (rlength > 0)
+      {
+        if (rlength > cache.recv_buf_caps[node])
+        {
+          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
+          cache.recv_bufs[node] = new double[rlength];
+          cache.recv_buf_caps[node] = rlength;
+        }
+        MPI_Irecv((void *)cache.recv_bufs[node], rlength, MPI_DOUBLE, node, 2, MPI_COMM_WORLD, cache.reqs + state.req_no++);
+      }
+    }
+  }
+}
+// Sync_finish: wait for async MPI operations and unpack
+void Parallel::Sync_finish(SyncCache &cache, AsyncSyncState &state,
+                           MyList<var> *VarList, int Symmetry)
+{
+  if (!state.active)
+    return;
+
+  MPI_Waitall(state.req_no, cache.reqs, cache.stats);
+
+  int cpusize = cache.cpusize;
+  MyList<Parallel::gridseg> **src = cache.combined_src;
+  MyList<Parallel::gridseg> **dst = cache.combined_dst;
+
+  for (int node = 0; node < cpusize; node++)
+    if (cache.recv_bufs[node] && cache.recv_lengths[node] > 0)
+      data_packer(cache.recv_bufs[node], src[node], dst[node], node, UNPACK, VarList, VarList, Symmetry);
+
+  state.active = false;
+}
 // collect buffer grid segments or blocks for the periodic boundary condition of given patch
 // ---------------------------------------------------
 // |con |                                       |con |
--- a/AMSS_NCKU_source/Parallel.h
+++ b/AMSS_NCKU_source/Parallel.h
@@ -81,6 +81,42 @@ namespace Parallel
                   int Symmetry);
  void Sync(Patch *Pat, MyList<var> *VarList, int Symmetry);
  void Sync(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry);
+  void Sync_merged(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry);
+
+  struct SyncCache {
+    bool valid;
+    int cpusize;
+    MyList<gridseg> **combined_src;
+    MyList<gridseg> **combined_dst;
+    int *send_lengths;
+    int *recv_lengths;
+    double **send_bufs;
+    double **recv_bufs;
+    int *send_buf_caps;
+    int *recv_buf_caps;
+    MPI_Request *reqs;
+    MPI_Status *stats;
+    int max_reqs;
+    SyncCache();
+    void invalidate();
+    void destroy();
+  };
+
+  void Sync_cached(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry, SyncCache &cache);
+  void transfer_cached(MyList<gridseg> **src, MyList<gridseg> **dst,
+                       MyList<var> *VarList1, MyList<var> *VarList2,
+                       int Symmetry, SyncCache &cache);
+
+  struct AsyncSyncState {
+    int req_no;
+    bool active;
+    AsyncSyncState() : req_no(0), active(false) {}
+  };
+
+  void Sync_start(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry,
+                  SyncCache &cache, AsyncSyncState &state);
+  void Sync_finish(SyncCache &cache, AsyncSyncState &state,
+                   MyList<var> *VarList, int Symmetry);
  void OutBdLow2Hi(Patch *Patc, Patch *Patf,
                   MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */,
                   int Symmetry);
--- a/AMSS_NCKU_source/ShellPatch.h
+++ b/AMSS_NCKU_source/ShellPatch.h
@@ -2,7 +2,11 @@
 #ifndef SHELLPATCH_H
 #define SHELLPATCH_H

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif
 #include "MyList.h"
 #include "Block.h"
 #include "Parallel.h"
--- a/AMSS_NCKU_source/Z4c_class.h
+++ b/AMSS_NCKU_source/Z4c_class.h
@@ -19,7 +19,11 @@ using namespace std;
 #include <math.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "cgh.h"
 #include "ShellPatch.h"
--- a/AMSS_NCKU_source/bssnEM_class.h
+++ b/AMSS_NCKU_source/bssnEM_class.h
@@ -19,7 +19,11 @@ using namespace std;
 #include <math.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "cgh.h"
 #include "ShellPatch.h"
--- a/AMSS_NCKU_source/bssnEScalar_class.h
+++ b/AMSS_NCKU_source/bssnEScalar_class.h
@@ -19,7 +19,11 @@ using namespace std;
 #include <math.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "cgh.h"
 #include "ShellPatch.h"
--- a/AMSS_NCKU_source/bssn_class.C
+++ b/AMSS_NCKU_source/bssn_class.C
@@ -730,6 +730,12 @@ void bssn_class::Initialize()
    PhysTime = StartTime;
    Setup_Black_Hole_position();
  }
+
+  // Initialize sync caches (per-level, for predictor and corrector)
+  sync_cache_pre = new Parallel::SyncCache[GH->levels];
+  sync_cache_cor = new Parallel::SyncCache[GH->levels];
+  sync_cache_rp_coarse = new Parallel::SyncCache[GH->levels];
+  sync_cache_rp_fine = new Parallel::SyncCache[GH->levels];
 }

 //================================================================================================
@@ -981,6 +987,32 @@ bssn_class::~bssn_class()
  delete Azzz;
 #endif

+  // Destroy sync caches before GH
+  if (sync_cache_pre)
+  {
+    for (int i = 0; i < GH->levels; i++)
+      sync_cache_pre[i].destroy();
+    delete[] sync_cache_pre;
+  }
+  if (sync_cache_cor)
+  {
+    for (int i = 0; i < GH->levels; i++)
+      sync_cache_cor[i].destroy();
+    delete[] sync_cache_cor;
+  }
+  if (sync_cache_rp_coarse)
+  {
+    for (int i = 0; i < GH->levels; i++)
+      sync_cache_rp_coarse[i].destroy();
+    delete[] sync_cache_rp_coarse;
+  }
+  if (sync_cache_rp_fine)
+  {
+    for (int i = 0; i < GH->levels; i++)
+      sync_cache_rp_fine[i].destroy();
+    delete[] sync_cache_rp_fine;
+  }
+
  delete GH;
 #ifdef WithShell
  delete SH;
@@ -2181,6 +2213,7 @@ void bssn_class::Evolve(int Steps)
    GH->Regrid(Symmetry, BH_num, Porgbr, Porg0,
               SynchList_cor, OldStateList, StateList, SynchList_pre,
               fgt(PhysTime - dT_mon, StartTime, dT_mon / 2), ErrorMonitor);
+    for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
 #endif

 #if (REGLEV == 0 && (PSTR == 1 || PSTR == 2))
@@ -2396,6 +2429,7 @@ void bssn_class::RecursiveStep(int lev)
  GH->Regrid_Onelevel(lev, Symmetry, BH_num, Porgbr, Porg0,
                      SynchList_cor, OldStateList, StateList, SynchList_pre,
                      fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor);
+  for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
 #endif
 }

@@ -2574,6 +2608,7 @@ void bssn_class::ParallelStep()
  GH->Regrid_Onelevel(GH->mylev, Symmetry, BH_num, Porgbr, Porg0,
                      SynchList_cor, OldStateList, StateList, SynchList_pre,
                      fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor);
+  for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
 #endif
 }

@@ -2740,6 +2775,7 @@ void bssn_class::ParallelStep()
        GH->Regrid_Onelevel(lev + 1, Symmetry, BH_num, Porgbr, Porg0,
                            SynchList_cor, OldStateList, StateList, SynchList_pre,
                            fgt(PhysTime - dT_levp1, StartTime, dT_levp1 / 2), ErrorMonitor);
+        for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }

        //               a_stream.clear();
        //               a_stream.str("");
@@ -2754,6 +2790,7 @@ void bssn_class::ParallelStep()
      GH->Regrid_Onelevel(lev, Symmetry, BH_num, Porgbr, Porg0,
                          SynchList_cor, OldStateList, StateList, SynchList_pre,
                          fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor);
+      for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }

      //               a_stream.clear();
      //               a_stream.str("");
@@ -2772,6 +2809,7 @@ void bssn_class::ParallelStep()
          GH->Regrid_Onelevel(lev - 1, Symmetry, BH_num, Porgbr, Porg0,
                              SynchList_cor, OldStateList, StateList, SynchList_pre,
                              fgt(PhysTime - dT_lev, StartTime, dT_levm1 / 2), ErrorMonitor);
+          for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }

          //               a_stream.clear();
          //               a_stream.str("");
@@ -2787,6 +2825,7 @@ void bssn_class::ParallelStep()
          GH->Regrid_Onelevel(lev - 1, Symmetry, BH_num, Porgbr, Porg0,
                              SynchList_cor, OldStateList, StateList, SynchList_pre,
                              fgt(PhysTime - dT_lev, StartTime, dT_levm1 / 2), ErrorMonitor);
+          for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }

          //               a_stream.clear();
          //               a_stream.str("");
@@ -3158,21 +3197,7 @@ void bssn_class::Step(int lev, int YN)
    }
    Pp = Pp->next;
  }
-  // check error information
-  {
-    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-  }
-  if (ERROR)
-  {
-    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
-    if (myrank == 0)
-    {
-      if (ErrorMonitor->outfile)
-        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime << ", lev = " << lev << endl;
-      MPI_Abort(MPI_COMM_WORLD, 1);
-    }
-  }
+  // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls

 #ifdef WithShell
  // evolve Shell Patches
@@ -3316,25 +3341,16 @@ void bssn_class::Step(int lev, int YN)
 #endif
  }

-  // check error information
+  // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
+  MPI_Request err_req;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-  }
-
-  if (ERROR)
-  {
-    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
-    if (myrank == 0)
-    {
-      if (ErrorMonitor->outfile)
-        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " << PhysTime << endl;
-      MPI_Abort(MPI_COMM_WORLD, 1);
-    }
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req);
  }
 #endif

-  Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry);
+  Parallel::AsyncSyncState async_pre;
+  Parallel::Sync_start(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev], async_pre);

 #ifdef WithShell
  if (lev == 0)
@@ -3353,6 +3369,23 @@ void bssn_class::Step(int lev, int YN)
    }
  }
 #endif
+  Parallel::Sync_finish(sync_cache_pre[lev], async_pre, SynchList_pre, Symmetry);
+
+#ifdef WithShell
+  // Complete non-blocking error reduction and check
+  MPI_Wait(&err_req, MPI_STATUS_IGNORE);
+  if (ERROR)
+  {
+    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
+    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
+    if (myrank == 0)
+    {
+      if (ErrorMonitor->outfile)
+        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime << ", lev = " << lev << endl;
+      MPI_Abort(MPI_COMM_WORLD, 1);
+    }
+  }
+#endif

 #if (MAPBH == 0)
  // for black hole position
@@ -3528,24 +3561,7 @@ void bssn_class::Step(int lev, int YN)
      Pp = Pp->next;
    }

-    // check error information
-    {
-      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-    }
-
-    if (ERROR)
-    {
-      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
-      if (myrank == 0)
-      {
-        if (ErrorMonitor->outfile)
-          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count 
-                                << " variables at t = " << PhysTime 
-                                << ", lev = " << lev << endl;
-        MPI_Abort(MPI_COMM_WORLD, 1);
-      }
-    }
+    // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls

 #ifdef WithShell
    // evolve Shell Patches
@@ -3685,26 +3701,16 @@ void bssn_class::Step(int lev, int YN)
        sPp = sPp->next;
      }
    }
-    // check error information
+    // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
+    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-    }
-    if (ERROR)
-    {
-      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
-      if (myrank == 0)
-      {
-        if (ErrorMonitor->outfile)
-          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" 
-                                << iter_count << " variables at t = " 
-                                << PhysTime << endl;
-        MPI_Abort(MPI_COMM_WORLD, 1);
-      }
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_cor);
    }
 #endif

-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::AsyncSyncState async_cor;
+    Parallel::Sync_start(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev], async_cor);

 #ifdef WithShell
    if (lev == 0)
@@ -3723,6 +3729,25 @@ void bssn_class::Step(int lev, int YN)
      }
    }
 #endif
+    Parallel::Sync_finish(sync_cache_cor[lev], async_cor, SynchList_cor, Symmetry);
+
+#ifdef WithShell
+    // Complete non-blocking error reduction and check
+    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
+    if (ERROR)
+    {
+      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
+      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
+      if (myrank == 0)
+      {
+        if (ErrorMonitor->outfile)
+          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count
+                                << " variables at t = " << PhysTime
+                                << ", lev = " << lev << endl;
+        MPI_Abort(MPI_COMM_WORLD, 1);
+      }
+    }
+#endif

 #if (MAPBH == 0)
    // for black hole position
@@ -4034,22 +4059,7 @@ void bssn_class::Step(int lev, int YN)
    }
    Pp = Pp->next;
  }
-  // check error information
-  {
-    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-  }
-  if (ERROR)
-  {
-    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
-    if (myrank == 0)
-    {
-      if (ErrorMonitor->outfile)
-        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime 
-                              << ", lev = " << lev << endl;
-      MPI_Abort(MPI_COMM_WORLD, 1);
-    }
-  }
+  // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls

 #ifdef WithShell
  // evolve Shell Patches
@@ -4190,25 +4200,16 @@ void bssn_class::Step(int lev, int YN)
  }
 #endif
  }
-  // check error information
+  // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
+  MPI_Request err_req;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-  }
-  if (ERROR)
-  {
-    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
-    if (myrank == 0)
-    {
-      if (ErrorMonitor->outfile)
-        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " 
-                              << PhysTime << endl;
-      MPI_Abort(MPI_COMM_WORLD, 1);
-    }
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req);
  }
 #endif

-  Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry);
+  Parallel::AsyncSyncState async_pre;
+  Parallel::Sync_start(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev], async_pre);

 #ifdef WithShell
  if (lev == 0)
@@ -4227,6 +4228,24 @@ void bssn_class::Step(int lev, int YN)
    }
  }
 #endif
+  Parallel::Sync_finish(sync_cache_pre[lev], async_pre, SynchList_pre, Symmetry);
+
+#ifdef WithShell
+  // Complete non-blocking error reduction and check
+  MPI_Wait(&err_req, MPI_STATUS_IGNORE);
+  if (ERROR)
+  {
+    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
+    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
+    if (myrank == 0)
+    {
+      if (ErrorMonitor->outfile)
+        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime
+                              << ", lev = " << lev << endl;
+      MPI_Abort(MPI_COMM_WORLD, 1);
+    }
+  }
+#endif

  // for black hole position
  if (BH_num > 0 && lev == GH->levels - 1)
@@ -4386,23 +4405,7 @@ void bssn_class::Step(int lev, int YN)
      Pp = Pp->next;
    }

-    // check error information
-    {
-      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-    }
-    if (ERROR)
-    {
-      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
-      if (myrank == 0)
-      {
-        if (ErrorMonitor->outfile)
-          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count 
-                                << " variables at t = " << PhysTime 
-                                << ", lev = " << lev << endl;
-        MPI_Abort(MPI_COMM_WORLD, 1);
-      }
-    }
+    // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls

 #ifdef WithShell
    // evolve Shell Patches
@@ -4542,25 +4545,16 @@ void bssn_class::Step(int lev, int YN)
        sPp = sPp->next;
      }
    }
-    // check error information
+    // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
+    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-    }
-    if (ERROR)
-    {
-      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
-      if (myrank == 0)
-      {
-        if (ErrorMonitor->outfile)
-          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" << iter_count 
-                                << " variables at t = " << PhysTime << endl;
-        MPI_Abort(MPI_COMM_WORLD, 1);
-      }
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_cor);
    }
 #endif

-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::AsyncSyncState async_cor;
+    Parallel::Sync_start(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev], async_cor);

 #ifdef WithShell
    if (lev == 0)
@@ -4578,6 +4572,25 @@ void bssn_class::Step(int lev, int YN)
             << " seconds! " << endl;
      }
    }
+#endif
+    Parallel::Sync_finish(sync_cache_cor[lev], async_cor, SynchList_cor, Symmetry);
+
+#ifdef WithShell
+    // Complete non-blocking error reduction and check
+    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
+    if (ERROR)
+    {
+      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
+      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
+      if (myrank == 0)
+      {
+        if (ErrorMonitor->outfile)
+          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count
+                                << " variables at t = " << PhysTime
+                                << ", lev = " << lev << endl;
+        MPI_Abort(MPI_COMM_WORLD, 1);
+      }
+    }
 #endif
    // for black hole position
    if (BH_num > 0 && lev == GH->levels - 1)
@@ -4943,11 +4956,19 @@ void bssn_class::Step(int lev, int YN)

  //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Predictor rhs calculation");

-  // check error information
+  // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
+  MPI_Request err_req;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev]);
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev], &err_req);
  }
+
+  //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Predictor sync");
+
+  Parallel::Sync_cached(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev]);
+
+  // Complete non-blocking error reduction and check
+  MPI_Wait(&err_req, MPI_STATUS_IGNORE);
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
@@ -4959,10 +4980,6 @@ void bssn_class::Step(int lev, int YN)
    }
  }

-  //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Predictor sync");
-
-  Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry);
-
 #if (MAPBH == 0)
  // for black hole position
  if (BH_num > 0 && lev == GH->levels - 1)
@@ -5140,11 +5157,21 @@ void bssn_class::Step(int lev, int YN)

    //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector error check");

-    // check error information
+    // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
+    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev]);
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev], &err_req_cor);
    }
+
+    //    misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector sync");
+
+    Parallel::Sync_cached(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev]);
+
+    //    misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Corrector sync");
+
+    // Complete non-blocking error reduction and check
+    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
@@ -5158,12 +5185,6 @@ void bssn_class::Step(int lev, int YN)
      }
    }

-    //    misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector sync");
-
-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
-
-    //    misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Corrector sync");
-
 #if (MAPBH == 0)
    // for black hole position
    if (BH_num > 0 && lev == GH->levels - 1)
@@ -5447,21 +5468,11 @@ void bssn_class::SHStep()
 #if (PSTR == 1 || PSTR == 2)
 //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Predictor's error check");
 #endif
-  // check error information
+  // Non-blocking error reduction overlapped with Synch to hide Allreduce latency
+  MPI_Request err_req;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-  }
-
-  if (ERROR)
-  {
-    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
-    if (myrank == 0)
-    {
-      if (ErrorMonitor->outfile)
-        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " << PhysTime << endl;
-      MPI_Abort(MPI_COMM_WORLD, 1);
-    }
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req);
  }

  {
@@ -5479,6 +5490,19 @@ void bssn_class::SHStep()
    }
  }

+  // Complete non-blocking error reduction and check
+  MPI_Wait(&err_req, MPI_STATUS_IGNORE);
+  if (ERROR)
+  {
+    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
+    if (myrank == 0)
+    {
+      if (ErrorMonitor->outfile)
+        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " << PhysTime << endl;
+      MPI_Abort(MPI_COMM_WORLD, 1);
+    }
+  }
+
  // corrector
  for (iter_count = 1; iter_count < 4; iter_count++)
  {
@@ -5621,21 +5645,11 @@ void bssn_class::SHStep()
        sPp = sPp->next;
      }
    }
-    // check error information
+    // Non-blocking error reduction overlapped with Synch to hide Allreduce latency
+    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
-    }
-    if (ERROR)
-    {
-      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
-      if (myrank == 0)
-      {
-        if (ErrorMonitor->outfile)
-          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" << iter_count 
-                                << " variables at t = " << PhysTime << endl;
-        MPI_Abort(MPI_COMM_WORLD, 1);
-      }
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_cor);
    }

    {
@@ -5653,6 +5667,20 @@ void bssn_class::SHStep()
      }
    }

+    // Complete non-blocking error reduction and check
+    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
+    if (ERROR)
+    {
+      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
+      if (myrank == 0)
+      {
+        if (ErrorMonitor->outfile)
+          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" << iter_count
+                                << " variables at t = " << PhysTime << endl;
+        MPI_Abort(MPI_COMM_WORLD, 1);
+      }
+    }
+
    sPp = SH->PatL;
    while (sPp)
    {
@@ -5781,7 +5809,7 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
 //       misc::tillherecheck(GH->Commlev[GH->mylev],GH->start_rank[GH->mylev],a_stream.str());
 #endif

-      Parallel::Sync(GH->PatL[lev - 1], SynchList_pre, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]);

 #if (PSTR == 1 || PSTR == 2)
 //       a_stream.clear();
@@ -5842,7 +5870,7 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
 //       misc::tillherecheck(GH->Commlev[GH->mylev],GH->start_rank[GH->mylev],a_stream.str());
 #endif

-      Parallel::Sync(GH->PatL[lev - 1], SL, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SL, Symmetry, sync_cache_rp_coarse[lev]);

 #if (PSTR == 1 || PSTR == 2)
 //       a_stream.clear();
@@ -5880,7 +5908,7 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
 #endif
    }

-    Parallel::Sync(GH->PatL[lev], SL, Symmetry);
+    Parallel::Sync_cached(GH->PatL[lev], SL, Symmetry, sync_cache_rp_fine[lev]);

 #if (PSTR == 1 || PSTR == 2)
 //    a_stream.clear();
@@ -5938,7 +5966,7 @@ void bssn_class::RestrictProlong_aux(int lev, int YN, bool BB,
      Parallel::Restrict_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SynchList_pre, GH->rsul[lev], Symmetry);
 #endif

-      Parallel::Sync(GH->PatL[lev - 1], SynchList_pre, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]);

 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
@@ -5970,7 +5998,7 @@ void bssn_class::RestrictProlong_aux(int lev, int YN, bool BB,
      Parallel::Restrict_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, GH->rsul[lev], Symmetry);
 #endif

-      Parallel::Sync(GH->PatL[lev - 1], SL, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SL, Symmetry, sync_cache_rp_coarse[lev]);

 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
@@ -5994,7 +6022,7 @@ void bssn_class::RestrictProlong_aux(int lev, int YN, bool BB,
 #endif
    }

-    Parallel::Sync(GH->PatL[lev], SL, Symmetry);
+    Parallel::Sync_cached(GH->PatL[lev], SL, Symmetry, sync_cache_rp_fine[lev]);
  }
 }

@@ -6045,7 +6073,7 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB)
      Parallel::Restrict_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_cor, SynchList_pre, GH->rsul[lev], Symmetry);
 #endif

-      Parallel::Sync(GH->PatL[lev - 1], SynchList_pre, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]);

 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
@@ -6079,7 +6107,7 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB)
      Parallel::Restrict_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_cor, StateList, GH->rsul[lev], Symmetry);
 #endif

-      Parallel::Sync(GH->PatL[lev - 1], StateList, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], StateList, Symmetry, sync_cache_rp_coarse[lev]);

 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
@@ -6103,7 +6131,7 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB)
 #endif
    }

-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::Sync_cached(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_rp_fine[lev]);
  }
 }

@@ -6186,10 +6214,10 @@ void bssn_class::ProlongRestrict(int lev, int YN, bool BB)
 #else
      Parallel::Restrict_after(GH->PatL[lev - 1], GH->PatL[lev], SynchList_cor, StateList, Symmetry);
 #endif
-      Parallel::Sync(GH->PatL[lev - 1], StateList, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], StateList, Symmetry, sync_cache_rp_coarse[lev]);
    }

-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::Sync_cached(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_rp_fine[lev]);
  }
 }
 #undef MIXOUTB
--- a/AMSS_NCKU_source/bssn_class.h
+++ b/AMSS_NCKU_source/bssn_class.h
@@ -19,7 +19,11 @@ using namespace std;
 #include <math.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "macrodef.h"
 #include "cgh.h"
@@ -126,6 +130,11 @@ public:
       MyList<var> *OldStateList, *DumpList;
       MyList<var> *ConstraintList;

+       Parallel::SyncCache *sync_cache_pre;  // per-level cache for predictor sync
+       Parallel::SyncCache *sync_cache_cor;  // per-level cache for corrector sync
+       Parallel::SyncCache *sync_cache_rp_coarse;  // RestrictProlong sync on PatL[lev-1]
+       Parallel::SyncCache *sync_cache_rp_fine;    // RestrictProlong sync on PatL[lev]
+
       monitor *ErrorMonitor, *Psi4Monitor, *BHMonitor, *MAPMonitor;
       monitor *ConVMonitor;
       surface_integral *Waveshell;
--- a/AMSS_NCKU_source/bssn_gpu_class.h
+++ b/AMSS_NCKU_source/bssn_gpu_class.h
@@ -19,7 +19,11 @@ using namespace std;
 #include <math.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "macrodef.h"
 #include "cgh.h"
--- a/AMSS_NCKU_source/bssn_rhs.f90
+++ b/AMSS_NCKU_source/bssn_rhs.f90
@@ -161,8 +161,36 @@

  chi_rhs = F2o3 *chin1*( alpn1 * trK - div_beta ) !rhs for chi

+  call fderivs(ex,dxx,gxxx,gxxy,gxxz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
+  call fderivs(ex,gxy,gxyx,gxyy,gxyz,X,Y,Z,ANTI,ANTI,SYM ,Symmetry,Lev)
+  call fderivs(ex,gxz,gxzx,gxzy,gxzz,X,Y,Z,ANTI,SYM ,ANTI,Symmetry,Lev)
+  call fderivs(ex,dyy,gyyx,gyyy,gyyz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
+  call fderivs(ex,gyz,gyzx,gyzy,gyzz,X,Y,Z,SYM ,ANTI,ANTI,Symmetry,Lev)
+  call fderivs(ex,dzz,gzzx,gzzy,gzzz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)

+  gxx_rhs = - TWO * alpn1 * Axx    -  F2o3 * gxx * div_beta          + &
+              TWO *(  gxx * betaxx +   gxy * betayx +   gxz * betazx)

+  gyy_rhs = - TWO * alpn1 * Ayy    -  F2o3 * gyy * div_beta          + &
+              TWO *(  gxy * betaxy +   gyy * betayy +   gyz * betazy)
+
+  gzz_rhs = - TWO * alpn1 * Azz    -  F2o3 * gzz * div_beta          + &
+              TWO *(  gxz * betaxz +   gyz * betayz +   gzz * betazz)
+
+  gxy_rhs = - TWO * alpn1 * Axy    +  F1o3 * gxy    * div_beta       + &
+                      gxx * betaxy                  +   gxz * betazy + &
+                                       gyy * betayx +   gyz * betazx   &
+                                                    -   gxy * betazz
+
+  gyz_rhs = - TWO * alpn1 * Ayz    +  F1o3 * gyz    * div_beta       + &
+                      gxy * betaxz +   gyy * betayz                  + &
+                      gxz * betaxy                  +   gzz * betazy   &
+                                                    -   gyz * betaxx
+ 
+  gxz_rhs = - TWO * alpn1 * Axz    +  F1o3 * gxz    * div_beta       + &
+                      gxx * betaxz +   gxy * betayz                  + &
+                                       gyz * betayx +   gzz * betazx   &
+                                                    -   gxz * betayy     !rhs for gij

 ! invert tilted metric
  gupzz =  gxx * gyy * gzz + gxy * gyz * gxz + gxz * gxy * gyz - &
@@ -173,12 +201,7 @@
  gupyy =   ( gxx * gzz - gxz * gxz ) / gupzz
  gupyz = - ( gxx * gyz - gxy * gxz ) / gupzz
  gupzz =   ( gxx * gyy - gxy * gxy ) / gupzz
-  call fderivs(ex,dxx,gxxx,gxxy,gxxz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
-  call fderivs(ex,gxy,gxyx,gxyy,gxyz,X,Y,Z,ANTI,ANTI,SYM ,Symmetry,Lev)
-  call fderivs(ex,gxz,gxzx,gxzy,gxzz,X,Y,Z,ANTI,SYM ,ANTI,Symmetry,Lev)
-  call fderivs(ex,dyy,gyyx,gyyy,gyyz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
-  call fderivs(ex,gyz,gyzx,gyzy,gyzz,X,Y,Z,SYM ,ANTI,ANTI,Symmetry,Lev)
-  call fderivs(ex,dzz,gzzx,gzzy,gzzz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
+
  if(co == 0)then
 ! Gam^i_Res = Gam^i + gup^ij_,j
  Gmx_Res = Gamx - (gupxx*(gupxx*gxxx+gupxy*gxyx+gupxz*gxzx)&
@@ -924,99 +947,99 @@

 !!!!!!!!!advection term part

-  gxx_rhs = - TWO * alpn1 * Axx    -  F2o3 * gxx * div_beta          + &
-              TWO *(  gxx * betaxx +   gxy * betayx +   gxz * betazx)
-
-  gyy_rhs = - TWO * alpn1 * Ayy    -  F2o3 * gyy * div_beta          + &
-              TWO *(  gxy * betaxy +   gyy * betayy +   gyz * betazy)
-
-  gzz_rhs = - TWO * alpn1 * Azz    -  F2o3 * gzz * div_beta          + &
-              TWO *(  gxz * betaxz +   gyz * betayz +   gzz * betazz)
-
-  gxy_rhs = - TWO * alpn1 * Axy    +  F1o3 * gxy    * div_beta       + &
-                      gxx * betaxy                  +   gxz * betazy + &
-                                        gyy * betayx +   gyz * betazx   &
-                                                    -   gxy * betazz
-
-  gyz_rhs = - TWO * alpn1 * Ayz    +  F1o3 * gyz    * div_beta       + &
-                      gxy * betaxz +   gyy * betayz                  + &
-                      gxz * betaxy                  +   gzz * betazy   &
-                                                    -   gyz * betaxx
-
-  gxz_rhs = - TWO * alpn1 * Axz    +  F1o3 * gxz    * div_beta       + &
-                      gxx * betaxz +   gxy * betayz                  + &
-                                        gyz * betayx +   gzz * betazx   &
-                                                    -   gxz * betayy     !rhs for gij
-
-
-
-
-
-  if(eps>0)then 
-! usual Kreiss-Oliger dissipation     
-  call merge_lopsided_kodis(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS,eps) 
-  call merge_lopsided_kodis(ex,X,Y,Z,gxx,gxx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,gxy,gxy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,gxz,gxz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,gyy,gyy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,gyz,gyz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,gzz,gzz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Axx,Axx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Axy,Axy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Axz,Axz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Ayy,Ayy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Ayz,Ayz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Azz,Azz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,trK,trK_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Gamx,Gamx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Gamy,Gamy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Gamz,Gamz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
-  call merge_lopsided_kodis(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
-
-
-
-
-
-
-
-
-
-
-
-
-  else 
  call lopsided(ex,X,Y,Z,gxx,gxx_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,gxy,gxy_rhs,betax,betay,betaz,Symmetry,AAS)
  call lopsided(ex,X,Y,Z,gxz,gxz_rhs,betax,betay,betaz,Symmetry,ASA)
  call lopsided(ex,X,Y,Z,gyy,gyy_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,gyz,gyz_rhs,betax,betay,betaz,Symmetry,SAA)
  call lopsided(ex,X,Y,Z,gzz,gzz_rhs,betax,betay,betaz,Symmetry,SSS)
+
  call lopsided(ex,X,Y,Z,Axx,Axx_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,Axy,Axy_rhs,betax,betay,betaz,Symmetry,AAS)
  call lopsided(ex,X,Y,Z,Axz,Axz_rhs,betax,betay,betaz,Symmetry,ASA)
  call lopsided(ex,X,Y,Z,Ayy,Ayy_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,Ayz,Ayz_rhs,betax,betay,betaz,Symmetry,SAA)
  call lopsided(ex,X,Y,Z,Azz,Azz_rhs,betax,betay,betaz,Symmetry,SSS)
+
  call lopsided(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,trK,trK_rhs,betax,betay,betaz,Symmetry,SSS)
+
  call lopsided(ex,X,Y,Z,Gamx,Gamx_rhs,betax,betay,betaz,Symmetry,ASS)
  call lopsided(ex,X,Y,Z,Gamy,Gamy_rhs,betax,betay,betaz,Symmetry,SAS)
  call lopsided(ex,X,Y,Z,Gamz,Gamz_rhs,betax,betay,betaz,Symmetry,SSA)
+!!
  call lopsided(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS)
+
+#if (GAUGE == 0 || GAUGE == 1 || GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7)
  call lopsided(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS)
  call lopsided(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS)
  call lopsided(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA)
+#endif
+
+#if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
  call lopsided(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS)
  call lopsided(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS)
  call lopsided(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA)
+#endif

+  if(eps>0)then 
+! usual Kreiss-Oliger dissipation      
+  call kodis(ex,X,Y,Z,chi,chi_rhs,SSS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,trK,trK_rhs,SSS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,dxx,gxx_rhs,SSS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,gxy,gxy_rhs,AAS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,gxz,gxz_rhs,ASA,Symmetry,eps)
+  call kodis(ex,X,Y,Z,dyy,gyy_rhs,SSS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,gyz,gyz_rhs,SAA,Symmetry,eps)
+  call kodis(ex,X,Y,Z,dzz,gzz_rhs,SSS,Symmetry,eps)
+#if 0
+#define i 42
+#define j 40
+#define k 40
+if(Lev == 1)then
+write(*,*) X(i),Y(j),Z(k)
+write(*,*) "before",Axx_rhs(i,j,k)
+endif
+#undef i
+#undef j
+#undef k
+!!stop
+#endif
+  call kodis(ex,X,Y,Z,Axx,Axx_rhs,SSS,Symmetry,eps)
+#if 0
+#define i 42
+#define j 40
+#define k 40
+if(Lev == 1)then
+write(*,*) X(i),Y(j),Z(k)
+write(*,*) "after",Axx_rhs(i,j,k)
+endif
+#undef i
+#undef j
+#undef k
+!!stop
+#endif
+  call kodis(ex,X,Y,Z,Axy,Axy_rhs,AAS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,Axz,Axz_rhs,ASA,Symmetry,eps)
+  call kodis(ex,X,Y,Z,Ayy,Ayy_rhs,SSS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,Ayz,Ayz_rhs,SAA,Symmetry,eps)
+  call kodis(ex,X,Y,Z,Azz,Azz_rhs,SSS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,Gamx,Gamx_rhs,ASS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,Gamy,Gamy_rhs,SAS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,Gamz,Gamz_rhs,SSA,Symmetry,eps)
+
+#if 1 
+!! bam does not apply dissipation on gauge variables
+  call kodis(ex,X,Y,Z,Lap,Lap_rhs,SSS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,betax,betax_rhs,ASS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,betay,betay_rhs,SAS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,betaz,betaz_rhs,SSA,Symmetry,eps)
+#if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
+  call kodis(ex,X,Y,Z,dtSfx,dtSfx_rhs,ASS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,dtSfy,dtSfy_rhs,SAS,Symmetry,eps)
+  call kodis(ex,X,Y,Z,dtSfz,dtSfz_rhs,SSA,Symmetry,eps)
+#endif
+#endif

  endif

@@ -1163,265 +1186,3 @@ endif
  return

  end function compute_rhs_bssn
-
-
-
-
-  subroutine merge_lopsided_kodis(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA,eps)
-    implicit none
-
-  !~~~~~~> Input parameters:
-
-    integer, intent(in)  :: ex(1:3),Symmetry
-    real*8,  intent(in)  :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
-    real*8,dimension(ex(1),ex(2),ex(3)),intent(in)   :: f,Sfx,Sfy,Sfz
-
-    real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
-    real*8,dimension(3),intent(in) ::SoA
-
-  !~~~~~~> local variables:
-  ! note index -2,-1,0, so we have 3 extra points
-    real*8,dimension(-2:ex(1),-2:ex(2),-2:ex(3))   :: fh
-    integer :: imin_lopsided,jmin_lopsided,kmin_lopsided,imin_kodis,jmin_kodis,kmin_kodis,imax,jmax,kmax,i,j,k
-    real*8 :: dX,dY,dZ
-    real*8 :: d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
-    real*8,  parameter :: ZEO=0.d0,ONE=1.d0, F3=3.d0
-    real*8,  parameter :: TWO=2.d0,F6=6.0d0,F18=1.8d1
-    real*8,  parameter :: F12=1.2d1, F10=1.d1,EIT=8.d0
-    integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
-    real*8, parameter :: SIX=6.d0,FIT=1.5d1,TWT=2.d1
-    real*8,parameter::cof=6.4d1   ! 2^6
-    real*8,intent(in) :: eps
-    dX = X(2)-X(1)
-    dY = Y(2)-Y(1)
-    dZ = Z(2)-Z(1)
-
-    d12dx = ONE/F12/dX
-    d12dy = ONE/F12/dY
-    d12dz = ONE/F12/dZ
-
-    d2dx = ONE/TWO/dX
-    d2dy = ONE/TWO/dY
-    d2dz = ONE/TWO/dZ
-
-    imax = ex(1)
-    jmax = ex(2)
-    kmax = ex(3)
-
-    imin_lopsided = 1
-    jmin_lopsided = 1
-    kmin_lopsided = 1
-    if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin_lopsided = -2
-    if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin_lopsided = -2
-    if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin_lopsided = -2
-
-    imin_kodis = 1
-    jmin_kodis = 1
-    kmin_kodis = 1
-
-    if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin_kodis = -2
-    if(Symmetry == OCTANT .and. dabs(X(1)) < dX) imin_kodis = -2
-    if(Symmetry == OCTANT .and. dabs(Y(1)) < dY) jmin_kodis = -2
-
-
-    call symmetry_bd(3,ex,f,fh,SoA)
-
-  ! upper bound set ex-1 only for efficiency, 
-  ! the loop body will set ex 0 also
-    do k=1,ex(3)-1
-    do j=1,ex(2)-1
-    do i=1,ex(1)-1
-
-  !! new code, 2012dec27, based on bam
-  ! x direction   
-      if(Sfx(i,j,k) > ZEO)then
-        if(i+3 <= imax)then
-  !         v
-  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
-  !  i     12dx       i-v      i      i+v     i+2v    i+3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
-                    Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
-                                      -F6*fh(i+2,j,k)+    fh(i+3,j,k))
-      elseif(i+2 <= imax)then
-  !
-  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
-  !  fx(i) = ---------------------------------------------
-  !                             12 dx
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
-                    Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
-
-      elseif(i+1 <= imax)then
-  !         v
-  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
-  !  i     12dx       i+v      i      i-v     i-2v    i-3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
-                    Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
-                                      -F6*fh(i-2,j,k)+    fh(i-3,j,k))
-  ! set imax and imin_lopsided 0
-      endif
-    elseif(Sfx(i,j,k) < ZEO)then
-        if(i-3 >= imin_lopsided)then
-  !         v
-  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
-  !  i     12dx       i-v      i      i+v     i+2v    i+3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
-                    Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
-                                      -F6*fh(i-2,j,k)+    fh(i-3,j,k))
-      elseif(i-2 >= imin_lopsided)then
-  !
-  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
-  !  fx(i) = ---------------------------------------------
-  !                             12 dx
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
-                    Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
-
-      elseif(i-1 >= imin_lopsided)then
-  !         v
-  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
-  !  i     12dx       i+v      i      i-v     i-2v    i-3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
-                    Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
-                                      -F6*fh(i+2,j,k)+    fh(i+3,j,k))
-  ! set imax and imin_lopsided 0
-      endif
-    endif
-
-  ! y direction   
-      if(Sfy(i,j,k) > ZEO)then
-        if(j+3 <= jmax)then
-  !         v
-  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
-  !  i     12dx       i-v      i      i+v     i+2v    i+3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
-                    Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
-                                      -F6*fh(i,j+2,k)+    fh(i,j+3,k))
-      elseif(j+2 <= jmax)then
-  !
-  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
-  !  fx(i) = ---------------------------------------------
-  !                             12 dx
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
-                    Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
-
-      elseif(j+1 <= jmax)then
-  !         v
-  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
-  !  i     12dx       i+v      i      i-v     i-2v    i-3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
-                    Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
-                                      -F6*fh(i,j-2,k)+    fh(i,j-3,k))
-  ! set imax and imin_lopsided 0
-      endif
-    elseif(Sfy(i,j,k) < ZEO)then
-        if(j-3 >= jmin_lopsided)then
-  !         v
-  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
-  !  i     12dx       i-v      i      i+v     i+2v    i+3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
-                    Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
-                                      -F6*fh(i,j-2,k)+    fh(i,j-3,k))
-      elseif(j-2 >= jmin_lopsided)then
-  !
-  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
-  !  fx(i) = ---------------------------------------------
-  !                             12 dx
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
-                    Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
-
-      elseif(j-1 >= jmin_lopsided)then
-  !         v
-  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
-  !  i     12dx       i+v      i      i-v     i-2v    i-3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
-                    Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
-                                      -F6*fh(i,j+2,k)+    fh(i,j+3,k))
-  ! set jmax and jmin_lopsided 0
-      endif
-    endif
-
-  ! z direction   
-      if(Sfz(i,j,k) > ZEO)then
-        if(k+3 <= kmax)then
-  !         v
-  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
-  !  i     12dx       i-v      i      i+v     i+2v    i+3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
-                    Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
-                                      -F6*fh(i,j,k+2)+    fh(i,j,k+3))
-      elseif(k+2 <= kmax)then
-  !
-  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
-  !  fx(i) = ---------------------------------------------
-  !                             12 dx
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
-                    Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
-
-      elseif(k+1 <= kmax)then
-  !         v
-  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
-  !  i     12dx       i+v      i      i-v     i-2v    i-3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
-                    Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
-                                      -F6*fh(i,j,k-2)+    fh(i,j,k-3))
-  ! set imax and imin_lopsided 0
-      endif
-    elseif(Sfz(i,j,k) < ZEO)then
-        if(k-3 >= kmin_lopsided)then
-  !         v
-  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
-  !  i     12dx       i-v      i      i+v     i+2v    i+3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
-                    Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
-                                      -F6*fh(i,j,k-2)+    fh(i,j,k-3))
-      elseif(k-2 >= kmin_lopsided)then
-  !
-  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
-  !  fx(i) = ---------------------------------------------
-  !                             12 dx
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
-                    Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
-
-      elseif(k-1 >= kmin_lopsided)then
-  !         v
-  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
-  !  i     12dx       i+v      i      i-v     i-2v    i-3v
-      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
-                    Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
-                                      -F6*fh(i,j,k+2)+    fh(i,j,k+3))
-  ! set kmax and kmin_lopsided 0
-      endif
-    endif
-
-
-    if(i-3 >= imin_kodis .and. i+3 <= imax .and. &
-      j-3 >= jmin_kodis .and. j+3 <= jmax .and. &
-      k-3 >= kmin_kodis .and. k+3 <= kmax) then
-
-  ! calculation order if important ?
-    f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/cof *( (     &
-                                (fh(i-3,j,k)+fh(i+3,j,k)) - &
-                            SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
-                            FIT*(fh(i-1,j,k)+fh(i+1,j,k)) - &
-                            TWT* fh(i,j,k)            )/dX + &
-                                                    (     &
-                                (fh(i,j-3,k)+fh(i,j+3,k)) - &
-                            SIX*(fh(i,j-2,k)+fh(i,j+2,k)) + &
-                            FIT*(fh(i,j-1,k)+fh(i,j+1,k)) - &
-                            TWT* fh(i,j,k)            )/dY + &
-                                                    (     &
-                                (fh(i,j,k-3)+fh(i,j,k+3)) - &
-                            SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
-                            FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
-                            TWT* fh(i,j,k)            )/dZ )
-
-    endif
-
-    enddo
-    enddo
-    enddo
-
-    return
-
-
-
-  end subroutine merge_lopsided_kodis
--- a/AMSS_NCKU_source/cgh.C
+++ b/AMSS_NCKU_source/cgh.C
@@ -20,7 +20,11 @@ using namespace std;
 #include <map.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "macrodef.h"
 #include "misc.h"
--- a/AMSS_NCKU_source/cgh.h
+++ b/AMSS_NCKU_source/cgh.h
@@ -2,7 +2,11 @@
 #ifndef CGH_H
 #define CGH_H

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif
 #include "MyList.h"
 #include "MPatch.h"
 #include "macrodef.h"
--- a/AMSS_NCKU_source/checkpoint.h
+++ b/AMSS_NCKU_source/checkpoint.h
@@ -19,7 +19,11 @@ using namespace std;
 #include <time.h>
 #include <stdlib.h>

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "var.h"
 #include "MyList.h"
--- a/AMSS_NCKU_source/diff_new.f90
+++ b/AMSS_NCKU_source/diff_new.f90
@@ -69,6 +69,7 @@
  fy = ZEO
  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -151,6 +152,7 @@

  fx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -218,6 +220,7 @@

  fy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -282,6 +285,7 @@

  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -371,6 +375,7 @@
  fxz = ZEO
  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -469,6 +474,7 @@

  fxx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -531,6 +537,7 @@

  fyy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -594,6 +601,7 @@

  fzz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -657,6 +665,7 @@

  fxy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -719,6 +728,7 @@

  fxz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -780,6 +790,7 @@

  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -866,6 +877,7 @@
  fxz = ZEO
  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -997,10 +1009,90 @@
  fy = ZEO
  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
+#if 0  
+! x direction   
+        if(i+2 <= imax .and. i-2 >= imin)then
+!
+!              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
+!  fx(i) = ---------------------------------------------
+!                             12 dx
+      fx(i,j,k)=d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))

+    elseif(i+1 <= imax .and. i-1 >= imin)then
+!
+!              - f(i-1) + f(i+1)
+!  fx(i) = --------------------------------
+!                     2 dx
+      fx(i,j,k)=d2dx*(-fh(i-1,j,k)+fh(i+1,j,k))
+
+! set imax and imin 0
+    endif
+! y direction   
+        if(j+2 <= jmax .and. j-2 >= jmin)then
+
+      fy(i,j,k)=d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
+
+    elseif(j+1 <= jmax .and. j-1 >= jmin)then
+
+     fy(i,j,k)=d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
+
+! set jmax and jmin 0
+    endif
+! z direction   
+        if(k+2 <= kmax .and. k-2 >= kmin)then
+
+      fz(i,j,k)=d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
+
+    elseif(k+1 <= kmax .and. k-1 >= kmin)then
+
+      fz(i,j,k)=d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
+
+! set kmax and kmin 0
+    endif
+#elif 0
+! x direction   
+        if(i+2 <= imax .and. i-2 >= imin)then
+!
+!              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
+!  fx(i) = ---------------------------------------------
+!                             12 dx
+      fx(i,j,k)=d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
+
+    elseif(i+3 <= imax .and. i-1 >= imin)then
+      fx(i,j,k)=d12dx*(-3.d0*fh(i-1,j,k)-1.d1*fh(i,j,k)+1.8d1*fh(i+1,j,k)-6.d0*fh(i+2,j,k)+fh(i+3,j,k))
+    elseif(i+1 <= imax .and. i-3 >= imin)then
+      fx(i,j,k)=d12dx*( 3.d0*fh(i+1,j,k)+1.d1*fh(i,j,k)-1.8d1*fh(i-1,j,k)+6.d0*fh(i-2,j,k)-fh(i-3,j,k))
+! set imax and imin 0
+    endif
+! y direction   
+        if(j+2 <= jmax .and. j-2 >= jmin)then
+
+      fy(i,j,k)=d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
+
+    elseif(j+3 <= jmax .and. j-1 >= jmin)then
+      fy(i,j,k)=d12dy*(-3.d0*fh(i,j-1,k)-1.d1*fh(i,j,k)+1.8d1*fh(i,j+1,k)-6.d0*fh(i,j+2,k)+fh(i,j+3,k))
+    elseif(j+1 <= jmax .and. j-3 >= jmin)then
+      fy(i,j,k)=d12dy*( 3.d0*fh(i,j+1,k)+1.d1*fh(i,j,k)-1.8d1*fh(i,j-1,k)+6.d0*fh(i,j-2,k)-fh(i,j-3,k))
+
+! set jmax and jmin 0
+    endif
+! z direction   
+        if(k+2 <= kmax .and. k-2 >= kmin)then
+
+      fz(i,j,k)=d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
+
+    elseif(k+3 <= kmax .and. k-1 >= kmin)then
+      fz(i,j,k)=d12dz*(-3.d0*fh(i,j,k-1)-1.d1*fh(i,j,k)+1.8d1*fh(i,j,k+1)-6.d0*fh(i,j,k+2)+fh(i,j,k+3))
+    elseif(k+1 <= kmax .and. k-3 >= kmin)then
+      fz(i,j,k)=d12dz*( 3.d0*fh(i,j,k+1)+1.d1*fh(i,j,k)-1.8d1*fh(i,j,k-1)+6.d0*fh(i,j,k-2)-fh(i,j,k-3))
+
+! set kmax and kmin 0
+    endif
+#else
 ! for bam comparison
   if(i+2 <= imax .and. i-2 >= imin .and. &
      j+2 <= jmax .and. j-2 >= jmin .and. &
@@ -1015,7 +1107,7 @@
      fy(i,j,k)=d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
      fz(i,j,k)=d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
   endif
-
+#endif
  enddo
  enddo
  enddo
@@ -1072,6 +1164,7 @@

  fx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1148,6 +1241,7 @@

  fy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1218,6 +1312,7 @@

  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1322,10 +1417,89 @@
  fxz = ZEO
  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
+#if 0  
+!~~~~~~ fxx
+        if(i+2 <= imax .and. i-2 >= imin)then
+!
+!               - f(i-2) + 16 f(i-1) - 30 f(i) + 16 f(i+1) - f(i+2)
+!  fxx(i) = ----------------------------------------------------------
+!                                  12 dx^2 
+   fxx(i,j,k) = Fdxdx*(-fh(i-2,j,k)+F16*fh(i-1,j,k)-F30*fh(i,j,k) &
+                       -fh(i+2,j,k)+F16*fh(i+1,j,k)              )
+   elseif(i+1 <= imax .and. i-1 >= imin)then
+!
+!               f(i-1) - 2 f(i) + f(i+1)
+!  fxx(i) = --------------------------------
+!                         dx^2 
+   fxx(i,j,k) = Sdxdx*(fh(i-1,j,k)-TWO*fh(i,j,k) &
+                      +fh(i+1,j,k)              )
+   endif

+
+!~~~~~~ fyy
+        if(j+2 <= jmax .and. j-2 >= jmin)then
+
+   fyy(i,j,k) = Fdydy*(-fh(i,j-2,k)+F16*fh(i,j-1,k)-F30*fh(i,j,k) &
+                       -fh(i,j+2,k)+F16*fh(i,j+1,k)              )
+   elseif(j+1 <= jmax .and. j-1 >= jmin)then
+
+   fyy(i,j,k) = Sdydy*(fh(i,j-1,k)-TWO*fh(i,j,k) &
+                      +fh(i,j+1,k)              )
+   endif
+
+!~~~~~~ fzz
+        if(k+2 <= kmax .and. k-2 >= kmin)then
+
+   fzz(i,j,k) = Fdzdz*(-fh(i,j,k-2)+F16*fh(i,j,k-1)-F30*fh(i,j,k) &
+                       -fh(i,j,k+2)+F16*fh(i,j,k+1)              )
+   elseif(k+1 <= kmax .and. k-1 >= kmin)then
+
+   fzz(i,j,k) = Sdzdz*(fh(i,j,k-1)-TWO*fh(i,j,k) &
+                      +fh(i,j,k+1)              )
+   endif
+!~~~~~~ fxy
+       if(i+2 <= imax .and. i-2 >= imin .and. j+2 <= jmax .and. j-2 >= jmin)then
+!
+!                 ( f(i-2,j-2) - 8 f(i-1,j-2) + 8 f(i+1,j-2) - f(i+2,j-2) )
+!             - 8 ( f(i-2,j-1) - 8 f(i-1,j-1) + 8 f(i+1,j-1) - f(i+2,j-1) )
+!             + 8 ( f(i-2,j+1) - 8 f(i-1,j+1) + 8 f(i+1,j+1) - f(i+2,j+1) )
+!             -   ( f(i-2,j+2) - 8 f(i-1,j+2) + 8 f(i+1,j+2) - f(i+2,j+2) )
+!  fxy(i,j) = ----------------------------------------------------------------
+!                                  144 dx dy
+   fxy(i,j,k) = Fdxdy*(     (fh(i-2,j-2,k)-F8*fh(i-1,j-2,k)+F8*fh(i+1,j-2,k)-fh(i+2,j-2,k))  &
+                       -F8 *(fh(i-2,j-1,k)-F8*fh(i-1,j-1,k)+F8*fh(i+1,j-1,k)-fh(i+2,j-1,k))  &
+                       +F8 *(fh(i-2,j+1,k)-F8*fh(i-1,j+1,k)+F8*fh(i+1,j+1,k)-fh(i+2,j+1,k))  &
+                       -    (fh(i-2,j+2,k)-F8*fh(i-1,j+2,k)+F8*fh(i+1,j+2,k)-fh(i+2,j+2,k)))
+
+   elseif(i+1 <= imax .and. i-1 >= imin .and. j+1 <= jmax .and. j-1 >= jmin)then
+!                 f(i-1,j-1) - f(i+1,j-1) - f(i-1,j+1) + f(i+1,j+1) 
+!  fxy(i,j) = -----------------------------------------------------------
+!                                      4 dx dy
+   fxy(i,j,k) = Sdxdy*(fh(i-1,j-1,k)-fh(i+1,j-1,k)-fh(i-1,j+1,k)+fh(i+1,j+1,k))
+   endif
+!~~~~~~ fxz
+       if(i+2 <= imax .and. i-2 >= imin .and. k+2 <= kmax .and. k-2 >= kmin)then
+   fxz(i,j,k) = Fdxdz*(     (fh(i-2,j,k-2)-F8*fh(i-1,j,k-2)+F8*fh(i+1,j,k-2)-fh(i+2,j,k-2))  &
+                       -F8 *(fh(i-2,j,k-1)-F8*fh(i-1,j,k-1)+F8*fh(i+1,j,k-1)-fh(i+2,j,k-1))  &
+                       +F8 *(fh(i-2,j,k+1)-F8*fh(i-1,j,k+1)+F8*fh(i+1,j,k+1)-fh(i+2,j,k+1))  &
+                       -    (fh(i-2,j,k+2)-F8*fh(i-1,j,k+2)+F8*fh(i+1,j,k+2)-fh(i+2,j,k+2)))
+   elseif(i+1 <= imax .and. i-1 >= imin .and. k+1 <= kmax .and. k-1 >= kmin)then
+   fxz(i,j,k) = Sdxdz*(fh(i-1,j,k-1)-fh(i+1,j,k-1)-fh(i-1,j,k+1)+fh(i+1,j,k+1))
+   endif
+!~~~~~~ fyz
+       if(j+2 <= jmax .and. j-2 >= jmin .and. k+2 <= kmax .and. k-2 >= kmin)then
+   fyz(i,j,k) = Fdydz*(     (fh(i,j-2,k-2)-F8*fh(i,j-1,k-2)+F8*fh(i,j+1,k-2)-fh(i,j+2,k-2))  &
+                       -F8 *(fh(i,j-2,k-1)-F8*fh(i,j-1,k-1)+F8*fh(i,j+1,k-1)-fh(i,j+2,k-1))  &
+                       +F8 *(fh(i,j-2,k+1)-F8*fh(i,j-1,k+1)+F8*fh(i,j+1,k+1)-fh(i,j+2,k+1))  &
+                       -    (fh(i,j-2,k+2)-F8*fh(i,j-1,k+2)+F8*fh(i,j+1,k+2)-fh(i,j+2,k+2)))
+   elseif(j+1 <= jmax .and. j-1 >= jmin .and. k+1 <= kmax .and. k-1 >= kmin)then
+   fyz(i,j,k) = Sdydz*(fh(i,j-1,k-1)-fh(i,j+1,k-1)-fh(i,j-1,k+1)+fh(i,j+1,k+1))
+   endif 
+#else
 ! for bam comparison
   if(i+2 <= imax .and. i-2 >= imin .and. &
      j+2 <= jmax .and. j-2 >= jmin .and. &
@@ -1361,7 +1535,7 @@
   fxz(i,j,k) = Sdxdz*(fh(i-1,j,k-1)-fh(i+1,j,k-1)-fh(i-1,j,k+1)+fh(i+1,j,k+1))
   fyz(i,j,k) = Sdydz*(fh(i,j-1,k-1)-fh(i,j+1,k-1)-fh(i,j-1,k+1)+fh(i,j+1,k+1))
   endif
-
+#endif
   enddo
   enddo
   enddo
@@ -1419,6 +1593,7 @@

  fxx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1486,6 +1661,7 @@

  fyy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1555,6 +1731,7 @@

  fzz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1624,6 +1801,7 @@

  fxy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1694,6 +1872,7 @@

  fxz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1762,6 +1941,7 @@

  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1854,6 +2034,7 @@
  fy = ZEO
  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -1970,6 +2151,7 @@

  fx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2055,6 +2237,7 @@

  fy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2131,6 +2314,7 @@

  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2249,6 +2433,7 @@
  fxz = ZEO
  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2436,6 +2621,7 @@

  fxx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2508,6 +2694,7 @@

  fyy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2583,6 +2770,7 @@

  fzz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2658,6 +2846,7 @@

  fxy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2738,6 +2927,7 @@

  fxz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2816,6 +3006,7 @@

  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -2923,6 +3114,7 @@
  fy = ZEO
  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -3059,6 +3251,7 @@

  fx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -3154,6 +3347,7 @@

  fy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -3238,6 +3432,7 @@

  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -3373,6 +3568,7 @@
  fxz = ZEO
  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -3645,6 +3841,7 @@

  fxx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -3726,6 +3923,7 @@

  fyy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -3810,6 +4008,7 @@
  
  fzz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -3894,6 +4093,7 @@

  fxy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -3996,6 +4196,7 @@

  fxz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -4096,6 +4297,7 @@

  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
--- a/AMSS_NCKU_source/diff_new_sh.f90
+++ b/AMSS_NCKU_source/diff_new_sh.f90
@@ -81,6 +81,7 @@
  fy = ZEO
  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -179,6 +180,7 @@

  fx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -262,6 +264,7 @@

  fy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -342,6 +345,7 @@

  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -443,6 +447,7 @@
  fxz = ZEO
  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -553,6 +558,7 @@

  fxx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -627,6 +633,7 @@

  fyy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -702,6 +709,7 @@

  fzz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -777,6 +785,7 @@

  fxy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -851,6 +860,7 @@

  fxz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -924,6 +934,7 @@

  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1019,6 +1030,7 @@
  fy = ZEO
  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1134,6 +1146,7 @@

  fx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1227,6 +1240,7 @@

  fy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1314,6 +1328,7 @@

  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1430,6 +1445,7 @@
  fxz = ZEO
  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1580,6 +1596,7 @@

  fxx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1659,6 +1676,7 @@

  fyy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1740,6 +1758,7 @@

  fzz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1821,6 +1840,7 @@

  fxy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1903,6 +1923,7 @@

  fxz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1983,6 +2004,7 @@

  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -2087,6 +2109,7 @@
  fy = ZEO
  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -2219,6 +2242,7 @@

  fx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -2321,6 +2345,7 @@

  fy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -2414,6 +2439,7 @@

  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -2544,6 +2570,7 @@
  fxz = ZEO
  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -2743,6 +2770,7 @@

  fxx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -2827,6 +2855,7 @@

  fyy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -2914,6 +2943,7 @@

  fzz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -3001,6 +3031,7 @@

  fxy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -3093,6 +3124,7 @@

  fxz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -3183,6 +3215,7 @@

  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -3302,6 +3335,7 @@
  fy = ZEO
  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -3454,6 +3488,7 @@

  fx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -3566,6 +3601,7 @@

  fy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -3667,6 +3703,7 @@

  fz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -3814,6 +3851,7 @@
  fxz = ZEO
  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -4098,6 +4136,7 @@

  fxx = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -4191,6 +4230,7 @@

  fyy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -4287,6 +4327,7 @@
  
  fzz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -4383,6 +4424,7 @@

  fxy = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -4497,6 +4539,7 @@

  fxz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -4609,6 +4652,7 @@

  fyz = ZEO

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -4679,6 +4723,7 @@ subroutine fderivs_shc(ex,f,fx,fy,fz,crho,sigma,R,SYM1,SYM2,SYM3,Symmetry,Lev,ss
 #if 0  
  integer :: i,j,k
  
+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -4729,6 +4774,7 @@ subroutine fdderivs_shc(ex,f,fxx,fxy,fxz,fyy,fyz,fzz,crho,sigma,R,SYM1,SYM2,SYM3
 #if 0  
  integer :: i,j,k
  
+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
--- a/AMSS_NCKU_source/enforce_algebra.f90
+++ b/AMSS_NCKU_source/enforce_algebra.f90
@@ -27,6 +27,7 @@

 !~~~~~~>

+  !$omp parallel do collapse(2) private(i,j,k,lgxx,lgyy,lgzz,ldetg,lgupxx,lgupxy,lgupxz,lgupyy,lgupyz,lgupzz,ltrA,lscale)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -104,6 +105,7 @@

 !~~~~~~>

+  !$omp parallel do collapse(2) private(i,j,k,lgxx,lgyy,lgzz,lscale,lgxy,lgxz,lgyz,lgupxx,lgupxy,lgupxz,lgupyy,lgupyz,lgupzz,ltrA)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
--- a/AMSS_NCKU_source/expansion.C
+++ b/AMSS_NCKU_source/expansion.C
@@ -6,7 +6,11 @@
 #include <stdio.h>
 #include <assert.h>
 #include <math.h>
+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "util_Table.h"
 #include "cctk.h"
--- a/AMSS_NCKU_source/find_horizons.C
+++ b/AMSS_NCKU_source/find_horizons.C
@@ -6,7 +6,11 @@
 #include <stdio.h>
 #include <assert.h>
 #include <math.h>
+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "cctk.h"

--- a/AMSS_NCKU_source/fmisc.f90
+++ b/AMSS_NCKU_source/fmisc.f90
@@ -326,7 +326,6 @@ subroutine symmetry_bd(ord,extc,func,funcc,SoA)

  funcc(1:extc(1),1:extc(2),1:extc(3)) = func
   do i=0,ord-1
-      
      funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1)
   enddo
   do i=0,ord-1
--- a/AMSS_NCKU_source/kodiss.f90
+++ b/AMSS_NCKU_source/kodiss.f90
@@ -6,6 +6,102 @@
 ! Vertex or Cell is distinguished in routine symmetry_bd which locates in
 ! file "fmisc.f90"

+#if (ghost_width == 2)
+! second order code
+
+!------------------------------------------------------------------------------------------------------------------------------
+!usual type Kreiss-Oliger type numerical dissipation
+!We support cell center only
+!  (D_+D_-)^2 =
+!   f(i-2) - 4 f(i-1) + 6 f(i) - 4 f(i+1) + f(i+2)
+! ------------------------------------------------------
+!                       dx^4
+!------------------------------------------------------------------------------------------------------------------------------
+! do not add dissipation near boundary
+subroutine kodis(ex,X,Y,Z,f,f_rhs,SoA,Symmetry,eps)
+
+implicit none
+! argument variables
+integer,intent(in) :: Symmetry
+integer,dimension(3),intent(in)::ex
+real*8, dimension(1:3), intent(in) :: SoA
+double precision,intent(in),dimension(ex(1))::X
+double precision,intent(in),dimension(ex(2))::Y
+double precision,intent(in),dimension(ex(3))::Z
+double precision,intent(in),dimension(ex(1),ex(2),ex(3))::f
+double precision,intent(inout),dimension(ex(1),ex(2),ex(3))::f_rhs
+real*8,intent(in) :: eps
+
+!~~~~~~ other variables
+
+  real*8 :: dX,dY,dZ
+  real*8,dimension(-1:ex(1),-1:ex(2),-1:ex(3))   :: fh
+  integer :: imin,jmin,kmin,imax,jmax,kmax
+  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
+  real*8,parameter   :: cof = 1.6d1 ! 2^4
+  real*8,  parameter :: F4=4.d0,F6=6.d0
+  integer::i,j,k
+
+  dX = X(2)-X(1)
+  dY = Y(2)-Y(1)
+  dZ = Z(2)-Z(1)
+
+  imax = ex(1)
+  jmax = ex(2)
+  kmax = ex(3)
+
+  imin = 1
+  jmin = 1
+  kmin = 1
+
+  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -1
+  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -1
+  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -1
+
+  call symmetry_bd(2,ex,f,fh,SoA)
+
+!   f(i-2) - 4 f(i-1) + 6 f(i) - 4 f(i+1) + f(i+2)
+! ------------------------------------------------------
+!                       dx^4
+
+!  note the sign (-1)^r-1, now r=2
+  !$omp parallel do collapse(2) private(i,j,k)
+  do k=1,ex(3)
+  do j=1,ex(2)
+  do i=1,ex(1)
+
+  if(i-2 >= imin .and. i+2 <= imax .and. &
+     j-2 >= jmin .and. j+2 <= jmax .and. &
+     k-2 >= kmin .and. k+2 <= kmax) then
+! x direction
+   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dX/cof * (     &
+                                (fh(i-2,j,k)+fh(i+2,j,k)) &
+                         - F4 * (fh(i-1,j,k)+fh(i+1,j,k)) &
+                         + F6 *  fh(i,j,k) )
+! y direction
+
+   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dY/cof * (     &
+                                (fh(i,j-2,k)+fh(i,j+2,k)) &
+                         - F4 * (fh(i,j-1,k)+fh(i,j+1,k)) &
+                         + F6 *  fh(i,j,k) )
+! z direction
+
+   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dZ/cof * (     &
+                                (fh(i,j,k-2)+fh(i,j,k+2)) &
+                         - F4 * (fh(i,j,k-1)+fh(i,j,k+1)) &
+                         + F6 *  fh(i,j,k) )
+
+  endif
+
+  enddo
+  enddo
+  enddo
+
+  return
+
+end subroutine kodis
+
+#elif (ghost_width == 3)
 ! fourth order code

 !---------------------------------------------------------------------------------------------
@@ -61,9 +157,10 @@ integer, parameter :: NO_SYMM=0, OCTANT=2
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -2
  if(Symmetry == OCTANT .and. dabs(X(1)) < dX) imin = -2
  if(Symmetry == OCTANT .and. dabs(Y(1)) < dY) jmin = -2
-  !print*,'imin,jmin,kmin=',imin,jmin,kmin
+
  call symmetry_bd(3,ex,f,fh,SoA)

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -71,7 +168,28 @@ integer, parameter :: NO_SYMM=0, OCTANT=2
  if(i-3 >= imin .and. i+3 <= imax .and. &
     j-3 >= jmin .and. j+3 <= jmax .and. &
     k-3 >= kmin .and. k+3 <= kmax) then
+#if 0     
+! x direction
+   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dX/cof * (     &
+                              (fh(i-3,j,k)+fh(i+3,j,k)) - &
+                          SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
+                          FIT*(fh(i-1,j,k)+fh(i+1,j,k)) - &
+                          TWT* fh(i,j,k)            )
+! y direction

+   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dY/cof * (     &
+                              (fh(i,j-3,k)+fh(i,j+3,k)) - &
+                          SIX*(fh(i,j-2,k)+fh(i,j+2,k)) + &
+                          FIT*(fh(i,j-1,k)+fh(i,j+1,k)) - &
+                          TWT* fh(i,j,k)            )
+! z direction
+
+   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dZ/cof * (     &
+                              (fh(i,j,k-3)+fh(i,j,k+3)) - &
+                          SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
+                          FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
+                          TWT* fh(i,j,k)            )
+#else
 ! calculation order if important ?
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/cof *( (     &
                              (fh(i-3,j,k)+fh(i+3,j,k)) - &
@@ -88,7 +206,7 @@ integer, parameter :: NO_SYMM=0, OCTANT=2
                          SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
                          FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
                          TWT* fh(i,j,k)            )/dZ )
-
+#endif
  endif

  enddo
@@ -99,6 +217,220 @@ integer, parameter :: NO_SYMM=0, OCTANT=2

  end subroutine kodis

+#elif (ghost_width == 4)
+! sixth order code
+!------------------------------------------------------------------------------------------------------------------------------
+!usual type Kreiss-Oliger type numerical dissipation
+!We support cell center only
+!  (D_+D_-)^4 =
+!   f(i-4) - 8 f(i-3) + 28 f(i-2) - 56 f(i-1) + 70 f(i) - 56 f(i+1) + 28 f(i+2) - 8 f(i+3) + f(i+4)
+! ----------------------------------------------------------------------------------------------------------
+!                                              dx^8
+!------------------------------------------------------------------------------------------------------------------------------
+! do not add dissipation near boundary
+subroutine kodis(ex,X,Y,Z,f,f_rhs,SoA,Symmetry,eps)

+implicit none
+! argument variables
+integer,intent(in) :: Symmetry
+integer,dimension(3),intent(in)::ex
+real*8, dimension(1:3), intent(in) :: SoA
+double precision,intent(in),dimension(ex(1))::X
+double precision,intent(in),dimension(ex(2))::Y
+double precision,intent(in),dimension(ex(3))::Z
+double precision,intent(in),dimension(ex(1),ex(2),ex(3))::f
+double precision,intent(inout),dimension(ex(1),ex(2),ex(3))::f_rhs
+real*8,intent(in) :: eps

+!~~~~~~ other variables

+  real*8 :: dX,dY,dZ
+  real*8,dimension(-3:ex(1),-3:ex(2),-3:ex(3))   :: fh
+  integer :: imin,jmin,kmin,imax,jmax,kmax
+  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
+  real*8,parameter   :: cof = 2.56d2 ! 2^8
+  real*8,  parameter :: F8=8.d0,F28=2.8d1,F56=5.6d1,F70=7.d1
+  integer::i,j,k
+
+  dX = X(2)-X(1)
+  dY = Y(2)-Y(1)
+  dZ = Z(2)-Z(1)
+  
+  imax = ex(1)
+  jmax = ex(2)
+  kmax = ex(3)
+
+  imin = 1
+  jmin = 1
+  kmin = 1
+
+  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -3
+  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -3
+  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -3
+
+  call symmetry_bd(4,ex,f,fh,SoA)
+
+!   f(i-4) - 8 f(i-3) + 28 f(i-2) - 56 f(i-1) + 70 f(i) - 56 f(i+1) + 28 f(i+2) - 8 f(i+3) + f(i+4)
+! ----------------------------------------------------------------------------------------------------------
+!                                              dx^8
+
+!  note the sign (-1)^r-1, now r=4
+  !$omp parallel do collapse(2) private(i,j,k)
+  do k=1,ex(3)
+  do j=1,ex(2)
+  do i=1,ex(1)
+
+  if(i>imin+3 .and. i < imax-3 .and. &
+     j>jmin+3 .and. j < jmax-3 .and. &
+     k>kmin+3 .and. k < kmax-3) then
+! x direction
+   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dX/cof * (     &
+                                (fh(i-4,j,k)+fh(i+4,j,k)) &
+                         - F8 * (fh(i-3,j,k)+fh(i+3,j,k)) &
+                         +F28 * (fh(i-2,j,k)+fh(i+2,j,k)) &
+                         -F56 * (fh(i-1,j,k)+fh(i+1,j,k)) &
+                         +F70 *  fh(i,j,k) )
+! y direction
+
+   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dY/cof * (     &
+                                (fh(i,j-4,k)+fh(i,j+4,k)) &
+                         - F8 * (fh(i,j-3,k)+fh(i,j+3,k)) &
+                         +F28 * (fh(i,j-2,k)+fh(i,j+2,k)) &
+                         -F56 * (fh(i,j-1,k)+fh(i,j+1,k)) &
+                         +F70 *  fh(i,j,k) )
+! z direction
+
+   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dZ/cof * (     &
+                                (fh(i,j,k-4)+fh(i,j,k+4)) &
+                         - F8 * (fh(i,j,k-3)+fh(i,j,k+3)) &
+                         +F28 * (fh(i,j,k-2)+fh(i,j,k+2)) &
+                         -F56 * (fh(i,j,k-1)+fh(i,j,k+1)) &
+                         +F70 *  fh(i,j,k) )
+
+  endif
+
+  enddo
+  enddo
+  enddo
+
+  return
+
+end subroutine kodis
+
+#elif (ghost_width == 5)
+! eighth order code
+!------------------------------------------------------------------------------------------------------------------------------
+!usual type Kreiss-Oliger type numerical dissipation
+!We support cell center only
+! Note the notation D_+ and D_- [P240 of B. Gustafsson, H.-O. Kreiss, and J. Oliger, Time
+! Dependent Problems and Difference Methods (Wiley, New York, 1995).]
+! D_+ = (f(i+1) - f(i))/h
+! D_- = (f(i) - f(i-1))/h
+! then we have D_+D_- = D_-D_+ = (f(i+1) - 2f(i) + f(i-1))/h^2
+! for nth order accurate finite difference code, we need r =n/2+1
+!              D_+^rD_-^r = (D_+D_-)^r 
+! following the tradiation of PRD 77, 024027 (BB's calibration paper, Eq.(64),
+!  correct some typo according to above book) :
+! + eps*(-1)^(r-1)*h^(2r-1)/2^(2r)*(D_+D_-)^r 
+!
+!
+! this is for 8th order accurate finite difference scheme
+!  (D_+D_-)^5 =
+!  f(i-5) - 10 f(i-4) + 45 f(i-3) - 120 f(i-2) + 210 f(i-1) - 252 f(i) + 210 f(i+1) - 120 f(i+2) + 45 f(i+3) - 10 f(i+4) + f(i+5)
+! -------------------------------------------------------------------------------------------------------------------------------
+!                                                              dx^10
+!---------------------------------------------------------------------------------------------------------------------------------
+! do not add dissipation near boundary
+subroutine kodis(ex,X,Y,Z,f,f_rhs,SoA,Symmetry,eps)
+
+implicit none
+! argument variables
+integer,intent(in) :: Symmetry
+integer,dimension(3),intent(in)::ex
+real*8, dimension(1:3), intent(in) :: SoA
+double precision,intent(in),dimension(ex(1))::X
+double precision,intent(in),dimension(ex(2))::Y
+double precision,intent(in),dimension(ex(3))::Z
+double precision,intent(in),dimension(ex(1),ex(2),ex(3))::f
+double precision,intent(inout),dimension(ex(1),ex(2),ex(3))::f_rhs
+real*8,intent(in) :: eps
+
+!~~~~~~ other variables
+
+  real*8 :: dX,dY,dZ
+  real*8,dimension(-4:ex(1),-4:ex(2),-4:ex(3))   :: fh
+  integer :: imin,jmin,kmin,imax,jmax,kmax
+  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
+  real*8,parameter   :: cof = 1.024d3 ! 2^2r = 2^10
+  real*8,  parameter :: F10=1.d1,F45=4.5d1,F120=1.2d2,F210=2.1d2,F252=2.52d2
+  integer::i,j,k
+
+  dX = X(2)-X(1)
+  dY = Y(2)-Y(1)
+  dZ = Z(2)-Z(1)
+  
+  imax = ex(1)
+  jmax = ex(2)
+  kmax = ex(3)
+
+  imin = 1
+  jmin = 1
+  kmin = 1
+
+  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -4
+  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -4
+  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -4
+
+  call symmetry_bd(5,ex,f,fh,SoA)
+
+!  f(i-5) - 10 f(i-4) + 45 f(i-3) - 120 f(i-2) + 210 f(i-1) - 252 f(i) + 210 f(i+1) - 120 f(i+2) + 45 f(i+3) - 10 f(i+4) + f(i+5)
+! -------------------------------------------------------------------------------------------------------------------------------
+!                                                              dx^10
+
+!  note the sign (-1)^r-1, now r=5
+  !$omp parallel do collapse(2) private(i,j,k)
+  do k=1,ex(3)
+  do j=1,ex(2)
+  do i=1,ex(1)
+
+  if(i>imin+4 .and. i < imax-4 .and. &
+     j>jmin+4 .and. j < jmax-4 .and. &
+     k>kmin+4 .and. k < kmax-4) then
+! x direction
+   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dX/cof * (      &
+                                 (fh(i-5,j,k)+fh(i+5,j,k)) &
+                         - F10 * (fh(i-4,j,k)+fh(i+4,j,k)) &
+                         + F45 * (fh(i-3,j,k)+fh(i+3,j,k)) &
+                         - F120* (fh(i-2,j,k)+fh(i+2,j,k)) &
+                         + F210* (fh(i-1,j,k)+fh(i+1,j,k)) &
+                         - F252 * fh(i,j,k) )
+! y direction
+
+   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dY/cof * (      &
+                                 (fh(i,j-5,k)+fh(i,j+5,k)) &
+                         - F10 * (fh(i,j-4,k)+fh(i,j+4,k)) &
+                         + F45 * (fh(i,j-3,k)+fh(i,j+3,k)) &
+                         - F120* (fh(i,j-2,k)+fh(i,j+2,k)) &
+                         + F210* (fh(i,j-1,k)+fh(i,j+1,k)) &
+                         - F252 * fh(i,j,k) )
+! z direction
+
+   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dZ/cof * (      &
+                                 (fh(i,j,k-5)+fh(i,j,k+5)) &
+                         - F10 * (fh(i,j,k-4)+fh(i,j,k+4)) &
+                         + F45 * (fh(i,j,k-3)+fh(i,j,k+3)) &
+                         - F120* (fh(i,j,k-2)+fh(i,j,k+2)) &
+                         + F210* (fh(i,j,k-1)+fh(i,j,k+1)) &
+                         - F252 * fh(i,j,k) )
+
+  endif
+
+  enddo
+  enddo
+  enddo
+
+  return
+
+end subroutine kodis
+
+#endif  
--- a/AMSS_NCKU_source/kodiss_sh.f90
+++ b/AMSS_NCKU_source/kodiss_sh.f90
@@ -80,6 +80,7 @@ real*8,intent(in) :: eps
 !                       dx^4

 !  note the sign (-1)^r-1, now r=2
+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -178,6 +179,7 @@ real*8,intent(in) :: eps
 !                       dx^4

 !  note the sign (-1)^r-1, now r=2
+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -273,6 +275,7 @@ integer, parameter :: NO_SYMM=0, EQ_SYMM=1, OCTANT=2

  call symmetry_stbd(2,ex,f,fh,SoA)

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -369,6 +372,7 @@ integer, parameter :: NO_SYMM=0, EQ_SYMM=1, OCTANT=2

  call symmetry_stbd(3,ex,f,fh,SoA)

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -510,6 +514,7 @@ integer, parameter :: NO_SYMM=0, EQ_SYMM=1, OCTANT=2

  call symmetry_stbd(3,ex,f,fh,SoA)

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -598,6 +603,7 @@ integer, parameter :: NO_SYMM=0, EQ_SYMM=1, OCTANT=2

  call symmetry_stbd(3,ex,f,fh,SoA)

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -694,6 +700,7 @@ real*8,intent(in) :: eps
 !                                              dx^8

 !  note the sign (-1)^r-1, now r=4
+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -794,6 +801,7 @@ integer, parameter :: NO_SYMM=0, EQ_SYMM=1, OCTANT=2

  call symmetry_stbd(4,ex,f,fh,SoA)

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -903,6 +911,7 @@ real*8,intent(in) :: eps
 !                                                              dx^10

 !  note the sign (-1)^r-1, now r=5
+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
@@ -1006,6 +1015,7 @@ integer, parameter :: NO_SYMM=0, EQ_SYMM=1, OCTANT=2

  call symmetry_stbd(5,ex,f,fh,SoA)

+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
--- a/AMSS_NCKU_source/lopsidediff.f90
+++ b/AMSS_NCKU_source/lopsidediff.f90
@@ -7,7 +7,164 @@
 ! Vertex or Cell is distinguished in routine symmetry_bd which locates in
 ! file "fmisc.f90"

+#if (ghost_width == 2)
+! second order code

+!-----------------------------------------------------------------------------
+!         v
+! D f = ------[ - 3 f  + 4 f   - f     ]
+!  i     2dx         i      i+v   i+2v
+!
+! where
+!
+!        i
+!      |B |
+! v = -----
+!        i
+!       B
+!
+!-----------------------------------------------------------------------------
+subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
+  implicit none
+
+!~~~~~~> Input parameters:
+
+  integer, intent(in)  :: ex(1:3),Symmetry
+  real*8,  intent(in)  :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
+  real*8,dimension(ex(1),ex(2),ex(3)),intent(in)   :: f,Sfx,Sfy,Sfz
+
+  real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
+  real*8,dimension(3),intent(in) ::SoA
+
+!~~~~~~> local variables:
+! note index -1,0, so we have 2 extra points
+  real*8,dimension(-1:ex(1),-1:ex(2),-1:ex(3))   :: fh
+  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
+  real*8 :: dX,dY,dZ
+  real*8 :: d2dx,d2dy,d2dz
+  real*8,  parameter :: ZEO=0.d0,ONE=1.d0,TWO=2.d0,THR=3.d0,FOUR=4.d0
+  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
+
+  dX = X(2)-X(1)
+  dY = Y(2)-Y(1)
+  dZ = Z(2)-Z(1)
+
+  d2dx = ONE/TWO/dX
+  d2dy = ONE/TWO/dY
+  d2dz = ONE/TWO/dZ
+
+  imax = ex(1)
+  jmax = ex(2)
+  kmax = ex(3)
+
+  imin = 1
+  jmin = 1
+  kmin = 1
+  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -1
+  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -1
+  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -1
+
+  call symmetry_bd(2,ex,f,fh,SoA)
+
+! upper bound set ex-1 only for efficiency, 
+! the loop body will set ex 0 also
+  !$omp parallel do collapse(2) private(i,j,k)
+  do k=1,ex(3)-1
+  do j=1,ex(2)-1
+  do i=1,ex(1)-1
+! x direction   
+    if(Sfx(i,j,k) >= ZEO)then
+       if( i+2 <= imax .and. i >= imin)then
+!         v
+! D f = ------[ - 3 f  + 4 f   - f     ]
+!  i     2dx         i      i+v   i+2v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
+                  Sfx(i,j,k)*d2dx*(-THR*fh(i,j,k)+FOUR*fh(i+1,j,k)-fh(i+2,j,k))
+       elseif(i+1 <= imax .and. i >= imin)then
+!         v
+! D f = ------[ - f  + f   ]
+!  i      dx       i    i+v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
+                  Sfx(i,j,k)*d2dx*(-fh(i,j,k)+fh(i+1,j,k))
+
+       endif
+
+    elseif(Sfx(i,j,k) <= ZEO)then
+      if( i-2 >= imin .and. i <= imax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
+                  Sfx(i,j,k)*d2dx*(-THR*fh(i,j,k)+FOUR*fh(i-1,j,k)-fh(i-2,j,k))
+      elseif(i-1 >= imin .and. i <= imax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
+                  Sfx(i,j,k)*d2dx*(-fh(i,j,k)+fh(i-1,j,k))
+      endif
+
+! set imax and imin 0
+    endif
+
+! y direction   
+    if(Sfy(i,j,k) >= ZEO)then
+       if( j+2 <= jmax .and. j >= jmin)then
+!         v
+! D f = ------[ - 3 f  + 4 f   - f     ]
+!  i     2dx         i      i+v   i+2v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
+                  Sfy(i,j,k)*d2dy*(-THR*fh(i,j,k)+FOUR*fh(i,j+1,k)-fh(i,j+2,k))
+       elseif(j+1 <= jmax .and. j >= jmin)then
+!         v
+! D f = ------[ - f  + f   ]
+!  i      dx       i    i+v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
+                  Sfy(i,j,k)*d2dy*(-fh(i,j,k)+fh(i,j+1,k))
+       endif
+
+    elseif(Sfy(i,j,k) <= ZEO)then
+      if( j-2 >= jmin .and. j <= jmax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
+                  Sfy(i,j,k)*d2dy*(-THR*fh(i,j,k)+FOUR*fh(i,j-1,k)-fh(i,j-2,k))
+      elseif(j-1 >= jmin .and. j <= jmax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
+                  Sfy(i,j,k)*d2dy*(-fh(i,j,k)+fh(i,j-1,k))
+      endif
+
+! set jmin and jmax 0
+     endif
+!! z direction   
+    if(Sfz(i,j,k) >= ZEO)then
+      if( k+2 <= kmax .and. k >= kmin)then
+!         v
+! D f = ------[ - 3 f  + 4 f   - f     ]
+!  i     2dx         i      i+v   i+2v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
+                  Sfz(i,j,k)*d2dz*(-THR*fh(i,j,k)+FOUR*fh(i,j,k+1)-fh(i,j,k+2))
+       elseif(k+1 <= kmax .and. k >= kmin)then
+!         v
+! D f = ------[ - f  + f   ]
+!  i      dx       i    i+v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
+                  Sfz(i,j,k)*d2dz*(-fh(i,j,k)+fh(i,j,k+1))
+       endif
+
+    elseif(Sfz(i,j,k) <= ZEO)then
+      if( k-2 >= kmin .and. k <= kmax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
+                  Sfz(i,j,k)*d2dz*(-THR*fh(i,j,k)+FOUR*fh(i,j,k-1)-fh(i,j,k-2))
+      elseif(k-1 >= kmin .and. k <= kmax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
+                  Sfz(i,j,k)*d2dz*(-fh(i,j,k)+fh(i,j,k-1))
+      endif
+
+! set kmin and kmax 0
+     endif
+
+  enddo
+  enddo
+  enddo
+
+  return
+
+  end subroutine lopsided
+
+#elif (ghost_width == 3)
 ! fourth order code

 !-----------------------------------------------------------------------------
@@ -77,10 +234,93 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)

 ! upper bound set ex-1 only for efficiency, 
 ! the loop body will set ex 0 also
+  !$omp parallel do collapse(2) private(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
+#if 0  
+!! old code
+! x direction   
+    if(Sfx(i,j,k) >= ZEO .and. i+3 <= imax .and. i-1 >= imin)then
+!         v
+! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
+!  i     12dx       i-v      i      i+v     i+2v    i+3v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
+                  Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
+                                    -F6*fh(i+2,j,k)+    fh(i+3,j,k))

+    elseif(Sfx(i,j,k) <= ZEO .and. i-3 >= imin .and. i+1 <= imax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
+                  Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
+                                    -F6*fh(i-2,j,k)+    fh(i-3,j,k))
+
+     elseif(i+2 <= imax .and. i-2 >= imin)then
+!
+!              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
+!  fx(i) = ---------------------------------------------
+!                             12 dx
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
+                  Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
+
+     elseif(i+1 <= imax .and. i-1 >= imin)then
+!
+!              - f(i-1) + f(i+1)
+!  fx(i) = --------------------------------
+!                     2 dx
+     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfx(i,j,k)*d2dx*(-fh(i-1,j,k)+fh(i+1,j,k))
+
+! set imax and imin 0
+    endif
+
+! y direction   
+    if(Sfy(i,j,k) >= ZEO .and. j+3 <= jmax .and. j-1 >= jmin)then
+!         v
+! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
+!  i     12dx       i-v      i      i+v     i+2v    i+3v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
+                  Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
+                                    -F6*fh(i,j+2,k)+    fh(i,j+3,k))
+
+    elseif(Sfy(i,j,k) <= ZEO .and. j-3 >= jmin .and. j+1 <= jmax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
+                  Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
+                                    -F6*fh(i,j-2,k)+    fh(i,j-3,k))
+
+     elseif(j+2 <= jmax .and. j-2 >= jmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
+                  Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
+
+     elseif(j+1 <= jmax .and. j-1 >= jmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfy(i,j,k)*d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
+! set jmin and jmax 0
+     endif
+!! z direction   
+    if(Sfz(i,j,k) >= ZEO .and. k+3 <= kmax .and. k-1 >= kmin)then
+!         v
+! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
+!  i     12dx       i-v      i      i+v     i+2v    i+3v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
+                  Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
+                                    -F6*fh(i,j,k+2)+    fh(i,j,k+3))
+
+    elseif(Sfz(i,j,k) <= ZEO .and. k-3 >= kmin .and. k+1 <= kmax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
+                  Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
+                                    -F6*fh(i,j,k-2)+    fh(i,j,k-3))
+
+     elseif(k+2 <= kmax .and. k-2 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
+                  Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
+
+     elseif(k+1 <= kmax .and. k-1 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+Sfz(i,j,k)*d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
+! set kmin and kmax 0
+     endif
+#else
 !! new code, 2012dec27, based on bam
 ! x direction   
    if(Sfx(i,j,k) > ZEO)then
@@ -240,6 +480,7 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
 ! set kmax and kmin 0
     endif
   endif
+#endif
  enddo
  enddo
  enddo
@@ -247,3 +488,419 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
  return

  end subroutine lopsided
+
+#elif (ghost_width == 4)
+! sixth order code
+! Compute advection terms in right hand sides of field equations
+!         v
+! D f = ------[ 2f     - 24f    - 35f  + 80f    - 30f     + 8f     - f    ]
+!  i     60dx     i-2v      i-v      i      i+v      i+2v     i+3v    i+4v
+!
+! where
+!
+!        i
+!      |B |
+! v = -----
+!        i
+!       B
+!
+!-----------------------------------------------------------------------------
+subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
+  implicit none
+
+!~~~~~~> Input parameters:
+
+  integer, intent(in)  :: ex(1:3),Symmetry
+  real*8,  intent(in)  :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
+  real*8,dimension(ex(1),ex(2),ex(3)),intent(in)   :: f,Sfx,Sfy,Sfz
+
+  real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
+  real*8,dimension(3),intent(in) ::SoA
+
+!~~~~~~> local variables:
+
+  real*8,dimension(-3:ex(1),-3:ex(2),-3:ex(3))   :: fh
+  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
+  real*8 :: dX,dY,dZ
+  real*8 :: d60dx,d60dy,d60dz,d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
+  real*8,  parameter :: ZEO=0.d0,ONE=1.d0, F60=6.d1
+  real*8,  parameter :: TWO=2.d0,F24=2.4d1,F35=3.5d1,F80=8.d1,F30=3.d1,EIT=8.d0
+  real*8,  parameter ::  F9=9.d0,F45=4.5d1,F12=1.2d1
+  real*8,  parameter ::  F10=1.d1,F77=7.7d1,F150=1.5d2,F100=1.d2,F50=5.d1,F15=1.5d1
+  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
+
+  dX = X(2)-X(1)
+  dY = Y(2)-Y(1)
+  dZ = Z(2)-Z(1)
+
+  d60dx = ONE/F60/dX
+  d60dy = ONE/F60/dY
+  d60dz = ONE/F60/dZ
+
+  d12dx = ONE/F12/dX
+  d12dy = ONE/F12/dY
+  d12dz = ONE/F12/dZ
+
+  d2dx = ONE/TWO/dX
+  d2dy = ONE/TWO/dY
+  d2dz = ONE/TWO/dZ
+
+  imax = ex(1)
+  jmax = ex(2)
+  kmax = ex(3)
+
+  imin = 1
+  jmin = 1
+  kmin = 1
+  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -3
+  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -3
+  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -3
+
+  call symmetry_bd(4,ex,f,fh,SoA)
+
+! upper bound set ex-1 only for efficiency, 
+! the loop body will set ex 0 also
+  !$omp parallel do collapse(2) private(i,j,k)
+  do k=1,ex(3)-1
+  do j=1,ex(2)-1
+  do i=1,ex(1)-1
+! x direction   
+    if(Sfx(i,j,k) >= ZEO .and. i+4 <= imax .and. i-2 >= imin)then
+!         v
+! D f = ------[ 2f     - 24f    - 35f  + 80f    - 30f     + 8f     - f    ]
+!  i     60dx     i-2v      i-v      i      i+v      i+2v     i+3v    i+4v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                             &
+                  Sfx(i,j,k)*d60dx*(TWO*fh(i-2,j,k)-F24*fh(i-1,j,k)-F35*fh(i,j,k)+F80*fh(i+1,j,k) &
+                                   -F30*fh(i+2,j,k)+EIT*fh(i+3,j,k)-    fh(i+4,j,k))
+    elseif(Sfx(i,j,k) >= ZEO .and. i+5 <= imax .and. i-1 >= imin)then
+!         v
+! D f = ------[-10f    - 77f  + 150f    - 100f     + 50f     -15f     + 2f    ]
+!  i     60dx      i-v      i       i+v       i+2v      i+3v     i+4v    i+5v
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                        &
+                  Sfx(i,j,k)*d60dx*(-F10*fh(i-1,j,k)-F77*fh(i  ,j,k)+F150*fh(i+1,j,k)-F100*fh(i+2,j,k) &
+                                    +F50*fh(i+3,j,k)-F15*fh(i+4,j,k)+ TWO*fh(i+5,j,k))
+
+    elseif(Sfx(i,j,k) <= ZEO .and. i-4 >= imin .and. i+2 <= imax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                   &
+                  Sfx(i,j,k)*d60dx*(TWO*fh(i+2,j,k)-F24*fh(i+1,j,k)-F35*fh(i,j,k)+F80*fh(i-1,j,k) &
+                                   -F30*fh(i-2,j,k)+EIT*fh(i-3,j,k)-    fh(i-4,j,k))
+    elseif(Sfx(i,j,k) <= ZEO .and. i-5 >= imin .and. i+1 <= imax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                        &
+                  Sfx(i,j,k)*d60dx*(-F10*fh(i+1,j,k)-F77*fh(i  ,j,k)+F150*fh(i-1,j,k)-F100*fh(i-2,j,k) &
+                                    +F50*fh(i-3,j,k)-F15*fh(i-4,j,k)+ TWO*fh(i-5,j,k))
+
+     elseif(i+3 <= imax .and. i-3 >= imin)then
+!           - f(i-3) + 9 f(i-2) - 45 f(i-1) + 45 f(i+1) - 9 f(i+2) + f(i+3)
+!  fx(i) = -----------------------------------------------------------------
+!                                        60 dx
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
+                  Sfx(i,j,k)*d60dx*(-fh(i-3,j,k)+F9*fh(i-2,j,k)-F45*fh(i-1,j,k)+F45*fh(i+1,j,k)-F9*fh(i+2,j,k)+fh(i+3,j,k))
+
+     elseif(i+2 <= imax .and. i-2 >= imin)then
+!
+!              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
+!  fx(i) = ---------------------------------------------
+!                             12 dx
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
+                  Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
+
+     elseif(i+1 <= imax .and. i-1 >= imin)then
+!
+!              - f(i-1) + f(i+1)
+!  fx(i) = --------------------------------
+!                     2 dx
+     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfx(i,j,k)*d2dx*(-fh(i-1,j,k)+fh(i+1,j,k))
+
+! set imax and imin 0
+    endif
+
+! y direction   
+     if(Sfy(i,j,k) >= ZEO .and. j+4 <= jmax .and. j-2 >= jmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                   &
+                  Sfy(i,j,k)*d60dy*(TWO*fh(i,j-2,k)-F24*fh(i,j-1,k)-F35*fh(i,j,k)+F80*fh(i,j+1,k) &
+                                   -F30*fh(i,j+2,k)+EIT*fh(i,j+3,k)-    fh(i,j+4,k))
+     elseif(Sfy(i,j,k) >= ZEO .and. j+5 <= jmax .and. j-1 >= jmin)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                        &
+                  Sfy(i,j,k)*d60dy*(-F10*fh(i,j-1,k)-F77*fh(i,j  ,k)+F150*fh(i,j+1,k)-F100*fh(i,j+2,k) &
+                                    +F50*fh(i,j+3,k)-F15*fh(i,j+4,k)+ TWO*fh(i,j+5,k))
+
+     elseif(Sfy(i,j,k) <= ZEO .and. j-4 >= jmin .and. j+2 <= jmax)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                   &
+                  Sfy(i,j,k)*d60dy*(TWO*fh(i,j+2,k)-F24*fh(i,j+1,k)-F35*fh(i,j,k)+F80*fh(i,j-1,k) &
+                                   -F30*fh(i,j-2,k)+EIT*fh(i,j-3,k)-    fh(i,j-4,k))
+
+     elseif(Sfy(i,j,k) <= ZEO .and. j-5 >= jmin .and. j+1 <= jmax)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                        &
+                  Sfy(i,j,k)*d60dy*(-F10*fh(i,j+1,k)-F77*fh(i,j  ,k)+F150*fh(i,j-1,k)-F100*fh(i,j-2,k) &
+                                    +F50*fh(i,j-3,k)-F15*fh(i,j-4,k)+ TWO*fh(i,j-5,k))
+
+     elseif(j+3 <= jmax .and. j-3 >= jmin)then
+          
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                                         &
+                  Sfy(i,j,k)*d60dy*(-fh(i,j-3,k)+F9*fh(i,j-2,k)-F45*fh(i,j-1,k)+F45*fh(i,j+1,k)-F9*fh(i,j+2,k)+fh(i,j+3,k))
+
+     elseif(j+2 <= jmax .and. j-2 >= jmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
+                  Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
+
+     elseif(j+1 <= jmax .and. j-1 >= jmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfy(i,j,k)*d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
+! set jmin and jmax 0
+     endif
+!! z direction   
+     if(Sfz(i,j,k) >= ZEO .and. k+4 <= kmax .and. k-2 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                   &
+                  Sfz(i,j,k)*d60dz*(TWO*fh(i,j,k-2)-F24*fh(i,j,k-1)-F35*fh(i,j,k)+F80*fh(i,j,k+1) &
+                                   -F30*fh(i,j,k+2)+EIT*fh(i,j,k+3)-    fh(i,j,k+4))
+     elseif(Sfz(i,j,k) >= ZEO .and. k+5 <= kmax .and. k-1 >= kmin)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                        &
+                  Sfz(i,j,k)*d60dz*(-F10*fh(i,j,k-1)-F77*fh(i,j,k  )+F150*fh(i,j,k+1)-F100*fh(i,j,k+2) &
+                                    +F50*fh(i,j,k+3)-F15*fh(i,j,k+4)+ TWO*fh(i,j,k+5))
+
+     elseif(Sfz(i,j,k) <= ZEO .and. k-4 >= kmin .and. k+2 <= kmax)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                   &
+                  Sfz(i,j,k)*d60dz*(TWO*fh(i,j,k+2)-F24*fh(i,j,k+1)-F35*fh(i,j,k)+F80*fh(i,j,k-1) &
+                                   -F30*fh(i,j,k-2)+EIT*fh(i,j,k-3)-    fh(i,j,k-4))
+
+     elseif(Sfz(i,j,k) <= ZEO .and. k-5 >= kmin .and. k+1 <= kmax)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                        &
+                  Sfz(i,j,k)*d60dz*(-F10*fh(i,j,k+1)-F77*fh(i,j,k  )+F150*fh(i,j,k-1)-F100*fh(i,j,k-2) &
+                                    +F50*fh(i,j,k-3)-F15*fh(i,j,k-4)+ TWO*fh(i,j,k-5))
+     
+     elseif(k+3 <= kmax .and. k-3 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                                         &
+                  Sfz(i,j,k)*d60dz*(-fh(i,j,k-3)+F9*fh(i,j,k-2)-F45*fh(i,j,k-1)+F45*fh(i,j,k+1)-F9*fh(i,j,k+2)+fh(i,j,k+3))
+
+     elseif(k+2 <= kmax .and. k-2 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
+                  Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
+
+     elseif(k+1 <= kmax .and. k-1 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+Sfz(i,j,k)*d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
+! set kmin and kmax 0
+     endif
+
+  enddo
+  enddo
+  enddo
+
+  return
+
+  end subroutine lopsided
+
+#elif (ghost_width == 5)
+! eighth order code
+!-----------------------------------------------------------------------------
+! PRD 77, 024034 (2008)
+! Compute advection terms in right hand sides of field equations
+!        v [ - 5 f(i-3v) + 60 f(i-2v) - 420 f(i-v) - 378 f(i) + 1050 f(i+v) - 420 f(i+2v) + 140 f(i+3v) - 30 f(i+4v) + 3 f(i+5v)]
+! D f = --------------------------------------------------------------------------------------------------------------------------
+!  i                                                             840 dx           
+!
+! where
+!
+!        i
+!      |B |
+! v = -----
+!        i
+!       B
+!
+!-----------------------------------------------------------------------------
+subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
+  implicit none
+
+!~~~~~~> Input parameters:
+
+  integer, intent(in)  :: ex(1:3),Symmetry
+  real*8,  intent(in)  :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
+  real*8,dimension(ex(1),ex(2),ex(3)),intent(in)   :: f,Sfx,Sfy,Sfz
+
+  real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
+  real*8,dimension(3),intent(in) ::SoA
+
+!~~~~~~> local variables:
+
+  real*8,dimension(-4:ex(1),-4:ex(2),-4:ex(3))   :: fh
+  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
+  real*8 :: dX,dY,dZ
+  real*8 :: d840dx,d840dy,d840dz,d60dx,d60dy,d60dz,d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
+  real*8,  parameter :: ZEO=0.d0,ONE=1.d0, F60=6.d1
+  real*8,  parameter :: TWO=2.d0,F30=3.d1,EIT=8.d0
+  real*8,  parameter ::  F9=9.d0,F45=4.5d1,F12=1.2d1,F140=1.4d2,THR=3.d0
+  real*8,  parameter :: F840=8.4d2,F5=5.d0,F420=4.2d2,F378=3.78d2,F1050=1.05d3
+  real*8,  parameter :: F32=3.2d1,F168=1.68d2,F672=6.72d2
+  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
+
+  dX = X(2)-X(1)
+  dY = Y(2)-Y(1)
+  dZ = Z(2)-Z(1)
+
+  d840dx = ONE/F840/dX
+  d840dy = ONE/F840/dY
+  d840dz = ONE/F840/dZ
+
+  d60dx = ONE/F60/dX
+  d60dy = ONE/F60/dY
+  d60dz = ONE/F60/dZ
+
+  d12dx = ONE/F12/dX
+  d12dy = ONE/F12/dY
+  d12dz = ONE/F12/dZ
+
+  d2dx = ONE/TWO/dX
+  d2dy = ONE/TWO/dY
+  d2dz = ONE/TWO/dZ
+
+  imax = ex(1)
+  jmax = ex(2)
+  kmax = ex(3)
+
+  imin = 1
+  jmin = 1
+  kmin = 1
+  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -4
+  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -4
+  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -4
+
+  call symmetry_bd(5,ex,f,fh,SoA)
+
+! upper bound set ex-1 only for efficiency, 
+! the loop body will set ex 0 also
+  !$omp parallel do collapse(2) private(i,j,k)
+  do k=1,ex(3)-1
+  do j=1,ex(2)-1
+  do i=1,ex(1)-1
+! x direction   
+    if(Sfx(i,j,k) >= ZEO .and. i+5 <= imax .and. i-3 >= imin)then
+!        v [ - 5 f(i-3v) + 60 f(i-2v) - 420 f(i-v) - 378 f(i) + 1050 f(i+v) - 420 f(i+2v) + 140 f(i+3v) - 30 f(i+4v) + 3 f(i+5v)]
+! D f = --------------------------------------------------------------------------------------------------------------------------
+!  i                                                             840 dx    
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                         &
+                  Sfx(i,j,k)*d840dx*(-F5*fh(i-3,j,k)+F60 *fh(i-2,j,k)-F420*fh(i-1,j,k)-F378*fh(i  ,j,k) &
+                                  +F1050*fh(i+1,j,k)-F420*fh(i+2,j,k)+F140*fh(i+3,j,k)-F30 *fh(i+4,j,k)+THR*fh(i+5,j,k))
+
+    elseif(Sfx(i,j,k) <= ZEO .and. i-5 >= imin .and. i+3 <= imax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                          &
+                  Sfx(i,j,k)*d840dx*(-F5*fh(i+3,j,k)+F60 *fh(i+2,j,k)-F420*fh(i+1,j,k)-F378*fh(i   ,j,k) &
+                                  +F1050*fh(i-1,j,k)-F420*fh(i-2,j,k)+F140*fh(i-3,j,k)- F30*fh(i-4,j,k)+THR*fh(i-5,j,k))
+
+    elseif(i+4 <= imax .and. i-4 >= imin)then
+!           3 f(i-4) - 32 f(i-3) + 168 f(i-2) - 672 f(i-1) + 672 f(i+1) - 168 f(i+2) + 32 f(i+3) - 3 f(i+4)
+!  fx(i) = -------------------------------------------------------------------------------------------------
+!                                                        840 dx
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
+                  Sfx(i,j,k)*d840dx*( THR*fh(i-4,j,k)-F32 *fh(i-3,j,k)+F168*fh(i-2,j,k)-F672*fh(i-1,j,k)+    &
+                                     F672*fh(i+1,j,k)-F168*fh(i+2,j,k)+F32 *fh(i+3,j,k)-THR *fh(i+4,j,k))
+
+     elseif(i+3 <= imax .and. i-3 >= imin)then
+!           - f(i-3) + 9 f(i-2) - 45 f(i-1) + 45 f(i+1) - 9 f(i+2) + f(i+3)
+!  fx(i) = -----------------------------------------------------------------
+!                                        60 dx
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
+                  Sfx(i,j,k)*d60dx*(-fh(i-3,j,k)+F9*fh(i-2,j,k)-F45*fh(i-1,j,k)+F45*fh(i+1,j,k)-F9*fh(i+2,j,k)+fh(i+3,j,k))
+
+     elseif(i+2 <= imax .and. i-2 >= imin)then
+!
+!              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
+!  fx(i) = ---------------------------------------------
+!                             12 dx
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
+                  Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
+
+     elseif(i+1 <= imax .and. i-1 >= imin)then
+!
+!              - f(i-1) + f(i+1)
+!  fx(i) = --------------------------------
+!                     2 dx
+     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfx(i,j,k)*d2dx*(-fh(i-1,j,k)+fh(i+1,j,k))
+
+! set imax and imin 0
+    endif
+
+! y direction   
+    if(Sfy(i,j,k) >= ZEO .and. j+5 <= jmax .and. j-3 >= jmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                         &
+                  Sfy(i,j,k)*d840dy*(-F5*fh(i,j-3,k)+F60 *fh(i,j-2,k)-F420*fh(i,j-1,k)-F378*fh(i,j  ,k) &
+                                  +F1050*fh(i,j+1,k)-F420*fh(i,j+2,k)+F140*fh(i,j+3,k)-F30 *fh(i,j+4,k)+THR*fh(i,j+5,k))
+
+    elseif(Sfy(i,j,k) <= ZEO .and. j-5 >= jmin .and. j+3 <= jmax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                         &
+                  Sfy(i,j,k)*d840dy*(-F5*fh(i,j+3,k)+F60 *fh(i,j+2,k)-F420*fh(i,j+1,k)-F378*fh(i,j  ,k) &
+                                  +F1050*fh(i,j-1,k)-F420*fh(i,j-2,k)+F140*fh(i,j-3,k)- F30*fh(i,j-4,k)+THR*fh(i,j-5,k))
+
+    elseif(j+4 <= jmax .and. j-4 >= jmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
+                  Sfy(i,j,k)*d840dy*( THR*fh(i,j-4,k)-F32 *fh(i,j-3,k)+F168*fh(i,j-2,k)-F672*fh(i,j-1,k)+    &
+                                     F672*fh(i,j+1,k)-F168*fh(i,j+2,k)+F32 *fh(i,j+3,k)-THR *fh(i,j+4,k))
+
+     elseif(j+3 <= jmax .and. j-3 >= jmin)then
+          
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                                         &
+                  Sfy(i,j,k)*d60dy*(-fh(i,j-3,k)+F9*fh(i,j-2,k)-F45*fh(i,j-1,k)+F45*fh(i,j+1,k)-F9*fh(i,j+2,k)+fh(i,j+3,k))
+
+     elseif(j+2 <= jmax .and. j-2 >= jmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
+                  Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
+
+     elseif(j+1 <= jmax .and. j-1 >= jmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfy(i,j,k)*d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
+! set jmin and jmax 0
+     endif
+!! z direction   
+    if(Sfz(i,j,k) >= ZEO .and. k+5 <= kmax .and. k-3 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                         &
+                  Sfz(i,j,k)*d840dz*(-F5*fh(i,j,k-3)+F60 *fh(i,j,k-2)-F420*fh(i,j,k-1)-F378*fh(i,j,k  ) &
+                                  +F1050*fh(i,j,k+1)-F420*fh(i,j,k+2)+F140*fh(i,j,k+3)-F30 *fh(i,j,k+4)+THR*fh(i,j,k+5))
+
+    elseif(Sfz(i,j,k) <= ZEO .and. k-5 >= kmin .and. k+3 <= kmax)then
+     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                         &
+                  Sfz(i,j,k)*d840dz*(-F5*fh(i,j,k+3)+F60 *fh(i,j,k+2)-F420*fh(i,j,k+1)-F378*fh(i,j,k  ) &
+                                  +F1050*fh(i,j,k-1)-F420*fh(i,j,k-2)+F140*fh(i,j,k-3)- F30*fh(i,j,k-4)+THR*fh(i,j,k-5))
+
+    elseif(k+4 <= kmax .and. k-4 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
+                  Sfz(i,j,k)*d840dz*( THR*fh(i,j,k-4)-F32 *fh(i,j,k-3)+F168*fh(i,j,k-2)-F672*fh(i,j,k-1)+    &
+                                     F672*fh(i,j,k+1)-F168*fh(i,j,k+2)+F32 *fh(i,j,k+3)-THR *fh(i,j,k+4))
+     
+     elseif(k+3 <= kmax .and. k-3 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                                         &
+                  Sfz(i,j,k)*d60dz*(-fh(i,j,k-3)+F9*fh(i,j,k-2)-F45*fh(i,j,k-1)+F45*fh(i,j,k+1)-F9*fh(i,j,k+2)+fh(i,j,k+3))
+
+     elseif(k+2 <= kmax .and. k-2 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
+                  Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
+
+     elseif(k+1 <= kmax .and. k-1 >= kmin)then
+
+     f_rhs(i,j,k)=f_rhs(i,j,k)+Sfz(i,j,k)*d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
+! set kmin and kmax 0
+     endif
+
+  enddo
+  enddo
+  enddo
+
+  return
+
+  end subroutine lopsided
+
+#endif  
--- a/AMSS_NCKU_source/makefile
+++ b/AMSS_NCKU_source/makefile
@@ -96,7 +96,7 @@ misc.o : zbesh.o

 # projects
 ABE: $(C++FILES) $(F90FILES) $(F77FILES) $(AHFDOBJS)
-	$(CLINKER) $(CXXAPPFLAGS) -o $@ $(C++FILES) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(LDLIBS)
+	$(CLINKER) $(CXXAPPFLAGS) -qopenmp -o $@ $(C++FILES) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(LDLIBS)
 	
 ABEGPU: $(C++FILES_GPU) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(CUDAFILES)
 	$(CLINKER) $(CXXAPPFLAGS) -o $@ $(C++FILES_GPU) $(F90FILES) $(F77FILES) $(AHFDOBJS) $(CUDAFILES) $(LDLIBS)
--- a/AMSS_NCKU_source/makefile.inc
+++ b/AMSS_NCKU_source/makefile.inc
@@ -6,24 +6,23 @@
 ## Intel oneAPI version with oneMKL (Optimized for performance)
 filein  = -I/usr/include/ -I${MKLROOT}/include

-## Using sequential MKL (OpenMP disabled for better single-threaded performance)
+## Using OpenMP-threaded MKL for parallel performance
 ## Added -lifcore for Intel Fortran runtime and -limf for Intel math library
-LDLIBS  = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lifcore -limf -lpthread -lm -ldl
+LDLIBS  = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lifcore -limf -lpthread -lm -ldl

-## Aggressive optimization flags:
-## -O3: Maximum optimization
-## -xHost: Optimize for the host CPU architecture (Intel/AMD compatible)
-## -fp-model fast=2: Aggressive floating-point optimizations
-## -fma: Enable fused multiply-add instructions
-CXXAPPFLAGS  = -O3 -xHost -fp-model fast=2 -fma -ipo \
-               -Dfortran3 -Dnewc -I${MKLROOT}/include
-f90appflags  = -O3 -xHost -fp-model fast=2 -fma -ipo \
+## Aggressive optimization flags + PGO Phase 2 (profile-guided optimization)
+## -fprofile-instr-use: use collected profile data to guide optimization decisions
+##   (branch prediction, basic block layout, inlining, loop unrolling)
+PROFDATA     = /home/amss/AMSS-NCKU/pgo_profile/default.profdata
+CXXAPPFLAGS  = -O3 -march=native -fp-model fast=2 -fma -ipo -qopenmp \
+               -DMPI_STUB -Dfortran3 -Dnewc -I${MKLROOT}/include
+f90appflags  = -O3 -march=native -fp-model fast=2 -fma -ipo -qopenmp \
               -align array64byte -fpp -I${MKLROOT}/include
 f90          = ifx
 f77          = ifx
 CXX          = icpx
 CC           = icx
-CLINKER      = mpiicpx 
+CLINKER      = icpx

 Cu = nvcc
 CUDA_LIB_PATH = -L/usr/lib/cuda/lib64 -I/usr/include -I/usr/lib/cuda/include
--- a/AMSS_NCKU_source/misc.C
+++ b/AMSS_NCKU_source/misc.C
@@ -14,7 +14,11 @@ using namespace std;
 #include <string.h>
 #include <math.h>
 #endif
+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "misc.h"
 #include "macrodef.h"
--- a/AMSS_NCKU_source/misc.h
+++ b/AMSS_NCKU_source/misc.h
@@ -24,7 +24,11 @@ using namespace std;
 #include <complex.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 namespace misc
 {
--- a/AMSS_NCKU_source/monitor.h
+++ b/AMSS_NCKU_source/monitor.h
@@ -20,7 +20,11 @@ using namespace std;
 #endif
 #include <time.h>

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 class monitor
 {
--- a/AMSS_NCKU_source/mpi_stub.h
+++ b/AMSS_NCKU_source/mpi_stub.h
@@ -0,0 +1,153 @@
+#ifndef MPI_STUB_H
+#define MPI_STUB_H
+
+/*
+ * MPI Stub Header — single-process shim for AMSS-NCKU ABE solver.
+ * Provides all MPI types, constants, and functions used in the codebase
+ * as no-ops or trivial implementations for nprocs=1, myrank=0.
+ */
+
+#include <cstring>
+#include <cstdlib>
+#include <cstdio>
+#include <time.h>
+
+/* ── Types ─────────────────────────────────────────────────────────── */
+
+typedef int MPI_Comm;
+typedef int MPI_Datatype;
+typedef int MPI_Op;
+typedef int MPI_Request;
+typedef int MPI_Group;
+
+typedef struct MPI_Status {
+    int MPI_SOURCE;
+    int MPI_TAG;
+    int MPI_ERROR;
+} MPI_Status;
+
+/* ── Constants ─────────────────────────────────────────────────────── */
+
+#define MPI_COMM_WORLD  0
+
+#define MPI_INT              1
+#define MPI_DOUBLE           2
+#define MPI_DOUBLE_PRECISION 2
+#define MPI_DOUBLE_INT       3
+
+#define MPI_SUM    1
+#define MPI_MAX    2
+#define MPI_MAXLOC 3
+
+#define MPI_STATUS_IGNORE   ((MPI_Status *)0)
+#define MPI_STATUSES_IGNORE ((MPI_Status *)0)
+
+#define MPI_MAX_PROCESSOR_NAME 256
+
+/* ── Helper: sizeof for MPI_Datatype ──────────────────────────────── */
+
+static inline size_t mpi_stub_sizeof(MPI_Datatype type) {
+    switch (type) {
+        case MPI_INT:        return sizeof(int);
+        case MPI_DOUBLE:     return sizeof(double);
+        case MPI_DOUBLE_INT: return sizeof(double) + sizeof(int);
+        default:             return 0;
+    }
+}
+
+/* ── Init / Finalize ──────────────────────────────────────────────── */
+
+static inline int MPI_Init(int *, char ***) { return 0; }
+static inline int MPI_Finalize() { return 0; }
+
+/* ── Communicator queries ─────────────────────────────────────────── */
+
+static inline int MPI_Comm_rank(MPI_Comm, int *rank) { *rank = 0; return 0; }
+static inline int MPI_Comm_size(MPI_Comm, int *size) { *size = 1; return 0; }
+static inline int MPI_Comm_split(MPI_Comm comm, int, int, MPI_Comm *newcomm) {
+    *newcomm = comm;
+    return 0;
+}
+static inline int MPI_Comm_free(MPI_Comm *) { return 0; }
+
+/* ── Group operations ─────────────────────────────────────────────── */
+
+static inline int MPI_Comm_group(MPI_Comm, MPI_Group *group) {
+    *group = 0;
+    return 0;
+}
+static inline int MPI_Group_translate_ranks(MPI_Group, int n,
+        const int *ranks1, MPI_Group, int *ranks2) {
+    for (int i = 0; i < n; ++i) ranks2[i] = ranks1[i];
+    return 0;
+}
+static inline int MPI_Group_free(MPI_Group *) { return 0; }
+
+/* ── Collective operations ────────────────────────────────────────── */
+
+static inline int MPI_Allreduce(const void *sendbuf, void *recvbuf,
+        int count, MPI_Datatype datatype, MPI_Op, MPI_Comm) {
+    std::memcpy(recvbuf, sendbuf, count * mpi_stub_sizeof(datatype));
+    return 0;
+}
+
+static inline int MPI_Iallreduce(const void *sendbuf, void *recvbuf,
+        int count, MPI_Datatype datatype, MPI_Op, MPI_Comm,
+        MPI_Request *request) {
+    std::memcpy(recvbuf, sendbuf, count * mpi_stub_sizeof(datatype));
+    *request = 0;
+    return 0;
+}
+
+static inline int MPI_Bcast(void *, int, MPI_Datatype, int, MPI_Comm) {
+    return 0;
+}
+
+static inline int MPI_Barrier(MPI_Comm) { return 0; }
+
+/* ── Point-to-point (never reached with nprocs=1) ─────────────────── */
+
+static inline int MPI_Send(const void *, int, MPI_Datatype, int, int, MPI_Comm) {
+    return 0;
+}
+static inline int MPI_Recv(void *, int, MPI_Datatype, int, int, MPI_Comm, MPI_Status *) {
+    return 0;
+}
+static inline int MPI_Isend(const void *, int, MPI_Datatype, int, int, MPI_Comm,
+        MPI_Request *req) {
+    *req = 0;
+    return 0;
+}
+static inline int MPI_Irecv(void *, int, MPI_Datatype, int, int, MPI_Comm,
+        MPI_Request *req) {
+    *req = 0;
+    return 0;
+}
+
+/* ── Completion ───────────────────────────────────────────────────── */
+
+static inline int MPI_Wait(MPI_Request *, MPI_Status *) { return 0; }
+static inline int MPI_Waitall(int, MPI_Request *, MPI_Status *) { return 0; }
+
+/* ── Utility ──────────────────────────────────────────────────────── */
+
+static inline int MPI_Abort(MPI_Comm, int error_code) {
+    std::fprintf(stderr, "MPI_Abort called with error code %d\n", error_code);
+    std::exit(error_code);
+    return 0;
+}
+
+static inline double MPI_Wtime() {
+    struct timespec ts;
+    clock_gettime(CLOCK_MONOTONIC, &ts);
+    return (double)ts.tv_sec + (double)ts.tv_nsec * 1.0e-9;
+}
+
+static inline int MPI_Get_processor_name(char *name, int *resultlen) {
+    const char *stub_name = "localhost";
+    std::strcpy(name, stub_name);
+    *resultlen = (int)std::strlen(stub_name);
+    return 0;
+}
+
+#endif /* MPI_STUB_H */
--- a/AMSS_NCKU_source/parameters.h
+++ b/AMSS_NCKU_source/parameters.h
@@ -24,7 +24,11 @@ using namespace std;
 #include <map.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 namespace parameters
 {
--- a/AMSS_NCKU_source/perf.h
+++ b/AMSS_NCKU_source/perf.h
@@ -30,7 +30,11 @@ using namespace std;
 #include <sys/time.h>
 #include <sys/resource.h>

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 /* Real time */
 #define TimerSignal SIGALRM
--- a/AMSS_NCKU_source/rungekutta4_rout.f90
+++ b/AMSS_NCKU_source/rungekutta4_rout.f90
@@ -109,23 +109,33 @@

  if( RK4 == 0 ) then

+   !$omp parallel workshare
   f1 = f0 + HLF * dT * f_rhs
+   !$omp end parallel workshare

  elseif(RK4 == 1 ) then

+   !$omp parallel workshare
   f_rhs = f_rhs + TWO * f1
-
+   !$omp end parallel workshare
+   !$omp parallel workshare
   f1 = f0 + HLF * dT * f1
+   !$omp end parallel workshare

  elseif(RK4 == 2 ) then

+   !$omp parallel workshare
   f_rhs = f_rhs + TWO * f1
-
+   !$omp end parallel workshare
+   !$omp parallel workshare
   f1 = f0 +       dT * f1
+   !$omp end parallel workshare

  elseif( RK4 == 3 ) then

+   !$omp parallel workshare
   f1 = f0 +F1o6 * dT *(f1 + f_rhs)
+   !$omp end parallel workshare

  else

@@ -215,11 +225,15 @@

  if( RK4 == 0 ) then

+   !$omp parallel workshare
   f1 = f0 + dT * f_rhs
+   !$omp end parallel workshare

  else

+   !$omp parallel workshare
   f1 = f0 + HLF * dT * (f1+f_rhs)
+   !$omp end parallel workshare

  endif

@@ -239,7 +253,9 @@
  real*8, dimension(ex(1),ex(2),ex(3)),intent(in) ::f_rhs
  real*8, dimension(ex(1),ex(2),ex(3)),intent(out) ::f1

+  !$omp parallel workshare
   f1 = f0 + dT * f_rhs
+  !$omp end parallel workshare

  return

--- a/AMSS_NCKU_source/scalar_class.h
+++ b/AMSS_NCKU_source/scalar_class.h
@@ -19,7 +19,11 @@ using namespace std;
 #include <math.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "cgh.h"
 #include "ShellPatch.h"
--- a/AMSS_NCKU_source/scalarwaves.C
+++ b/AMSS_NCKU_source/scalarwaves.C
@@ -18,7 +18,11 @@ using namespace std;
 #include <math.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "misc.h"
 #include "microdef.h"
--- a/AMSS_NCKU_source/setup.C
+++ b/AMSS_NCKU_source/setup.C
@@ -3,7 +3,11 @@
 #include <math.h>
 #include <string.h>

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "util_Table.h"
 #include "cctk.h"
--- a/AMSS_NCKU_source/surface_integral.C
+++ b/AMSS_NCKU_source/surface_integral.C
@@ -20,7 +20,11 @@ using namespace std;
 #include <math.h>
 #include <map.h>
 #endif
+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "misc.h"
 #include "cgh.h"
@@ -220,16 +224,9 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH, var *Rpsi4, var *
    pox[2][n] = rex * nz_g[n];
  }

-  double *shellf;
-  shellf = new double[n_tot * InList];
-
-  GH->PatL[lev]->data->Interp_Points(DG_List, n_tot, pox, shellf, Symmetry);
-
  int mp, Lp, Nmin, Nmax;
-
  mp = n_tot / cpusize;
  Lp = n_tot - cpusize * mp;
-
  if (Lp > myrank)
  {
    Nmin = myrank * mp + myrank;
@@ -241,6 +238,11 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH, var *Rpsi4, var *
    Nmax = Nmin + mp - 1;
  }

+  double *shellf;
+  shellf = new double[n_tot * InList];
+
+  GH->PatL[lev]->data->Interp_Points(DG_List, n_tot, pox, shellf, Symmetry, Nmin, Nmax);
+
  //|~~~~~> Integrate the dot product of Dphi with the surface normal.

  double *RP_out, *IP_out;
@@ -363,8 +365,17 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH, var *Rpsi4, var *
  }
  //|------+  Communicate and sum the results from each processor.

-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double *RPIP_out = new double[2 * NN];
+    double *RPIP = new double[2 * NN];
+    memcpy(RPIP_out, RP_out, NN * sizeof(double));
+    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
+    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    memcpy(RP, RPIP, NN * sizeof(double));
+    memcpy(IP, RPIP + NN, NN * sizeof(double));
+    delete[] RPIP_out;
+    delete[] RPIP;
+  }

  //|------= Free memory.

@@ -556,8 +567,17 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH, var *Rpsi4, var *
  }
  //|------+  Communicate and sum the results from each processor.

-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, Comm_here);
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, Comm_here);
+  {
+    double *RPIP_out = new double[2 * NN];
+    double *RPIP = new double[2 * NN];
+    memcpy(RPIP_out, RP_out, NN * sizeof(double));
+    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
+    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, Comm_here);
+    memcpy(RP, RPIP, NN * sizeof(double));
+    memcpy(IP, RPIP + NN, NN * sizeof(double));
+    delete[] RPIP_out;
+    delete[] RPIP;
+  }

  //|------= Free memory.

@@ -735,8 +755,17 @@ void surface_integral::surf_Wave(double rex, int lev, ShellPatch *GH, var *Rpsi4
  }
  //|------+  Communicate and sum the results from each processor.

-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double *RPIP_out = new double[2 * NN];
+    double *RPIP = new double[2 * NN];
+    memcpy(RPIP_out, RP_out, NN * sizeof(double));
+    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
+    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    memcpy(RP, RPIP, NN * sizeof(double));
+    memcpy(IP, RPIP + NN, NN * sizeof(double));
+    delete[] RPIP_out;
+    delete[] RPIP;
+  }

  //|------= Free memory.

@@ -984,8 +1013,17 @@ void surface_integral::surf_Wave(double rex, int lev, ShellPatch *GH,
  }
  //|------+  Communicate and sum the results from each processor.

-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double *RPIP_out = new double[2 * NN];
+    double *RPIP = new double[2 * NN];
+    memcpy(RPIP_out, RP_out, NN * sizeof(double));
+    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
+    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    memcpy(RP, RPIP, NN * sizeof(double));
+    memcpy(IP, RPIP + NN, NN * sizeof(double));
+    delete[] RPIP_out;
+    delete[] RPIP;
+  }

  //|------= Free memory.

@@ -1419,8 +1457,17 @@ void surface_integral::surf_Wave(double rex, int lev, ShellPatch *GH,
  }
  //|------+  Communicate and sum the results from each processor.

-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double *RPIP_out = new double[2 * NN];
+    double *RPIP = new double[2 * NN];
+    memcpy(RPIP_out, RP_out, NN * sizeof(double));
+    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
+    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    memcpy(RP, RPIP, NN * sizeof(double));
+    memcpy(IP, RPIP + NN, NN * sizeof(double));
+    delete[] RPIP_out;
+    delete[] RPIP;
+  }

  //|------= Free memory.

@@ -1854,8 +1901,17 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH,
  }
  //|------+  Communicate and sum the results from each processor.

-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double *RPIP_out = new double[2 * NN];
+    double *RPIP = new double[2 * NN];
+    memcpy(RPIP_out, RP_out, NN * sizeof(double));
+    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
+    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    memcpy(RP, RPIP, NN * sizeof(double));
+    memcpy(IP, RPIP + NN, NN * sizeof(double));
+    delete[] RPIP_out;
+    delete[] RPIP;
+  }

  //|------= Free memory.

@@ -2040,8 +2096,17 @@ void surface_integral::surf_Wave(double rex, int lev, NullShellPatch2 *GH, var *
  }
  //|------+  Communicate and sum the results from each processor.

-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double *RPIP_out = new double[2 * NN];
+    double *RPIP = new double[2 * NN];
+    memcpy(RPIP_out, RP_out, NN * sizeof(double));
+    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
+    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    memcpy(RP, RPIP, NN * sizeof(double));
+    memcpy(IP, RPIP + NN, NN * sizeof(double));
+    delete[] RPIP_out;
+    delete[] RPIP;
+  }

  //|------= Free memory.

@@ -2226,8 +2291,17 @@ void surface_integral::surf_Wave(double rex, int lev, NullShellPatch *GH, var *R
  }
  //|------+  Communicate and sum the results from each processor.

-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double *RPIP_out = new double[2 * NN];
+    double *RPIP = new double[2 * NN];
+    memcpy(RPIP_out, RP_out, NN * sizeof(double));
+    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
+    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    memcpy(RP, RPIP, NN * sizeof(double));
+    memcpy(IP, RPIP + NN, NN * sizeof(double));
+    delete[] RPIP_out;
+    delete[] RPIP;
+  }

  //|------= Free memory.

@@ -2314,25 +2388,9 @@ void surface_integral::surf_MassPAng(double rex, int lev, cgh *GH, var *chi, var
    pox[2][n] = rex * nz_g[n];
  }

-  double *shellf;
-  shellf = new double[n_tot * InList];
-
-  // we have assumed there is only one box on this level,
-  // so we do not need loop boxes
-  GH->PatL[lev]->data->Interp_Points(DG_List, n_tot, pox, shellf, Symmetry);
-
-  double Mass_out = 0;
-  double ang_outx, ang_outy, ang_outz;
-  double p_outx, p_outy, p_outz;
-  ang_outx = ang_outy = ang_outz = 0.0;
-  p_outx = p_outy = p_outz = 0.0;
-  const double f1o8 = 0.125;
-
  int mp, Lp, Nmin, Nmax;
-
  mp = n_tot / cpusize;
  Lp = n_tot - cpusize * mp;
-
  if (Lp > myrank)
  {
    Nmin = myrank * mp + myrank;
@@ -2344,6 +2402,20 @@ void surface_integral::surf_MassPAng(double rex, int lev, cgh *GH, var *chi, var
    Nmax = Nmin + mp - 1;
  }

+  double *shellf;
+  shellf = new double[n_tot * InList];
+
+  // we have assumed there is only one box on this level,
+  // so we do not need loop boxes
+  GH->PatL[lev]->data->Interp_Points(DG_List, n_tot, pox, shellf, Symmetry, Nmin, Nmax);
+
+  double Mass_out = 0;
+  double ang_outx, ang_outy, ang_outz;
+  double p_outx, p_outy, p_outz;
+  ang_outx = ang_outy = ang_outz = 0.0;
+  p_outx = p_outy = p_outz = 0.0;
+  const double f1o8 = 0.125;
+
  double Chi, Psi;
  double Gxx, Gxy, Gxz, Gyy, Gyz, Gzz;
  double gupxx, gupxy, gupxz, gupyy, gupyz, gupzz;
@@ -2464,15 +2536,13 @@ void surface_integral::surf_MassPAng(double rex, int lev, cgh *GH, var *chi, var
    }
  }

-  MPI_Allreduce(&Mass_out, &mass, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-
-  MPI_Allreduce(&ang_outx, &sx, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&ang_outy, &sy, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&ang_outz, &sz, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-
-  MPI_Allreduce(&p_outx, &px, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&p_outy, &py, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&p_outz, &pz, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double scalar_out[7] = {Mass_out, ang_outx, ang_outy, ang_outz, p_outx, p_outy, p_outz};
+    double scalar_in[7];
+    MPI_Allreduce(scalar_out, scalar_in, 7, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    mass = scalar_in[0]; sx = scalar_in[1]; sy = scalar_in[2]; sz = scalar_in[3];
+    px = scalar_in[4]; py = scalar_in[5]; pz = scalar_in[6];
+  }

 #ifdef GaussInt
  mass = mass * rex * rex * dphi * factor;
@@ -2735,15 +2805,13 @@ void surface_integral::surf_MassPAng(double rex, int lev, cgh *GH, var *chi, var
    }
  }

-  MPI_Allreduce(&Mass_out, &mass, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
-
-  MPI_Allreduce(&ang_outx, &sx, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
-  MPI_Allreduce(&ang_outy, &sy, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
-  MPI_Allreduce(&ang_outz, &sz, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
-
-  MPI_Allreduce(&p_outx, &px, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
-  MPI_Allreduce(&p_outy, &py, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
-  MPI_Allreduce(&p_outz, &pz, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
+  {
+    double scalar_out[7] = {Mass_out, ang_outx, ang_outy, ang_outz, p_outx, p_outy, p_outz};
+    double scalar_in[7];
+    MPI_Allreduce(scalar_out, scalar_in, 7, MPI_DOUBLE, MPI_SUM, Comm_here);
+    mass = scalar_in[0]; sx = scalar_in[1]; sy = scalar_in[2]; sz = scalar_in[3];
+    px = scalar_in[4]; py = scalar_in[5]; pz = scalar_in[6];
+  }

 #ifdef GaussInt
  mass = mass * rex * rex * dphi * factor;
@@ -3020,15 +3088,13 @@ void surface_integral::surf_MassPAng(double rex, int lev, ShellPatch *GH, var *c
    }
  }

-  MPI_Allreduce(&Mass_out, &mass, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-
-  MPI_Allreduce(&ang_outx, &sx, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&ang_outy, &sy, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&ang_outz, &sz, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-
-  MPI_Allreduce(&p_outx, &px, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&p_outy, &py, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&p_outz, &pz, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double scalar_out[7] = {Mass_out, ang_outx, ang_outy, ang_outz, p_outx, p_outy, p_outz};
+    double scalar_in[7];
+    MPI_Allreduce(scalar_out, scalar_in, 7, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    mass = scalar_in[0]; sx = scalar_in[1]; sy = scalar_in[2]; sz = scalar_in[3];
+    px = scalar_in[4]; py = scalar_in[5]; pz = scalar_in[6];
+  }

 #ifdef GaussInt
  mass = mass * rex * rex * dphi * factor;
@@ -3607,8 +3673,17 @@ void surface_integral::surf_Wave(double rex, cgh *GH, ShellPatch *SH,
  }
  //|------+  Communicate and sum the results from each processor.

-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
+    double *RPIP_out = new double[2 * NN];
+    double *RPIP = new double[2 * NN];
+    memcpy(RPIP_out, RP_out, NN * sizeof(double));
+    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
+    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    memcpy(RP, RPIP, NN * sizeof(double));
+    memcpy(IP, RPIP + NN, NN * sizeof(double));
+    delete[] RPIP_out;
+    delete[] RPIP;
+  }

  //|------= Free memory.

--- a/AMSS_NCKU_source/testNull.C
+++ b/AMSS_NCKU_source/testNull.C
@@ -18,7 +18,11 @@ using namespace std;
 #include <math.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "misc.h"
 #include "macrodef.h"
--- a/AMSS_NCKU_source/testNull2.C
+++ b/AMSS_NCKU_source/testNull2.C
@@ -20,7 +20,11 @@ using namespace std;
 #include <map.h>
 #endif

+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "misc.h"
 #include "macrodef.h"
--- a/AMSS_NCKU_source/var.C
+++ b/AMSS_NCKU_source/var.C
@@ -9,7 +9,11 @@
 using namespace std;

 #include <time.h>
+#ifdef MPI_STUB
+#include "mpi_stub.h"
+#else
 #include <mpi.h>
+#endif

 #include "var.h"

--- a/makefile_and_run.py
+++ b/makefile_and_run.py
@@ -11,16 +11,30 @@
 import AMSS_NCKU_Input as input_data
 import subprocess
 import time
+import os
+
+## OpenMP configuration for threaded Fortran kernels
+## OMP_NUM_THREADS: set to number of physical cores (not hyperthreads)
+## OMP_PROC_BIND: bind threads to cores to avoid migration overhead
+## OMP_STACKSIZE: each thread needs stack space for fh arrays (~3.6MB)
+if "OMP_NUM_THREADS" not in os.environ:
+    os.environ["OMP_NUM_THREADS"] = "96"
+os.environ["OMP_STACKSIZE"] = "16M"
+os.environ["OMP_PROC_BIND"] = "close"
+os.environ["OMP_PLACES"] = "cores"
 ## CPU core binding configuration using taskset
 ## taskset ensures all child processes inherit the CPU affinity mask
 ## This forces make and all compiler processes to use only nohz_full cores (4-55, 60-111)
 ## Format: taskset -c 4-55,60-111 ensures processes only run on these cores
-NUMACTL_CPU_BIND = "taskset -c 0-111"
+#NUMACTL_CPU_BIND = "taskset -c 0-111"
+#NUMACTL_CPU_BIND = "taskset -c 16-47,64-95"
+#NUMACTL_CPU_BIND = "taskset -c 8-15"
+NUMACTL_CPU_BIND = ""

 ## Build parallelism configuration
 ## Use nohz_full cores (4-55, 60-111) for compilation: 52 + 52 = 104 cores
 ## Set make -j to utilize available cores for faster builds
-BUILD_JOBS = 104
+BUILD_JOBS = 16


 ##################################################################
@@ -116,26 +130,26 @@ def run_ABE():
    ## Define the command to run; cast other values to strings as needed

    if (input_data.GPU_Calculation == "no"):
-        mpi_command         = NUMACTL_CPU_BIND + " mpirun -np " + str(input_data.MPI_processes) + " ./ABE"
-        mpi_command_outfile = "ABE_out.log"
+        run_command         = NUMACTL_CPU_BIND + " ./ABE"
+        run_command_outfile = "ABE_out.log"
    elif (input_data.GPU_Calculation == "yes"):
-        mpi_command         = NUMACTL_CPU_BIND + " mpirun -np " + str(input_data.MPI_processes) + " ./ABEGPU"
-        mpi_command_outfile = "ABEGPU_out.log"
+        run_command         = NUMACTL_CPU_BIND + " ./ABEGPU"
+        run_command_outfile = "ABEGPU_out.log"

-    ## Execute the MPI command and stream output
-    mpi_process = subprocess.Popen(mpi_command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
+    ## Execute the command and stream output
+    run_process = subprocess.Popen(run_command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)

    ## Write ABE run output to file while printing to stdout
-    with open(mpi_command_outfile, 'w') as file0:  
+    with open(run_command_outfile, 'w') as file0:
        ## Read and print output lines; also write each line to file
-        for line in mpi_process.stdout:
+        for line in run_process.stdout:
            print(line, end='')  # stream output in real time
            file0.write(line)    # write the line to file
            file0.flush()        # flush to ensure each line is written immediately (optional)            
    file0.close()

    ## Wait for the process to finish
-    mpi_return_code = mpi_process.wait()
+    run_return_code = run_process.wait()
    
    print(                                           )
    print( " The ABE/ABEGPU simulation is finished " ) 
@@ -158,7 +172,8 @@ def run_TwoPunctureABE():
    print(                                                          )
    
    ## Define the command to run
-    TwoPuncture_command         = NUMACTL_CPU_BIND + " ./TwoPunctureABE"
+    #TwoPuncture_command         = NUMACTL_CPU_BIND + " ./TwoPunctureABE"
+    TwoPuncture_command         = " ./TwoPunctureABE"
    TwoPuncture_command_outfile = "TwoPunctureABE_out.log"

    ## Execute the command with subprocess.Popen and stream output
--- a/pgo_profile/PGO_Profile_Analysis.md
+++ b/pgo_profile/PGO_Profile_Analysis.md
@@ -0,0 +1,97 @@
+# AMSS-NCKU PGO Profile Analysis Report
+
+## 1. Profiling Environment
+
+| Item | Value |
+|------|-------|
+| Compiler | Intel oneAPI DPC++/C++ 2025.3.0 (icpx/ifx) |
+| Instrumentation Flag | `-fprofile-instr-generate` |
+| Optimization Level (instrumented) | `-O2 -xHost -fma` |
+| MPI Processes | 1 (single process to avoid MPI+instrumentation deadlock) |
+| Profile File | `default_9725750769337483397_0.profraw` (327 KB) |
+| Merged Profile | `default.profdata` (394 KB) |
+| llvm-profdata | `/home/intel/oneapi/compiler/2025.3/bin/compiler/llvm-profdata` |
+
+## 2. Reduced Simulation Parameters (for profiling run)
+
+| Parameter | Production Value | Profiling Value |
+|-----------|-----------------|-----------------|
+| MPI_processes | 64 | 1 |
+| grid_level | 9 | 4 |
+| static_grid_level | 5 | 3 |
+| static_grid_number | 96 | 24 |
+| moving_grid_number | 48 | 16 |
+| largest_box_xyz_max | 320^3 | 160^3 |
+| Final_Evolution_Time | 1000.0 | 10.0 |
+| Evolution_Step_Number | 10,000,000 | 1,000 |
+| Detector_Number | 12 | 2 |
+
+## 3. Profile Summary
+
+| Metric | Value |
+|--------|-------|
+| Total instrumented functions | 1,392 |
+| Functions with non-zero counts | 117 (8.4%) |
+| Functions with zero counts | 1,275 (91.6%) |
+| Maximum function entry count | 386,459,248 |
+| Maximum internal block count | 370,477,680 |
+| Total block count | 4,198,023,118 |
+
+## 4. Top 20 Hotspot Functions
+
+| Rank | Total Count | Max Block Count | Function | Category |
+|------|------------|-----------------|----------|----------|
+| 1 | 1,241,601,732 | 370,477,680 | `polint_` | Interpolation |
+| 2 | 755,994,435 | 230,156,640 | `prolong3_` | Grid prolongation |
+| 3 | 667,964,095 | 3,697,792 | `compute_rhs_bssn_` | BSSN RHS evolution |
+| 4 | 539,736,051 | 386,459,248 | `symmetry_bd_` | Symmetry boundary |
+| 5 | 277,310,808 | 53,170,728 | `lopsided_` | Lopsided FD stencil |
+| 6 | 155,534,488 | 94,535,040 | `decide3d_` | 3D grid decision |
+| 7 | 119,267,712 | 19,266,048 | `rungekutta4_rout_` | RK4 time integrator |
+| 8 | 91,574,616 | 48,824,160 | `kodis_` | Kreiss-Oliger dissipation |
+| 9 | 67,555,389 | 43,243,680 | `fderivs_` | Finite differences |
+| 10 | 55,296,000 | 42,246,144 | `misc::fact(int)` | Factorial utility |
+| 11 | 43,191,071 | 27,663,328 | `fdderivs_` | 2nd-order FD derivatives |
+| 12 | 36,233,965 | 22,429,440 | `restrict3_` | Grid restriction |
+| 13 | 24,698,512 | 17,231,520 | `polin3_` | Polynomial interpolation |
+| 14 | 22,962,942 | 20,968,768 | `copy_` | Data copy |
+| 15 | 20,135,696 | 17,259,168 | `Ansorg::barycentric(...)` | Spectral interpolation |
+| 16 | 14,650,224 | 7,224,768 | `Ansorg::barycentric_omega(...)` | Spectral weights |
+| 17 | 13,242,296 | 2,871,920 | `global_interp_` | Global interpolation |
+| 18 | 12,672,000 | 7,734,528 | `sommerfeld_rout_` | Sommerfeld boundary |
+| 19 | 6,872,832 | 1,880,064 | `sommerfeld_routbam_` | Sommerfeld boundary (BAM) |
+| 20 | 5,709,900 | 2,809,632 | `l2normhelper_` | L2 norm computation |
+
+## 5. Hotspot Category Breakdown
+
+Top 20 functions account for ~98% of total execution counts:
+
+| Category | Functions | Combined Count | Share |
+|----------|-----------|---------------|-------|
+| Interpolation / Prolongation / Restriction | polint_, prolong3_, restrict3_, polin3_, global_interp_, Ansorg::* | ~2,093M | ~50% |
+| BSSN RHS + FD stencils | compute_rhs_bssn_, lopsided_, fderivs_, fdderivs_ | ~1,056M | ~25% |
+| Boundary conditions | symmetry_bd_, sommerfeld_rout_, sommerfeld_routbam_ | ~559M | ~13% |
+| Time integration | rungekutta4_rout_ | ~119M | ~3% |
+| Dissipation | kodis_ | ~92M | ~2% |
+| Utilities | misc::fact, decide3d_, copy_, l2normhelper_ | ~256M | ~6% |
+
+## 6. Conclusions
+
+1. **Profile data is valid**: 1,392 functions instrumented, 117 exercised with ~4.2 billion total counts.
+2. **Hotspot concentration is high**: Top 5 functions alone account for ~76% of all counts, which is ideal for PGO — the compiler has strong branch/layout optimization targets.
+3. **Fortran numerical kernels dominate**: `polint_`, `prolong3_`, `compute_rhs_bssn_`, `symmetry_bd_`, `lopsided_` are all Fortran routines in the inner evolution loop. PGO will optimize their branch prediction and basic block layout.
+4. **91.6% of functions have zero counts**: These are code paths for unused features (GPU, BSSN-EScalar, BSSN-EM, Z4C, etc.). PGO will deprioritize them, improving instruction cache utilization.
+5. **Profile is representative**: Despite the reduced grid size, the code path coverage matches production — the same kernels (RHS, prolongation, restriction, boundary) are exercised. PGO branch probabilities from this profile will transfer well to full-scale runs.
+
+## 7. PGO Phase 2 Usage
+
+To apply the profile, use the following flags in `makefile.inc`:
+
+```makefile
+CXXAPPFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \
+              -fprofile-instr-use=/home/amss/AMSS-NCKU/pgo_profile/default.profdata \
+              -Dfortran3 -Dnewc -I${MKLROOT}/include
+f90appflags = -O3 -xHost -fp-model fast=2 -fma -ipo \
+              -fprofile-instr-use=/home/amss/AMSS-NCKU/pgo_profile/default.profdata \
+              -align array64byte -fpp -I${MKLROOT}/include
+```
--- a/pgo_profile/default.profdata
+++ b/pgo_profile/default.profdata
--- a/pgo_profile/default_9725750769337483397_0.profraw
+++ b/pgo_profile/default_9725750769337483397_0.profraw
Author	SHA1	Message	Date
CGH0S7	714c6e90c6	Add OpenMP parallelization to Fortran compute kernels Add !$omp parallel do collapse(2) directives to all triple-loop stencil kernels (fderivs, fdderivs, fdx/fdy/fdz, kodis, lopsided, enforce_ag/enforce_ga) across all ghost_width variants. Add !$omp parallel workshare to RK4/ICN/Euler whole-array update routines. Build system: add -qopenmp to compile and link flags, switch MKL from sequential to threaded (-lmkl_intel_thread -liomp5). Runtime: set OMP_NUM_THREADS=96, OMP_STACKSIZE=16M, OMP_PROC_BIND=close, OMP_PLACES=cores for 96-core server target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 23:40:17 +08:00
CGH0S7	caf192b2e4	Remove MPI dependency, replace with single-process stub for non-MPI builds - Add mpi_stub.h providing all MPI types/constants/functions as no-ops (nprocs=1, myrank=0) with memcpy-based Allreduce and clock_gettime Wtime - Replace #include <mpi.h> with conditional #ifdef MPI_STUB in 31 files (19 headers + 12 source files) preserving ability to build with real MPI - Change makefile.inc: CLINKER mpiicpx->icpx, add -DMPI_STUB to CXXAPPFLAGS - Update makefile_and_run.py: run ./ABE directly instead of mpirun -np N - Set MPI_processes=1 in AMSS_NCKU_Input.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 22:51:11 +08:00
CGH0S7	d06d5b4db8	Add targeted point-to-point Interp_Points overload for surface_integral Instead of broadcasting all interpolated point data to every MPI rank, the new overload sends each point only to the one rank that needs it for integration, reducing communication volume by ~nprocs times. The consumer rank is computed deterministically using the same Nmin/Nmax work distribution formula used by surface_integral callers. Two active call sites (surf_Wave and surf_MassPAng with MPI_COMM_WORLD) now use the new overload. Other callers (ShellPatch, Comm_here variants, etc.) remain unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 19:18:56 +08:00
CGH0S7	50e2a845f8	Replace MPI_Allreduce with owner-rank MPI_Bcast in Patch::Interp_Points The two MPI_Allreduce calls (data + weight) were the #1 hotspot at 38.5% CPU time. Since all ranks traverse the same block list and agree on point ownership, we replace the global reduction with targeted MPI_Bcast from each owner rank. This also eliminates the weight array/Allreduce entirely, removes redundant heap allocations (shellf, weight, DH, llb, uub), and writes interpolation results directly into the output buffer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 22:39:18 +08:00
CGH0S7	738498cb28	Optimize MPI communication in RestrictProlong and surface_integral Cache Sync in RestrictProlong: replace 11 basic Parallel::Sync() calls with Parallel::Sync_cached() across RestrictProlong, RestrictProlong_aux, and ProlongRestrict to avoid rebuilding grid segment lists every call. Merge paired MPI_Allreduce in surface_integral: combine 9 pairs of consecutive RP/IP Allreduce calls into single calls with count=2*NN. Merge scalar MPI_Allreduce in surf_MassPAng: combine 3 groups of 7 scalar Allreduce calls (mass + angular/linear momentum) into single calls with count=7. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 22:07:12 +08:00
CGH0S7	42b9cf1ad9	Optimize MPI Sync with merged transfers, caching, and async overlap Phase 1: Merge N+1 transfer() calls into a single transfer() per Sync(PatchList), reducing N+1 MPI_Waitall barriers to 1 via new Sync_merged() that collects all intra-patch and inter-patch grid segment lists into combined per-rank arrays. Phase 2: Cache grid segment lists and reuse grow-only communication buffers across RK4 substeps via SyncCache struct. Caches are per-level and per-variable-list (predictor/corrector), invalidated on regrid. Eliminates redundant build_ghost_gsl/build_owned_gsl0/build_gstl rebuilds and malloc/free cycles between regrids. Phase 3: Split Sync into async Sync_start/Sync_finish to overlap Cartesian ghost zone exchange (MPI_Isend/Irecv) with Shell patch synchronization. Uses MPI tag 2 to avoid conflicts with SH->Synch() which uses transfer() with tag 1. Also updates makefile.inc paths and flags for local build environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 21:03:37 +08:00
CGH0S7	e9d321fd00	Convert MPI_Allreduce error checks to non-blocking MPI_Iallreduce overlapped with Sync Replace all 8 blocking MPI_Allreduce error-check calls with MPI_Iallreduce, overlapping the reduction with subsequent Parallel::Sync/SH->Synch operations. MPI_Wait is called after Sync completes to retrieve the error result. This hides the Allreduce latency (46.5% of CPU time) behind the ghost zone exchange communication that must happen anyway. Safe because Sync only copies existing data to ghost zones and the error check + abort happens before any further computation uses the synced data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 12:39:29 +08:00
CGH0S7	ed1d86ade9	Merge paired MPI_Allreduce error checks to reduce global sync barriers In the two Step() functions that handle both Patch and Shell Patch, defer the Patch error check until after Shell Patch computation completes, then perform a single combined MPI_Allreduce instead of two separate ones. This eliminates 4 MPI_Allreduce calls per timestep (2 per Step function, Predictor + Corrector phases each). The optimization is mathematically equivalent: in normal execution (no NaN) behavior is identical; on error, both Patch and Shell data are dumped before MPI_Abort. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 12:12:16 +08:00
CGH0S7	471baa5065	PGO supported	2026-02-09 10:59:26 +08:00
CGH0S7	4bb6c03013	makefile setting updated	2026-02-08 16:14:43 +08:00
ianchb	b8e41b2b39	Only enable OpenMP for TwoPunctures	2026-02-08 13:00:37 +08:00
ianchb	133e4f13a2	Use OpenMP's parallel for with schedule(dynamic,1)	2026-02-07 19:48:24 +08:00
ianchb	914c4f4791	Optimize memory allocation in JFD_times_dv This should reduce the pressure on the memory allocator, indirectly improving caching behavior. Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>	2026-02-07 15:55:45 +08:00