Apply async Sync optimization to Z4c_class using Sync_start/finish pattern

Replaces blocking Parallel::Sync + MPI_Allreduce in Z4c_class Step() with non-blocking MPI_Iallreduce overlapped with Sync_start/Sync_finish, matching the pattern already used in bssn_class on cjy-oneapi-opus-hotfix. Covers both ABEtype==2 and CPBC variants (predictor + corrector = 4 call sites). Cherry-picked optimization from afd4006, adapted to SyncCache infrastructure instead of the separate SyncPlan API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge lopsided advection + kodis dissipation to share symmetry_bd buffer
2026-02-20 09:58:26 +08:00 · 2026-02-20 09:57:51 +08:00 · 2026-02-20 08:48:25 +08:00 · 2026-02-11 19:17:35 +08:00 · 2026-02-11 19:15:12 +08:00 · 2026-02-11 18:26:30 +08:00
26 changed files with 3256 additions and 1624 deletions
--- a/AMSS_NCKU_ABEtest.py
+++ b/AMSS_NCKU_ABEtest.py
@@ -1,447 +0,0 @@
 ##################################################################
 ##
 ## AMSS-NCKU ABE Test Program (Skip TwoPuncture if data exists)
 ## Modified from AMSS_NCKU_Program.py
 ## Author: Xiaoqu
 ## Modified: 2026/02/01
 ##
 ##################################################################
 ##################################################################
 ## Print program introduction
 import print_information
 print_information.print_program_introduction()
 ##################################################################
 import AMSS_NCKU_Input as input_data
 ##################################################################
 ## Create directories to store program run data
 import os
 import shutil
 import sys
 import time
 ## Set the output directory according to the input file
 File_directory = os.path.join(input_data.File_directory)
 ## Check if output directory exists and if TwoPuncture data is available
 #skip_twopuncture = False
 skip_twopuncture = True
 output_directory = os.path.join(File_directory, "AMSS_NCKU_output")
 binary_results_directory = os.path.join(output_directory, input_data.Output_directory)
 if os.path.exists(File_directory):
    print( " Output directory already exists." )
    print()
    '''
    # Check if TwoPuncture initial data files exist
    if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture"):
        twopuncture_output = os.path.join(output_directory, "TwoPunctureABE")
        input_par = os.path.join(output_directory, "input.par")
        if os.path.exists(twopuncture_output) and os.path.exists(input_par):
            print( " Found existing TwoPuncture initial data." )
            print( " Do you want to skip TwoPuncture phase and reuse existing data?" )
            print( " Input 'skip' to skip TwoPuncture and start ABE directly" )
            print( " Input 'regenerate' to regenerate everything from scratch" )
            print()
            while True:
                try:
                    inputvalue = input()
                    if ( inputvalue == "skip" ):
                        print( " Skipping TwoPuncture phase, will reuse existing initial data." )
                        print()
                        skip_twopuncture = True
                        break
                    elif ( inputvalue == "regenerate" ):
                        print( " Regenerating everything from scratch." )
                        print()
                        skip_twopuncture = False
                        break
                    else:
                        print( " Please input 'skip' or 'regenerate'." )
                except ValueError:
                    print( " Please input 'skip' or 'regenerate'." )
        else:
            print( " TwoPuncture initial data not found, will regenerate everything." )
            print()
 '''
    # If not skipping, remove and recreate directory
    if not skip_twopuncture:
        shutil.rmtree(File_directory, ignore_errors=True)
        os.mkdir(File_directory)
        os.mkdir(output_directory)
        os.mkdir(binary_results_directory)
        figure_directory = os.path.join(File_directory, "figure")
        os.mkdir(figure_directory)
        shutil.copy("AMSS_NCKU_Input.py", File_directory)
        print( " Output directory has been regenerated." )
        print()
 else:
    # Create fresh directory structure
    os.mkdir(File_directory)
    shutil.copy("AMSS_NCKU_Input.py", File_directory)
    os.mkdir(output_directory)
    os.mkdir(binary_results_directory)
    figure_directory = os.path.join(File_directory, "figure")
    os.mkdir(figure_directory)
    print( " Output directory has been generated." )
    print()
 # Ensure figure directory exists
 figure_directory = os.path.join(File_directory, "figure")
 if not os.path.exists(figure_directory):
    os.mkdir(figure_directory)
 ##################################################################
 ## Output related parameter information
 import setup
 ## Print and save input parameter information
 setup.print_input_data( File_directory )
 if not skip_twopuncture:
    setup.generate_AMSSNCKU_input()
 setup.print_puncture_information()
 ##################################################################
 ## Generate AMSS-NCKU program input files based on the configured parameters
 if not skip_twopuncture:
    print()
    print( " Generating the AMSS-NCKU input parfile for the ABE executable." )
    print()
    ## Generate cgh-related input files from the grid information
    import numerical_grid
    numerical_grid.append_AMSSNCKU_cgh_input()
    print()
    print( " The input parfile for AMSS-NCKU C++ executable file ABE has been generated." )
    print( " However, the input relevant to TwoPuncture need to be appended later." )
    print()
 ##################################################################
 ## Plot the initial grid configuration
 if not skip_twopuncture:
    print()
    print( " Schematically plot the numerical grid structure." )
    print()
    import numerical_grid
    numerical_grid.plot_initial_grid()
 ##################################################################
 ## Generate AMSS-NCKU macro files according to the numerical scheme and parameters
 if not skip_twopuncture:
    print()
    print( " Automatically generating the macro file for AMSS-NCKU C++ executable file ABE " )
    print( " (Based on the finite-difference numerical scheme) " )
    print()
    import generate_macrodef
    generate_macrodef.generate_macrodef_h()
    print( " AMSS-NCKU macro file macrodef.h has been generated. " )
    generate_macrodef.generate_macrodef_fh()
    print( " AMSS-NCKU macro file macrodef.fh has been generated. " )
 ##################################################################
 # Compile the AMSS-NCKU program according to user requirements
 # NOTE: ABE compilation is always performed, even when skipping TwoPuncture
 print()
 print( " Preparing to compile and run the AMSS-NCKU code as requested " )
 print( " Compiling the AMSS-NCKU code based on the generated macro files " )
 print()
 AMSS_NCKU_source_path = "AMSS_NCKU_source"
 AMSS_NCKU_source_copy = os.path.join(File_directory, "AMSS_NCKU_source_copy")
 ## If AMSS_NCKU source folder is missing, create it and prompt the user
 if not os.path.exists(AMSS_NCKU_source_path):
    os.makedirs(AMSS_NCKU_source_path)
    print( " The AMSS-NCKU source files are incomplete; copy all source files into ./AMSS_NCKU_source. " )
    print( " Press Enter to continue. " )
    inputvalue = input()
 # Copy AMSS-NCKU source files to prepare for compilation
 # If skipping TwoPuncture and source_copy already exists, remove it first
 if skip_twopuncture and os.path.exists(AMSS_NCKU_source_copy):
    shutil.rmtree(AMSS_NCKU_source_copy)
 shutil.copytree(AMSS_NCKU_source_path, AMSS_NCKU_source_copy)
 # Copy the generated macro files into the AMSS_NCKU source folder
 if not skip_twopuncture:
    macrodef_h_path  = os.path.join(File_directory, "macrodef.h")
    macrodef_fh_path = os.path.join(File_directory, "macrodef.fh")
 else:
    # When skipping TwoPuncture, use existing macro files from previous run
    macrodef_h_path  = os.path.join(File_directory, "macrodef.h")
    macrodef_fh_path = os.path.join(File_directory, "macrodef.fh")
 shutil.copy2(macrodef_h_path,  AMSS_NCKU_source_copy)
 shutil.copy2(macrodef_fh_path, AMSS_NCKU_source_copy)
 # Compile related programs
 import makefile_and_run
 ## Change working directory to the target source copy
 os.chdir(AMSS_NCKU_source_copy)
 ## Build the main AMSS-NCKU executable (ABE or ABEGPU)
 makefile_and_run.makefile_ABE()
 ## If the initial-data method is Ansorg-TwoPuncture, build the TwoPunctureABE executable
 ## Only build TwoPunctureABE if not skipping TwoPuncture phase
 if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and not skip_twopuncture:
    makefile_and_run.makefile_TwoPunctureABE()
 ## Change current working directory back up two levels
 os.chdir('..')
 os.chdir('..')
 print()
 ##################################################################
 ## Copy the AMSS-NCKU executable (ABE/ABEGPU) to the run directory
 if (input_data.GPU_Calculation == "no"):
    ABE_file = os.path.join(AMSS_NCKU_source_copy, "ABE")
 elif (input_data.GPU_Calculation == "yes"):
    ABE_file = os.path.join(AMSS_NCKU_source_copy, "ABEGPU")
 if not os.path.exists( ABE_file ):
    print()
    print( " Lack of AMSS-NCKU executable file ABE/ABEGPU; recompile AMSS_NCKU_source manually. " )
    print( " When recompilation is finished, press Enter to continue. " )
    inputvalue = input()
 ## Copy the executable ABE (or ABEGPU) into the run directory
 shutil.copy2(ABE_file, output_directory)
 ## If the initial-data method is TwoPuncture, copy the TwoPunctureABE executable to the run directory
 ## Only copy TwoPunctureABE if not skipping TwoPuncture phase
 if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and not skip_twopuncture:
    TwoPuncture_file = os.path.join(AMSS_NCKU_source_copy, "TwoPunctureABE")
    if not os.path.exists( TwoPuncture_file ):
        print()
        print( " Lack of AMSS-NCKU executable file TwoPunctureABE; recompile TwoPunctureABE in AMSS_NCKU_source. " )
        print( " When recompilation is finished, press Enter to continue. " )
        inputvalue = input()
    ## Copy the TwoPunctureABE executable into the run directory
    shutil.copy2(TwoPuncture_file, output_directory)
 ##################################################################
 ## If the initial-data method is TwoPuncture, generate the TwoPuncture input files
 if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and not skip_twopuncture:
    print()
    print( " Initial data is chosen as Ansorg-TwoPuncture" )
    print()
    print()
    print( " Automatically generating the input parfile for the TwoPunctureABE executable " )
    print()
    import generate_TwoPuncture_input
    generate_TwoPuncture_input.generate_AMSSNCKU_TwoPuncture_input()
    print()
    print( " The input parfile for the TwoPunctureABE executable has been generated. " )
    print()
    ## Generated AMSS-NCKU TwoPuncture input filename
    AMSS_NCKU_TwoPuncture_inputfile      = 'AMSS-NCKU-TwoPuncture.input'
    AMSS_NCKU_TwoPuncture_inputfile_path = os.path.join( File_directory, AMSS_NCKU_TwoPuncture_inputfile )
    ## Copy and rename the file
    shutil.copy2( AMSS_NCKU_TwoPuncture_inputfile_path, os.path.join(output_directory, 'TwoPunctureinput.par') )
    ## Run TwoPuncture to generate initial-data files
    start_time = time.time()  # Record start time
    print()
    print()
    ## Change to the output (run) directory
    os.chdir(output_directory)
    ## Run the TwoPuncture executable
    import makefile_and_run
    makefile_and_run.run_TwoPunctureABE()
    ## Change current working directory back up two levels
    os.chdir('..')
    os.chdir('..')
 elif (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ) and skip_twopuncture:
    print()
    print( " Skipping TwoPuncture execution, using existing initial data." )
    print()
    start_time = time.time()  # Record start time for ABE only
 else:
    start_time = time.time()  # Record start time
 ##################################################################
 ## Update puncture data based on TwoPuncture run results
 if not skip_twopuncture:
    import renew_puncture_parameter
    renew_puncture_parameter.append_AMSSNCKU_BSSN_input(File_directory, output_directory)
    ## Generated AMSS-NCKU input filename
    AMSS_NCKU_inputfile      = 'AMSS-NCKU.input'
    AMSS_NCKU_inputfile_path = os.path.join(File_directory, AMSS_NCKU_inputfile)
    ## Copy and rename the file
    shutil.copy2( AMSS_NCKU_inputfile_path, os.path.join(output_directory, 'input.par') )
    print()
    print( " Successfully copy all AMSS-NCKU input parfile to target dictionary. " )
    print()
 else:
    print()
    print( " Using existing input.par file from previous run." )
    print()
 ##################################################################
 ## Launch the AMSS-NCKU program
 print()
 print()
 ## Change to the run directory
 os.chdir( output_directory )
 import makefile_and_run
 makefile_and_run.run_ABE()
 ## Change current working directory back up two levels
 os.chdir('..')
 os.chdir('..')
 end_time = time.time()
 elapsed_time = end_time - start_time
 ##################################################################
 ## Copy some basic input and log files out to facilitate debugging
 ## Path to the file that stores calculation settings
 AMSS_NCKU_error_file_path = os.path.join(binary_results_directory, "setting.par")
 ## Copy and rename the file for easier inspection
 shutil.copy( AMSS_NCKU_error_file_path, os.path.join(output_directory, "AMSSNCKU_setting_parameter") )
 ## Path to the error log file
 AMSS_NCKU_error_file_path = os.path.join(binary_results_directory, "Error.log")
 ## Copy and rename the error log
 shutil.copy( AMSS_NCKU_error_file_path, os.path.join(output_directory, "Error.log") )
 ## Primary program outputs
 AMSS_NCKU_BH_data         = os.path.join(binary_results_directory, "bssn_BH.dat"        )
 AMSS_NCKU_ADM_data        = os.path.join(binary_results_directory, "bssn_ADMQs.dat"     )
 AMSS_NCKU_psi4_data       = os.path.join(binary_results_directory, "bssn_psi4.dat"      )
 AMSS_NCKU_constraint_data = os.path.join(binary_results_directory, "bssn_constraint.dat")
 ## copy and rename the file
 shutil.copy( AMSS_NCKU_BH_data,         os.path.join(output_directory, "bssn_BH.dat"        ) )
 shutil.copy( AMSS_NCKU_ADM_data,        os.path.join(output_directory, "bssn_ADMQs.dat"     ) )
 shutil.copy( AMSS_NCKU_psi4_data,       os.path.join(output_directory, "bssn_psi4.dat"      ) )
 shutil.copy( AMSS_NCKU_constraint_data, os.path.join(output_directory, "bssn_constraint.dat") )
 ## Additional program outputs
 if (input_data.Equation_Class == "BSSN-EM"):
    AMSS_NCKU_phi1_data = os.path.join(binary_results_directory, "bssn_phi1.dat" )
    AMSS_NCKU_phi2_data = os.path.join(binary_results_directory, "bssn_phi2.dat" )
    shutil.copy( AMSS_NCKU_phi1_data, os.path.join(output_directory, "bssn_phi1.dat" ) )
    shutil.copy( AMSS_NCKU_phi2_data, os.path.join(output_directory, "bssn_phi2.dat" ) )
 elif (input_data.Equation_Class == "BSSN-EScalar"):
    AMSS_NCKU_maxs_data = os.path.join(binary_results_directory, "bssn_maxs.dat" )
    shutil.copy( AMSS_NCKU_maxs_data, os.path.join(output_directory, "bssn_maxs.dat" ) )
 ##################################################################
 ## Plot the AMSS-NCKU program results
 print()
 print( " Plotting the txt and binary results data from the AMSS-NCKU simulation " )
 print()
 import plot_xiaoqu
 import plot_GW_strain_amplitude_xiaoqu
 ## Plot black hole trajectory
 plot_xiaoqu.generate_puncture_orbit_plot(   binary_results_directory, figure_directory )
 plot_xiaoqu.generate_puncture_orbit_plot3D( binary_results_directory, figure_directory )
 ## Plot black hole separation vs. time
 plot_xiaoqu.generate_puncture_distence_plot( binary_results_directory, figure_directory )
 ## Plot gravitational waveforms (psi4 and strain amplitude)
 for i in range(input_data.Detector_Number):
    plot_xiaoqu.generate_gravitational_wave_psi4_plot( binary_results_directory, figure_directory, i )
    plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot( binary_results_directory, figure_directory, i )
 ## Plot ADM mass evolution
 for i in range(input_data.Detector_Number):
    plot_xiaoqu.generate_ADMmass_plot( binary_results_directory, figure_directory, i )
 ## Plot Hamiltonian constraint violation over time
 for i in range(input_data.grid_level):
    plot_xiaoqu.generate_constraint_check_plot( binary_results_directory, figure_directory, i )
 ## Plot stored binary data
 plot_xiaoqu.generate_binary_data_plot( binary_results_directory, figure_directory )
 print()
 print( f" This Program Cost = {elapsed_time} Seconds " )
 print()
 ##################################################################
 print()
 print( " The AMSS-NCKU-Python simulation is successfully finished, thanks for using !!! " )
 print()
 ##################################################################
--- a/AMSS_NCKU_Program.py
+++ b/AMSS_NCKU_Program.py
@@ -8,6 +8,14 @@
 ##
 ##################################################################
 ## Guard against re-execution by multiprocessing child processes.
 ## Without this, using 'spawn' or 'forkserver' context would cause every
 ## worker to re-run the entire script, spawning exponentially more
 ## workers (fork bomb).
 if __name__ != '__main__':
    import sys as _sys
    _sys.exit(0)
 ##################################################################
@@ -424,26 +432,31 @@ print(
 import plot_xiaoqu
 import plot_GW_strain_amplitude_xiaoqu
 from parallel_plot_helper import run_plot_tasks_parallel
 plot_tasks = []
 ## Plot black hole trajectory
-plot_xiaoqu.generate_puncture_orbit_plot(   binary_results_directory, figure_directory )
+plot_tasks.append( ( plot_xiaoqu.generate_puncture_orbit_plot,   (binary_results_directory, figure_directory) ) )
-plot_xiaoqu.generate_puncture_orbit_plot3D( binary_results_directory, figure_directory )
+plot_tasks.append( ( plot_xiaoqu.generate_puncture_orbit_plot3D, (binary_results_directory, figure_directory) ) )
 ## Plot black hole separation vs. time
-plot_xiaoqu.generate_puncture_distence_plot( binary_results_directory, figure_directory )
+plot_tasks.append( ( plot_xiaoqu.generate_puncture_distence_plot, (binary_results_directory, figure_directory) ) )
 ## Plot gravitational waveforms (psi4 and strain amplitude)
 for i in range(input_data.Detector_Number):
-    plot_xiaoqu.generate_gravitational_wave_psi4_plot( binary_results_directory, figure_directory, i )
+    plot_tasks.append( ( plot_xiaoqu.generate_gravitational_wave_psi4_plot, (binary_results_directory, figure_directory, i) ) )
-    plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot( binary_results_directory, figure_directory, i )
+    plot_tasks.append( ( plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot, (binary_results_directory, figure_directory, i) ) )
 ## Plot ADM mass evolution
 for i in range(input_data.Detector_Number):
-    plot_xiaoqu.generate_ADMmass_plot( binary_results_directory, figure_directory, i )
+    plot_tasks.append( ( plot_xiaoqu.generate_ADMmass_plot, (binary_results_directory, figure_directory, i) ) )
 ## Plot Hamiltonian constraint violation over time
 for i in range(input_data.grid_level):
-    plot_xiaoqu.generate_constraint_check_plot( binary_results_directory, figure_directory, i )
+    plot_tasks.append( ( plot_xiaoqu.generate_constraint_check_plot, (binary_results_directory, figure_directory, i) ) )
 run_plot_tasks_parallel(plot_tasks)
 ## Plot stored binary data
 plot_xiaoqu.generate_binary_data_plot( binary_results_directory, figure_directory )
--- a/AMSS_NCKU_source/MPatch.C
+++ b/AMSS_NCKU_source/MPatch.C
@@ -341,8 +341,9 @@ void Patch::Interp_Points(MyList<var> *VarList,
                          double *Shellf, int Symmetry)
 {
  // NOTE: we do not Synchnize variables here, make sure of that before calling this routine
-  int myrank;
+  int myrank, nprocs;
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
  int ordn = 2 * ghost_width;
  MyList<var> *varl;
@@ -354,24 +355,18 @@ void Patch::Interp_Points(MyList<var> *VarList,
    varl = varl->next;
  }
-  double *shellf;
+  memset(Shellf, 0, sizeof(double) * NN * num_var);
  shellf = new double[NN * num_var];
  memset(shellf, 0, sizeof(double) * NN * num_var);
-  // we use weight to monitor code, later some day we can move it for optimization
+  // owner_rank[j] records which MPI rank owns point j
-  int *weight;
+  // All ranks traverse the same block list so they all agree on ownership
-  weight = new int[NN];
+  int *owner_rank;
-  memset(weight, 0, sizeof(int) * NN);
+  owner_rank = new int[NN];
-
+  for (int j = 0; j < NN; j++)
-  double *DH, *llb, *uub;
+    owner_rank[j] = -1;
  DH = new double[dim];
  double DH[dim], llb[dim], uub[dim];
  for (int i = 0; i < dim; i++)
  {
    DH[i] = getdX(i);
  }
  llb = new double[dim];
  uub = new double[dim];
  for (int j = 0; j < NN; j++) // run along points
  {
@@ -403,12 +398,6 @@ void Patch::Interp_Points(MyList<var> *VarList,
      bool flag = true;
      for (int i = 0; i < dim; i++)
      {
 // NOTE: our dividing structure is (exclude ghost)
 // -1 0
 //       1  2
 // so (0,1) does not belong to any part for vertex structure
 // here we put (0,0.5) to left part and (0.5,1) to right part
 // BUT for cell structure the bbox is (-1.5,0.5) and (0.5,2.5), there is no missing region at all
 #ifdef Vertex
 #ifdef Cell
 #error Both Cell and Vertex are defined
@@ -433,6 +422,7 @@ void Patch::Interp_Points(MyList<var> *VarList,
      if (flag)
      {
        notfind = false;
        owner_rank[j] = BP->rank;
        if (myrank == BP->rank)
        {
          //---> interpolation
@@ -440,14 +430,11 @@ void Patch::Interp_Points(MyList<var> *VarList,
          int k = 0;
          while (varl) // run along variables
          {
-            //              shellf[j*num_var+k] = Parallel::global_interp(dim,BP->shape,BP->X,BP->fgfs[varl->data->sgfn],
+            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], Shellf[j * num_var + k],
            //	  		                                    pox,ordn,varl->data->SoA,Symmetry);
            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], shellf[j * num_var + k],
                            pox[0], pox[1], pox[2], ordn, varl->data->SoA, Symmetry);
            varl = varl->next;
            k++;
          }
          weight[j] = 1;
        }
      }
      if (Bp == ble)
@@ -456,103 +443,327 @@ void Patch::Interp_Points(MyList<var> *VarList,
    }
  }
-  MPI_Allreduce(shellf, Shellf, NN * num_var, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  // Replace MPI_Allreduce with per-owner MPI_Bcast:
-  int *Weight;
+  // Group consecutive points by owner rank and broadcast each group.
-  Weight = new int[NN];
+  // Since each point's data is non-zero only on the owner rank,
-  MPI_Allreduce(weight, Weight, NN, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+  // Bcast from owner is equivalent to Allreduce(MPI_SUM) but much cheaper.
  //  misc::tillherecheck("print me");
  for (int i = 0; i < NN; i++)
  {
-    if (Weight[i] > 1)
+    int j = 0;
    while (j < NN)
    {
-      if (myrank == 0)
+      int cur_owner = owner_rank[j];
-        cout << "WARNING: Patch::Interp_Points meets multiple weight" << endl;
+      if (cur_owner < 0)
-      for (int j = 0; j < num_var; j++)
+      {
-        Shellf[j + i * num_var] = Shellf[j + i * num_var] / Weight[i];
+        if (myrank == 0)
        {
          cout << "ERROR: Patch::Interp_Points fails to find point (";
          for (int d = 0; d < dim; d++)
          {
            cout << XX[d][j];
            if (d < dim - 1)
              cout << ",";
            else
              cout << ")";
          }
          cout << " on Patch (";
          for (int d = 0; d < dim; d++)
          {
            cout << bbox[d] << "+" << lli[d] * DH[d];
            if (d < dim - 1)
              cout << ",";
            else
              cout << ")--";
          }
          cout << "(";
          for (int d = 0; d < dim; d++)
          {
            cout << bbox[dim + d] << "-" << uui[d] * DH[d];
            if (d < dim - 1)
              cout << ",";
            else
              cout << ")" << endl;
          }
          MPI_Abort(MPI_COMM_WORLD, 1);
        }
        j++;
        continue;
      }
      // Find contiguous run of points with the same owner
      int jstart = j;
      while (j < NN && owner_rank[j] == cur_owner)
        j++;
      int count = (j - jstart) * num_var;
      MPI_Bcast(Shellf + jstart * num_var, count, MPI_DOUBLE, cur_owner, MPI_COMM_WORLD);
    }
-    else if (Weight[i] == 0 && myrank == 0)
+  }
  delete[] owner_rank;
 }
 void Patch::Interp_Points(MyList<var> *VarList,
                          int NN, double **XX,
                          double *Shellf, int Symmetry,
                          int Nmin_consumer, int Nmax_consumer)
 {
  // Targeted point-to-point overload: each owner sends each point only to
  // the one rank that needs it for integration (consumer), reducing
  // communication volume by ~nprocs times compared to the Bcast version.
  int myrank, nprocs;
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
  int ordn = 2 * ghost_width;
  MyList<var> *varl;
  int num_var = 0;
  varl = VarList;
  while (varl)
  {
    num_var++;
    varl = varl->next;
  }
  memset(Shellf, 0, sizeof(double) * NN * num_var);
  // owner_rank[j] records which MPI rank owns point j
  int *owner_rank;
  owner_rank = new int[NN];
  for (int j = 0; j < NN; j++)
    owner_rank[j] = -1;
  double DH[dim], llb[dim], uub[dim];
  for (int i = 0; i < dim; i++)
    DH[i] = getdX(i);
  // --- Interpolation phase (identical to original) ---
  for (int j = 0; j < NN; j++)
  {
    double pox[dim];
    for (int i = 0; i < dim; i++)
    {
      pox[i] = XX[i][j];
      if (myrank == 0 && (XX[i][j] < bbox[i] + lli[i] * DH[i] || XX[i][j] > bbox[dim + i] - uui[i] * DH[i]))
      {
        cout << "Patch::Interp_Points: point (";
        for (int k = 0; k < dim; k++)
        {
          cout << XX[k][j];
          if (k < dim - 1)
            cout << ",";
          else
            cout << ") is out of current Patch." << endl;
        }
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
    MyList<Block> *Bp = blb;
    bool notfind = true;
    while (notfind && Bp)
    {
      Block *BP = Bp->data;
      bool flag = true;
      for (int i = 0; i < dim; i++)
      {
 #ifdef Vertex
 #ifdef Cell
 #error Both Cell and Vertex are defined
 #endif
        llb[i] = (feq(BP->bbox[i], bbox[i], DH[i] / 2)) ? BP->bbox[i] + lli[i] * DH[i] : BP->bbox[i] + (ghost_width - 0.5) * DH[i];
        uub[i] = (feq(BP->bbox[dim + i], bbox[dim + i], DH[i] / 2)) ? BP->bbox[dim + i] - uui[i] * DH[i] : BP->bbox[dim + i] - (ghost_width - 0.5) * DH[i];
 #else
 #ifdef Cell
        llb[i] = (feq(BP->bbox[i], bbox[i], DH[i] / 2)) ? BP->bbox[i] + lli[i] * DH[i] : BP->bbox[i] + ghost_width * DH[i];
        uub[i] = (feq(BP->bbox[dim + i], bbox[dim + i], DH[i] / 2)) ? BP->bbox[dim + i] - uui[i] * DH[i] : BP->bbox[dim + i] - ghost_width * DH[i];
 #else
 #error Not define Vertex nor Cell
 #endif
 #endif
        if (XX[i][j] - llb[i] < -DH[i] / 2 || XX[i][j] - uub[i] > DH[i] / 2)
        {
          flag = false;
          break;
        }
      }
      if (flag)
      {
        notfind = false;
        owner_rank[j] = BP->rank;
        if (myrank == BP->rank)
        {
          varl = VarList;
          int k = 0;
          while (varl)
          {
            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], Shellf[j * num_var + k],
                            pox[0], pox[1], pox[2], ordn, varl->data->SoA, Symmetry);
            varl = varl->next;
            k++;
          }
        }
      }
      if (Bp == ble)
        break;
      Bp = Bp->next;
    }
  }
  // --- Error check for unfound points ---
  for (int j = 0; j < NN; j++)
  {
    if (owner_rank[j] < 0 && myrank == 0)
    {
      cout << "ERROR: Patch::Interp_Points fails to find point (";
-      for (int j = 0; j < dim; j++)
+      for (int d = 0; d < dim; d++)
      {
-        cout << XX[j][i];
+        cout << XX[d][j];
-        if (j < dim - 1)
+        if (d < dim - 1)
          cout << ",";
        else
          cout << ")";
      }
      cout << " on Patch (";
-      for (int j = 0; j < dim; j++)
+      for (int d = 0; d < dim; d++)
      {
-        cout << bbox[j] << "+" << lli[j] * getdX(j);
+        cout << bbox[d] << "+" << lli[d] * DH[d];
-        if (j < dim - 1)
+        if (d < dim - 1)
          cout << ",";
        else
          cout << ")--";
      }
      cout << "(";
-      for (int j = 0; j < dim; j++)
+      for (int d = 0; d < dim; d++)
      {
-        cout << bbox[dim + j] << "-" << uui[j] * getdX(j);
+        cout << bbox[dim + d] << "-" << uui[d] * DH[d];
-        if (j < dim - 1)
+        if (d < dim - 1)
          cout << ",";
        else
          cout << ")" << endl;
      }
 #if 0
       checkBlock();
 #else
      cout << "splited domains:" << endl;
      {
        MyList<Block> *Bp = blb;
        while (Bp)
        {
          Block *BP = Bp->data;
          for (int i = 0; i < dim; i++)
          {
 #ifdef Vertex
 #ifdef Cell
 #error Both Cell and Vertex are defined
 #endif
            llb[i] = (feq(BP->bbox[i], bbox[i], DH[i] / 2)) ? BP->bbox[i] + lli[i] * DH[i] : BP->bbox[i] + (ghost_width - 0.5) * DH[i];
            uub[i] = (feq(BP->bbox[dim + i], bbox[dim + i], DH[i] / 2)) ? BP->bbox[dim + i] - uui[i] * DH[i] : BP->bbox[dim + i] - (ghost_width - 0.5) * DH[i];
 #else
 #ifdef Cell
            llb[i] = (feq(BP->bbox[i], bbox[i], DH[i] / 2)) ? BP->bbox[i] + lli[i] * DH[i] : BP->bbox[i] + ghost_width * DH[i];
            uub[i] = (feq(BP->bbox[dim + i], bbox[dim + i], DH[i] / 2)) ? BP->bbox[dim + i] - uui[i] * DH[i] : BP->bbox[dim + i] - ghost_width * DH[i];
 #else
 #error Not define Vertex nor Cell
 #endif
 #endif
          }
          cout << "(";
          for (int j = 0; j < dim; j++)
          {
            cout << llb[j] << ":" << uub[j];
            if (j < dim - 1)
              cout << ",";
            else
              cout << ")" << endl;
          }
          if (Bp == ble)
            break;
          Bp = Bp->next;
        }
      }
 #endif
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
-  delete[] shellf;
+  // --- Targeted point-to-point communication phase ---
-  delete[] weight;
+  // Compute consumer_rank[j] using the same deterministic formula as surface_integral
-  delete[] Weight;
+  int *consumer_rank = new int[NN];
-  delete[] DH;
+  {
-  delete[] llb;
+    int mp = NN / nprocs;
-  delete[] uub;
+    int Lp = NN - nprocs * mp;
    for (int j = 0; j < NN; j++)
    {
      if (j < Lp * (mp + 1))
        consumer_rank[j] = j / (mp + 1);
      else
        consumer_rank[j] = Lp + (j - Lp * (mp + 1)) / mp;
    }
  }
  // Count sends and recvs per rank
  int *send_count = new int[nprocs];
  int *recv_count = new int[nprocs];
  memset(send_count, 0, sizeof(int) * nprocs);
  memset(recv_count, 0, sizeof(int) * nprocs);
  for (int j = 0; j < NN; j++)
  {
    int own = owner_rank[j];
    int con = consumer_rank[j];
    if (own == con)
      continue; // local — no communication needed
    if (own == myrank)
      send_count[con]++;
    if (con == myrank)
      recv_count[own]++;
  }
  // Build send buffers: for each destination rank, pack (index, data) pairs
  // Each entry: 1 int (point index j) + num_var doubles
  int total_send = 0, total_recv = 0;
  int *send_offset = new int[nprocs];
  int *recv_offset = new int[nprocs];
  for (int r = 0; r < nprocs; r++)
  {
    send_offset[r] = total_send;
    total_send += send_count[r];
    recv_offset[r] = total_recv;
    total_recv += recv_count[r];
  }
  // Pack send buffers: each message contains (j, data[0..num_var-1]) per point
  int stride = 1 + num_var; // 1 double for index + num_var doubles for data
  double *sendbuf = new double[total_send * stride];
  double *recvbuf = new double[total_recv * stride];
  // Temporary counters for packing
  int *pack_pos = new int[nprocs];
  memset(pack_pos, 0, sizeof(int) * nprocs);
  for (int j = 0; j < NN; j++)
  {
    int own = owner_rank[j];
    int con = consumer_rank[j];
    if (own != myrank || con == myrank)
      continue;
    int pos = (send_offset[con] + pack_pos[con]) * stride;
    sendbuf[pos] = (double)j; // point index
    for (int v = 0; v < num_var; v++)
      sendbuf[pos + 1 + v] = Shellf[j * num_var + v];
    pack_pos[con]++;
  }
  // Post non-blocking recvs and sends
  int n_req = 0;
  for (int r = 0; r < nprocs; r++)
  {
    if (recv_count[r] > 0) n_req++;
    if (send_count[r] > 0) n_req++;
  }
  MPI_Request *reqs = new MPI_Request[n_req];
  int req_idx = 0;
  for (int r = 0; r < nprocs; r++)
  {
    if (recv_count[r] > 0)
    {
      MPI_Irecv(recvbuf + recv_offset[r] * stride,
                recv_count[r] * stride, MPI_DOUBLE,
                r, 0, MPI_COMM_WORLD, &reqs[req_idx++]);
    }
  }
  for (int r = 0; r < nprocs; r++)
  {
    if (send_count[r] > 0)
    {
      MPI_Isend(sendbuf + send_offset[r] * stride,
                send_count[r] * stride, MPI_DOUBLE,
                r, 0, MPI_COMM_WORLD, &reqs[req_idx++]);
    }
  }
  if (n_req > 0)
    MPI_Waitall(n_req, reqs, MPI_STATUSES_IGNORE);
  // Unpack recv buffers into Shellf
  for (int i = 0; i < total_recv; i++)
  {
    int pos = i * stride;
    int j = (int)recvbuf[pos];
    for (int v = 0; v < num_var; v++)
      Shellf[j * num_var + v] = recvbuf[pos + 1 + v];
  }
  delete[] reqs;
  delete[] sendbuf;
  delete[] recvbuf;
  delete[] pack_pos;
  delete[] send_offset;
  delete[] recv_offset;
  delete[] send_count;
  delete[] recv_count;
  delete[] consumer_rank;
  delete[] owner_rank;
 }
 void Patch::Interp_Points(MyList<var> *VarList,
                          int NN, double **XX,
@@ -573,24 +784,22 @@ void Patch::Interp_Points(MyList<var> *VarList,
    varl = varl->next;
  }
-  double *shellf;
+  memset(Shellf, 0, sizeof(double) * NN * num_var);
  shellf = new double[NN * num_var];
  memset(shellf, 0, sizeof(double) * NN * num_var);
-  // we use weight to monitor code, later some day we can move it for optimization
+  // owner_rank[j] stores the global rank that owns point j
-  int *weight;
+  int *owner_rank;
-  weight = new int[NN];
+  owner_rank = new int[NN];
-  memset(weight, 0, sizeof(int) * NN);
+  for (int j = 0; j < NN; j++)
    owner_rank[j] = -1;
-  double *DH, *llb, *uub;
+  // Build global-to-local rank translation for Comm_here
-  DH = new double[dim];
+  MPI_Group world_group, local_group;
  MPI_Comm_group(MPI_COMM_WORLD, &world_group);
  MPI_Comm_group(Comm_here, &local_group);
  double DH[dim], llb[dim], uub[dim];
  for (int i = 0; i < dim; i++)
  {
    DH[i] = getdX(i);
  }
  llb = new double[dim];
  uub = new double[dim];
  for (int j = 0; j < NN; j++) // run along points
  {
@@ -622,12 +831,6 @@ void Patch::Interp_Points(MyList<var> *VarList,
      bool flag = true;
      for (int i = 0; i < dim; i++)
      {
 // NOTE: our dividing structure is (exclude ghost)
 // -1 0
 //       1  2
 // so (0,1) does not belong to any part for vertex structure
 // here we put (0,0.5) to left part and (0.5,1) to right part
 // BUT for cell structure the bbox is (-1.5,0.5) and (0.5,2.5), there is no missing region at all
 #ifdef Vertex
 #ifdef Cell
 #error Both Cell and Vertex are defined
@@ -652,6 +855,7 @@ void Patch::Interp_Points(MyList<var> *VarList,
      if (flag)
      {
        notfind = false;
        owner_rank[j] = BP->rank;
        if (myrank == BP->rank)
        {
          //---> interpolation
@@ -659,14 +863,11 @@ void Patch::Interp_Points(MyList<var> *VarList,
          int k = 0;
          while (varl) // run along variables
          {
-            //              shellf[j*num_var+k] = Parallel::global_interp(dim,BP->shape,BP->X,BP->fgfs[varl->data->sgfn],
+            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], Shellf[j * num_var + k],
            //	  		                                    pox,ordn,varl->data->SoA,Symmetry);
            f_global_interp(BP->shape, BP->X[0], BP->X[1], BP->X[2], BP->fgfs[varl->data->sgfn], shellf[j * num_var + k],
                            pox[0], pox[1], pox[2], ordn, varl->data->SoA, Symmetry);
            varl = varl->next;
            k++;
          }
          weight[j] = 1;
        }
      }
      if (Bp == ble)
@@ -675,97 +876,35 @@ void Patch::Interp_Points(MyList<var> *VarList,
    }
  }
-  MPI_Allreduce(shellf, Shellf, NN * num_var, MPI_DOUBLE, MPI_SUM, Comm_here);
+  // Collect unique global owner ranks and translate to local ranks in Comm_here
-  int *Weight;
+  // Then broadcast each owner's points via MPI_Bcast on Comm_here
  Weight = new int[NN];
  MPI_Allreduce(weight, Weight, NN, MPI_INT, MPI_SUM, Comm_here);
  //  misc::tillherecheck("print me");
  //  if(lmyrank == 0) cout<<"myrank = "<<myrank<<"print me"<<endl;
  for (int i = 0; i < NN; i++)
  {
-    if (Weight[i] > 1)
+    int j = 0;
    while (j < NN)
    {
-      if (lmyrank == 0)
+      int cur_owner_global = owner_rank[j];
-        cout << "WARNING: Patch::Interp_Points meets multiple weight" << endl;
+      if (cur_owner_global < 0)
-      for (int j = 0; j < num_var; j++)
+      {
-        Shellf[j + i * num_var] = Shellf[j + i * num_var] / Weight[i];
+        // Point not found — skip (error check disabled for sub-communicator levels)
        j++;
        continue;
      }
      // Translate global rank to local rank in Comm_here
      int cur_owner_local;
      MPI_Group_translate_ranks(world_group, 1, &cur_owner_global, local_group, &cur_owner_local);
      // Find contiguous run of points with the same owner
      int jstart = j;
      while (j < NN && owner_rank[j] == cur_owner_global)
        j++;
      int count = (j - jstart) * num_var;
      MPI_Bcast(Shellf + jstart * num_var, count, MPI_DOUBLE, cur_owner_local, Comm_here);
    }
 #if 0 // for not involved levels, this may fail     
     else if(Weight[i] == 0 && lmyrank == 0)
     {
       cout<<"ERROR: Patch::Interp_Points fails to find point (";
       for(int j=0;j<dim;j++)
       {
 	  cout<<XX[j][i];
 	  if(j<dim-1) cout<<",";
 	  else        cout<<")";
       }
       cout<<" on Patch (";
       for(int j=0;j<dim;j++)
       {
 	  cout<<bbox[j]<<"+"<<lli[j]*getdX(j);
 	  if(j<dim-1) cout<<",";
 	  else        cout<<")--";
       }
       cout<<"(";
       for(int j=0;j<dim;j++)
       {
 	  cout<<bbox[dim+j]<<"-"<<uui[j]*getdX(j);
 	  if(j<dim-1) cout<<",";
 	  else        cout<<")"<<endl;
       }
 #if 0
       checkBlock();
 #else
  cout<<"splited domains:"<<endl;
  {
     MyList<Block> *Bp=blb;
     while(Bp)
     {
 	Block *BP=Bp->data;
 	for(int i=0;i<dim;i++)
 	{
 #ifdef Vertex
 #ifdef Cell
 #error Both Cell and Vertex are defined
 #endif    
          llb[i] = (feq(BP->bbox[i]    ,bbox[i]    ,DH[i]/2)) ? BP->bbox[i]+lli[i]*DH[i]     : BP->bbox[i]    +(ghost_width-0.5)*DH[i];
          uub[i] = (feq(BP->bbox[dim+i],bbox[dim+i],DH[i]/2)) ? BP->bbox[dim+i]-uui[i]*DH[i] : BP->bbox[dim+i]-(ghost_width-0.5)*DH[i];
 #else
 #ifdef Cell
          llb[i] = (feq(BP->bbox[i]    ,bbox[i]    ,DH[i]/2)) ? BP->bbox[i]+lli[i]*DH[i]     : BP->bbox[i]    +ghost_width*DH[i];
          uub[i] = (feq(BP->bbox[dim+i],bbox[dim+i],DH[i]/2)) ? BP->bbox[dim+i]-uui[i]*DH[i] : BP->bbox[dim+i]-ghost_width*DH[i];
 #else
 #error Not define Vertex nor Cell
 #endif
 #endif 
 	}       
       cout<<"(";
       for(int j=0;j<dim;j++)
       {
 	  cout<<llb[j]<<":"<<uub[j];
 	  if(j<dim-1) cout<<",";
 	  else        cout<<")"<<endl;
       }
 	if(Bp == ble) break;
 	Bp=Bp->next;
     }
  }
 #endif       
       MPI_Abort(MPI_COMM_WORLD,1);
     }
 #endif
  }
-  delete[] shellf;
+  MPI_Group_free(&world_group);
-  delete[] weight;
+  MPI_Group_free(&local_group);
-  delete[] Weight;
+  delete[] owner_rank;
  delete[] DH;
  delete[] llb;
  delete[] uub;
 }
 void Patch::checkBlock()
 {
--- a/AMSS_NCKU_source/MPatch.h
+++ b/AMSS_NCKU_source/MPatch.h
@@ -39,6 +39,10 @@ public:
   bool Find_Point(double *XX);
   void Interp_Points(MyList<var> *VarList,
                      int NN, double **XX,
                      double *Shellf, int Symmetry,
                      int Nmin_consumer, int Nmax_consumer);
   void Interp_Points(MyList<var> *VarList,
                      int NN, double **XX,
                      double *Shellf, int Symmetry, MPI_Comm Comm_here);
--- a/AMSS_NCKU_source/Parallel.C
+++ b/AMSS_NCKU_source/Parallel.C
@@ -4,8 +4,6 @@
 #include "prolongrestrict.h"
 #include "misc.h"
 #include "parameters.h"
 #include <vector>
 #include <algorithm>
 int Parallel::partition1(int &nx, int split_size, int min_width, int cpusize, int shape) // special for 1 diemnsion
 {
@@ -74,14 +72,14 @@ int Parallel::partition3(int *nxyz, int split_size, int *min_width, int cpusize,
  int n;
  block_size = shape[0] * shape[1] * shape[2];
-  n =  Mymax(1, (block_size + split_size / 2) / split_size);
+  n = Mymax(1, (block_size + split_size / 2) / split_size);
  maxnx = Mymax(1, shape[0] / min_width[0]);
-  maxnx =  Mymin(cpusize, maxnx);
+  maxnx = Mymin(cpusize, maxnx);
  maxny = Mymax(1, shape[1] / min_width[1]);
-  maxny =  Mymin(cpusize, maxny);
+  maxny = Mymin(cpusize, maxny);
  maxnz = Mymax(1, shape[2] / min_width[2]);
-  maxnz =  Mymin(cpusize, maxnz);
+  maxnz = Mymin(cpusize, maxnz);
  fx = (double)shape[0] / (shape[0] + shape[1] + shape[2]);
  fy = (double)shape[1] / (shape[0] + shape[1] + shape[2]);
  fz = (double)shape[2] / (shape[0] + shape[1] + shape[2]);
@@ -354,73 +352,14 @@ MyList<Block> *Parallel::distribute(MyList<Patch> *PatchLIST, int cpusize, int i
  split_size = Mymax(min_size, block_size / nodes);
  split_size = Mymax(1, split_size);
-  // Pass 1: compute block volumes for greedy rank assignment
+  int n_rank = 0;
  std::vector<long> block_volumes;
  PLi = PatchLIST;
  int reacpu = 0;
  while (PLi)
  {
    Patch *PP = PLi->data;
    reacpu += partition3(nxyz, split_size, mmin_width, nodes, PP->shape);
    int ibbox_here[2 * dim];
    for (int i = 0; i < nxyz[0]; i++)
      for (int j = 0; j < nxyz[1]; j++)
        for (int k = 0; k < nxyz[2]; k++)
        {
          ibbox_here[0] = (PP->shape[0] * i) / nxyz[0];
          ibbox_here[3] = (PP->shape[0] * (i + 1)) / nxyz[0] - 1;
          ibbox_here[1] = (PP->shape[1] * j) / nxyz[1];
          ibbox_here[4] = (PP->shape[1] * (j + 1)) / nxyz[1] - 1;
          ibbox_here[2] = (PP->shape[2] * k) / nxyz[2];
          ibbox_here[5] = (PP->shape[2] * (k + 1)) / nxyz[2] - 1;
          if (periodic)
          {
            for (int d = 0; d < dim; d++) { ibbox_here[d] -= ghost_width; ibbox_here[dim + d] += ghost_width; }
          }
          else
          {
            ibbox_here[0] = Mymax(0, ibbox_here[0] - ghost_width);
            ibbox_here[3] = Mymin(PP->shape[0] - 1, ibbox_here[3] + ghost_width);
            ibbox_here[1] = Mymax(0, ibbox_here[1] - ghost_width);
            ibbox_here[4] = Mymin(PP->shape[1] - 1, ibbox_here[4] + ghost_width);
            ibbox_here[2] = Mymax(0, ibbox_here[2] - ghost_width);
            ibbox_here[5] = Mymin(PP->shape[2] - 1, ibbox_here[5] + ghost_width);
          }
          long vol = 1;
          for (int d = 0; d < dim; d++)
            vol *= (ibbox_here[dim + d] - ibbox_here[d] + 1);
          block_volumes.push_back(vol);
        }
    PLi = PLi->next;
  }
  // Greedy LPT: sort by volume descending, assign each to least-loaded rank
  std::vector<int> assigned_ranks(block_volumes.size());
  {
    std::vector<int> order(block_volumes.size());
    for (int i = 0; i < (int)order.size(); i++) order[i] = i;
    std::sort(order.begin(), order.end(), [&](int a, int b) {
      return block_volumes[a] > block_volumes[b];
    });
    std::vector<long> load(cpusize, 0);
    for (int idx : order)
    {
      int min_r = 0;
      for (int r = 1; r < cpusize; r++)
        if (load[r] < load[min_r]) min_r = r;
      assigned_ranks[idx] = min_r;
      load[min_r] += block_volumes[idx];
    }
  }
  // Pass 2: create blocks with pre-assigned ranks
  int block_idx = 0;
  PLi = PatchLIST;
  while (PLi)
  {
    Patch *PP = PLi->data;
    partition3(nxyz, split_size, mmin_width, nodes, PP->shape);
    Block *ng0, *ng;
    int shape_here[dim], ibbox_here[2 * dim];
@@ -504,7 +443,10 @@ MyList<Block> *Parallel::distribute(MyList<Patch> *PatchLIST, int cpusize, int i
            int shape_res[dim * pices];
            double bbox_res[2 * dim * pices];
            misc::dividBlock(dim, shape_here, bbox_here, pices, picef, shape_res, bbox_res, min_width);
-            ng = ng0 = new Block(dim, shape_res, bbox_res, assigned_ranks[block_idx++], ingfsi, fngfsi, PP->lev, 0); // delete through KillBlocks
+            ng = ng0 = new Block(dim, shape_res, bbox_res, n_rank++, ingfsi, fngfsi, PP->lev, 0); // delete through KillBlocks
            //	       if(n_rank==cpusize) {n_rank=0; cerr<<"place one!!"<<endl;}
            //	       ng->checkBlock();
            if (BlL)
              BlL->insert(ng);
@@ -513,19 +455,22 @@ MyList<Block> *Parallel::distribute(MyList<Patch> *PatchLIST, int cpusize, int i
            for (int i = 1; i < pices; i++)
            {
-              ng = new Block(dim, shape_res + i * dim, bbox_res + i * 2 * dim, assigned_ranks[block_idx++], ingfsi, fngfsi, PP->lev, i); // delete through KillBlocks
+              ng = new Block(dim, shape_res + i * dim, bbox_res + i * 2 * dim, n_rank++, ingfsi, fngfsi, PP->lev, i); // delete through KillBlocks
              //	        if(n_rank==cpusize) {n_rank=0; cerr<<"place two!! "<<i<<endl;}
              //	        ng->checkBlock();
              BlL->insert(ng);
            }
          }
 #else
-          ng = ng0 = new Block(dim, shape_here, bbox_here, assigned_ranks[block_idx++], ingfsi, fngfsi, PP->lev); // delete through KillBlocks
+          ng = ng0 = new Block(dim, shape_here, bbox_here, n_rank++, ingfsi, fngfsi, PP->lev); // delete through KillBlocks
          //	    ng->checkBlock();
          if (BlL)
            BlL->insert(ng);
          else
            BlL = new MyList<Block>(ng); // delete through KillBlocks
 #endif
          if (n_rank == cpusize)
            n_rank = 0;
          // set PP->blb
          if (i == 0 && j == 0 && k == 0)
@@ -3559,7 +3504,7 @@ int Parallel::data_packermix(double *data, MyList<Parallel::gridseg> *src, MyLis
  return size_out;
 }
-
+//
 void Parallel::transfer(MyList<Parallel::gridseg> **src, MyList<Parallel::gridseg> **dst,
                        MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /*target */,
                        int Symmetry)
@@ -3567,20 +3512,13 @@ void Parallel::transfer(MyList<Parallel::gridseg> **src, MyList<Parallel::gridse
  int myrank, cpusize;
  MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
-/*
+
  // Early exit: if no gridseg pairs exist for any node, skip all work
  {
    bool has_segs = false;
    for (int n = 0; n < cpusize; n++) {
      if (src[n] && dst[n]) { has_segs = true; break; }
    }
    if (!has_segs) return;
  }
 */
  int node;
-  MPI_Request *reqs = new MPI_Request[2 * cpusize];
+  MPI_Request *reqs;
-  MPI_Status *stats = new MPI_Status[2 * cpusize];
+  MPI_Status *stats;
  reqs = new MPI_Request[2 * cpusize];
  stats = new MPI_Status[2 * cpusize];
  int req_no = 0;
  double **send_data, **rec_data;
@@ -3588,42 +3526,50 @@ void Parallel::transfer(MyList<Parallel::gridseg> **src, MyList<Parallel::gridse
  rec_data = new double *[cpusize];
  int length;
  for (node = 0; node < cpusize; node++)
    send_data[node] = rec_data[node] = 0;
  // 第1步: 本地拷贝 + 所有 Irecv
  for (node = 0; node < cpusize; node++)
  {
    send_data[node] = rec_data[node] = 0;
    if (node == myrank)
    {
      if (length = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry))
      {
        rec_data[node] = new double[length];
        if (!rec_data[node])
        {
          cout << "out of memory when new in short transfer, place 1" << endl;
          MPI_Abort(MPI_COMM_WORLD, 1);
        }
        data_packer(rec_data[node], src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
      }
    }
    else
    {
      // send from this cpu to cpu#node
      if (length = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry))
      {
        send_data[node] = new double[length];
        if (!send_data[node])
        {
          cout << "out of memory when new in short transfer, place 2" << endl;
          MPI_Abort(MPI_COMM_WORLD, 1);
        }
        data_packer(send_data[node], src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
        MPI_Isend((void *)send_data[node], length, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, reqs + req_no++);
      }
      // receive from cpu#node to this cpu
      if (length = data_packer(0, src[node], dst[node], node, UNPACK, VarList1, VarList2, Symmetry))
      {
        rec_data[node] = new double[length];
        if (!rec_data[node])
        {
          cout << "out of memory when new in short transfer, place 3" << endl;
          MPI_Abort(MPI_COMM_WORLD, 1);
        }
        MPI_Irecv((void *)rec_data[node], length, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, reqs + req_no++);
      }
    }
  }
-
+  // wait for all requests to complete
  // 第2步: pack + Isend
  for (node = 0; node < cpusize; node++)
  {
    if (node == myrank) continue;
    if (length = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry))
    {
      send_data[node] = new double[length];
      data_packer(send_data[node], src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
      MPI_Isend((void *)send_data[node], length, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, reqs + req_no++);
    }
  }
  MPI_Waitall(req_no, reqs, stats);
  for (node = 0; node < cpusize; node++)
@@ -3810,6 +3756,502 @@ void Parallel::Sync(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry)
  delete[] transfer_src;
  delete[] transfer_dst;
 }
 // Merged Sync: collect all intra-patch and inter-patch grid segment lists,
 // then issue a single transfer() call instead of N+1 separate ones.
 void Parallel::Sync_merged(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry)
 {
  int cpusize;
  MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
  MyList<Parallel::gridseg> **combined_src = new MyList<Parallel::gridseg> *[cpusize];
  MyList<Parallel::gridseg> **combined_dst = new MyList<Parallel::gridseg> *[cpusize];
  for (int node = 0; node < cpusize; node++)
    combined_src[node] = combined_dst[node] = 0;
  // Phase A: Intra-patch ghost exchange segments
  MyList<Patch> *Pp = PatL;
  while (Pp)
  {
    Patch *Pat = Pp->data;
    MyList<Parallel::gridseg> *dst_ghost = build_ghost_gsl(Pat);
    for (int node = 0; node < cpusize; node++)
    {
      MyList<Parallel::gridseg> *src_owned = build_owned_gsl0(Pat, node);
      MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
      build_gstl(src_owned, dst_ghost, &tsrc, &tdst);
      if (tsrc)
      {
        if (combined_src[node])
          combined_src[node]->catList(tsrc);
        else
          combined_src[node] = tsrc;
      }
      if (tdst)
      {
        if (combined_dst[node])
          combined_dst[node]->catList(tdst);
        else
          combined_dst[node] = tdst;
      }
      if (src_owned)
        src_owned->destroyList();
    }
    if (dst_ghost)
      dst_ghost->destroyList();
    Pp = Pp->next;
  }
  // Phase B: Inter-patch buffer exchange segments
  MyList<Parallel::gridseg> *dst_buffer = build_buffer_gsl(PatL);
  for (int node = 0; node < cpusize; node++)
  {
    MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatL, node, 5, Symmetry);
    MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
    build_gstl(src_owned, dst_buffer, &tsrc, &tdst);
    if (tsrc)
    {
      if (combined_src[node])
        combined_src[node]->catList(tsrc);
      else
        combined_src[node] = tsrc;
    }
    if (tdst)
    {
      if (combined_dst[node])
        combined_dst[node]->catList(tdst);
      else
        combined_dst[node] = tdst;
    }
    if (src_owned)
      src_owned->destroyList();
  }
  if (dst_buffer)
    dst_buffer->destroyList();
  // Phase C: Single transfer
  transfer(combined_src, combined_dst, VarList, VarList, Symmetry);
  // Phase D: Cleanup
  for (int node = 0; node < cpusize; node++)
  {
    if (combined_src[node])
      combined_src[node]->destroyList();
    if (combined_dst[node])
      combined_dst[node]->destroyList();
  }
  delete[] combined_src;
  delete[] combined_dst;
 }
 // SyncCache constructor
 Parallel::SyncCache::SyncCache()
    : valid(false), cpusize(0), combined_src(0), combined_dst(0),
      send_lengths(0), recv_lengths(0), send_bufs(0), recv_bufs(0),
      send_buf_caps(0), recv_buf_caps(0), reqs(0), stats(0), max_reqs(0),
      lengths_valid(false)
 {
 }
 // SyncCache invalidate: free grid segment lists but keep buffers
 void Parallel::SyncCache::invalidate()
 {
  if (!valid)
    return;
  for (int i = 0; i < cpusize; i++)
  {
    if (combined_src[i])
      combined_src[i]->destroyList();
    if (combined_dst[i])
      combined_dst[i]->destroyList();
    combined_src[i] = combined_dst[i] = 0;
    send_lengths[i] = recv_lengths[i] = 0;
  }
  valid = false;
  lengths_valid = false;
 }
 // SyncCache destroy: free everything
 void Parallel::SyncCache::destroy()
 {
  invalidate();
  if (combined_src) delete[] combined_src;
  if (combined_dst) delete[] combined_dst;
  if (send_lengths) delete[] send_lengths;
  if (recv_lengths) delete[] recv_lengths;
  if (send_buf_caps) delete[] send_buf_caps;
  if (recv_buf_caps) delete[] recv_buf_caps;
  for (int i = 0; i < cpusize; i++)
  {
    if (send_bufs && send_bufs[i]) delete[] send_bufs[i];
    if (recv_bufs && recv_bufs[i]) delete[] recv_bufs[i];
  }
  if (send_bufs) delete[] send_bufs;
  if (recv_bufs) delete[] recv_bufs;
  if (reqs) delete[] reqs;
  if (stats) delete[] stats;
  combined_src = combined_dst = 0;
  send_lengths = recv_lengths = 0;
  send_buf_caps = recv_buf_caps = 0;
  send_bufs = recv_bufs = 0;
  reqs = 0; stats = 0;
  cpusize = 0; max_reqs = 0;
 }
 // transfer_cached: reuse pre-allocated buffers from SyncCache
 void Parallel::transfer_cached(MyList<Parallel::gridseg> **src, MyList<Parallel::gridseg> **dst,
                               MyList<var> *VarList1, MyList<var> *VarList2,
                               int Symmetry, SyncCache &cache)
 {
  int myrank;
  MPI_Comm_size(MPI_COMM_WORLD, &cache.cpusize);
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  int cpusize = cache.cpusize;
  int req_no = 0;
  int node;
  for (node = 0; node < cpusize; node++)
  {
    if (node == myrank)
    {
      int length = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
      cache.recv_lengths[node] = length;
      if (length > 0)
      {
        if (length > cache.recv_buf_caps[node])
        {
          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
          cache.recv_bufs[node] = new double[length];
          cache.recv_buf_caps[node] = length;
        }
        data_packer(cache.recv_bufs[node], src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
      }
    }
    else
    {
      // send
      int slength = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
      cache.send_lengths[node] = slength;
      if (slength > 0)
      {
        if (slength > cache.send_buf_caps[node])
        {
          if (cache.send_bufs[node]) delete[] cache.send_bufs[node];
          cache.send_bufs[node] = new double[slength];
          cache.send_buf_caps[node] = slength;
        }
        data_packer(cache.send_bufs[node], src[myrank], dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
        MPI_Isend((void *)cache.send_bufs[node], slength, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, cache.reqs + req_no++);
      }
      // recv
      int rlength = data_packer(0, src[node], dst[node], node, UNPACK, VarList1, VarList2, Symmetry);
      cache.recv_lengths[node] = rlength;
      if (rlength > 0)
      {
        if (rlength > cache.recv_buf_caps[node])
        {
          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
          cache.recv_bufs[node] = new double[rlength];
          cache.recv_buf_caps[node] = rlength;
        }
        MPI_Irecv((void *)cache.recv_bufs[node], rlength, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, cache.reqs + req_no++);
      }
    }
  }
  MPI_Waitall(req_no, cache.reqs, cache.stats);
  for (node = 0; node < cpusize; node++)
    if (cache.recv_bufs[node] && cache.recv_lengths[node] > 0)
      data_packer(cache.recv_bufs[node], src[node], dst[node], node, UNPACK, VarList1, VarList2, Symmetry);
 }
 // Sync_cached: build grid segment lists on first call, reuse on subsequent calls
 void Parallel::Sync_cached(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry, SyncCache &cache)
 {
  if (!cache.valid)
  {
    int cpusize;
    MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
    cache.cpusize = cpusize;
    // Allocate cache arrays if needed
    if (!cache.combined_src)
    {
      cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
      cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
      cache.send_lengths = new int[cpusize];
      cache.recv_lengths = new int[cpusize];
      cache.send_bufs = new double *[cpusize];
      cache.recv_bufs = new double *[cpusize];
      cache.send_buf_caps = new int[cpusize];
      cache.recv_buf_caps = new int[cpusize];
      for (int i = 0; i < cpusize; i++)
      {
        cache.send_bufs[i] = cache.recv_bufs[i] = 0;
        cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
      }
      cache.max_reqs = 2 * cpusize;
      cache.reqs = new MPI_Request[cache.max_reqs];
      cache.stats = new MPI_Status[cache.max_reqs];
    }
    for (int node = 0; node < cpusize; node++)
    {
      cache.combined_src[node] = cache.combined_dst[node] = 0;
      cache.send_lengths[node] = cache.recv_lengths[node] = 0;
    }
    // Build intra-patch segments (same as Sync_merged Phase A)
    MyList<Patch> *Pp = PatL;
    while (Pp)
    {
      Patch *Pat = Pp->data;
      MyList<Parallel::gridseg> *dst_ghost = build_ghost_gsl(Pat);
      for (int node = 0; node < cpusize; node++)
      {
        MyList<Parallel::gridseg> *src_owned = build_owned_gsl0(Pat, node);
        MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
        build_gstl(src_owned, dst_ghost, &tsrc, &tdst);
        if (tsrc)
        {
          if (cache.combined_src[node])
            cache.combined_src[node]->catList(tsrc);
          else
            cache.combined_src[node] = tsrc;
        }
        if (tdst)
        {
          if (cache.combined_dst[node])
            cache.combined_dst[node]->catList(tdst);
          else
            cache.combined_dst[node] = tdst;
        }
        if (src_owned) src_owned->destroyList();
      }
      if (dst_ghost) dst_ghost->destroyList();
      Pp = Pp->next;
    }
    // Build inter-patch segments (same as Sync_merged Phase B)
    MyList<Parallel::gridseg> *dst_buffer = build_buffer_gsl(PatL);
    for (int node = 0; node < cpusize; node++)
    {
      MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatL, node, 5, Symmetry);
      MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
      build_gstl(src_owned, dst_buffer, &tsrc, &tdst);
      if (tsrc)
      {
        if (cache.combined_src[node])
          cache.combined_src[node]->catList(tsrc);
        else
          cache.combined_src[node] = tsrc;
      }
      if (tdst)
      {
        if (cache.combined_dst[node])
          cache.combined_dst[node]->catList(tdst);
        else
          cache.combined_dst[node] = tdst;
      }
      if (src_owned) src_owned->destroyList();
    }
    if (dst_buffer) dst_buffer->destroyList();
    cache.valid = true;
  }
  // Use cached lists with buffer-reusing transfer
  transfer_cached(cache.combined_src, cache.combined_dst, VarList, VarList, Symmetry, cache);
 }
 // Sync_start: pack and post MPI_Isend/Irecv, return immediately
 void Parallel::Sync_start(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry,
                          SyncCache &cache, AsyncSyncState &state)
 {
  // Ensure cache is built
  if (!cache.valid)
  {
    // Build cache (same logic as Sync_cached)
    int cpusize;
    MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
    cache.cpusize = cpusize;
    if (!cache.combined_src)
    {
      cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
      cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
      cache.send_lengths = new int[cpusize];
      cache.recv_lengths = new int[cpusize];
      cache.send_bufs = new double *[cpusize];
      cache.recv_bufs = new double *[cpusize];
      cache.send_buf_caps = new int[cpusize];
      cache.recv_buf_caps = new int[cpusize];
      for (int i = 0; i < cpusize; i++)
      {
        cache.send_bufs[i] = cache.recv_bufs[i] = 0;
        cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
      }
      cache.max_reqs = 2 * cpusize;
      cache.reqs = new MPI_Request[cache.max_reqs];
      cache.stats = new MPI_Status[cache.max_reqs];
    }
    for (int node = 0; node < cpusize; node++)
    {
      cache.combined_src[node] = cache.combined_dst[node] = 0;
      cache.send_lengths[node] = cache.recv_lengths[node] = 0;
    }
    MyList<Patch> *Pp = PatL;
    while (Pp)
    {
      Patch *Pat = Pp->data;
      MyList<Parallel::gridseg> *dst_ghost = build_ghost_gsl(Pat);
      for (int node = 0; node < cpusize; node++)
      {
        MyList<Parallel::gridseg> *src_owned = build_owned_gsl0(Pat, node);
        MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
        build_gstl(src_owned, dst_ghost, &tsrc, &tdst);
        if (tsrc)
        {
          if (cache.combined_src[node])
            cache.combined_src[node]->catList(tsrc);
          else
            cache.combined_src[node] = tsrc;
        }
        if (tdst)
        {
          if (cache.combined_dst[node])
            cache.combined_dst[node]->catList(tdst);
          else
            cache.combined_dst[node] = tdst;
        }
        if (src_owned) src_owned->destroyList();
      }
      if (dst_ghost) dst_ghost->destroyList();
      Pp = Pp->next;
    }
    MyList<Parallel::gridseg> *dst_buffer = build_buffer_gsl(PatL);
    for (int node = 0; node < cpusize; node++)
    {
      MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatL, node, 5, Symmetry);
      MyList<Parallel::gridseg> *tsrc = 0, *tdst = 0;
      build_gstl(src_owned, dst_buffer, &tsrc, &tdst);
      if (tsrc)
      {
        if (cache.combined_src[node])
          cache.combined_src[node]->catList(tsrc);
        else
          cache.combined_src[node] = tsrc;
      }
      if (tdst)
      {
        if (cache.combined_dst[node])
          cache.combined_dst[node]->catList(tdst);
        else
          cache.combined_dst[node] = tdst;
      }
      if (src_owned) src_owned->destroyList();
    }
    if (dst_buffer) dst_buffer->destroyList();
    cache.valid = true;
  }
  // Now pack and post async MPI operations
  int myrank;
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  int cpusize = cache.cpusize;
  state.req_no = 0;
  state.active = true;
  MyList<Parallel::gridseg> **src = cache.combined_src;
  MyList<Parallel::gridseg> **dst = cache.combined_dst;
  for (int node = 0; node < cpusize; node++)
  {
    if (node == myrank)
    {
      int length;
      if (!cache.lengths_valid) {
        length = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
        cache.recv_lengths[node] = length;
      } else {
        length = cache.recv_lengths[node];
      }
      if (length > 0)
      {
        if (length > cache.recv_buf_caps[node])
        {
          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
          cache.recv_bufs[node] = new double[length];
          cache.recv_buf_caps[node] = length;
        }
        data_packer(cache.recv_bufs[node], src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
      }
    }
    else
    {
      int slength;
      if (!cache.lengths_valid) {
        slength = data_packer(0, src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
        cache.send_lengths[node] = slength;
      } else {
        slength = cache.send_lengths[node];
      }
      if (slength > 0)
      {
        if (slength > cache.send_buf_caps[node])
        {
          if (cache.send_bufs[node]) delete[] cache.send_bufs[node];
          cache.send_bufs[node] = new double[slength];
          cache.send_buf_caps[node] = slength;
        }
        data_packer(cache.send_bufs[node], src[myrank], dst[myrank], node, PACK, VarList, VarList, Symmetry);
        MPI_Isend((void *)cache.send_bufs[node], slength, MPI_DOUBLE, node, 2, MPI_COMM_WORLD, cache.reqs + state.req_no++);
      }
      int rlength;
      if (!cache.lengths_valid) {
        rlength = data_packer(0, src[node], dst[node], node, UNPACK, VarList, VarList, Symmetry);
        cache.recv_lengths[node] = rlength;
      } else {
        rlength = cache.recv_lengths[node];
      }
      if (rlength > 0)
      {
        if (rlength > cache.recv_buf_caps[node])
        {
          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
          cache.recv_bufs[node] = new double[rlength];
          cache.recv_buf_caps[node] = rlength;
        }
        MPI_Irecv((void *)cache.recv_bufs[node], rlength, MPI_DOUBLE, node, 2, MPI_COMM_WORLD, cache.reqs + state.req_no++);
      }
    }
  }
  cache.lengths_valid = true;
 }
 // Sync_finish: wait for async MPI operations and unpack
 void Parallel::Sync_finish(SyncCache &cache, AsyncSyncState &state,
                           MyList<var> *VarList, int Symmetry)
 {
  if (!state.active)
    return;
  MPI_Waitall(state.req_no, cache.reqs, cache.stats);
  int cpusize = cache.cpusize;
  MyList<Parallel::gridseg> **src = cache.combined_src;
  MyList<Parallel::gridseg> **dst = cache.combined_dst;
  for (int node = 0; node < cpusize; node++)
    if (cache.recv_bufs[node] && cache.recv_lengths[node] > 0)
      data_packer(cache.recv_bufs[node], src[node], dst[node], node, UNPACK, VarList, VarList, Symmetry);
  state.active = false;
 }
 // collect buffer grid segments or blocks for the periodic boundary condition of given patch
 // ---------------------------------------------------
 // |con |                                       |con |
@@ -4844,6 +5286,203 @@ void Parallel::OutBdLow2Himix(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
  delete[] transfer_src;
  delete[] transfer_dst;
 }
 // Restrict_cached: cache grid segment lists, reuse buffers via transfer_cached
 void Parallel::Restrict_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
                               MyList<var> *VarList1, MyList<var> *VarList2,
                               int Symmetry, SyncCache &cache)
 {
  if (!cache.valid)
  {
    int cpusize;
    MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
    cache.cpusize = cpusize;
    if (!cache.combined_src)
    {
      cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
      cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
      cache.send_lengths = new int[cpusize];
      cache.recv_lengths = new int[cpusize];
      cache.send_bufs = new double *[cpusize];
      cache.recv_bufs = new double *[cpusize];
      cache.send_buf_caps = new int[cpusize];
      cache.recv_buf_caps = new int[cpusize];
      for (int i = 0; i < cpusize; i++)
      {
        cache.send_bufs[i] = cache.recv_bufs[i] = 0;
        cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
      }
      cache.max_reqs = 2 * cpusize;
      cache.reqs = new MPI_Request[cache.max_reqs];
      cache.stats = new MPI_Status[cache.max_reqs];
    }
    MyList<Parallel::gridseg> *dst = build_complete_gsl(PatcL);
    for (int node = 0; node < cpusize; node++)
    {
      MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatfL, node, 2, Symmetry);
      build_gstl(src_owned, dst, &cache.combined_src[node], &cache.combined_dst[node]);
      if (src_owned) src_owned->destroyList();
    }
    if (dst) dst->destroyList();
    cache.valid = true;
  }
  transfer_cached(cache.combined_src, cache.combined_dst, VarList1, VarList2, Symmetry, cache);
 }
 // OutBdLow2Hi_cached: cache grid segment lists, reuse buffers via transfer_cached
 void Parallel::OutBdLow2Hi_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
                                  MyList<var> *VarList1, MyList<var> *VarList2,
                                  int Symmetry, SyncCache &cache)
 {
  if (!cache.valid)
  {
    int cpusize;
    MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
    cache.cpusize = cpusize;
    if (!cache.combined_src)
    {
      cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
      cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
      cache.send_lengths = new int[cpusize];
      cache.recv_lengths = new int[cpusize];
      cache.send_bufs = new double *[cpusize];
      cache.recv_bufs = new double *[cpusize];
      cache.send_buf_caps = new int[cpusize];
      cache.recv_buf_caps = new int[cpusize];
      for (int i = 0; i < cpusize; i++)
      {
        cache.send_bufs[i] = cache.recv_bufs[i] = 0;
        cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
      }
      cache.max_reqs = 2 * cpusize;
      cache.reqs = new MPI_Request[cache.max_reqs];
      cache.stats = new MPI_Status[cache.max_reqs];
    }
    MyList<Parallel::gridseg> *dst = build_buffer_gsl(PatfL);
    for (int node = 0; node < cpusize; node++)
    {
      MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatcL, node, 4, Symmetry);
      build_gstl(src_owned, dst, &cache.combined_src[node], &cache.combined_dst[node]);
      if (src_owned) src_owned->destroyList();
    }
    if (dst) dst->destroyList();
    cache.valid = true;
  }
  transfer_cached(cache.combined_src, cache.combined_dst, VarList1, VarList2, Symmetry, cache);
 }
 // OutBdLow2Himix_cached: same as OutBdLow2Hi_cached but uses transfermix for unpacking
 void Parallel::OutBdLow2Himix_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
                                     MyList<var> *VarList1, MyList<var> *VarList2,
                                     int Symmetry, SyncCache &cache)
 {
  if (!cache.valid)
  {
    int cpusize;
    MPI_Comm_size(MPI_COMM_WORLD, &cpusize);
    cache.cpusize = cpusize;
    if (!cache.combined_src)
    {
      cache.combined_src = new MyList<Parallel::gridseg> *[cpusize];
      cache.combined_dst = new MyList<Parallel::gridseg> *[cpusize];
      cache.send_lengths = new int[cpusize];
      cache.recv_lengths = new int[cpusize];
      cache.send_bufs = new double *[cpusize];
      cache.recv_bufs = new double *[cpusize];
      cache.send_buf_caps = new int[cpusize];
      cache.recv_buf_caps = new int[cpusize];
      for (int i = 0; i < cpusize; i++)
      {
        cache.send_bufs[i] = cache.recv_bufs[i] = 0;
        cache.send_buf_caps[i] = cache.recv_buf_caps[i] = 0;
      }
      cache.max_reqs = 2 * cpusize;
      cache.reqs = new MPI_Request[cache.max_reqs];
      cache.stats = new MPI_Status[cache.max_reqs];
    }
    MyList<Parallel::gridseg> *dst = build_buffer_gsl(PatfL);
    for (int node = 0; node < cpusize; node++)
    {
      MyList<Parallel::gridseg> *src_owned = build_owned_gsl(PatcL, node, 4, Symmetry);
      build_gstl(src_owned, dst, &cache.combined_src[node], &cache.combined_dst[node]);
      if (src_owned) src_owned->destroyList();
    }
    if (dst) dst->destroyList();
    cache.valid = true;
  }
  // Use transfermix instead of transfer for mix-mode interpolation
  int myrank;
  MPI_Comm_size(MPI_COMM_WORLD, &cache.cpusize);
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  int cpusize = cache.cpusize;
  int req_no = 0;
  for (int node = 0; node < cpusize; node++)
  {
    if (node == myrank)
    {
      int length = data_packermix(0, cache.combined_src[myrank], cache.combined_dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
      cache.recv_lengths[node] = length;
      if (length > 0)
      {
        if (length > cache.recv_buf_caps[node])
        {
          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
          cache.recv_bufs[node] = new double[length];
          cache.recv_buf_caps[node] = length;
        }
        data_packermix(cache.recv_bufs[node], cache.combined_src[myrank], cache.combined_dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
      }
    }
    else
    {
      int slength = data_packermix(0, cache.combined_src[myrank], cache.combined_dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
      cache.send_lengths[node] = slength;
      if (slength > 0)
      {
        if (slength > cache.send_buf_caps[node])
        {
          if (cache.send_bufs[node]) delete[] cache.send_bufs[node];
          cache.send_bufs[node] = new double[slength];
          cache.send_buf_caps[node] = slength;
        }
        data_packermix(cache.send_bufs[node], cache.combined_src[myrank], cache.combined_dst[myrank], node, PACK, VarList1, VarList2, Symmetry);
        MPI_Isend((void *)cache.send_bufs[node], slength, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, cache.reqs + req_no++);
      }
      int rlength = data_packermix(0, cache.combined_src[node], cache.combined_dst[node], node, UNPACK, VarList1, VarList2, Symmetry);
      cache.recv_lengths[node] = rlength;
      if (rlength > 0)
      {
        if (rlength > cache.recv_buf_caps[node])
        {
          if (cache.recv_bufs[node]) delete[] cache.recv_bufs[node];
          cache.recv_bufs[node] = new double[rlength];
          cache.recv_buf_caps[node] = rlength;
        }
        MPI_Irecv((void *)cache.recv_bufs[node], rlength, MPI_DOUBLE, node, 1, MPI_COMM_WORLD, cache.reqs + req_no++);
      }
    }
  }
  MPI_Waitall(req_no, cache.reqs, cache.stats);
  for (int node = 0; node < cpusize; node++)
    if (cache.recv_bufs[node] && cache.recv_lengths[node] > 0)
      data_packermix(cache.recv_bufs[node], cache.combined_src[node], cache.combined_dst[node], node, UNPACK, VarList1, VarList2, Symmetry);
 }
 // collect all buffer grid segments or blocks for given patch
 MyList<Parallel::gridseg> *Parallel::build_buffer_gsl(Patch *Pat)
 {
--- a/AMSS_NCKU_source/Parallel.h
+++ b/AMSS_NCKU_source/Parallel.h
@@ -81,6 +81,43 @@ namespace Parallel
                   int Symmetry);
  void Sync(Patch *Pat, MyList<var> *VarList, int Symmetry);
  void Sync(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry);
  void Sync_merged(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry);
  struct SyncCache {
    bool valid;
    int cpusize;
    MyList<gridseg> **combined_src;
    MyList<gridseg> **combined_dst;
    int *send_lengths;
    int *recv_lengths;
    double **send_bufs;
    double **recv_bufs;
    int *send_buf_caps;
    int *recv_buf_caps;
    MPI_Request *reqs;
    MPI_Status *stats;
    int max_reqs;
    bool lengths_valid;
    SyncCache();
    void invalidate();
    void destroy();
  };
  void Sync_cached(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry, SyncCache &cache);
  void transfer_cached(MyList<gridseg> **src, MyList<gridseg> **dst,
                       MyList<var> *VarList1, MyList<var> *VarList2,
                       int Symmetry, SyncCache &cache);
  struct AsyncSyncState {
    int req_no;
    bool active;
    AsyncSyncState() : req_no(0), active(false) {}
  };
  void Sync_start(MyList<Patch> *PatL, MyList<var> *VarList, int Symmetry,
                  SyncCache &cache, AsyncSyncState &state);
  void Sync_finish(SyncCache &cache, AsyncSyncState &state,
                   MyList<var> *VarList, int Symmetry);
  void OutBdLow2Hi(Patch *Patc, Patch *Patf,
                   MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */,
                   int Symmetry);
@@ -93,6 +130,15 @@ namespace Parallel
  void OutBdLow2Himix(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
                      MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */,
                      int Symmetry);
  void Restrict_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
                       MyList<var> *VarList1, MyList<var> *VarList2,
                       int Symmetry, SyncCache &cache);
  void OutBdLow2Hi_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
                          MyList<var> *VarList1, MyList<var> *VarList2,
                          int Symmetry, SyncCache &cache);
  void OutBdLow2Himix_cached(MyList<Patch> *PatcL, MyList<Patch> *PatfL,
                             MyList<var> *VarList1, MyList<var> *VarList2,
                             int Symmetry, SyncCache &cache);
  void Prolong(Patch *Patc, Patch *Patf,
               MyList<var> *VarList1 /* source */, MyList<var> *VarList2 /* target */,
               int Symmetry);
--- a/AMSS_NCKU_source/Z4c_class.C
+++ b/AMSS_NCKU_source/Z4c_class.C
@@ -321,22 +321,7 @@ void Z4c_class::Step(int lev, int YN)
    }
    Pp = Pp->next;
  }
-  // check error information
+  // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls
  {
    int erh = ERROR;
    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
  }
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime 
                              << ", lev = " << lev << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
 #ifdef WithShell
  // evolve Shell Patches
@@ -468,24 +453,16 @@ void Z4c_class::Step(int lev, int YN)
      sPp = sPp->next;
    }
  }
-  // check error information
+  // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
  MPI_Request err_req_pre;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_pre);
  }
  if (ERROR)
  {
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " << PhysTime << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
 #endif
-  Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry);
+  Parallel::AsyncSyncState async_pre;
  Parallel::Sync_start(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev], async_pre);
 #ifdef WithShell
  if (lev == 0)
@@ -504,6 +481,24 @@ void Z4c_class::Step(int lev, int YN)
    }
  }
 #endif
  Parallel::Sync_finish(sync_cache_pre[lev], async_pre, SynchList_pre, Symmetry);
 #ifdef WithShell
  // Complete non-blocking error reduction and check
  MPI_Wait(&err_req_pre, MPI_STATUS_IGNORE);
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime
                              << ", lev = " << lev << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
 #endif
  // for black hole position
  if (BH_num > 0 && lev == GH->levels - 1)
@@ -693,23 +688,7 @@ void Z4c_class::Step(int lev, int YN)
      Pp = Pp->next;
    }
-    // check error information
+    // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls
    {
      int erh = ERROR;
      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
    }
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count 
                                << " variables at t = " << PhysTime 
                                << ", lev = " << lev << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
 #ifdef WithShell
    // evolve Shell Patches
@@ -850,25 +829,16 @@ void Z4c_class::Step(int lev, int YN)
        sPp = sPp->next;
      }
    }
-    // check error information
+    // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_cor);
    }
    if (ERROR)
    {
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" << iter_count 
                                << " variables at t = " << PhysTime << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
 #endif
-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::AsyncSyncState async_cor;
    Parallel::Sync_start(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev], async_cor);
 #ifdef WithShell
    if (lev == 0)
@@ -886,6 +856,25 @@ void Z4c_class::Step(int lev, int YN)
             << " seconds! " << endl;
      }
    }
 #endif
    Parallel::Sync_finish(sync_cache_cor[lev], async_cor, SynchList_cor, Symmetry);
 #ifdef WithShell
    // Complete non-blocking error reduction and check
    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count
                                << " variables at t = " << PhysTime
                                << ", lev = " << lev << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
 #endif
    // for black hole position
    if (BH_num > 0 && lev == GH->levels - 1)
@@ -1252,22 +1241,7 @@ void Z4c_class::Step(int lev, int YN)
 	 }
  }
 #endif
-  // check error information
+  // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls
  {
    int erh = ERROR;
    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
  }
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime 
                              << ", lev = " << lev << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
  // evolve Shell Patches
  if (lev == 0)
@@ -1542,23 +1516,15 @@ void Z4c_class::Step(int lev, int YN)
  }
 #endif
  }
-  // check error information
+  // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
  MPI_Request err_req_pre;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_pre);
  }
  if (ERROR)
  {
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " << PhysTime << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
-  Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry);
+  Parallel::AsyncSyncState async_pre;
  Parallel::Sync_start(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev], async_pre);
  if (lev == 0)
  {
@@ -1620,6 +1586,22 @@ void Z4c_class::Step(int lev, int YN)
  }
 #endif
  }
  Parallel::Sync_finish(sync_cache_pre[lev], async_pre, SynchList_pre, Symmetry);
  // Complete non-blocking error reduction and check
  MPI_Wait(&err_req_pre, MPI_STATUS_IGNORE);
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime
                              << ", lev = " << lev << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
  // for black hole position
  if (BH_num > 0 && lev == GH->levels - 1)
@@ -1841,23 +1823,7 @@ void Z4c_class::Step(int lev, int YN)
      Pp = Pp->next;
    }
-    // check error information
+    // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls
    {
      int erh = ERROR;
      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
    }
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count 
                                << " variables at t = " << PhysTime 
                                << ", lev = " << lev << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
    // evolve Shell Patches
    if (lev == 0)
@@ -2103,24 +2069,15 @@ void Z4c_class::Step(int lev, int YN)
        sPp = sPp->next;
      }
    }
-    // check error information
+    // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_cor);
    }
    if (ERROR)
    {
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" << iter_count 
                                << " variables at t = " << PhysTime << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::AsyncSyncState async_cor;
    Parallel::Sync_start(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev], async_cor);
    if (lev == 0)
    {
@@ -2170,6 +2127,23 @@ void Z4c_class::Step(int lev, int YN)
    }
 // end smooth
 #endif
    Parallel::Sync_finish(sync_cache_cor[lev], async_cor, SynchList_cor, Symmetry);
    // Complete non-blocking error reduction and check
    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count
                                << " variables at t = " << PhysTime
                                << ", lev = " << lev << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
    // for black hole position
    if (BH_num > 0 && lev == GH->levels - 1)
--- a/AMSS_NCKU_source/bssn_class.C
+++ b/AMSS_NCKU_source/bssn_class.C
@@ -730,6 +730,12 @@ void bssn_class::Initialize()
    PhysTime = StartTime;
    Setup_Black_Hole_position();
  }
  // Initialize sync caches (per-level, for predictor and corrector)
  sync_cache_pre = new Parallel::SyncCache[GH->levels];
  sync_cache_cor = new Parallel::SyncCache[GH->levels];
  sync_cache_rp_coarse = new Parallel::SyncCache[GH->levels];
  sync_cache_rp_fine = new Parallel::SyncCache[GH->levels];
 }
 //================================================================================================
@@ -981,6 +987,32 @@ bssn_class::~bssn_class()
  delete Azzz;
 #endif
  // Destroy sync caches before GH
  if (sync_cache_pre)
  {
    for (int i = 0; i < GH->levels; i++)
      sync_cache_pre[i].destroy();
    delete[] sync_cache_pre;
  }
  if (sync_cache_cor)
  {
    for (int i = 0; i < GH->levels; i++)
      sync_cache_cor[i].destroy();
    delete[] sync_cache_cor;
  }
  if (sync_cache_rp_coarse)
  {
    for (int i = 0; i < GH->levels; i++)
      sync_cache_rp_coarse[i].destroy();
    delete[] sync_cache_rp_coarse;
  }
  if (sync_cache_rp_fine)
  {
    for (int i = 0; i < GH->levels; i++)
      sync_cache_rp_fine[i].destroy();
    delete[] sync_cache_rp_fine;
  }
  delete GH;
 #ifdef WithShell
  delete SH;
@@ -2181,6 +2213,7 @@ void bssn_class::Evolve(int Steps)
    GH->Regrid(Symmetry, BH_num, Porgbr, Porg0,
               SynchList_cor, OldStateList, StateList, SynchList_pre,
               fgt(PhysTime - dT_mon, StartTime, dT_mon / 2), ErrorMonitor);
    for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
 #endif
 #if (REGLEV == 0 && (PSTR == 1 || PSTR == 2))
@@ -2396,6 +2429,7 @@ void bssn_class::RecursiveStep(int lev)
  GH->Regrid_Onelevel(lev, Symmetry, BH_num, Porgbr, Porg0,
                      SynchList_cor, OldStateList, StateList, SynchList_pre,
                      fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor);
  for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
 #endif
 }
@@ -2574,6 +2608,7 @@ void bssn_class::ParallelStep()
  GH->Regrid_Onelevel(GH->mylev, Symmetry, BH_num, Porgbr, Porg0,
                      SynchList_cor, OldStateList, StateList, SynchList_pre,
                      fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor);
  for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
 #endif
 }
@@ -2740,6 +2775,7 @@ void bssn_class::ParallelStep()
        GH->Regrid_Onelevel(lev + 1, Symmetry, BH_num, Porgbr, Porg0,
                            SynchList_cor, OldStateList, StateList, SynchList_pre,
                            fgt(PhysTime - dT_levp1, StartTime, dT_levp1 / 2), ErrorMonitor);
        for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
        //               a_stream.clear();
        //               a_stream.str("");
@@ -2754,6 +2790,7 @@ void bssn_class::ParallelStep()
      GH->Regrid_Onelevel(lev, Symmetry, BH_num, Porgbr, Porg0,
                          SynchList_cor, OldStateList, StateList, SynchList_pre,
                          fgt(PhysTime - dT_lev, StartTime, dT_lev / 2), ErrorMonitor);
      for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
      //               a_stream.clear();
      //               a_stream.str("");
@@ -2772,6 +2809,7 @@ void bssn_class::ParallelStep()
          GH->Regrid_Onelevel(lev - 1, Symmetry, BH_num, Porgbr, Porg0,
                              SynchList_cor, OldStateList, StateList, SynchList_pre,
                              fgt(PhysTime - dT_lev, StartTime, dT_levm1 / 2), ErrorMonitor);
          for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
          //               a_stream.clear();
          //               a_stream.str("");
@@ -2787,6 +2825,7 @@ void bssn_class::ParallelStep()
          GH->Regrid_Onelevel(lev - 1, Symmetry, BH_num, Porgbr, Porg0,
                              SynchList_cor, OldStateList, StateList, SynchList_pre,
                              fgt(PhysTime - dT_lev, StartTime, dT_levm1 / 2), ErrorMonitor);
          for (int il = 0; il < GH->levels; il++) { sync_cache_pre[il].invalidate(); sync_cache_cor[il].invalidate(); sync_cache_rp_coarse[il].invalidate(); sync_cache_rp_fine[il].invalidate(); }
          //               a_stream.clear();
          //               a_stream.str("");
@@ -3158,21 +3197,7 @@ void bssn_class::Step(int lev, int YN)
    }
    Pp = Pp->next;
  }
-  // check error information
+  // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls
  {
    int erh = ERROR;
    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
  }
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime << ", lev = " << lev << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
 #ifdef WithShell
  // evolve Shell Patches
@@ -3316,25 +3341,16 @@ void bssn_class::Step(int lev, int YN)
 #endif
  }
-  // check error information
+  // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
  MPI_Request err_req;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req);
  }
  if (ERROR)
  {
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " << PhysTime << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
 #endif
-  Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry);
+  Parallel::AsyncSyncState async_pre;
  Parallel::Sync_start(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev], async_pre);
 #ifdef WithShell
  if (lev == 0)
@@ -3353,6 +3369,23 @@ void bssn_class::Step(int lev, int YN)
    }
  }
 #endif
  Parallel::Sync_finish(sync_cache_pre[lev], async_pre, SynchList_pre, Symmetry);
 #ifdef WithShell
  // Complete non-blocking error reduction and check
  MPI_Wait(&err_req, MPI_STATUS_IGNORE);
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime << ", lev = " << lev << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
 #endif
 #if (MAPBH == 0)
  // for black hole position
@@ -3528,24 +3561,7 @@ void bssn_class::Step(int lev, int YN)
      Pp = Pp->next;
    }
-    // check error information
+    // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls
    {
      int erh = ERROR;
      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
    }
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count 
                                << " variables at t = " << PhysTime 
                                << ", lev = " << lev << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
 #ifdef WithShell
    // evolve Shell Patches
@@ -3685,26 +3701,16 @@ void bssn_class::Step(int lev, int YN)
        sPp = sPp->next;
      }
    }
-    // check error information
+    // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_cor);
    }
    if (ERROR)
    {
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" 
                                << iter_count << " variables at t = " 
                                << PhysTime << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
 #endif
-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::AsyncSyncState async_cor;
    Parallel::Sync_start(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev], async_cor);
 #ifdef WithShell
    if (lev == 0)
@@ -3723,6 +3729,25 @@ void bssn_class::Step(int lev, int YN)
      }
    }
 #endif
    Parallel::Sync_finish(sync_cache_cor[lev], async_cor, SynchList_cor, Symmetry);
 #ifdef WithShell
    // Complete non-blocking error reduction and check
    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count
                                << " variables at t = " << PhysTime
                                << ", lev = " << lev << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
 #endif
 #if (MAPBH == 0)
    // for black hole position
@@ -4034,22 +4059,7 @@ void bssn_class::Step(int lev, int YN)
    }
    Pp = Pp->next;
  }
-  // check error information
+  // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls
  {
    int erh = ERROR;
    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
  }
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime 
                              << ", lev = " << lev << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
 #ifdef WithShell
  // evolve Shell Patches
@@ -4190,25 +4200,16 @@ void bssn_class::Step(int lev, int YN)
  }
 #endif
  }
-  // check error information
+  // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
  MPI_Request err_req;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req);
  }
  if (ERROR)
  {
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " 
                              << PhysTime << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
 #endif
-  Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry);
+  Parallel::AsyncSyncState async_pre;
  Parallel::Sync_start(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev], async_pre);
 #ifdef WithShell
  if (lev == 0)
@@ -4222,8 +4223,26 @@ void bssn_class::Step(int lev, int YN)
      prev_clock = curr_clock;
      curr_clock = clock();
      cout << " Shell stuff synchronization used "
-      << (double)(curr_clock - prev_clock) / ((double)CLOCKS_PER_SEC) 
+           << (double)(curr_clock - prev_clock) / ((double)CLOCKS_PER_SEC)
-      << " seconds! " << endl;
+           << " seconds! " << endl;
    }
  }
 #endif
  Parallel::Sync_finish(sync_cache_pre[lev], async_pre, SynchList_pre, Symmetry);
 #ifdef WithShell
  // Complete non-blocking error reduction and check
  MPI_Wait(&err_req, MPI_STATUS_IGNORE);
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables at t = " << PhysTime
                              << ", lev = " << lev << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
 #endif
@@ -4386,23 +4405,7 @@ void bssn_class::Step(int lev, int YN)
      Pp = Pp->next;
    }
-    // check error information
+    // NOTE: error check deferred to after Shell Patch computation to reduce MPI_Allreduce calls
    {
      int erh = ERROR;
      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
    }
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count 
                                << " variables at t = " << PhysTime 
                                << ", lev = " << lev << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
 #ifdef WithShell
    // evolve Shell Patches
@@ -4542,25 +4545,16 @@ void bssn_class::Step(int lev, int YN)
        sPp = sPp->next;
      }
    }
-    // check error information
+    // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_cor);
    }
    if (ERROR)
    {
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" << iter_count 
                                << " variables at t = " << PhysTime << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
 #endif
-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::AsyncSyncState async_cor;
    Parallel::Sync_start(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev], async_cor);
 #ifdef WithShell
    if (lev == 0)
@@ -4578,6 +4572,25 @@ void bssn_class::Step(int lev, int YN)
             << " seconds! " << endl;
      }
    }
 #endif
    Parallel::Sync_finish(sync_cache_cor[lev], async_cor, SynchList_cor, Symmetry);
 #ifdef WithShell
    // Complete non-blocking error reduction and check
    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN in RK4 substep#" << iter_count
                                << " variables at t = " << PhysTime
                                << ", lev = " << lev << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
 #endif
    // for black hole position
    if (BH_num > 0 && lev == GH->levels - 1)
@@ -4943,11 +4956,19 @@ void bssn_class::Step(int lev, int YN)
  //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Predictor rhs calculation");
-  // check error information
+  // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
  MPI_Request err_req;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev]);
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev], &err_req);
  }
  //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Predictor sync");
  Parallel::Sync_cached(GH->PatL[lev], SynchList_pre, Symmetry, sync_cache_pre[lev]);
  // Complete non-blocking error reduction and check
  MPI_Wait(&err_req, MPI_STATUS_IGNORE);
  if (ERROR)
  {
    Parallel::Dump_Data(GH->PatL[lev], StateList, 0, PhysTime, dT_lev);
@@ -4959,10 +4980,6 @@ void bssn_class::Step(int lev, int YN)
    }
  }
  //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Predictor sync");
  Parallel::Sync(GH->PatL[lev], SynchList_pre, Symmetry);
 #if (MAPBH == 0)
  // for black hole position
  if (BH_num > 0 && lev == GH->levels - 1)
@@ -5140,11 +5157,21 @@ void bssn_class::Step(int lev, int YN)
    //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector error check");
-    // check error information
+    // Non-blocking error reduction overlapped with Sync to hide Allreduce latency
    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev]);
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, GH->Commlev[lev], &err_req_cor);
    }
    //    misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector sync");
    Parallel::Sync_cached(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_cor[lev]);
    //    misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Corrector sync");
    // Complete non-blocking error reduction and check
    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
    if (ERROR)
    {
      Parallel::Dump_Data(GH->PatL[lev], SynchList_pre, 0, PhysTime, dT_lev);
@@ -5158,12 +5185,6 @@ void bssn_class::Step(int lev, int YN)
      }
    }
    //    misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Corrector sync");
    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
    //    misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"after Corrector sync");
 #if (MAPBH == 0)
    // for black hole position
    if (BH_num > 0 && lev == GH->levels - 1)
@@ -5447,21 +5468,11 @@ void bssn_class::SHStep()
 #if (PSTR == 1 || PSTR == 2)
 //   misc::tillherecheck(GH->Commlev[lev],GH->start_rank[lev],"before Predictor's error check");
 #endif
-  // check error information
+  // Non-blocking error reduction overlapped with Synch to hide Allreduce latency
  MPI_Request err_req;
  {
    int erh = ERROR;
-    MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+    MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req);
  }
  if (ERROR)
  {
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " << PhysTime << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
  {
@@ -5479,6 +5490,19 @@ void bssn_class::SHStep()
    }
  }
  // Complete non-blocking error reduction and check
  MPI_Wait(&err_req, MPI_STATUS_IGNORE);
  if (ERROR)
  {
    SH->Dump_Data(StateList, 0, PhysTime, dT_lev);
    if (myrank == 0)
    {
      if (ErrorMonitor->outfile)
        ErrorMonitor->outfile << "find NaN in state variables on Shell Patches at t = " << PhysTime << endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
  }
  // corrector
  for (iter_count = 1; iter_count < 4; iter_count++)
  {
@@ -5621,21 +5645,11 @@ void bssn_class::SHStep()
        sPp = sPp->next;
      }
    }
-    // check error information
+    // Non-blocking error reduction overlapped with Synch to hide Allreduce latency
    MPI_Request err_req_cor;
    {
      int erh = ERROR;
-      MPI_Allreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
+      MPI_Iallreduce(&erh, &ERROR, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &err_req_cor);
    }
    if (ERROR)
    {
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" << iter_count 
                                << " variables at t = " << PhysTime << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
    {
@@ -5653,6 +5667,20 @@ void bssn_class::SHStep()
      }
    }
    // Complete non-blocking error reduction and check
    MPI_Wait(&err_req_cor, MPI_STATUS_IGNORE);
    if (ERROR)
    {
      SH->Dump_Data(SynchList_pre, 0, PhysTime, dT_lev);
      if (myrank == 0)
      {
        if (ErrorMonitor->outfile)
          ErrorMonitor->outfile << "find NaN on Shell Patches in RK4 substep#" << iter_count
                                << " variables at t = " << PhysTime << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
      }
    }
    sPp = SH->PatL;
    while (sPp)
    {
@@ -5781,7 +5809,7 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
 //       misc::tillherecheck(GH->Commlev[GH->mylev],GH->start_rank[GH->mylev],a_stream.str());
 #endif
-      Parallel::Sync(GH->PatL[lev - 1], SynchList_pre, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]);
 #if (PSTR == 1 || PSTR == 2)
 //       a_stream.clear();
@@ -5791,21 +5819,11 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
 #endif
 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
      while (Ppc)
      {
        Pp = GH->PatL[lev];
        while (Pp)
        {
 #if (MIXOUTB == 0)
-          Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SynchList_pre, SL, Symmetry);
+      Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, Symmetry);
 #elif (MIXOUTB == 1)
-          Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SynchList_pre, SL, Symmetry);
+      Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, Symmetry);
 #endif
          Pp = Pp->next;
        }
        Ppc = Ppc->next;
      }
 #elif (RPB == 1)
      //       Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SL,Symmetry);
      Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, GH->bdsul[lev], Symmetry);
@@ -5842,7 +5860,7 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
 //       misc::tillherecheck(GH->Commlev[GH->mylev],GH->start_rank[GH->mylev],a_stream.str());
 #endif
-      Parallel::Sync(GH->PatL[lev - 1], SL, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SL, Symmetry, sync_cache_rp_coarse[lev]);
 #if (PSTR == 1 || PSTR == 2)
 //       a_stream.clear();
@@ -5852,21 +5870,11 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
 #endif
 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
      while (Ppc)
      {
        Pp = GH->PatL[lev];
        while (Pp)
        {
 #if (MIXOUTB == 0)
-          Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SL, SL, Symmetry);
+      Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, Symmetry);
 #elif (MIXOUTB == 1)
-          Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SL, SL, Symmetry);
+      Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, Symmetry);
 #endif
          Pp = Pp->next;
        }
        Ppc = Ppc->next;
      }
 #elif (RPB == 1)
      //       Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SL,SL,Symmetry);
      Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, GH->bdsul[lev], Symmetry);
@@ -5880,7 +5888,7 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB,
 #endif
    }
-    Parallel::Sync(GH->PatL[lev], SL, Symmetry);
+    Parallel::Sync_cached(GH->PatL[lev], SL, Symmetry, sync_cache_rp_fine[lev]);
 #if (PSTR == 1 || PSTR == 2)
 //    a_stream.clear();
@@ -5938,24 +5946,14 @@ void bssn_class::RestrictProlong_aux(int lev, int YN, bool BB,
      Parallel::Restrict_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SynchList_pre, GH->rsul[lev], Symmetry);
 #endif
-      Parallel::Sync(GH->PatL[lev - 1], SynchList_pre, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]);
 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
      while (Ppc)
      {
        Pp = GH->PatL[lev];
        while (Pp)
        {
 #if (MIXOUTB == 0)
-          Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SynchList_pre, SL, Symmetry);
+      Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, Symmetry);
 #elif (MIXOUTB == 1)
-          Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SynchList_pre, SL, Symmetry);
+      Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, Symmetry);
 #endif
          Pp = Pp->next;
        }
        Ppc = Ppc->next;
      }
 #elif (RPB == 1)
      //       Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SL,Symmetry);
      Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SL, GH->bdsul[lev], Symmetry);
@@ -5970,31 +5968,21 @@ void bssn_class::RestrictProlong_aux(int lev, int YN, bool BB,
      Parallel::Restrict_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, GH->rsul[lev], Symmetry);
 #endif
-      Parallel::Sync(GH->PatL[lev - 1], SL, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SL, Symmetry, sync_cache_rp_coarse[lev]);
 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
      while (Ppc)
      {
        Pp = GH->PatL[lev];
        while (Pp)
        {
 #if (MIXOUTB == 0)
-          Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SL, SL, Symmetry);
+      Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, Symmetry);
 #elif (MIXOUTB == 1)
-          Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SL, SL, Symmetry);
+      Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, Symmetry);
 #endif
          Pp = Pp->next;
        }
        Ppc = Ppc->next;
      }
 #elif (RPB == 1)
      //       Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SL,SL,Symmetry);
      Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SL, SL, GH->bdsul[lev], Symmetry);
 #endif
    }
-    Parallel::Sync(GH->PatL[lev], SL, Symmetry);
+    Parallel::Sync_cached(GH->PatL[lev], SL, Symmetry, sync_cache_rp_fine[lev]);
  }
 }
@@ -6045,24 +6033,14 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB)
      Parallel::Restrict_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_cor, SynchList_pre, GH->rsul[lev], Symmetry);
 #endif
-      Parallel::Sync(GH->PatL[lev - 1], SynchList_pre, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], SynchList_pre, Symmetry, sync_cache_rp_coarse[lev]);
 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
      while (Ppc)
      {
        Pp = GH->PatL[lev];
        while (Pp)
        {
 #if (MIXOUTB == 0)
-          Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SynchList_pre, SynchList_cor, Symmetry);
+      Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, Symmetry);
 #elif (MIXOUTB == 1)
-          Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SynchList_pre, SynchList_cor, Symmetry);
+      Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, Symmetry);
 #endif
          Pp = Pp->next;
        }
        Ppc = Ppc->next;
      }
 #elif (RPB == 1)
      //       Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SynchList_cor,Symmetry);
      Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, GH->bdsul[lev], Symmetry);
@@ -6079,31 +6057,21 @@ void bssn_class::RestrictProlong(int lev, int YN, bool BB)
      Parallel::Restrict_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_cor, StateList, GH->rsul[lev], Symmetry);
 #endif
-      Parallel::Sync(GH->PatL[lev - 1], StateList, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], StateList, Symmetry, sync_cache_rp_coarse[lev]);
 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
      while (Ppc)
      {
        Pp = GH->PatL[lev];
        while (Pp)
        {
 #if (MIXOUTB == 0)
-          Parallel::OutBdLow2Hi(Ppc->data, Pp->data, StateList, SynchList_cor, Symmetry);
+      Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, Symmetry);
 #elif (MIXOUTB == 1)
-          Parallel::OutBdLow2Himix(Ppc->data, Pp->data, StateList, SynchList_cor, Symmetry);
+      Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, Symmetry);
 #endif
          Pp = Pp->next;
        }
        Ppc = Ppc->next;
      }
 #elif (RPB == 1)
      //       Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],StateList,SynchList_cor,Symmetry);
      Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, GH->bdsul[lev], Symmetry);
 #endif
    }
-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::Sync_cached(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_rp_fine[lev]);
  }
 }
@@ -6133,21 +6101,11 @@ void bssn_class::ProlongRestrict(int lev, int YN, bool BB)
      }
 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
      while (Ppc)
      {
        Pp = GH->PatL[lev];
        while (Pp)
        {
 #if (MIXOUTB == 0)
-          Parallel::OutBdLow2Hi(Ppc->data, Pp->data, SynchList_pre, SynchList_cor, Symmetry);
+      Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, Symmetry);
 #elif (MIXOUTB == 1)
-          Parallel::OutBdLow2Himix(Ppc->data, Pp->data, SynchList_pre, SynchList_cor, Symmetry);
+      Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, Symmetry);
 #endif
          Pp = Pp->next;
        }
        Ppc = Ppc->next;
      }
 #elif (RPB == 1)
      //       Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],SynchList_pre,SynchList_cor,Symmetry);
      Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], SynchList_pre, SynchList_cor, GH->bdsul[lev], Symmetry);
@@ -6156,21 +6114,11 @@ void bssn_class::ProlongRestrict(int lev, int YN, bool BB)
    else // no time refinement levels and for all same time levels
    {
 #if (RPB == 0)
      Ppc = GH->PatL[lev - 1];
      while (Ppc)
      {
        Pp = GH->PatL[lev];
        while (Pp)
        {
 #if (MIXOUTB == 0)
-          Parallel::OutBdLow2Hi(Ppc->data, Pp->data, StateList, SynchList_cor, Symmetry);
+      Parallel::OutBdLow2Hi(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, Symmetry);
 #elif (MIXOUTB == 1)
-          Parallel::OutBdLow2Himix(Ppc->data, Pp->data, StateList, SynchList_cor, Symmetry);
+      Parallel::OutBdLow2Himix(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, Symmetry);
 #endif
          Pp = Pp->next;
        }
        Ppc = Ppc->next;
      }
 #elif (RPB == 1)
      //       Parallel::OutBdLow2Hi_bam(GH->PatL[lev-1],GH->PatL[lev],StateList,SynchList_cor,Symmetry);
      Parallel::OutBdLow2Hi_bam(GH->PatL[lev - 1], GH->PatL[lev], StateList, SynchList_cor, GH->bdsul[lev], Symmetry);
@@ -6186,10 +6134,10 @@ void bssn_class::ProlongRestrict(int lev, int YN, bool BB)
 #else
      Parallel::Restrict_after(GH->PatL[lev - 1], GH->PatL[lev], SynchList_cor, StateList, Symmetry);
 #endif
-      Parallel::Sync(GH->PatL[lev - 1], StateList, Symmetry);
+      Parallel::Sync_cached(GH->PatL[lev - 1], StateList, Symmetry, sync_cache_rp_coarse[lev]);
    }
-    Parallel::Sync(GH->PatL[lev], SynchList_cor, Symmetry);
+    Parallel::Sync_cached(GH->PatL[lev], SynchList_cor, Symmetry, sync_cache_rp_fine[lev]);
  }
 }
 #undef MIXOUTB
--- a/AMSS_NCKU_source/bssn_class.h
+++ b/AMSS_NCKU_source/bssn_class.h
@@ -126,6 +126,11 @@ public:
       MyList<var> *OldStateList, *DumpList;
       MyList<var> *ConstraintList;
       Parallel::SyncCache *sync_cache_pre;  // per-level cache for predictor sync
       Parallel::SyncCache *sync_cache_cor;  // per-level cache for corrector sync
       Parallel::SyncCache *sync_cache_rp_coarse;  // RestrictProlong sync on PatL[lev-1]
       Parallel::SyncCache *sync_cache_rp_fine;    // RestrictProlong sync on PatL[lev]
       monitor *ErrorMonitor, *Psi4Monitor, *BHMonitor, *MAPMonitor;
       monitor *ConVMonitor;
       surface_integral *Waveshell;
--- a/AMSS_NCKU_source/bssn_rhs.f90
+++ b/AMSS_NCKU_source/bssn_rhs.f90
@@ -161,8 +161,36 @@
  chi_rhs = F2o3 *chin1*( alpn1 * trK - div_beta ) !rhs for chi
  call fderivs(ex,dxx,gxxx,gxxy,gxxz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  call fderivs(ex,gxy,gxyx,gxyy,gxyz,X,Y,Z,ANTI,ANTI,SYM ,Symmetry,Lev)
  call fderivs(ex,gxz,gxzx,gxzy,gxzz,X,Y,Z,ANTI,SYM ,ANTI,Symmetry,Lev)
  call fderivs(ex,dyy,gyyx,gyyy,gyyz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  call fderivs(ex,gyz,gyzx,gyzy,gyzz,X,Y,Z,SYM ,ANTI,ANTI,Symmetry,Lev)
  call fderivs(ex,dzz,gzzx,gzzy,gzzz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  gxx_rhs = - TWO * alpn1 * Axx    -  F2o3 * gxx * div_beta          + &
              TWO *(  gxx * betaxx +   gxy * betayx +   gxz * betazx)
  gyy_rhs = - TWO * alpn1 * Ayy    -  F2o3 * gyy * div_beta          + &
              TWO *(  gxy * betaxy +   gyy * betayy +   gyz * betazy)
  gzz_rhs = - TWO * alpn1 * Azz    -  F2o3 * gzz * div_beta          + &
              TWO *(  gxz * betaxz +   gyz * betayz +   gzz * betazz)
  gxy_rhs = - TWO * alpn1 * Axy    +  F1o3 * gxy    * div_beta       + &
                      gxx * betaxy                  +   gxz * betazy + &
                                       gyy * betayx +   gyz * betazx   &
                                                    -   gxy * betazz
  gyz_rhs = - TWO * alpn1 * Ayz    +  F1o3 * gyz    * div_beta       + &
                      gxy * betaxz +   gyy * betayz                  + &
                      gxz * betaxy                  +   gzz * betazy   &
                                                    -   gyz * betaxx
  gxz_rhs = - TWO * alpn1 * Axz    +  F1o3 * gxz    * div_beta       + &
                      gxx * betaxz +   gxy * betayz                  + &
                                       gyz * betayx +   gzz * betazx   &
                                                    -   gxz * betayy     !rhs for gij
 ! invert tilted metric
  gupzz =  gxx * gyy * gzz + gxy * gyz * gxz + gxz * gxy * gyz - &
@@ -173,12 +201,7 @@
  gupyy =   ( gxx * gzz - gxz * gxz ) / gupzz
  gupyz = - ( gxx * gyz - gxy * gxz ) / gupzz
  gupzz =   ( gxx * gyy - gxy * gxy ) / gupzz
-  call fderivs(ex,dxx,gxxx,gxxy,gxxz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
+
  call fderivs(ex,gxy,gxyx,gxyy,gxyz,X,Y,Z,ANTI,ANTI,SYM ,Symmetry,Lev)
  call fderivs(ex,gxz,gxzx,gxzy,gxzz,X,Y,Z,ANTI,SYM ,ANTI,Symmetry,Lev)
  call fderivs(ex,dyy,gyyx,gyyy,gyyz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  call fderivs(ex,gyz,gyzx,gyzy,gyzz,X,Y,Z,SYM ,ANTI,ANTI,Symmetry,Lev)
  call fderivs(ex,dzz,gzzx,gzzy,gzzz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  if(co == 0)then
 ! Gam^i_Res = Gam^i + gup^ij_,j
  Gmx_Res = Gamx - (gupxx*(gupxx*gxxx+gupxy*gxyx+gupxz*gxzx)&
@@ -922,103 +945,60 @@
  SSA(2)=SYM
  SSA(3)=ANTI
-!!!!!!!!!advection term part
+!!!!!!!!!advection term + Kreiss-Oliger dissipation (merged for cache efficiency)
 ! lopsided_kodis shares the symmetry_bd buffer between advection and
 ! dissipation, eliminating redundant full-grid copies. For metric variables
 ! gxx/gyy/gzz (=dxx/dyy/dzz+1): kodis stencil coefficients sum to zero,
 ! so the constant offset has no effect on dissipation.
-  gxx_rhs = - TWO * alpn1 * Axx    -  F2o3 * gxx * div_beta          + &
+  call lopsided_kodis(ex,X,Y,Z,gxx,gxx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-              TWO *(  gxx * betaxx +   gxy * betayx +   gxz * betazx)
+  call lopsided_kodis(ex,X,Y,Z,gxy,gxy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
  call lopsided_kodis(ex,X,Y,Z,gxz,gxz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
  call lopsided_kodis(ex,X,Y,Z,gyy,gyy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call lopsided_kodis(ex,X,Y,Z,gyz,gyz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
  call lopsided_kodis(ex,X,Y,Z,gzz,gzz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  gyy_rhs = - TWO * alpn1 * Ayy    -  F2o3 * gyy * div_beta          + &
+  call lopsided_kodis(ex,X,Y,Z,Axx,Axx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-              TWO *(  gxy * betaxy +   gyy * betayy +   gyz * betazy)
+  call lopsided_kodis(ex,X,Y,Z,Axy,Axy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
  call lopsided_kodis(ex,X,Y,Z,Axz,Axz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
  call lopsided_kodis(ex,X,Y,Z,Ayy,Ayy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call lopsided_kodis(ex,X,Y,Z,Ayz,Ayz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
  call lopsided_kodis(ex,X,Y,Z,Azz,Azz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  gzz_rhs = - TWO * alpn1 * Azz    -  F2o3 * gzz * div_beta          + &
+  call lopsided_kodis(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-              TWO *(  gxz * betaxz +   gyz * betayz +   gzz * betazz)
+  call lopsided_kodis(ex,X,Y,Z,trK,trK_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-  gxy_rhs = - TWO * alpn1 * Axy    +  F1o3 * gxy    * div_beta       + &
+  call lopsided_kodis(ex,X,Y,Z,Gamx,Gamx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
-                      gxx * betaxy                  +   gxz * betazy + &
+  call lopsided_kodis(ex,X,Y,Z,Gamy,Gamy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
-                                        gyy * betayx +   gyz * betazx   &
+  call lopsided_kodis(ex,X,Y,Z,Gamz,Gamz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
                                                    -   gxy * betazz
-  gyz_rhs = - TWO * alpn1 * Ayz    +  F1o3 * gyz    * div_beta       + &
+#if 1 
-                      gxy * betaxz +   gyy * betayz                  + &
+!! bam does not apply dissipation on gauge variables
-                      gxz * betaxy                  +   gzz * betazy   &
+  call lopsided_kodis(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS,eps)
-                                                    -   gyz * betaxx
+#if (GAUGE == 0 || GAUGE == 1 || GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7)
-
+  call lopsided_kodis(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS,eps)
-  gxz_rhs = - TWO * alpn1 * Axz    +  F1o3 * gxz    * div_beta       + &
+  call lopsided_kodis(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS,eps)
-                      gxx * betaxz +   gxy * betayz                  + &
+  call lopsided_kodis(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
-                                        gyz * betayx +   gzz * betazx   &
+#endif
-                                                    -   gxz * betayy     !rhs for gij
+#if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
-
+  call lopsided_kodis(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
-
+  call lopsided_kodis(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
-
+  call lopsided_kodis(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
-
+#endif
-
+#else
-  if(eps>0)then 
+! No dissipation on gauge variables (advection only)
 ! usual Kreiss-Oliger dissipation     
  call merge_lopsided_kodis(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS,eps) 
  call merge_lopsided_kodis(ex,X,Y,Z,gxx,gxx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,gxy,gxy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,gxz,gxz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,gyy,gyy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,gyz,gyz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,gzz,gzz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Axx,Axx_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Axy,Axy_rhs,betax,betay,betaz,Symmetry,AAS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Axz,Axz_rhs,betax,betay,betaz,Symmetry,ASA,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Ayy,Ayy_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Ayz,Ayz_rhs,betax,betay,betaz,Symmetry,SAA,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Azz,Azz_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,trK,trK_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Gamx,Gamx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Gamy,Gamy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Gamz,Gamz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS,eps)
  call merge_lopsided_kodis(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA,eps)
  else 
  call lopsided(ex,X,Y,Z,gxx,gxx_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,gxy,gxy_rhs,betax,betay,betaz,Symmetry,AAS)
  call lopsided(ex,X,Y,Z,gxz,gxz_rhs,betax,betay,betaz,Symmetry,ASA)
  call lopsided(ex,X,Y,Z,gyy,gyy_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,gyz,gyz_rhs,betax,betay,betaz,Symmetry,SAA)
  call lopsided(ex,X,Y,Z,gzz,gzz_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,Axx,Axx_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,Axy,Axy_rhs,betax,betay,betaz,Symmetry,AAS)
  call lopsided(ex,X,Y,Z,Axz,Axz_rhs,betax,betay,betaz,Symmetry,ASA)
  call lopsided(ex,X,Y,Z,Ayy,Ayy_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,Ayz,Ayz_rhs,betax,betay,betaz,Symmetry,SAA)
  call lopsided(ex,X,Y,Z,Azz,Azz_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,chi,chi_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,trK,trK_rhs,betax,betay,betaz,Symmetry,SSS)
  call lopsided(ex,X,Y,Z,Gamx,Gamx_rhs,betax,betay,betaz,Symmetry,ASS)
  call lopsided(ex,X,Y,Z,Gamy,Gamy_rhs,betax,betay,betaz,Symmetry,SAS)
  call lopsided(ex,X,Y,Z,Gamz,Gamz_rhs,betax,betay,betaz,Symmetry,SSA)
  call lopsided(ex,X,Y,Z,Lap,Lap_rhs,betax,betay,betaz,Symmetry,SSS)
 #if (GAUGE == 0 || GAUGE == 1 || GAUGE == 2 || GAUGE == 3 || GAUGE == 4 || GAUGE == 5 || GAUGE == 6 || GAUGE == 7)
  call lopsided(ex,X,Y,Z,betax,betax_rhs,betax,betay,betaz,Symmetry,ASS)
  call lopsided(ex,X,Y,Z,betay,betay_rhs,betax,betay,betaz,Symmetry,SAS)
  call lopsided(ex,X,Y,Z,betaz,betaz_rhs,betax,betay,betaz,Symmetry,SSA)
 #endif
 #if (GAUGE == 0 || GAUGE == 2 || GAUGE == 3 || GAUGE == 6 || GAUGE == 7)
  call lopsided(ex,X,Y,Z,dtSfx,dtSfx_rhs,betax,betay,betaz,Symmetry,ASS)
  call lopsided(ex,X,Y,Z,dtSfy,dtSfy_rhs,betax,betay,betaz,Symmetry,SAS)
  call lopsided(ex,X,Y,Z,dtSfz,dtSfz_rhs,betax,betay,betaz,Symmetry,SSA)
-
+#endif
-
+#endif
  endif
  if(co == 0)then
 ! ham_Res = trR + 2/3 * K^2 - A_ij * A^ij - 16 * PI * rho
@@ -1163,265 +1143,3 @@ endif
  return
  end function compute_rhs_bssn
  subroutine merge_lopsided_kodis(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA,eps)
    implicit none
  !~~~~~~> Input parameters:
    integer, intent(in)  :: ex(1:3),Symmetry
    real*8,  intent(in)  :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
    real*8,dimension(ex(1),ex(2),ex(3)),intent(in)   :: f,Sfx,Sfy,Sfz
    real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
    real*8,dimension(3),intent(in) ::SoA
  !~~~~~~> local variables:
  ! note index -2,-1,0, so we have 3 extra points
    real*8,dimension(-2:ex(1),-2:ex(2),-2:ex(3))   :: fh
    integer :: imin_lopsided,jmin_lopsided,kmin_lopsided,imin_kodis,jmin_kodis,kmin_kodis,imax,jmax,kmax,i,j,k
    real*8 :: dX,dY,dZ
    real*8 :: d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
    real*8,  parameter :: ZEO=0.d0,ONE=1.d0, F3=3.d0
    real*8,  parameter :: TWO=2.d0,F6=6.0d0,F18=1.8d1
    real*8,  parameter :: F12=1.2d1, F10=1.d1,EIT=8.d0
    integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
    real*8, parameter :: SIX=6.d0,FIT=1.5d1,TWT=2.d1
    real*8,parameter::cof=6.4d1   ! 2^6
    real*8,intent(in) :: eps
    dX = X(2)-X(1)
    dY = Y(2)-Y(1)
    dZ = Z(2)-Z(1)
    d12dx = ONE/F12/dX
    d12dy = ONE/F12/dY
    d12dz = ONE/F12/dZ
    d2dx = ONE/TWO/dX
    d2dy = ONE/TWO/dY
    d2dz = ONE/TWO/dZ
    imax = ex(1)
    jmax = ex(2)
    kmax = ex(3)
    imin_lopsided = 1
    jmin_lopsided = 1
    kmin_lopsided = 1
    if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin_lopsided = -2
    if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin_lopsided = -2
    if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin_lopsided = -2
    imin_kodis = 1
    jmin_kodis = 1
    kmin_kodis = 1
    if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin_kodis = -2
    if(Symmetry == OCTANT .and. dabs(X(1)) < dX) imin_kodis = -2
    if(Symmetry == OCTANT .and. dabs(Y(1)) < dY) jmin_kodis = -2
    call symmetry_bd(3,ex,f,fh,SoA)
  ! upper bound set ex-1 only for efficiency, 
  ! the loop body will set ex 0 also
    do k=1,ex(3)-1
    do j=1,ex(2)-1
    do i=1,ex(1)-1
  !! new code, 2012dec27, based on bam
  ! x direction   
      if(Sfx(i,j,k) > ZEO)then
        if(i+3 <= imax)then
  !         v
  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
  !  i     12dx       i-v      i      i+v     i+2v    i+3v
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                    Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
                                      -F6*fh(i+2,j,k)+    fh(i+3,j,k))
      elseif(i+2 <= imax)then
  !
  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
  !  fx(i) = ---------------------------------------------
  !                             12 dx
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                    Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
      elseif(i+1 <= imax)then
  !         v
  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
  !  i     12dx       i+v      i      i-v     i-2v    i-3v
      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                    Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
                                      -F6*fh(i-2,j,k)+    fh(i-3,j,k))
  ! set imax and imin_lopsided 0
      endif
    elseif(Sfx(i,j,k) < ZEO)then
        if(i-3 >= imin_lopsided)then
  !         v
  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
  !  i     12dx       i-v      i      i+v     i+2v    i+3v
      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                    Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
                                      -F6*fh(i-2,j,k)+    fh(i-3,j,k))
      elseif(i-2 >= imin_lopsided)then
  !
  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
  !  fx(i) = ---------------------------------------------
  !                             12 dx
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                    Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
      elseif(i-1 >= imin_lopsided)then
  !         v
  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
  !  i     12dx       i+v      i      i-v     i-2v    i-3v
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                    Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
                                      -F6*fh(i+2,j,k)+    fh(i+3,j,k))
  ! set imax and imin_lopsided 0
      endif
    endif
  ! y direction   
      if(Sfy(i,j,k) > ZEO)then
        if(j+3 <= jmax)then
  !         v
  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
  !  i     12dx       i-v      i      i+v     i+2v    i+3v
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                    Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
                                      -F6*fh(i,j+2,k)+    fh(i,j+3,k))
      elseif(j+2 <= jmax)then
  !
  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
  !  fx(i) = ---------------------------------------------
  !                             12 dx
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                    Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
      elseif(j+1 <= jmax)then
  !         v
  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
  !  i     12dx       i+v      i      i-v     i-2v    i-3v
      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                    Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
                                      -F6*fh(i,j-2,k)+    fh(i,j-3,k))
  ! set imax and imin_lopsided 0
      endif
    elseif(Sfy(i,j,k) < ZEO)then
        if(j-3 >= jmin_lopsided)then
  !         v
  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
  !  i     12dx       i-v      i      i+v     i+2v    i+3v
      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                    Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
                                      -F6*fh(i,j-2,k)+    fh(i,j-3,k))
      elseif(j-2 >= jmin_lopsided)then
  !
  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
  !  fx(i) = ---------------------------------------------
  !                             12 dx
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                    Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
      elseif(j-1 >= jmin_lopsided)then
  !         v
  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
  !  i     12dx       i+v      i      i-v     i-2v    i-3v
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                    Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
                                      -F6*fh(i,j+2,k)+    fh(i,j+3,k))
  ! set jmax and jmin_lopsided 0
      endif
    endif
  ! z direction   
      if(Sfz(i,j,k) > ZEO)then
        if(k+3 <= kmax)then
  !         v
  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
  !  i     12dx       i-v      i      i+v     i+2v    i+3v
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                    Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
                                      -F6*fh(i,j,k+2)+    fh(i,j,k+3))
      elseif(k+2 <= kmax)then
  !
  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
  !  fx(i) = ---------------------------------------------
  !                             12 dx
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                    Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
      elseif(k+1 <= kmax)then
  !         v
  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
  !  i     12dx       i+v      i      i-v     i-2v    i-3v
      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                    Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
                                      -F6*fh(i,j,k-2)+    fh(i,j,k-3))
  ! set imax and imin_lopsided 0
      endif
    elseif(Sfz(i,j,k) < ZEO)then
        if(k-3 >= kmin_lopsided)then
  !         v
  ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
  !  i     12dx       i-v      i      i+v     i+2v    i+3v
      f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                    Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
                                      -F6*fh(i,j,k-2)+    fh(i,j,k-3))
      elseif(k-2 >= kmin_lopsided)then
  !
  !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
  !  fx(i) = ---------------------------------------------
  !                             12 dx
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                    Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
      elseif(k-1 >= kmin_lopsided)then
  !         v
  ! D f = ------[   3f    + 10f  - 18f    + 6f     - f     ]
  !  i     12dx       i+v      i      i-v     i-2v    i-3v
      f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                    Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
                                      -F6*fh(i,j,k+2)+    fh(i,j,k+3))
  ! set kmax and kmin_lopsided 0
      endif
    endif
    if(i-3 >= imin_kodis .and. i+3 <= imax .and. &
      j-3 >= jmin_kodis .and. j+3 <= jmax .and. &
      k-3 >= kmin_kodis .and. k+3 <= kmax) then
  ! calculation order if important ?
    f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/cof *( (     &
                                (fh(i-3,j,k)+fh(i+3,j,k)) - &
                            SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
                            FIT*(fh(i-1,j,k)+fh(i+1,j,k)) - &
                            TWT* fh(i,j,k)            )/dX + &
                                                    (     &
                                (fh(i,j-3,k)+fh(i,j+3,k)) - &
                            SIX*(fh(i,j-2,k)+fh(i,j+2,k)) + &
                            FIT*(fh(i,j-1,k)+fh(i,j+1,k)) - &
                            TWT* fh(i,j,k)            )/dY + &
                                                    (     &
                                (fh(i,j,k-3)+fh(i,j,k+3)) - &
                            SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
                            FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
                            TWT* fh(i,j,k)            )/dZ )
    endif
    enddo
    enddo
    enddo
    return
  end subroutine merge_lopsided_kodis
--- a/AMSS_NCKU_source/diff_new.f90
+++ b/AMSS_NCKU_source/diff_new.f90
@@ -1000,7 +1000,86 @@
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
 #if 0  
 ! x direction   
        if(i+2 <= imax .and. i-2 >= imin)then
 !
 !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
 !  fx(i) = ---------------------------------------------
 !                             12 dx
      fx(i,j,k)=d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
    elseif(i+1 <= imax .and. i-1 >= imin)then
 !
 !              - f(i-1) + f(i+1)
 !  fx(i) = --------------------------------
 !                     2 dx
      fx(i,j,k)=d2dx*(-fh(i-1,j,k)+fh(i+1,j,k))
 ! set imax and imin 0
    endif
 ! y direction   
        if(j+2 <= jmax .and. j-2 >= jmin)then
      fy(i,j,k)=d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
    elseif(j+1 <= jmax .and. j-1 >= jmin)then
     fy(i,j,k)=d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
 ! set jmax and jmin 0
    endif
 ! z direction   
        if(k+2 <= kmax .and. k-2 >= kmin)then
      fz(i,j,k)=d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
    elseif(k+1 <= kmax .and. k-1 >= kmin)then
      fz(i,j,k)=d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
 ! set kmax and kmin 0
    endif
 #elif 0
 ! x direction   
        if(i+2 <= imax .and. i-2 >= imin)then
 !
 !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
 !  fx(i) = ---------------------------------------------
 !                             12 dx
      fx(i,j,k)=d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
    elseif(i+3 <= imax .and. i-1 >= imin)then
      fx(i,j,k)=d12dx*(-3.d0*fh(i-1,j,k)-1.d1*fh(i,j,k)+1.8d1*fh(i+1,j,k)-6.d0*fh(i+2,j,k)+fh(i+3,j,k))
    elseif(i+1 <= imax .and. i-3 >= imin)then
      fx(i,j,k)=d12dx*( 3.d0*fh(i+1,j,k)+1.d1*fh(i,j,k)-1.8d1*fh(i-1,j,k)+6.d0*fh(i-2,j,k)-fh(i-3,j,k))
 ! set imax and imin 0
    endif
 ! y direction   
        if(j+2 <= jmax .and. j-2 >= jmin)then
      fy(i,j,k)=d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
    elseif(j+3 <= jmax .and. j-1 >= jmin)then
      fy(i,j,k)=d12dy*(-3.d0*fh(i,j-1,k)-1.d1*fh(i,j,k)+1.8d1*fh(i,j+1,k)-6.d0*fh(i,j+2,k)+fh(i,j+3,k))
    elseif(j+1 <= jmax .and. j-3 >= jmin)then
      fy(i,j,k)=d12dy*( 3.d0*fh(i,j+1,k)+1.d1*fh(i,j,k)-1.8d1*fh(i,j-1,k)+6.d0*fh(i,j-2,k)-fh(i,j-3,k))
 ! set jmax and jmin 0
    endif
 ! z direction   
        if(k+2 <= kmax .and. k-2 >= kmin)then
      fz(i,j,k)=d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
    elseif(k+3 <= kmax .and. k-1 >= kmin)then
      fz(i,j,k)=d12dz*(-3.d0*fh(i,j,k-1)-1.d1*fh(i,j,k)+1.8d1*fh(i,j,k+1)-6.d0*fh(i,j,k+2)+fh(i,j,k+3))
    elseif(k+1 <= kmax .and. k-3 >= kmin)then
      fz(i,j,k)=d12dz*( 3.d0*fh(i,j,k+1)+1.d1*fh(i,j,k)-1.8d1*fh(i,j,k-1)+6.d0*fh(i,j,k-2)-fh(i,j,k-3))
 ! set kmax and kmin 0
    endif
 #else
 ! for bam comparison
   if(i+2 <= imax .and. i-2 >= imin .and. &
      j+2 <= jmax .and. j-2 >= jmin .and. &
@@ -1015,7 +1094,7 @@
      fy(i,j,k)=d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
      fz(i,j,k)=d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
   endif
-
+#endif
  enddo
  enddo
  enddo
@@ -1325,7 +1404,85 @@
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
 #if 0  
 !~~~~~~ fxx
        if(i+2 <= imax .and. i-2 >= imin)then
 !
 !               - f(i-2) + 16 f(i-1) - 30 f(i) + 16 f(i+1) - f(i+2)
 !  fxx(i) = ----------------------------------------------------------
 !                                  12 dx^2 
   fxx(i,j,k) = Fdxdx*(-fh(i-2,j,k)+F16*fh(i-1,j,k)-F30*fh(i,j,k) &
                       -fh(i+2,j,k)+F16*fh(i+1,j,k)              )
   elseif(i+1 <= imax .and. i-1 >= imin)then
 !
 !               f(i-1) - 2 f(i) + f(i+1)
 !  fxx(i) = --------------------------------
 !                         dx^2 
   fxx(i,j,k) = Sdxdx*(fh(i-1,j,k)-TWO*fh(i,j,k) &
                      +fh(i+1,j,k)              )
   endif
 !~~~~~~ fyy
        if(j+2 <= jmax .and. j-2 >= jmin)then
   fyy(i,j,k) = Fdydy*(-fh(i,j-2,k)+F16*fh(i,j-1,k)-F30*fh(i,j,k) &
                       -fh(i,j+2,k)+F16*fh(i,j+1,k)              )
   elseif(j+1 <= jmax .and. j-1 >= jmin)then
   fyy(i,j,k) = Sdydy*(fh(i,j-1,k)-TWO*fh(i,j,k) &
                      +fh(i,j+1,k)              )
   endif
 !~~~~~~ fzz
        if(k+2 <= kmax .and. k-2 >= kmin)then
   fzz(i,j,k) = Fdzdz*(-fh(i,j,k-2)+F16*fh(i,j,k-1)-F30*fh(i,j,k) &
                       -fh(i,j,k+2)+F16*fh(i,j,k+1)              )
   elseif(k+1 <= kmax .and. k-1 >= kmin)then
   fzz(i,j,k) = Sdzdz*(fh(i,j,k-1)-TWO*fh(i,j,k) &
                      +fh(i,j,k+1)              )
   endif
 !~~~~~~ fxy
       if(i+2 <= imax .and. i-2 >= imin .and. j+2 <= jmax .and. j-2 >= jmin)then
 !
 !                 ( f(i-2,j-2) - 8 f(i-1,j-2) + 8 f(i+1,j-2) - f(i+2,j-2) )
 !             - 8 ( f(i-2,j-1) - 8 f(i-1,j-1) + 8 f(i+1,j-1) - f(i+2,j-1) )
 !             + 8 ( f(i-2,j+1) - 8 f(i-1,j+1) + 8 f(i+1,j+1) - f(i+2,j+1) )
 !             -   ( f(i-2,j+2) - 8 f(i-1,j+2) + 8 f(i+1,j+2) - f(i+2,j+2) )
 !  fxy(i,j) = ----------------------------------------------------------------
 !                                  144 dx dy
   fxy(i,j,k) = Fdxdy*(     (fh(i-2,j-2,k)-F8*fh(i-1,j-2,k)+F8*fh(i+1,j-2,k)-fh(i+2,j-2,k))  &
                       -F8 *(fh(i-2,j-1,k)-F8*fh(i-1,j-1,k)+F8*fh(i+1,j-1,k)-fh(i+2,j-1,k))  &
                       +F8 *(fh(i-2,j+1,k)-F8*fh(i-1,j+1,k)+F8*fh(i+1,j+1,k)-fh(i+2,j+1,k))  &
                       -    (fh(i-2,j+2,k)-F8*fh(i-1,j+2,k)+F8*fh(i+1,j+2,k)-fh(i+2,j+2,k)))
   elseif(i+1 <= imax .and. i-1 >= imin .and. j+1 <= jmax .and. j-1 >= jmin)then
 !                 f(i-1,j-1) - f(i+1,j-1) - f(i-1,j+1) + f(i+1,j+1) 
 !  fxy(i,j) = -----------------------------------------------------------
 !                                      4 dx dy
   fxy(i,j,k) = Sdxdy*(fh(i-1,j-1,k)-fh(i+1,j-1,k)-fh(i-1,j+1,k)+fh(i+1,j+1,k))
   endif
 !~~~~~~ fxz
       if(i+2 <= imax .and. i-2 >= imin .and. k+2 <= kmax .and. k-2 >= kmin)then
   fxz(i,j,k) = Fdxdz*(     (fh(i-2,j,k-2)-F8*fh(i-1,j,k-2)+F8*fh(i+1,j,k-2)-fh(i+2,j,k-2))  &
                       -F8 *(fh(i-2,j,k-1)-F8*fh(i-1,j,k-1)+F8*fh(i+1,j,k-1)-fh(i+2,j,k-1))  &
                       +F8 *(fh(i-2,j,k+1)-F8*fh(i-1,j,k+1)+F8*fh(i+1,j,k+1)-fh(i+2,j,k+1))  &
                       -    (fh(i-2,j,k+2)-F8*fh(i-1,j,k+2)+F8*fh(i+1,j,k+2)-fh(i+2,j,k+2)))
   elseif(i+1 <= imax .and. i-1 >= imin .and. k+1 <= kmax .and. k-1 >= kmin)then
   fxz(i,j,k) = Sdxdz*(fh(i-1,j,k-1)-fh(i+1,j,k-1)-fh(i-1,j,k+1)+fh(i+1,j,k+1))
   endif
 !~~~~~~ fyz
       if(j+2 <= jmax .and. j-2 >= jmin .and. k+2 <= kmax .and. k-2 >= kmin)then
   fyz(i,j,k) = Fdydz*(     (fh(i,j-2,k-2)-F8*fh(i,j-1,k-2)+F8*fh(i,j+1,k-2)-fh(i,j+2,k-2))  &
                       -F8 *(fh(i,j-2,k-1)-F8*fh(i,j-1,k-1)+F8*fh(i,j+1,k-1)-fh(i,j+2,k-1))  &
                       +F8 *(fh(i,j-2,k+1)-F8*fh(i,j-1,k+1)+F8*fh(i,j+1,k+1)-fh(i,j+2,k+1))  &
                       -    (fh(i,j-2,k+2)-F8*fh(i,j-1,k+2)+F8*fh(i,j+1,k+2)-fh(i,j+2,k+2)))
   elseif(j+1 <= jmax .and. j-1 >= jmin .and. k+1 <= kmax .and. k-1 >= kmin)then
   fyz(i,j,k) = Sdydz*(fh(i,j-1,k-1)-fh(i,j+1,k-1)-fh(i,j-1,k+1)+fh(i,j+1,k+1))
   endif 
 #else
 ! for bam comparison
   if(i+2 <= imax .and. i-2 >= imin .and. &
      j+2 <= jmax .and. j-2 >= jmin .and. &
@@ -1361,7 +1518,7 @@
   fxz(i,j,k) = Sdxdz*(fh(i-1,j,k-1)-fh(i+1,j,k-1)-fh(i-1,j,k+1)+fh(i+1,j,k+1))
   fyz(i,j,k) = Sdydz*(fh(i,j-1,k-1)-fh(i,j+1,k-1)-fh(i,j-1,k+1)+fh(i,j+1,k+1))
   endif
-
+#endif
   enddo
   enddo
   enddo
--- a/AMSS_NCKU_source/fmisc.f90
+++ b/AMSS_NCKU_source/fmisc.f90
@@ -326,8 +326,7 @@ subroutine symmetry_bd(ord,extc,func,funcc,SoA)
  funcc(1:extc(1),1:extc(2),1:extc(3)) = func
   do i=0,ord-1
-      
+      funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1)
    funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1)
   enddo
   do i=0,ord-1
      funcc(:,-i,1:extc(3)) = funcc(:,i+2,1:extc(3))*SoA(2)
--- a/AMSS_NCKU_source/kodiss.f90
+++ b/AMSS_NCKU_source/kodiss.f90
@@ -6,6 +6,101 @@
 ! Vertex or Cell is distinguished in routine symmetry_bd which locates in
 ! file "fmisc.f90"
 #if (ghost_width == 2)
 ! second order code
 !------------------------------------------------------------------------------------------------------------------------------
 !usual type Kreiss-Oliger type numerical dissipation
 !We support cell center only
 !  (D_+D_-)^2 =
 !   f(i-2) - 4 f(i-1) + 6 f(i) - 4 f(i+1) + f(i+2)
 ! ------------------------------------------------------
 !                       dx^4
 !------------------------------------------------------------------------------------------------------------------------------
 ! do not add dissipation near boundary
 subroutine kodis(ex,X,Y,Z,f,f_rhs,SoA,Symmetry,eps)
 implicit none
 ! argument variables
 integer,intent(in) :: Symmetry
 integer,dimension(3),intent(in)::ex
 real*8, dimension(1:3), intent(in) :: SoA
 double precision,intent(in),dimension(ex(1))::X
 double precision,intent(in),dimension(ex(2))::Y
 double precision,intent(in),dimension(ex(3))::Z
 double precision,intent(in),dimension(ex(1),ex(2),ex(3))::f
 double precision,intent(inout),dimension(ex(1),ex(2),ex(3))::f_rhs
 real*8,intent(in) :: eps
 !~~~~~~ other variables
  real*8 :: dX,dY,dZ
  real*8,dimension(-1:ex(1),-1:ex(2),-1:ex(3))   :: fh
  integer :: imin,jmin,kmin,imax,jmax,kmax
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
  real*8,parameter   :: cof = 1.6d1 ! 2^4
  real*8,  parameter :: F4=4.d0,F6=6.d0
  integer::i,j,k
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -1
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -1
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -1
  call symmetry_bd(2,ex,f,fh,SoA)
 !   f(i-2) - 4 f(i-1) + 6 f(i) - 4 f(i+1) + f(i+2)
 ! ------------------------------------------------------
 !                       dx^4
 !  note the sign (-1)^r-1, now r=2
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
  if(i-2 >= imin .and. i+2 <= imax .and. &
     j-2 >= jmin .and. j+2 <= jmax .and. &
     k-2 >= kmin .and. k+2 <= kmax) then
 ! x direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dX/cof * (     &
                                (fh(i-2,j,k)+fh(i+2,j,k)) &
                         - F4 * (fh(i-1,j,k)+fh(i+1,j,k)) &
                         + F6 *  fh(i,j,k) )
 ! y direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dY/cof * (     &
                                (fh(i,j-2,k)+fh(i,j+2,k)) &
                         - F4 * (fh(i,j-1,k)+fh(i,j+1,k)) &
                         + F6 *  fh(i,j,k) )
 ! z direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dZ/cof * (     &
                                (fh(i,j,k-2)+fh(i,j,k+2)) &
                         - F4 * (fh(i,j,k-1)+fh(i,j,k+1)) &
                         + F6 *  fh(i,j,k) )
  endif
  enddo
  enddo
  enddo
  return
 end subroutine kodis
 #elif (ghost_width == 3)
 ! fourth order code
 !---------------------------------------------------------------------------------------------
@@ -61,7 +156,7 @@ integer, parameter :: NO_SYMM=0, OCTANT=2
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -2
  if(Symmetry == OCTANT .and. dabs(X(1)) < dX) imin = -2
  if(Symmetry == OCTANT .and. dabs(Y(1)) < dY) jmin = -2
-  !print*,'imin,jmin,kmin=',imin,jmin,kmin
+
  call symmetry_bd(3,ex,f,fh,SoA)
  do k=1,ex(3)
@@ -71,7 +166,28 @@ integer, parameter :: NO_SYMM=0, OCTANT=2
  if(i-3 >= imin .and. i+3 <= imax .and. &
     j-3 >= jmin .and. j+3 <= jmax .and. &
     k-3 >= kmin .and. k+3 <= kmax) then
 #if 0     
 ! x direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dX/cof * (     &
                              (fh(i-3,j,k)+fh(i+3,j,k)) - &
                          SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
                          FIT*(fh(i-1,j,k)+fh(i+1,j,k)) - &
                          TWT* fh(i,j,k)            )
 ! y direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dY/cof * (     &
                              (fh(i,j-3,k)+fh(i,j+3,k)) - &
                          SIX*(fh(i,j-2,k)+fh(i,j+2,k)) + &
                          FIT*(fh(i,j-1,k)+fh(i,j+1,k)) - &
                          TWT* fh(i,j,k)            )
 ! z direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dZ/cof * (     &
                              (fh(i,j,k-3)+fh(i,j,k+3)) - &
                          SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
                          FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
                          TWT* fh(i,j,k)            )
 #else
 ! calculation order if important ?
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/cof *( (     &
                              (fh(i-3,j,k)+fh(i+3,j,k)) - &
@@ -88,7 +204,7 @@ integer, parameter :: NO_SYMM=0, OCTANT=2
                          SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
                          FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
                          TWT* fh(i,j,k)            )/dZ )
-
+#endif
  endif
  enddo
@@ -99,6 +215,218 @@ integer, parameter :: NO_SYMM=0, OCTANT=2
  end subroutine kodis
 #elif (ghost_width == 4)
 ! sixth order code
 !------------------------------------------------------------------------------------------------------------------------------
 !usual type Kreiss-Oliger type numerical dissipation
 !We support cell center only
 !  (D_+D_-)^4 =
 !   f(i-4) - 8 f(i-3) + 28 f(i-2) - 56 f(i-1) + 70 f(i) - 56 f(i+1) + 28 f(i+2) - 8 f(i+3) + f(i+4)
 ! ----------------------------------------------------------------------------------------------------------
 !                                              dx^8
 !------------------------------------------------------------------------------------------------------------------------------
 ! do not add dissipation near boundary
 subroutine kodis(ex,X,Y,Z,f,f_rhs,SoA,Symmetry,eps)
 implicit none
 ! argument variables
 integer,intent(in) :: Symmetry
 integer,dimension(3),intent(in)::ex
 real*8, dimension(1:3), intent(in) :: SoA
 double precision,intent(in),dimension(ex(1))::X
 double precision,intent(in),dimension(ex(2))::Y
 double precision,intent(in),dimension(ex(3))::Z
 double precision,intent(in),dimension(ex(1),ex(2),ex(3))::f
 double precision,intent(inout),dimension(ex(1),ex(2),ex(3))::f_rhs
 real*8,intent(in) :: eps
 !~~~~~~ other variables
  real*8 :: dX,dY,dZ
  real*8,dimension(-3:ex(1),-3:ex(2),-3:ex(3))   :: fh
  integer :: imin,jmin,kmin,imax,jmax,kmax
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
  real*8,parameter   :: cof = 2.56d2 ! 2^8
  real*8,  parameter :: F8=8.d0,F28=2.8d1,F56=5.6d1,F70=7.d1
  integer::i,j,k
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -3
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -3
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -3
  call symmetry_bd(4,ex,f,fh,SoA)
 !   f(i-4) - 8 f(i-3) + 28 f(i-2) - 56 f(i-1) + 70 f(i) - 56 f(i+1) + 28 f(i+2) - 8 f(i+3) + f(i+4)
 ! ----------------------------------------------------------------------------------------------------------
 !                                              dx^8
 !  note the sign (-1)^r-1, now r=4
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
  if(i>imin+3 .and. i < imax-3 .and. &
     j>jmin+3 .and. j < jmax-3 .and. &
     k>kmin+3 .and. k < kmax-3) then
 ! x direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dX/cof * (     &
                                (fh(i-4,j,k)+fh(i+4,j,k)) &
                         - F8 * (fh(i-3,j,k)+fh(i+3,j,k)) &
                         +F28 * (fh(i-2,j,k)+fh(i+2,j,k)) &
                         -F56 * (fh(i-1,j,k)+fh(i+1,j,k)) &
                         +F70 *  fh(i,j,k) )
 ! y direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dY/cof * (     &
                                (fh(i,j-4,k)+fh(i,j+4,k)) &
                         - F8 * (fh(i,j-3,k)+fh(i,j+3,k)) &
                         +F28 * (fh(i,j-2,k)+fh(i,j+2,k)) &
                         -F56 * (fh(i,j-1,k)+fh(i,j+1,k)) &
                         +F70 *  fh(i,j,k) )
 ! z direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) - eps/dZ/cof * (     &
                                (fh(i,j,k-4)+fh(i,j,k+4)) &
                         - F8 * (fh(i,j,k-3)+fh(i,j,k+3)) &
                         +F28 * (fh(i,j,k-2)+fh(i,j,k+2)) &
                         -F56 * (fh(i,j,k-1)+fh(i,j,k+1)) &
                         +F70 *  fh(i,j,k) )
  endif
  enddo
  enddo
  enddo
  return
 end subroutine kodis
 #elif (ghost_width == 5)
 ! eighth order code
 !------------------------------------------------------------------------------------------------------------------------------
 !usual type Kreiss-Oliger type numerical dissipation
 !We support cell center only
 ! Note the notation D_+ and D_- [P240 of B. Gustafsson, H.-O. Kreiss, and J. Oliger, Time
 ! Dependent Problems and Difference Methods (Wiley, New York, 1995).]
 ! D_+ = (f(i+1) - f(i))/h
 ! D_- = (f(i) - f(i-1))/h
 ! then we have D_+D_- = D_-D_+ = (f(i+1) - 2f(i) + f(i-1))/h^2
 ! for nth order accurate finite difference code, we need r =n/2+1
 !              D_+^rD_-^r = (D_+D_-)^r 
 ! following the tradiation of PRD 77, 024027 (BB's calibration paper, Eq.(64),
 !  correct some typo according to above book) :
 ! + eps*(-1)^(r-1)*h^(2r-1)/2^(2r)*(D_+D_-)^r 
 !
 !
 ! this is for 8th order accurate finite difference scheme
 !  (D_+D_-)^5 =
 !  f(i-5) - 10 f(i-4) + 45 f(i-3) - 120 f(i-2) + 210 f(i-1) - 252 f(i) + 210 f(i+1) - 120 f(i+2) + 45 f(i+3) - 10 f(i+4) + f(i+5)
 ! -------------------------------------------------------------------------------------------------------------------------------
 !                                                              dx^10
 !---------------------------------------------------------------------------------------------------------------------------------
 ! do not add dissipation near boundary
 subroutine kodis(ex,X,Y,Z,f,f_rhs,SoA,Symmetry,eps)
 implicit none
 ! argument variables
 integer,intent(in) :: Symmetry
 integer,dimension(3),intent(in)::ex
 real*8, dimension(1:3), intent(in) :: SoA
 double precision,intent(in),dimension(ex(1))::X
 double precision,intent(in),dimension(ex(2))::Y
 double precision,intent(in),dimension(ex(3))::Z
 double precision,intent(in),dimension(ex(1),ex(2),ex(3))::f
 double precision,intent(inout),dimension(ex(1),ex(2),ex(3))::f_rhs
 real*8,intent(in) :: eps
 !~~~~~~ other variables
  real*8 :: dX,dY,dZ
  real*8,dimension(-4:ex(1),-4:ex(2),-4:ex(3))   :: fh
  integer :: imin,jmin,kmin,imax,jmax,kmax
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
  real*8,parameter   :: cof = 1.024d3 ! 2^2r = 2^10
  real*8,  parameter :: F10=1.d1,F45=4.5d1,F120=1.2d2,F210=2.1d2,F252=2.52d2
  integer::i,j,k
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -4
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -4
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -4
  call symmetry_bd(5,ex,f,fh,SoA)
 !  f(i-5) - 10 f(i-4) + 45 f(i-3) - 120 f(i-2) + 210 f(i-1) - 252 f(i) + 210 f(i+1) - 120 f(i+2) + 45 f(i+3) - 10 f(i+4) + f(i+5)
 ! -------------------------------------------------------------------------------------------------------------------------------
 !                                                              dx^10
 !  note the sign (-1)^r-1, now r=5
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
  if(i>imin+4 .and. i < imax-4 .and. &
     j>jmin+4 .and. j < jmax-4 .and. &
     k>kmin+4 .and. k < kmax-4) then
 ! x direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dX/cof * (      &
                                 (fh(i-5,j,k)+fh(i+5,j,k)) &
                         - F10 * (fh(i-4,j,k)+fh(i+4,j,k)) &
                         + F45 * (fh(i-3,j,k)+fh(i+3,j,k)) &
                         - F120* (fh(i-2,j,k)+fh(i+2,j,k)) &
                         + F210* (fh(i-1,j,k)+fh(i+1,j,k)) &
                         - F252 * fh(i,j,k) )
 ! y direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dY/cof * (      &
                                 (fh(i,j-5,k)+fh(i,j+5,k)) &
                         - F10 * (fh(i,j-4,k)+fh(i,j+4,k)) &
                         + F45 * (fh(i,j-3,k)+fh(i,j+3,k)) &
                         - F120* (fh(i,j-2,k)+fh(i,j+2,k)) &
                         + F210* (fh(i,j-1,k)+fh(i,j+1,k)) &
                         - F252 * fh(i,j,k) )
 ! z direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dZ/cof * (      &
                                 (fh(i,j,k-5)+fh(i,j,k+5)) &
                         - F10 * (fh(i,j,k-4)+fh(i,j,k+4)) &
                         + F45 * (fh(i,j,k-3)+fh(i,j,k+3)) &
                         - F120* (fh(i,j,k-2)+fh(i,j,k+2)) &
                         + F210* (fh(i,j,k-1)+fh(i,j,k+1)) &
                         - F252 * fh(i,j,k) )
  endif
  enddo
  enddo
  enddo
  return
 end subroutine kodis
 #endif  
--- a/AMSS_NCKU_source/lopsidediff.f90
+++ b/AMSS_NCKU_source/lopsidediff.f90
@@ -7,7 +7,163 @@
 ! Vertex or Cell is distinguished in routine symmetry_bd which locates in
 ! file "fmisc.f90"
 #if (ghost_width == 2)
 ! second order code
 !-----------------------------------------------------------------------------
 !         v
 ! D f = ------[ - 3 f  + 4 f   - f     ]
 !  i     2dx         i      i+v   i+2v
 !
 ! where
 !
 !        i
 !      |B |
 ! v = -----
 !        i
 !       B
 !
 !-----------------------------------------------------------------------------
 subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
  implicit none
 !~~~~~~> Input parameters:
  integer, intent(in)  :: ex(1:3),Symmetry
  real*8,  intent(in)  :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
  real*8,dimension(ex(1),ex(2),ex(3)),intent(in)   :: f,Sfx,Sfy,Sfz
  real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
  real*8,dimension(3),intent(in) ::SoA
 !~~~~~~> local variables:
 ! note index -1,0, so we have 2 extra points
  real*8,dimension(-1:ex(1),-1:ex(2),-1:ex(3))   :: fh
  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
  real*8 :: dX,dY,dZ
  real*8 :: d2dx,d2dy,d2dz
  real*8,  parameter :: ZEO=0.d0,ONE=1.d0,TWO=2.d0,THR=3.d0,FOUR=4.d0
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  d2dx = ONE/TWO/dX
  d2dy = ONE/TWO/dY
  d2dz = ONE/TWO/dZ
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -1
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -1
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -1
  call symmetry_bd(2,ex,f,fh,SoA)
 ! upper bound set ex-1 only for efficiency, 
 ! the loop body will set ex 0 also
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
 ! x direction   
    if(Sfx(i,j,k) >= ZEO)then
       if( i+2 <= imax .and. i >= imin)then
 !         v
 ! D f = ------[ - 3 f  + 4 f   - f     ]
 !  i     2dx         i      i+v   i+2v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
                  Sfx(i,j,k)*d2dx*(-THR*fh(i,j,k)+FOUR*fh(i+1,j,k)-fh(i+2,j,k))
       elseif(i+1 <= imax .and. i >= imin)then
 !         v
 ! D f = ------[ - f  + f   ]
 !  i      dx       i    i+v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
                  Sfx(i,j,k)*d2dx*(-fh(i,j,k)+fh(i+1,j,k))
       endif
    elseif(Sfx(i,j,k) <= ZEO)then
      if( i-2 >= imin .and. i <= imax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
                  Sfx(i,j,k)*d2dx*(-THR*fh(i,j,k)+FOUR*fh(i-1,j,k)-fh(i-2,j,k))
      elseif(i-1 >= imin .and. i <= imax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
                  Sfx(i,j,k)*d2dx*(-fh(i,j,k)+fh(i-1,j,k))
      endif
 ! set imax and imin 0
    endif
 ! y direction   
    if(Sfy(i,j,k) >= ZEO)then
       if( j+2 <= jmax .and. j >= jmin)then
 !         v
 ! D f = ------[ - 3 f  + 4 f   - f     ]
 !  i     2dx         i      i+v   i+2v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
                  Sfy(i,j,k)*d2dy*(-THR*fh(i,j,k)+FOUR*fh(i,j+1,k)-fh(i,j+2,k))
       elseif(j+1 <= jmax .and. j >= jmin)then
 !         v
 ! D f = ------[ - f  + f   ]
 !  i      dx       i    i+v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
                  Sfy(i,j,k)*d2dy*(-fh(i,j,k)+fh(i,j+1,k))
       endif
    elseif(Sfy(i,j,k) <= ZEO)then
      if( j-2 >= jmin .and. j <= jmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
                  Sfy(i,j,k)*d2dy*(-THR*fh(i,j,k)+FOUR*fh(i,j-1,k)-fh(i,j-2,k))
      elseif(j-1 >= jmin .and. j <= jmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
                  Sfy(i,j,k)*d2dy*(-fh(i,j,k)+fh(i,j-1,k))
      endif
 ! set jmin and jmax 0
     endif
 !! z direction   
    if(Sfz(i,j,k) >= ZEO)then
      if( k+2 <= kmax .and. k >= kmin)then
 !         v
 ! D f = ------[ - 3 f  + 4 f   - f     ]
 !  i     2dx         i      i+v   i+2v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
                  Sfz(i,j,k)*d2dz*(-THR*fh(i,j,k)+FOUR*fh(i,j,k+1)-fh(i,j,k+2))
       elseif(k+1 <= kmax .and. k >= kmin)then
 !         v
 ! D f = ------[ - f  + f   ]
 !  i      dx       i    i+v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                           &
                  Sfz(i,j,k)*d2dz*(-fh(i,j,k)+fh(i,j,k+1))
       endif
    elseif(Sfz(i,j,k) <= ZEO)then
      if( k-2 >= kmin .and. k <= kmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
                  Sfz(i,j,k)*d2dz*(-THR*fh(i,j,k)+FOUR*fh(i,j,k-1)-fh(i,j,k-2))
      elseif(k-1 >= kmin .and. k <= kmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                           &
                  Sfz(i,j,k)*d2dz*(-fh(i,j,k)+fh(i,j,k-1))
      endif
 ! set kmin and kmax 0
     endif
  enddo
  enddo
  enddo
  return
  end subroutine lopsided
 #elif (ghost_width == 3)
 ! fourth order code
 !-----------------------------------------------------------------------------
@@ -80,7 +236,89 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
 #if 0  
 !! old code
 ! x direction   
    if(Sfx(i,j,k) >= ZEO .and. i+3 <= imax .and. i-1 >= imin)then
 !         v
 ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
 !  i     12dx       i-v      i      i+v     i+2v    i+3v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                  Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
                                    -F6*fh(i+2,j,k)+    fh(i+3,j,k))
    elseif(Sfx(i,j,k) <= ZEO .and. i-3 >= imin .and. i+1 <= imax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                  Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
                                    -F6*fh(i-2,j,k)+    fh(i-3,j,k))
     elseif(i+2 <= imax .and. i-2 >= imin)then
 !
 !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
 !  fx(i) = ---------------------------------------------
 !                             12 dx
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                  Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
     elseif(i+1 <= imax .and. i-1 >= imin)then
 !
 !              - f(i-1) + f(i+1)
 !  fx(i) = --------------------------------
 !                     2 dx
     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfx(i,j,k)*d2dx*(-fh(i-1,j,k)+fh(i+1,j,k))
 ! set imax and imin 0
    endif
 ! y direction   
    if(Sfy(i,j,k) >= ZEO .and. j+3 <= jmax .and. j-1 >= jmin)then
 !         v
 ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
 !  i     12dx       i-v      i      i+v     i+2v    i+3v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                  Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
                                    -F6*fh(i,j+2,k)+    fh(i,j+3,k))
    elseif(Sfy(i,j,k) <= ZEO .and. j-3 >= jmin .and. j+1 <= jmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                  Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
                                    -F6*fh(i,j-2,k)+    fh(i,j-3,k))
     elseif(j+2 <= jmax .and. j-2 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
                  Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
     elseif(j+1 <= jmax .and. j-1 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfy(i,j,k)*d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
 ! set jmin and jmax 0
     endif
 !! z direction   
    if(Sfz(i,j,k) >= ZEO .and. k+3 <= kmax .and. k-1 >= kmin)then
 !         v
 ! D f = ------[ - 3f    - 10f  + 18f    - 6f     + f     ]
 !  i     12dx       i-v      i      i+v     i+2v    i+3v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                  Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
                                    -F6*fh(i,j,k+2)+    fh(i,j,k+3))
    elseif(Sfz(i,j,k) <= ZEO .and. k-3 >= kmin .and. k+1 <= kmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                  Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
                                    -F6*fh(i,j,k-2)+    fh(i,j,k-3))
     elseif(k+2 <= kmax .and. k-2 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
                  Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
     elseif(k+1 <= kmax .and. k-1 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+Sfz(i,j,k)*d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
 ! set kmin and kmax 0
     endif
 #else
 !! new code, 2012dec27, based on bam
 ! x direction   
    if(Sfx(i,j,k) > ZEO)then
@@ -240,6 +478,7 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
 ! set kmax and kmin 0
     endif
   endif
 #endif
  enddo
  enddo
  enddo
@@ -247,3 +486,612 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
  return
  end subroutine lopsided
 !-----------------------------------------------------------------------------
 ! Combined advection (lopsided) + Kreiss-Oliger dissipation (kodis)
 ! Shares the symmetry_bd buffer fh, eliminating one full-grid copy per call.
 ! Mathematically identical to calling lopsided then kodis separately.
 !-----------------------------------------------------------------------------
 subroutine lopsided_kodis(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA,eps)
  implicit none
 !~~~~~~> Input parameters:
  integer, intent(in)  :: ex(1:3),Symmetry
  real*8,  intent(in)  :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
  real*8,dimension(ex(1),ex(2),ex(3)),intent(in)   :: f,Sfx,Sfy,Sfz
  real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
  real*8,dimension(3),intent(in) ::SoA
  real*8,intent(in) :: eps
 !~~~~~~> local variables:
 ! note index -2,-1,0, so we have 3 extra points
  real*8,dimension(-2:ex(1),-2:ex(2),-2:ex(3))   :: fh
  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
  real*8 :: dX,dY,dZ
  real*8 :: d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
  real*8,  parameter :: ZEO=0.d0,ONE=1.d0, F3=3.d0
  real*8,  parameter :: TWO=2.d0,F6=6.0d0,F18=1.8d1
  real*8,  parameter :: F12=1.2d1, F10=1.d1,EIT=8.d0
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
 ! kodis parameters
  real*8, parameter :: SIX=6.d0,FIT=1.5d1,TWT=2.d1
  real*8, parameter :: cof=6.4d1   ! 2^6
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  d12dx = ONE/F12/dX
  d12dy = ONE/F12/dY
  d12dz = ONE/F12/dZ
  d2dx = ONE/TWO/dX
  d2dy = ONE/TWO/dY
  d2dz = ONE/TWO/dZ
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -2
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -2
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -2
 ! Single symmetry_bd call shared by both advection and dissipation
  call symmetry_bd(3,ex,f,fh,SoA)
 ! ---- Advection (lopsided) loop ----
 ! upper bound set ex-1 only for efficiency, 
 ! the loop body will set ex 0 also
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
 ! x direction   
    if(Sfx(i,j,k) > ZEO)then
      if(i+3 <= imax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                  Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
                                    -F6*fh(i+2,j,k)+    fh(i+3,j,k))
     elseif(i+2 <= imax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                  Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
     elseif(i+1 <= imax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                  Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
                                    -F6*fh(i-2,j,k)+    fh(i-3,j,k))
     endif
   elseif(Sfx(i,j,k) < ZEO)then
      if(i-3 >= imin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                  Sfx(i,j,k)*d12dx*(-F3*fh(i+1,j,k)-F10*fh(i,j,k)+F18*fh(i-1,j,k) &
                                    -F6*fh(i-2,j,k)+    fh(i-3,j,k))
     elseif(i-2 >= imin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                  Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
     elseif(i-1 >= imin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                  Sfx(i,j,k)*d12dx*(-F3*fh(i-1,j,k)-F10*fh(i,j,k)+F18*fh(i+1,j,k) &
                                    -F6*fh(i+2,j,k)+    fh(i+3,j,k))
     endif
   endif
 ! y direction   
    if(Sfy(i,j,k) > ZEO)then
      if(j+3 <= jmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                  Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
                                    -F6*fh(i,j+2,k)+    fh(i,j+3,k))
     elseif(j+2 <= jmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                  Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
     elseif(j+1 <= jmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                  Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
                                    -F6*fh(i,j-2,k)+    fh(i,j-3,k))
     endif
   elseif(Sfy(i,j,k) < ZEO)then
      if(j-3 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                  Sfy(i,j,k)*d12dy*(-F3*fh(i,j+1,k)-F10*fh(i,j,k)+F18*fh(i,j-1,k) &
                                    -F6*fh(i,j-2,k)+    fh(i,j-3,k))
     elseif(j-2 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                  Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
     elseif(j-1 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                  Sfy(i,j,k)*d12dy*(-F3*fh(i,j-1,k)-F10*fh(i,j,k)+F18*fh(i,j+1,k) &
                                    -F6*fh(i,j+2,k)+    fh(i,j+3,k))
     endif
   endif
 ! z direction   
    if(Sfz(i,j,k) > ZEO)then
      if(k+3 <= kmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                  Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
                                    -F6*fh(i,j,k+2)+    fh(i,j,k+3))
     elseif(k+2 <= kmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                  Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
     elseif(k+1 <= kmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                  Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
                                    -F6*fh(i,j,k-2)+    fh(i,j,k-3))
     endif
   elseif(Sfz(i,j,k) < ZEO)then
      if(k-3 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                   &
                  Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k+1)-F10*fh(i,j,k)+F18*fh(i,j,k-1) &
                                    -F6*fh(i,j,k-2)+    fh(i,j,k-3))
     elseif(k-2 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                  Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
     elseif(k-1 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                   &
                  Sfz(i,j,k)*d12dz*(-F3*fh(i,j,k-1)-F10*fh(i,j,k)+F18*fh(i,j,k+1) &
                                    -F6*fh(i,j,k+2)+    fh(i,j,k+3))
     endif
   endif
  enddo
  enddo
  enddo
 ! ---- Dissipation (kodis) loop ----
  if(eps > ZEO) then
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
  if(i-3 >= imin .and. i+3 <= imax .and. &
     j-3 >= jmin .and. j+3 <= jmax .and. &
     k-3 >= kmin .and. k+3 <= kmax) then
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/cof *( (     &
                              (fh(i-3,j,k)+fh(i+3,j,k)) - &
                          SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
                          FIT*(fh(i-1,j,k)+fh(i+1,j,k)) - &
                          TWT* fh(i,j,k)            )/dX + &
                                                  (     &
                              (fh(i,j-3,k)+fh(i,j+3,k)) - &
                          SIX*(fh(i,j-2,k)+fh(i,j+2,k)) + &
                          FIT*(fh(i,j-1,k)+fh(i,j+1,k)) - &
                          TWT* fh(i,j,k)            )/dY + &
                                                  (     &
                              (fh(i,j,k-3)+fh(i,j,k+3)) - &
                          SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
                          FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
                          TWT* fh(i,j,k)            )/dZ )
  endif
  enddo
  enddo
  enddo
  endif
  return
  end subroutine lopsided_kodis
 #elif (ghost_width == 4)
 ! sixth order code
 ! Compute advection terms in right hand sides of field equations
 !         v
 ! D f = ------[ 2f     - 24f    - 35f  + 80f    - 30f     + 8f     - f    ]
 !  i     60dx     i-2v      i-v      i      i+v      i+2v     i+3v    i+4v
 !
 ! where
 !
 !        i
 !      |B |
 ! v = -----
 !        i
 !       B
 !
 !-----------------------------------------------------------------------------
 subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
  implicit none
 !~~~~~~> Input parameters:
  integer, intent(in)  :: ex(1:3),Symmetry
  real*8,  intent(in)  :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
  real*8,dimension(ex(1),ex(2),ex(3)),intent(in)   :: f,Sfx,Sfy,Sfz
  real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
  real*8,dimension(3),intent(in) ::SoA
 !~~~~~~> local variables:
  real*8,dimension(-3:ex(1),-3:ex(2),-3:ex(3))   :: fh
  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
  real*8 :: dX,dY,dZ
  real*8 :: d60dx,d60dy,d60dz,d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
  real*8,  parameter :: ZEO=0.d0,ONE=1.d0, F60=6.d1
  real*8,  parameter :: TWO=2.d0,F24=2.4d1,F35=3.5d1,F80=8.d1,F30=3.d1,EIT=8.d0
  real*8,  parameter ::  F9=9.d0,F45=4.5d1,F12=1.2d1
  real*8,  parameter ::  F10=1.d1,F77=7.7d1,F150=1.5d2,F100=1.d2,F50=5.d1,F15=1.5d1
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  d60dx = ONE/F60/dX
  d60dy = ONE/F60/dY
  d60dz = ONE/F60/dZ
  d12dx = ONE/F12/dX
  d12dy = ONE/F12/dY
  d12dz = ONE/F12/dZ
  d2dx = ONE/TWO/dX
  d2dy = ONE/TWO/dY
  d2dz = ONE/TWO/dZ
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -3
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -3
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -3
  call symmetry_bd(4,ex,f,fh,SoA)
 ! upper bound set ex-1 only for efficiency, 
 ! the loop body will set ex 0 also
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
 ! x direction   
    if(Sfx(i,j,k) >= ZEO .and. i+4 <= imax .and. i-2 >= imin)then
 !         v
 ! D f = ------[ 2f     - 24f    - 35f  + 80f    - 30f     + 8f     - f    ]
 !  i     60dx     i-2v      i-v      i      i+v      i+2v     i+3v    i+4v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                             &
                  Sfx(i,j,k)*d60dx*(TWO*fh(i-2,j,k)-F24*fh(i-1,j,k)-F35*fh(i,j,k)+F80*fh(i+1,j,k) &
                                   -F30*fh(i+2,j,k)+EIT*fh(i+3,j,k)-    fh(i+4,j,k))
    elseif(Sfx(i,j,k) >= ZEO .and. i+5 <= imax .and. i-1 >= imin)then
 !         v
 ! D f = ------[-10f    - 77f  + 150f    - 100f     + 50f     -15f     + 2f    ]
 !  i     60dx      i-v      i       i+v       i+2v      i+3v     i+4v    i+5v
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                        &
                  Sfx(i,j,k)*d60dx*(-F10*fh(i-1,j,k)-F77*fh(i  ,j,k)+F150*fh(i+1,j,k)-F100*fh(i+2,j,k) &
                                    +F50*fh(i+3,j,k)-F15*fh(i+4,j,k)+ TWO*fh(i+5,j,k))
    elseif(Sfx(i,j,k) <= ZEO .and. i-4 >= imin .and. i+2 <= imax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                   &
                  Sfx(i,j,k)*d60dx*(TWO*fh(i+2,j,k)-F24*fh(i+1,j,k)-F35*fh(i,j,k)+F80*fh(i-1,j,k) &
                                   -F30*fh(i-2,j,k)+EIT*fh(i-3,j,k)-    fh(i-4,j,k))
    elseif(Sfx(i,j,k) <= ZEO .and. i-5 >= imin .and. i+1 <= imax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                        &
                  Sfx(i,j,k)*d60dx*(-F10*fh(i+1,j,k)-F77*fh(i  ,j,k)+F150*fh(i-1,j,k)-F100*fh(i-2,j,k) &
                                    +F50*fh(i-3,j,k)-F15*fh(i-4,j,k)+ TWO*fh(i-5,j,k))
     elseif(i+3 <= imax .and. i-3 >= imin)then
 !           - f(i-3) + 9 f(i-2) - 45 f(i-1) + 45 f(i+1) - 9 f(i+2) + f(i+3)
 !  fx(i) = -----------------------------------------------------------------
 !                                        60 dx
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
                  Sfx(i,j,k)*d60dx*(-fh(i-3,j,k)+F9*fh(i-2,j,k)-F45*fh(i-1,j,k)+F45*fh(i+1,j,k)-F9*fh(i+2,j,k)+fh(i+3,j,k))
     elseif(i+2 <= imax .and. i-2 >= imin)then
 !
 !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
 !  fx(i) = ---------------------------------------------
 !                             12 dx
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                  Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
     elseif(i+1 <= imax .and. i-1 >= imin)then
 !
 !              - f(i-1) + f(i+1)
 !  fx(i) = --------------------------------
 !                     2 dx
     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfx(i,j,k)*d2dx*(-fh(i-1,j,k)+fh(i+1,j,k))
 ! set imax and imin 0
    endif
 ! y direction   
     if(Sfy(i,j,k) >= ZEO .and. j+4 <= jmax .and. j-2 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                   &
                  Sfy(i,j,k)*d60dy*(TWO*fh(i,j-2,k)-F24*fh(i,j-1,k)-F35*fh(i,j,k)+F80*fh(i,j+1,k) &
                                   -F30*fh(i,j+2,k)+EIT*fh(i,j+3,k)-    fh(i,j+4,k))
     elseif(Sfy(i,j,k) >= ZEO .and. j+5 <= jmax .and. j-1 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                        &
                  Sfy(i,j,k)*d60dy*(-F10*fh(i,j-1,k)-F77*fh(i,j  ,k)+F150*fh(i,j+1,k)-F100*fh(i,j+2,k) &
                                    +F50*fh(i,j+3,k)-F15*fh(i,j+4,k)+ TWO*fh(i,j+5,k))
     elseif(Sfy(i,j,k) <= ZEO .and. j-4 >= jmin .and. j+2 <= jmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                   &
                  Sfy(i,j,k)*d60dy*(TWO*fh(i,j+2,k)-F24*fh(i,j+1,k)-F35*fh(i,j,k)+F80*fh(i,j-1,k) &
                                   -F30*fh(i,j-2,k)+EIT*fh(i,j-3,k)-    fh(i,j-4,k))
     elseif(Sfy(i,j,k) <= ZEO .and. j-5 >= jmin .and. j+1 <= jmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                        &
                  Sfy(i,j,k)*d60dy*(-F10*fh(i,j+1,k)-F77*fh(i,j  ,k)+F150*fh(i,j-1,k)-F100*fh(i,j-2,k) &
                                    +F50*fh(i,j-3,k)-F15*fh(i,j-4,k)+ TWO*fh(i,j-5,k))
     elseif(j+3 <= jmax .and. j-3 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                                         &
                  Sfy(i,j,k)*d60dy*(-fh(i,j-3,k)+F9*fh(i,j-2,k)-F45*fh(i,j-1,k)+F45*fh(i,j+1,k)-F9*fh(i,j+2,k)+fh(i,j+3,k))
     elseif(j+2 <= jmax .and. j-2 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
                  Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
     elseif(j+1 <= jmax .and. j-1 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfy(i,j,k)*d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
 ! set jmin and jmax 0
     endif
 !! z direction   
     if(Sfz(i,j,k) >= ZEO .and. k+4 <= kmax .and. k-2 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                   &
                  Sfz(i,j,k)*d60dz*(TWO*fh(i,j,k-2)-F24*fh(i,j,k-1)-F35*fh(i,j,k)+F80*fh(i,j,k+1) &
                                   -F30*fh(i,j,k+2)+EIT*fh(i,j,k+3)-    fh(i,j,k+4))
     elseif(Sfz(i,j,k) >= ZEO .and. k+5 <= kmax .and. k-1 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                        &
                  Sfz(i,j,k)*d60dz*(-F10*fh(i,j,k-1)-F77*fh(i,j,k  )+F150*fh(i,j,k+1)-F100*fh(i,j,k+2) &
                                    +F50*fh(i,j,k+3)-F15*fh(i,j,k+4)+ TWO*fh(i,j,k+5))
     elseif(Sfz(i,j,k) <= ZEO .and. k-4 >= kmin .and. k+2 <= kmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                   &
                  Sfz(i,j,k)*d60dz*(TWO*fh(i,j,k+2)-F24*fh(i,j,k+1)-F35*fh(i,j,k)+F80*fh(i,j,k-1) &
                                   -F30*fh(i,j,k-2)+EIT*fh(i,j,k-3)-    fh(i,j,k-4))
     elseif(Sfz(i,j,k) <= ZEO .and. k-5 >= kmin .and. k+1 <= kmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                        &
                  Sfz(i,j,k)*d60dz*(-F10*fh(i,j,k+1)-F77*fh(i,j,k  )+F150*fh(i,j,k-1)-F100*fh(i,j,k-2) &
                                    +F50*fh(i,j,k-3)-F15*fh(i,j,k-4)+ TWO*fh(i,j,k-5))
     elseif(k+3 <= kmax .and. k-3 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                                         &
                  Sfz(i,j,k)*d60dz*(-fh(i,j,k-3)+F9*fh(i,j,k-2)-F45*fh(i,j,k-1)+F45*fh(i,j,k+1)-F9*fh(i,j,k+2)+fh(i,j,k+3))
     elseif(k+2 <= kmax .and. k-2 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
                  Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
     elseif(k+1 <= kmax .and. k-1 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+Sfz(i,j,k)*d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
 ! set kmin and kmax 0
     endif
  enddo
  enddo
  enddo
  return
  end subroutine lopsided
 #elif (ghost_width == 5)
 ! eighth order code
 !-----------------------------------------------------------------------------
 ! PRD 77, 024034 (2008)
 ! Compute advection terms in right hand sides of field equations
 !        v [ - 5 f(i-3v) + 60 f(i-2v) - 420 f(i-v) - 378 f(i) + 1050 f(i+v) - 420 f(i+2v) + 140 f(i+3v) - 30 f(i+4v) + 3 f(i+5v)]
 ! D f = --------------------------------------------------------------------------------------------------------------------------
 !  i                                                             840 dx           
 !
 ! where
 !
 !        i
 !      |B |
 ! v = -----
 !        i
 !       B
 !
 !-----------------------------------------------------------------------------
 subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
  implicit none
 !~~~~~~> Input parameters:
  integer, intent(in)  :: ex(1:3),Symmetry
  real*8,  intent(in)  :: X(1:ex(1)),Y(1:ex(2)),Z(1:ex(3))
  real*8,dimension(ex(1),ex(2),ex(3)),intent(in)   :: f,Sfx,Sfy,Sfz
  real*8,dimension(ex(1),ex(2),ex(3)),intent(inout):: f_rhs
  real*8,dimension(3),intent(in) ::SoA
 !~~~~~~> local variables:
  real*8,dimension(-4:ex(1),-4:ex(2),-4:ex(3))   :: fh
  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
  real*8 :: dX,dY,dZ
  real*8 :: d840dx,d840dy,d840dz,d60dx,d60dy,d60dz,d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
  real*8,  parameter :: ZEO=0.d0,ONE=1.d0, F60=6.d1
  real*8,  parameter :: TWO=2.d0,F30=3.d1,EIT=8.d0
  real*8,  parameter ::  F9=9.d0,F45=4.5d1,F12=1.2d1,F140=1.4d2,THR=3.d0
  real*8,  parameter :: F840=8.4d2,F5=5.d0,F420=4.2d2,F378=3.78d2,F1050=1.05d3
  real*8,  parameter :: F32=3.2d1,F168=1.68d2,F672=6.72d2
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  d840dx = ONE/F840/dX
  d840dy = ONE/F840/dY
  d840dz = ONE/F840/dZ
  d60dx = ONE/F60/dX
  d60dy = ONE/F60/dY
  d60dz = ONE/F60/dZ
  d12dx = ONE/F12/dX
  d12dy = ONE/F12/dY
  d12dz = ONE/F12/dZ
  d2dx = ONE/TWO/dX
  d2dy = ONE/TWO/dY
  d2dz = ONE/TWO/dZ
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -4
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -4
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -4
  call symmetry_bd(5,ex,f,fh,SoA)
 ! upper bound set ex-1 only for efficiency, 
 ! the loop body will set ex 0 also
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
 ! x direction   
    if(Sfx(i,j,k) >= ZEO .and. i+5 <= imax .and. i-3 >= imin)then
 !        v [ - 5 f(i-3v) + 60 f(i-2v) - 420 f(i-v) - 378 f(i) + 1050 f(i+v) - 420 f(i+2v) + 140 f(i+3v) - 30 f(i+4v) + 3 f(i+5v)]
 ! D f = --------------------------------------------------------------------------------------------------------------------------
 !  i                                                             840 dx    
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                         &
                  Sfx(i,j,k)*d840dx*(-F5*fh(i-3,j,k)+F60 *fh(i-2,j,k)-F420*fh(i-1,j,k)-F378*fh(i  ,j,k) &
                                  +F1050*fh(i+1,j,k)-F420*fh(i+2,j,k)+F140*fh(i+3,j,k)-F30 *fh(i+4,j,k)+THR*fh(i+5,j,k))
    elseif(Sfx(i,j,k) <= ZEO .and. i-5 >= imin .and. i+3 <= imax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                          &
                  Sfx(i,j,k)*d840dx*(-F5*fh(i+3,j,k)+F60 *fh(i+2,j,k)-F420*fh(i+1,j,k)-F378*fh(i   ,j,k) &
                                  +F1050*fh(i-1,j,k)-F420*fh(i-2,j,k)+F140*fh(i-3,j,k)- F30*fh(i-4,j,k)+THR*fh(i-5,j,k))
    elseif(i+4 <= imax .and. i-4 >= imin)then
 !           3 f(i-4) - 32 f(i-3) + 168 f(i-2) - 672 f(i-1) + 672 f(i+1) - 168 f(i+2) + 32 f(i+3) - 3 f(i+4)
 !  fx(i) = -------------------------------------------------------------------------------------------------
 !                                                        840 dx
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
                  Sfx(i,j,k)*d840dx*( THR*fh(i-4,j,k)-F32 *fh(i-3,j,k)+F168*fh(i-2,j,k)-F672*fh(i-1,j,k)+    &
                                     F672*fh(i+1,j,k)-F168*fh(i+2,j,k)+F32 *fh(i+3,j,k)-THR *fh(i+4,j,k))
     elseif(i+3 <= imax .and. i-3 >= imin)then
 !           - f(i-3) + 9 f(i-2) - 45 f(i-1) + 45 f(i+1) - 9 f(i+2) + f(i+3)
 !  fx(i) = -----------------------------------------------------------------
 !                                        60 dx
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
                  Sfx(i,j,k)*d60dx*(-fh(i-3,j,k)+F9*fh(i-2,j,k)-F45*fh(i-1,j,k)+F45*fh(i+1,j,k)-F9*fh(i+2,j,k)+fh(i+3,j,k))
     elseif(i+2 <= imax .and. i-2 >= imin)then
 !
 !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
 !  fx(i) = ---------------------------------------------
 !                             12 dx
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                           &
                  Sfx(i,j,k)*d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
     elseif(i+1 <= imax .and. i-1 >= imin)then
 !
 !              - f(i-1) + f(i+1)
 !  fx(i) = --------------------------------
 !                     2 dx
     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfx(i,j,k)*d2dx*(-fh(i-1,j,k)+fh(i+1,j,k))
 ! set imax and imin 0
    endif
 ! y direction   
    if(Sfy(i,j,k) >= ZEO .and. j+5 <= jmax .and. j-3 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                         &
                  Sfy(i,j,k)*d840dy*(-F5*fh(i,j-3,k)+F60 *fh(i,j-2,k)-F420*fh(i,j-1,k)-F378*fh(i,j  ,k) &
                                  +F1050*fh(i,j+1,k)-F420*fh(i,j+2,k)+F140*fh(i,j+3,k)-F30 *fh(i,j+4,k)+THR*fh(i,j+5,k))
    elseif(Sfy(i,j,k) <= ZEO .and. j-5 >= jmin .and. j+3 <= jmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                         &
                  Sfy(i,j,k)*d840dy*(-F5*fh(i,j+3,k)+F60 *fh(i,j+2,k)-F420*fh(i,j+1,k)-F378*fh(i,j  ,k) &
                                  +F1050*fh(i,j-1,k)-F420*fh(i,j-2,k)+F140*fh(i,j-3,k)- F30*fh(i,j-4,k)+THR*fh(i,j-5,k))
    elseif(j+4 <= jmax .and. j-4 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
                  Sfy(i,j,k)*d840dy*( THR*fh(i,j-4,k)-F32 *fh(i,j-3,k)+F168*fh(i,j-2,k)-F672*fh(i,j-1,k)+    &
                                     F672*fh(i,j+1,k)-F168*fh(i,j+2,k)+F32 *fh(i,j+3,k)-THR *fh(i,j+4,k))
     elseif(j+3 <= jmax .and. j-3 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                                         &
                  Sfy(i,j,k)*d60dy*(-fh(i,j-3,k)+F9*fh(i,j-2,k)-F45*fh(i,j-1,k)+F45*fh(i,j+1,k)-F9*fh(i,j+2,k)+fh(i,j+3,k))
     elseif(j+2 <= jmax .and. j-2 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
                  Sfy(i,j,k)*d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
     elseif(j+1 <= jmax .and. j-1 >= jmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k) + Sfy(i,j,k)*d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
 ! set jmin and jmax 0
     endif
 !! z direction   
    if(Sfz(i,j,k) >= ZEO .and. k+5 <= kmax .and. k-3 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                         &
                  Sfz(i,j,k)*d840dz*(-F5*fh(i,j,k-3)+F60 *fh(i,j,k-2)-F420*fh(i,j,k-1)-F378*fh(i,j,k  ) &
                                  +F1050*fh(i,j,k+1)-F420*fh(i,j,k+2)+F140*fh(i,j,k+3)-F30 *fh(i,j,k+4)+THR*fh(i,j,k+5))
    elseif(Sfz(i,j,k) <= ZEO .and. k-5 >= kmin .and. k+3 <= kmax)then
     f_rhs(i,j,k)=f_rhs(i,j,k)-                                                                         &
                  Sfz(i,j,k)*d840dz*(-F5*fh(i,j,k+3)+F60 *fh(i,j,k+2)-F420*fh(i,j,k+1)-F378*fh(i,j,k  ) &
                                  +F1050*fh(i,j,k-1)-F420*fh(i,j,k-2)+F140*fh(i,j,k-3)- F30*fh(i,j,k-4)+THR*fh(i,j,k-5))
    elseif(k+4 <= kmax .and. k-4 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                              &
                  Sfz(i,j,k)*d840dz*( THR*fh(i,j,k-4)-F32 *fh(i,j,k-3)+F168*fh(i,j,k-2)-F672*fh(i,j,k-1)+    &
                                     F672*fh(i,j,k+1)-F168*fh(i,j,k+2)+F32 *fh(i,j,k+3)-THR *fh(i,j,k+4))
     elseif(k+3 <= kmax .and. k-3 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                                                         &
                  Sfz(i,j,k)*d60dz*(-fh(i,j,k-3)+F9*fh(i,j,k-2)-F45*fh(i,j,k-1)+F45*fh(i,j,k+1)-F9*fh(i,j,k+2)+fh(i,j,k+3))
     elseif(k+2 <= kmax .and. k-2 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+                                                            &
                  Sfz(i,j,k)*d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
     elseif(k+1 <= kmax .and. k-1 >= kmin)then
     f_rhs(i,j,k)=f_rhs(i,j,k)+Sfz(i,j,k)*d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
 ! set kmin and kmax 0
     endif
  enddo
  enddo
  enddo
  return
  end subroutine lopsided
 #endif  
--- a/AMSS_NCKU_source/makefile.inc
+++ b/AMSS_NCKU_source/makefile.inc
@@ -10,14 +10,15 @@ filein  = -I/usr/include/ -I${MKLROOT}/include
 ## Added -lifcore for Intel Fortran runtime and -limf for Intel math library
 LDLIBS  = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lifcore -limf -lpthread -lm -ldl
-## Aggressive optimization flags:
+## Aggressive optimization flags + PGO Phase 2 (profile-guided optimization)
-## -O3: Maximum optimization
+## -fprofile-instr-use: use collected profile data to guide optimization decisions
-## -xHost: Optimize for the host CPU architecture (Intel/AMD compatible)
+##   (branch prediction, basic block layout, inlining, loop unrolling)
-## -fp-model fast=2: Aggressive floating-point optimizations
+PROFDATA     = ../../pgo_profile/default.profdata
 ## -fma: Enable fused multiply-add instructions
 CXXAPPFLAGS  = -O3 -xHost -fp-model fast=2 -fma -ipo \
               -fprofile-instr-use=$(PROFDATA) \
               -Dfortran3 -Dnewc -I${MKLROOT}/include
 f90appflags  = -O3 -xHost -fp-model fast=2 -fma -ipo \
               -fprofile-instr-use=$(PROFDATA) \
               -align array64byte -fpp -I${MKLROOT}/include
 f90          = ifx
 f77          = ifx
--- a/AMSS_NCKU_source/surface_integral.C
+++ b/AMSS_NCKU_source/surface_integral.C
@@ -220,16 +220,9 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH, var *Rpsi4, var *
    pox[2][n] = rex * nz_g[n];
  }
  double *shellf;
  shellf = new double[n_tot * InList];
  GH->PatL[lev]->data->Interp_Points(DG_List, n_tot, pox, shellf, Symmetry);
  int mp, Lp, Nmin, Nmax;
  mp = n_tot / cpusize;
  Lp = n_tot - cpusize * mp;
  if (Lp > myrank)
  {
    Nmin = myrank * mp + myrank;
@@ -241,6 +234,11 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH, var *Rpsi4, var *
    Nmax = Nmin + mp - 1;
  }
  double *shellf;
  shellf = new double[n_tot * InList];
  GH->PatL[lev]->data->Interp_Points(DG_List, n_tot, pox, shellf, Symmetry, Nmin, Nmax);
  //|~~~~~> Integrate the dot product of Dphi with the surface normal.
  double *RP_out, *IP_out;
@@ -363,8 +361,17 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH, var *Rpsi4, var *
  }
  //|------+  Communicate and sum the results from each processor.
-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double *RPIP_out = new double[2 * NN];
    double *RPIP = new double[2 * NN];
    memcpy(RPIP_out, RP_out, NN * sizeof(double));
    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    memcpy(RP, RPIP, NN * sizeof(double));
    memcpy(IP, RPIP + NN, NN * sizeof(double));
    delete[] RPIP_out;
    delete[] RPIP;
  }
  //|------= Free memory.
@@ -556,8 +563,17 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH, var *Rpsi4, var *
  }
  //|------+  Communicate and sum the results from each processor.
-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, Comm_here);
+  {
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, Comm_here);
+    double *RPIP_out = new double[2 * NN];
    double *RPIP = new double[2 * NN];
    memcpy(RPIP_out, RP_out, NN * sizeof(double));
    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, Comm_here);
    memcpy(RP, RPIP, NN * sizeof(double));
    memcpy(IP, RPIP + NN, NN * sizeof(double));
    delete[] RPIP_out;
    delete[] RPIP;
  }
  //|------= Free memory.
@@ -735,8 +751,17 @@ void surface_integral::surf_Wave(double rex, int lev, ShellPatch *GH, var *Rpsi4
  }
  //|------+  Communicate and sum the results from each processor.
-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double *RPIP_out = new double[2 * NN];
    double *RPIP = new double[2 * NN];
    memcpy(RPIP_out, RP_out, NN * sizeof(double));
    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    memcpy(RP, RPIP, NN * sizeof(double));
    memcpy(IP, RPIP + NN, NN * sizeof(double));
    delete[] RPIP_out;
    delete[] RPIP;
  }
  //|------= Free memory.
@@ -984,8 +1009,17 @@ void surface_integral::surf_Wave(double rex, int lev, ShellPatch *GH,
  }
  //|------+  Communicate and sum the results from each processor.
-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double *RPIP_out = new double[2 * NN];
    double *RPIP = new double[2 * NN];
    memcpy(RPIP_out, RP_out, NN * sizeof(double));
    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    memcpy(RP, RPIP, NN * sizeof(double));
    memcpy(IP, RPIP + NN, NN * sizeof(double));
    delete[] RPIP_out;
    delete[] RPIP;
  }
  //|------= Free memory.
@@ -1419,8 +1453,17 @@ void surface_integral::surf_Wave(double rex, int lev, ShellPatch *GH,
  }
  //|------+  Communicate and sum the results from each processor.
-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double *RPIP_out = new double[2 * NN];
    double *RPIP = new double[2 * NN];
    memcpy(RPIP_out, RP_out, NN * sizeof(double));
    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    memcpy(RP, RPIP, NN * sizeof(double));
    memcpy(IP, RPIP + NN, NN * sizeof(double));
    delete[] RPIP_out;
    delete[] RPIP;
  }
  //|------= Free memory.
@@ -1854,8 +1897,17 @@ void surface_integral::surf_Wave(double rex, int lev, cgh *GH,
  }
  //|------+  Communicate and sum the results from each processor.
-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double *RPIP_out = new double[2 * NN];
    double *RPIP = new double[2 * NN];
    memcpy(RPIP_out, RP_out, NN * sizeof(double));
    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    memcpy(RP, RPIP, NN * sizeof(double));
    memcpy(IP, RPIP + NN, NN * sizeof(double));
    delete[] RPIP_out;
    delete[] RPIP;
  }
  //|------= Free memory.
@@ -2040,8 +2092,17 @@ void surface_integral::surf_Wave(double rex, int lev, NullShellPatch2 *GH, var *
  }
  //|------+  Communicate and sum the results from each processor.
-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double *RPIP_out = new double[2 * NN];
    double *RPIP = new double[2 * NN];
    memcpy(RPIP_out, RP_out, NN * sizeof(double));
    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    memcpy(RP, RPIP, NN * sizeof(double));
    memcpy(IP, RPIP + NN, NN * sizeof(double));
    delete[] RPIP_out;
    delete[] RPIP;
  }
  //|------= Free memory.
@@ -2226,8 +2287,17 @@ void surface_integral::surf_Wave(double rex, int lev, NullShellPatch *GH, var *R
  }
  //|------+  Communicate and sum the results from each processor.
-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double *RPIP_out = new double[2 * NN];
    double *RPIP = new double[2 * NN];
    memcpy(RPIP_out, RP_out, NN * sizeof(double));
    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    memcpy(RP, RPIP, NN * sizeof(double));
    memcpy(IP, RPIP + NN, NN * sizeof(double));
    delete[] RPIP_out;
    delete[] RPIP;
  }
  //|------= Free memory.
@@ -2314,25 +2384,9 @@ void surface_integral::surf_MassPAng(double rex, int lev, cgh *GH, var *chi, var
    pox[2][n] = rex * nz_g[n];
  }
  double *shellf;
  shellf = new double[n_tot * InList];
  // we have assumed there is only one box on this level,
  // so we do not need loop boxes
  GH->PatL[lev]->data->Interp_Points(DG_List, n_tot, pox, shellf, Symmetry);
  double Mass_out = 0;
  double ang_outx, ang_outy, ang_outz;
  double p_outx, p_outy, p_outz;
  ang_outx = ang_outy = ang_outz = 0.0;
  p_outx = p_outy = p_outz = 0.0;
  const double f1o8 = 0.125;
  int mp, Lp, Nmin, Nmax;
  mp = n_tot / cpusize;
  Lp = n_tot - cpusize * mp;
  if (Lp > myrank)
  {
    Nmin = myrank * mp + myrank;
@@ -2344,6 +2398,20 @@ void surface_integral::surf_MassPAng(double rex, int lev, cgh *GH, var *chi, var
    Nmax = Nmin + mp - 1;
  }
  double *shellf;
  shellf = new double[n_tot * InList];
  // we have assumed there is only one box on this level,
  // so we do not need loop boxes
  GH->PatL[lev]->data->Interp_Points(DG_List, n_tot, pox, shellf, Symmetry, Nmin, Nmax);
  double Mass_out = 0;
  double ang_outx, ang_outy, ang_outz;
  double p_outx, p_outy, p_outz;
  ang_outx = ang_outy = ang_outz = 0.0;
  p_outx = p_outy = p_outz = 0.0;
  const double f1o8 = 0.125;
  double Chi, Psi;
  double Gxx, Gxy, Gxz, Gyy, Gyz, Gzz;
  double gupxx, gupxy, gupxz, gupyy, gupyz, gupzz;
@@ -2464,15 +2532,13 @@ void surface_integral::surf_MassPAng(double rex, int lev, cgh *GH, var *chi, var
    }
  }
-  MPI_Allreduce(&Mass_out, &mass, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-
+    double scalar_out[7] = {Mass_out, ang_outx, ang_outy, ang_outz, p_outx, p_outy, p_outz};
-  MPI_Allreduce(&ang_outx, &sx, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double scalar_in[7];
-  MPI_Allreduce(&ang_outy, &sy, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    MPI_Allreduce(scalar_out, scalar_in, 7, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&ang_outz, &sz, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    mass = scalar_in[0]; sx = scalar_in[1]; sy = scalar_in[2]; sz = scalar_in[3];
-
+    px = scalar_in[4]; py = scalar_in[5]; pz = scalar_in[6];
-  MPI_Allreduce(&p_outx, &px, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  }
  MPI_Allreduce(&p_outy, &py, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
  MPI_Allreduce(&p_outz, &pz, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
 #ifdef GaussInt
  mass = mass * rex * rex * dphi * factor;
@@ -2735,15 +2801,13 @@ void surface_integral::surf_MassPAng(double rex, int lev, cgh *GH, var *chi, var
    }
  }
-  MPI_Allreduce(&Mass_out, &mass, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
+  {
-
+    double scalar_out[7] = {Mass_out, ang_outx, ang_outy, ang_outz, p_outx, p_outy, p_outz};
-  MPI_Allreduce(&ang_outx, &sx, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
+    double scalar_in[7];
-  MPI_Allreduce(&ang_outy, &sy, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
+    MPI_Allreduce(scalar_out, scalar_in, 7, MPI_DOUBLE, MPI_SUM, Comm_here);
-  MPI_Allreduce(&ang_outz, &sz, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
+    mass = scalar_in[0]; sx = scalar_in[1]; sy = scalar_in[2]; sz = scalar_in[3];
-
+    px = scalar_in[4]; py = scalar_in[5]; pz = scalar_in[6];
-  MPI_Allreduce(&p_outx, &px, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
+  }
  MPI_Allreduce(&p_outy, &py, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
  MPI_Allreduce(&p_outz, &pz, 1, MPI_DOUBLE, MPI_SUM, Comm_here);
 #ifdef GaussInt
  mass = mass * rex * rex * dphi * factor;
@@ -3020,15 +3084,13 @@ void surface_integral::surf_MassPAng(double rex, int lev, ShellPatch *GH, var *c
    }
  }
-  MPI_Allreduce(&Mass_out, &mass, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-
+    double scalar_out[7] = {Mass_out, ang_outx, ang_outy, ang_outz, p_outx, p_outy, p_outz};
-  MPI_Allreduce(&ang_outx, &sx, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double scalar_in[7];
-  MPI_Allreduce(&ang_outy, &sy, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    MPI_Allreduce(scalar_out, scalar_in, 7, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
-  MPI_Allreduce(&ang_outz, &sz, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    mass = scalar_in[0]; sx = scalar_in[1]; sy = scalar_in[2]; sz = scalar_in[3];
-
+    px = scalar_in[4]; py = scalar_in[5]; pz = scalar_in[6];
-  MPI_Allreduce(&p_outx, &px, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  }
  MPI_Allreduce(&p_outy, &py, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
  MPI_Allreduce(&p_outz, &pz, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
 #ifdef GaussInt
  mass = mass * rex * rex * dphi * factor;
@@ -3607,8 +3669,17 @@ void surface_integral::surf_Wave(double rex, cgh *GH, ShellPatch *SH,
  }
  //|------+  Communicate and sum the results from each processor.
-  MPI_Allreduce(RP_out, RP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+  {
-  MPI_Allreduce(IP_out, IP, NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
+    double *RPIP_out = new double[2 * NN];
    double *RPIP = new double[2 * NN];
    memcpy(RPIP_out, RP_out, NN * sizeof(double));
    memcpy(RPIP_out + NN, IP_out, NN * sizeof(double));
    MPI_Allreduce(RPIP_out, RPIP, 2 * NN, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    memcpy(RP, RPIP, NN * sizeof(double));
    memcpy(IP, RPIP + NN, NN * sizeof(double));
    delete[] RPIP_out;
    delete[] RPIP;
  }
  //|------= Free memory.
--- a/makefile_and_run.py
+++ b/makefile_and_run.py
@@ -15,12 +15,13 @@ import time
 ## taskset ensures all child processes inherit the CPU affinity mask
 ## This forces make and all compiler processes to use only nohz_full cores (4-55, 60-111)
 ## Format: taskset -c 4-55,60-111 ensures processes only run on these cores
-NUMACTL_CPU_BIND = "taskset -c 0-111"
+#NUMACTL_CPU_BIND = "taskset -c 0-111"
 NUMACTL_CPU_BIND = "taskset -c 16-47,64-95"
 ## Build parallelism configuration
 ## Use nohz_full cores (4-55, 60-111) for compilation: 52 + 52 = 104 cores
 ## Set make -j to utilize available cores for faster builds
-BUILD_JOBS = 104
+BUILD_JOBS = 96
 ##################################################################
@@ -117,6 +118,7 @@ def run_ABE():
    if (input_data.GPU_Calculation == "no"):
        mpi_command         = NUMACTL_CPU_BIND + " mpirun -np " + str(input_data.MPI_processes) + " ./ABE"
        #mpi_command         = " mpirun -np " + str(input_data.MPI_processes) + " ./ABE"
        mpi_command_outfile = "ABE_out.log"
    elif (input_data.GPU_Calculation == "yes"):
        mpi_command         = NUMACTL_CPU_BIND + " mpirun -np " + str(input_data.MPI_processes) + " ./ABEGPU"
@@ -158,7 +160,8 @@ def run_TwoPunctureABE():
    print(                                                          )
    ## Define the command to run
-    TwoPuncture_command         = NUMACTL_CPU_BIND + " ./TwoPunctureABE"
+    #TwoPuncture_command         = NUMACTL_CPU_BIND + " ./TwoPunctureABE"
    TwoPuncture_command         = " ./TwoPunctureABE"
    TwoPuncture_command_outfile = "TwoPunctureABE_out.log"
    ## Execute the command with subprocess.Popen and stream output
--- a/parallel_plot_helper.py
+++ b/parallel_plot_helper.py
@@ -0,0 +1,29 @@
 import multiprocessing
 def run_plot_task(task):
    """Execute a single plotting task.
    Parameters
    ----------
    task : tuple
        A tuple of (function, args_tuple) where function is a callable
        plotting function and args_tuple contains its arguments.
    """
    func, args = task
    return func(*args)
 def run_plot_tasks_parallel(plot_tasks):
    """Execute a list of independent plotting tasks in parallel.
    Uses the 'fork' context to create worker processes so that the main
    script is NOT re-imported/re-executed in child processes.
    Parameters
    ----------
    plot_tasks : list of tuples
        Each element is (function, args_tuple).
    """
    ctx = multiprocessing.get_context('fork')
    with ctx.Pool() as pool:
        pool.map(run_plot_task, plot_tasks)
--- a/pgo_profile/PGO_Profile_Analysis.md
+++ b/pgo_profile/PGO_Profile_Analysis.md
@@ -0,0 +1,97 @@
 # AMSS-NCKU PGO Profile Analysis Report
 ## 1. Profiling Environment
 | Item | Value |
 |------|-------|
 | Compiler | Intel oneAPI DPC++/C++ 2025.3.0 (icpx/ifx) |
 | Instrumentation Flag | `-fprofile-instr-generate` |
 | Optimization Level (instrumented) | `-O2 -xHost -fma` |
 | MPI Processes | 1 (single process to avoid MPI+instrumentation deadlock) |
 | Profile File | `default_9725750769337483397_0.profraw` (327 KB) |
 | Merged Profile | `default.profdata` (394 KB) |
 | llvm-profdata | `/home/intel/oneapi/compiler/2025.3/bin/compiler/llvm-profdata` |
 ## 2. Reduced Simulation Parameters (for profiling run)
 | Parameter | Production Value | Profiling Value |
 |-----------|-----------------|-----------------|
 | MPI_processes | 64 | 1 |
 | grid_level | 9 | 4 |
 | static_grid_level | 5 | 3 |
 | static_grid_number | 96 | 24 |
 | moving_grid_number | 48 | 16 |
 | largest_box_xyz_max | 320^3 | 160^3 |
 | Final_Evolution_Time | 1000.0 | 10.0 |
 | Evolution_Step_Number | 10,000,000 | 1,000 |
 | Detector_Number | 12 | 2 |
 ## 3. Profile Summary
 | Metric | Value |
 |--------|-------|
 | Total instrumented functions | 1,392 |
 | Functions with non-zero counts | 117 (8.4%) |
 | Functions with zero counts | 1,275 (91.6%) |
 | Maximum function entry count | 386,459,248 |
 | Maximum internal block count | 370,477,680 |
 | Total block count | 4,198,023,118 |
 ## 4. Top 20 Hotspot Functions
 | Rank | Total Count | Max Block Count | Function | Category |
 |------|------------|-----------------|----------|----------|
 | 1 | 1,241,601,732 | 370,477,680 | `polint_` | Interpolation |
 | 2 | 755,994,435 | 230,156,640 | `prolong3_` | Grid prolongation |
 | 3 | 667,964,095 | 3,697,792 | `compute_rhs_bssn_` | BSSN RHS evolution |
 | 4 | 539,736,051 | 386,459,248 | `symmetry_bd_` | Symmetry boundary |
 | 5 | 277,310,808 | 53,170,728 | `lopsided_` | Lopsided FD stencil |
 | 6 | 155,534,488 | 94,535,040 | `decide3d_` | 3D grid decision |
 | 7 | 119,267,712 | 19,266,048 | `rungekutta4_rout_` | RK4 time integrator |
 | 8 | 91,574,616 | 48,824,160 | `kodis_` | Kreiss-Oliger dissipation |
 | 9 | 67,555,389 | 43,243,680 | `fderivs_` | Finite differences |
 | 10 | 55,296,000 | 42,246,144 | `misc::fact(int)` | Factorial utility |
 | 11 | 43,191,071 | 27,663,328 | `fdderivs_` | 2nd-order FD derivatives |
 | 12 | 36,233,965 | 22,429,440 | `restrict3_` | Grid restriction |
 | 13 | 24,698,512 | 17,231,520 | `polin3_` | Polynomial interpolation |
 | 14 | 22,962,942 | 20,968,768 | `copy_` | Data copy |
 | 15 | 20,135,696 | 17,259,168 | `Ansorg::barycentric(...)` | Spectral interpolation |
 | 16 | 14,650,224 | 7,224,768 | `Ansorg::barycentric_omega(...)` | Spectral weights |
 | 17 | 13,242,296 | 2,871,920 | `global_interp_` | Global interpolation |
 | 18 | 12,672,000 | 7,734,528 | `sommerfeld_rout_` | Sommerfeld boundary |
 | 19 | 6,872,832 | 1,880,064 | `sommerfeld_routbam_` | Sommerfeld boundary (BAM) |
 | 20 | 5,709,900 | 2,809,632 | `l2normhelper_` | L2 norm computation |
 ## 5. Hotspot Category Breakdown
 Top 20 functions account for ~98% of total execution counts:
 | Category | Functions | Combined Count | Share |
 |----------|-----------|---------------|-------|
 | Interpolation / Prolongation / Restriction | polint_, prolong3_, restrict3_, polin3_, global_interp_, Ansorg::* | ~2,093M | ~50% |
 | BSSN RHS + FD stencils | compute_rhs_bssn_, lopsided_, fderivs_, fdderivs_ | ~1,056M | ~25% |
 | Boundary conditions | symmetry_bd_, sommerfeld_rout_, sommerfeld_routbam_ | ~559M | ~13% |
 | Time integration | rungekutta4_rout_ | ~119M | ~3% |
 | Dissipation | kodis_ | ~92M | ~2% |
 | Utilities | misc::fact, decide3d_, copy_, l2normhelper_ | ~256M | ~6% |
 ## 6. Conclusions
 1. **Profile data is valid**: 1,392 functions instrumented, 117 exercised with ~4.2 billion total counts.
 2. **Hotspot concentration is high**: Top 5 functions alone account for ~76% of all counts, which is ideal for PGO — the compiler has strong branch/layout optimization targets.
 3. **Fortran numerical kernels dominate**: `polint_`, `prolong3_`, `compute_rhs_bssn_`, `symmetry_bd_`, `lopsided_` are all Fortran routines in the inner evolution loop. PGO will optimize their branch prediction and basic block layout.
 4. **91.6% of functions have zero counts**: These are code paths for unused features (GPU, BSSN-EScalar, BSSN-EM, Z4C, etc.). PGO will deprioritize them, improving instruction cache utilization.
 5. **Profile is representative**: Despite the reduced grid size, the code path coverage matches production — the same kernels (RHS, prolongation, restriction, boundary) are exercised. PGO branch probabilities from this profile will transfer well to full-scale runs.
 ## 7. PGO Phase 2 Usage
 To apply the profile, use the following flags in `makefile.inc`:
 ```makefile
 CXXAPPFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \
              -fprofile-instr-use=/home/amss/AMSS-NCKU/pgo_profile/default.profdata \
              -Dfortran3 -Dnewc -I${MKLROOT}/include
 f90appflags = -O3 -xHost -fp-model fast=2 -fma -ipo \
              -fprofile-instr-use=/home/amss/AMSS-NCKU/pgo_profile/default.profdata \
              -align array64byte -fpp -I${MKLROOT}/include
 ```
--- a/pgo_profile/default.profdata
+++ b/pgo_profile/default.profdata
--- a/pgo_profile/default.profdata.backup
+++ b/pgo_profile/default.profdata.backup
--- a/pgo_profile/default_15874826282416242821_0_58277.profraw
+++ b/pgo_profile/default_15874826282416242821_0_58277.profraw
--- a/pgo_profile/default_9725750769337483397_0.profraw
+++ b/pgo_profile/default_9725750769337483397_0.profraw
--- a/plot_GW_strain_amplitude_xiaoqu.py
+++ b/plot_GW_strain_amplitude_xiaoqu.py
@@ -11,6 +11,8 @@
 import numpy                               ## numpy for array operations
 import scipy                               ## scipy for interpolation and signal processing
 import math
 import matplotlib
 matplotlib.use('Agg')                      ## use non-interactive backend for multiprocessing safety
 import matplotlib.pyplot    as     plt     ## matplotlib for plotting
 import os                                  ## os for system/file operations
--- a/plot_binary_data.py
+++ b/plot_binary_data.py
@@ -8,16 +8,23 @@
 ##
 #################################################
 ## Restrict OpenMP to one thread per process so that running
 ## many workers in parallel does not create an O(workers * BLAS_threads)
 ## thread explosion.  The variable MUST be set before numpy/scipy
 ## are imported, because the BLAS library reads them only at load time.
 import os
 os.environ.setdefault("OMP_NUM_THREADS",        "1")
 import numpy
 import scipy
 import matplotlib
 matplotlib.use('Agg')                      ## use non-interactive backend for multiprocessing safety
 import matplotlib.pyplot    as     plt
 from   matplotlib.colors    import LogNorm
 from   mpl_toolkits.mplot3d import Axes3D
 ## import torch
 import AMSS_NCKU_Input      as input_data
 import os
 #########################################################################################
@@ -192,3 +199,19 @@ def get_data_xy( Rmin, Rmax, n, data0, time, figure_title, figure_outdir ):
 ####################################################################################
 ####################################################################################
 ## Allow this module to be run as a standalone script so that each
 ## binary-data plot can be executed in a fresh subprocess whose BLAS
 ## environment variables (set above) take effect before numpy loads.
 ##
 ## Usage:  python3 plot_binary_data.py <filename> <binary_outdir> <figure_outdir>
 ####################################################################################
 if __name__ == '__main__':
    import sys
    if len(sys.argv) != 4:
        print(f"Usage: {sys.argv[0]} <filename> <binary_outdir> <figure_outdir>")
        sys.exit(1)
    plot_binary_data(sys.argv[1], sys.argv[2], sys.argv[3])
--- a/plot_xiaoqu.py
+++ b/plot_xiaoqu.py
@@ -8,6 +8,8 @@
 #################################################
 import numpy                               ## numpy for array operations
 import matplotlib
 matplotlib.use('Agg')                      ## use non-interactive backend for multiprocessing safety
 import matplotlib.pyplot    as     plt     ## matplotlib for plotting
 from   mpl_toolkits.mplot3d import Axes3D  ## needed for 3D plots
 import glob
@@ -15,6 +17,9 @@ import os                                  ## operating system utilities
 import plot_binary_data
 import AMSS_NCKU_Input as input_data
 import subprocess
 import sys
 import multiprocessing
 # plt.rcParams['text.usetex'] = True  ## enable LaTeX fonts in plots
@@ -50,10 +55,40 @@ def generate_binary_data_plot( binary_outdir, figure_outdir ):
        file_list.append(x)
        print(x)
-    ## Plot each file in the list
+    ## Plot each file in parallel using subprocesses.
    ## Each subprocess is a fresh Python process where the BLAS thread-count
    ## environment variables (set at the top of plot_binary_data.py) take
    ## effect before numpy is imported.  This avoids the thread explosion
    ## that occurs when multiprocessing.Pool with 'fork' context inherits
    ## already-initialized multi-threaded BLAS from the parent.
    script = os.path.join( os.path.dirname(__file__), "plot_binary_data.py" )
    max_workers = min( multiprocessing.cpu_count(), len(file_list) ) if file_list else 0
    running = []
    failed  = []
    for filename in file_list:
        print(filename)
-        plot_binary_data.plot_binary_data(filename, binary_outdir, figure_outdir)
+        proc = subprocess.Popen(
            [sys.executable, script, filename, binary_outdir, figure_outdir],
        )
        running.append( (proc, filename) )
        ## Keep at most max_workers subprocesses active at a time
        if len(running) >= max_workers:
            p, fn = running.pop(0)
            p.wait()
            if p.returncode != 0:
                failed.append(fn)
    ## Wait for all remaining subprocesses to finish
    for p, fn in running:
        p.wait()
        if p.returncode != 0:
            failed.append(fn)
    if failed:
        print( " WARNING: the following binary data plots failed:" )
        for fn in failed:
            print( "   ", fn )
    print(                        )
    print( " Binary Data Plot Has been Finished " )
Author	SHA1	Message	Date
ianchb	cc06e30404	Apply async Sync optimization to Z4c_class using Sync_start/finish pattern Replaces blocking Parallel::Sync + MPI_Allreduce in Z4c_class Step() with non-blocking MPI_Iallreduce overlapped with Sync_start/Sync_finish, matching the pattern already used in bssn_class on cjy-oneapi-opus-hotfix. Covers both ABEtype==2 and CPBC variants (predictor + corrector = 4 call sites). Cherry-picked optimization from `afd4006`, adapted to SyncCache infrastructure instead of the separate SyncPlan API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 09:58:26 +08:00
ianchb	25c79dc7cd	Merge lopsided advection + kodis dissipation to share symmetry_bd buffer Cherry-picked from `38c2c30`.	2026-02-20 09:57:51 +08:00
ianchb	a725d34dd3	Don't hardcode pgo profile path	2026-02-20 08:48:25 +08:00
gh0s7	2791d2e225	Merge pull request 'PGO updated' (#1 ) from cjy-oneapi-opus-hotfix into main Reviewed-on: #1	2026-02-11 19:17:35 +08:00
CGH0S7	72ce153e48	Merge cjy-oneapi-opus-hotfix into main	2026-02-11 19:15:12 +08:00
CGH0S7	5b7e05cd32	PGO updated	2026-02-11 18:26:30 +08:00
CGH0S7	85afe00fc5	Merge plotting optimizations from chb-copilot-test - Implement multiprocessing-based parallel plotting - Add parallel_plot_helper.py for concurrent plot task execution - Use matplotlib 'Agg' backend for multiprocessing safety - Set OMP_NUM_THREADS=1 to prevent BLAS thread explosion - Use subprocess for binary data plots to avoid thread conflicts - Add fork bomb protection in main program This merge only includes plotting improvements and excludes MPI communication changes to preserve existing optimizations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-11 16:19:17 +08:00
CGH0S7	5c1790277b	Replace nested OutBdLow2Hi loops with batch calls in RestrictProlong The 8 nested while(Ppc){while(Pp){OutBdLow2Hi(single,single,...)}} loops across RestrictProlong (3 overloads) and ProlongRestrict each produced N_c × N_f separate transfer() → MPI_Waitall barriers. Replace with the existing batch OutBdLow2Hi(MyList<Patch>*,...) which merges all patch pairs into a single transfer() call with 1 MPI_Waitall. Also add Restrict_cached, OutBdLow2Hi_cached, OutBdLow2Himix_cached to Parallel (unused for now — kept as infrastructure for future use). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-11 16:09:08 +08:00
CGH0S7	e09ae438a2	Cache data_packer lengths in Sync_start to skip redundant buffer-size traversals The data_packer(NULL, ...) calls that compute send/recv buffer lengths traverse all grid segments × variables × nprocs on every Sync_start invocation, even though lengths never change once the cache is built. Add a lengths_valid flag to SyncCache so these length computations are done once and reused on subsequent calls (4× per RK4 step). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 21:39:22 +08:00
CGH0S7	d06d5b4db8	Add targeted point-to-point Interp_Points overload for surface_integral Instead of broadcasting all interpolated point data to every MPI rank, the new overload sends each point only to the one rank that needs it for integration, reducing communication volume by ~nprocs times. The consumer rank is computed deterministically using the same Nmin/Nmax work distribution formula used by surface_integral callers. Two active call sites (surf_Wave and surf_MassPAng with MPI_COMM_WORLD) now use the new overload. Other callers (ShellPatch, Comm_here variants, etc.) remain unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-02-10 19:18:56 +08:00
CGH0S7	50e2a845f8	Replace MPI_Allreduce with owner-rank MPI_Bcast in Patch::Interp_Points The two MPI_Allreduce calls (data + weight) were the #1 hotspot at 38.5% CPU time. Since all ranks traverse the same block list and agree on point ownership, we replace the global reduction with targeted MPI_Bcast from each owner rank. This also eliminates the weight array/Allreduce entirely, removes redundant heap allocations (shellf, weight, DH, llb, uub), and writes interpolation results directly into the output buffer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 22:39:18 +08:00
CGH0S7	738498cb28	Optimize MPI communication in RestrictProlong and surface_integral Cache Sync in RestrictProlong: replace 11 basic Parallel::Sync() calls with Parallel::Sync_cached() across RestrictProlong, RestrictProlong_aux, and ProlongRestrict to avoid rebuilding grid segment lists every call. Merge paired MPI_Allreduce in surface_integral: combine 9 pairs of consecutive RP/IP Allreduce calls into single calls with count=2*NN. Merge scalar MPI_Allreduce in surf_MassPAng: combine 3 groups of 7 scalar Allreduce calls (mass + angular/linear momentum) into single calls with count=7. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 22:07:12 +08:00
CGH0S7	42b9cf1ad9	Optimize MPI Sync with merged transfers, caching, and async overlap Phase 1: Merge N+1 transfer() calls into a single transfer() per Sync(PatchList), reducing N+1 MPI_Waitall barriers to 1 via new Sync_merged() that collects all intra-patch and inter-patch grid segment lists into combined per-rank arrays. Phase 2: Cache grid segment lists and reuse grow-only communication buffers across RK4 substeps via SyncCache struct. Caches are per-level and per-variable-list (predictor/corrector), invalidated on regrid. Eliminates redundant build_ghost_gsl/build_owned_gsl0/build_gstl rebuilds and malloc/free cycles between regrids. Phase 3: Split Sync into async Sync_start/Sync_finish to overlap Cartesian ghost zone exchange (MPI_Isend/Irecv) with Shell patch synchronization. Uses MPI tag 2 to avoid conflicts with SH->Synch() which uses transfer() with tag 1. Also updates makefile.inc paths and flags for local build environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 21:03:37 +08:00
CGH0S7	e9d321fd00	Convert MPI_Allreduce error checks to non-blocking MPI_Iallreduce overlapped with Sync Replace all 8 blocking MPI_Allreduce error-check calls with MPI_Iallreduce, overlapping the reduction with subsequent Parallel::Sync/SH->Synch operations. MPI_Wait is called after Sync completes to retrieve the error result. This hides the Allreduce latency (46.5% of CPU time) behind the ghost zone exchange communication that must happen anyway. Safe because Sync only copies existing data to ghost zones and the error check + abort happens before any further computation uses the synced data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 12:39:29 +08:00
CGH0S7	ed1d86ade9	Merge paired MPI_Allreduce error checks to reduce global sync barriers In the two Step() functions that handle both Patch and Shell Patch, defer the Patch error check until after Shell Patch computation completes, then perform a single combined MPI_Allreduce instead of two separate ones. This eliminates 4 MPI_Allreduce calls per timestep (2 per Step function, Predictor + Corrector phases each). The optimization is mathematically equivalent: in normal execution (no NaN) behavior is identical; on error, both Patch and Shell data are dumped before MPI_Abort. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 12:12:16 +08:00
CGH0S7	471baa5065	PGO supported	2026-02-09 10:59:26 +08:00
CGH0S7	4bb6c03013	makefile setting updated	2026-02-08 16:14:43 +08:00
ianchb	b8e41b2b39	Only enable OpenMP for TwoPunctures	2026-02-08 13:00:37 +08:00
ianchb	133e4f13a2	Use OpenMP's parallel for with schedule(dynamic,1)	2026-02-07 19:48:24 +08:00
ianchb	914c4f4791	Optimize memory allocation in JFD_times_dv This should reduce the pressure on the memory allocator, indirectly improving caching behavior. Co-authored-by: copilot-swe-agent[bot] <198982749+copilot@users.noreply.github.com>	2026-02-07 15:55:45 +08:00
CGH0S7	79af79d471	baseline updated	2026-02-05 19:53:55 +08:00