Add OpenMP parallelization to BSSN RHS hot-path stencil routines

Enable OpenMP threading for the dominant computational kernels: - makefile.inc: add -qopenmp to f90appflags - diff_new.f90: split fderivs/fdderivs into OpenMP interior + serial boundary - kodiss.f90: split kodis into OpenMP interior + serial boundary - lopsidediff.f90: add OMP PARALLEL DO COLLAPSE(2) to lopsided - fmisc.f90: parallelize symmetry_bd bulk array copy - bssn_rhs.f90: add OMP WORKSHARE to array-syntax operations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminate hot-path heap allocations in TwoPunctures spectral solver
2026-02-07 13:58:55 +08:00 · 2026-02-06 21:20:35 +08:00 · 2026-02-06 19:00:35 +08:00 · 2026-02-06 18:36:29 +08:00 · 2026-02-06 17:13:39 +08:00 · 2026-02-06 13:57:48 +08:00
16 changed files with 1696 additions and 2347 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,7 +1,6 @@
 __pycache__
 GW150914
 GW150914-origin
 GW150914-mini
 docs
 *.tmp
--- a/AMSS_NCKU_Input.py
+++ b/AMSS_NCKU_Input.py
@@ -16,14 +16,12 @@ import numpy
 File_directory   = "GW150914"                    ## output file directory
 Output_directory = "binary_output"               ## binary data file directory
                                                 ## The file directory name should not be too long
-MPI_processes    = 8                             ## number of mpi processes used in the simulation
+MPI_processes    = 64                             ## number of mpi processes used in the simulation
 GPU_Calculation  = "no"                          ## Use GPU or not 
                                                 ## (prefer "no" in the current version, because the GPU part may have bugs when integrated in this Python interface)
 CPU_Part         = 1.0
 GPU_Part         = 0.0
 Debug_NaN_Check          = 0                       ## enable NaN checks in compute_rhs_bssn: 0 (off) or 1 (on)
 #################################################
--- a/AMSS_NCKU_Input_Mini.py
+++ b/AMSS_NCKU_Input_Mini.py
@@ -1,233 +0,0 @@
 #################################################
 ##
 ## This file provides the input parameters required for numerical relativity.
 ## XIAOQU
 ## 2024/03/19 --- 2025/09/14
 ## Modified for GW150914-mini test case
 ##
 #################################################
 import numpy    
 #################################################
 ## Setting MPI processes and the output file directory
 File_directory   = "GW150914-mini"               ## output file directory
 Output_directory = "binary_output"               ## binary data file directory
                                                 ## The file directory name should not be too long
 MPI_processes    = 4                             ## number of mpi processes used in the simulation (Reduced for laptop)
 GPU_Calculation  = "no"                          ## Use GPU or not 
                                                 ## (prefer "no" in the current version, because the GPU part may have bugs when integrated in this Python interface)
 CPU_Part         = 1.0
 GPU_Part         = 0.0
 #################################################
 #################################################
 ## Setting the physical system and numerical method
 Symmetry                 = "equatorial-symmetry"   ## Symmetry of System: choose equatorial-symmetry、no-symmetry、octant-symmetry
 Equation_Class           = "BSSN"                  ## Evolution Equation: choose "BSSN", "BSSN-EScalar", "BSSN-EM", "Z4C" 
                                                   ## If "BSSN-EScalar" is chosen, it is necessary to set other parameters below
 Initial_Data_Method      = "Ansorg-TwoPuncture"    ## initial data method: choose "Ansorg-TwoPuncture", "Lousto-Analytical", "Cao-Analytical", "KerrSchild-Analytical"
 Time_Evolution_Method    = "runge-kutta-45"        ## time evolution method: choose "runge-kutta-45"
 Finite_Diffenence_Method = "4th-order"             ## finite-difference method: choose "2nd-order", "4th-order", "6th-order", "8th-order"
 Debug_NaN_Check          = 0                       ## enable NaN checks in compute_rhs_bssn: 0 (off) or 1 (on)
 #################################################
 #################################################
 ## Setting the time evolutionary information
 Start_Evolution_Time     = 0.0                    ## start evolution time t0
 Final_Evolution_Time     = 100.0                  ## final evolution time t1 (Reduced for quick test)
 Check_Time               = 10.0
 Dump_Time                = 10.0                   ## time inteval dT for dumping binary data
 D2_Dump_Time             = 10.0                   ## dump the ascii data for 2d surface after dT'
 Analysis_Time            = 1.0                    ## dump the puncture position and GW psi4 after dT"
 Evolution_Step_Number    = 10000000               ## stop the calculation after the maximal step number
 Courant_Factor           = 0.5                    ## Courant Factor
 Dissipation              = 0.15                   ## Kreiss-Oliger Dissipation Strength
 #################################################
 #################################################
 ## Setting the grid structure
 basic_grid_set    = "Patch"                          ## grid structure: choose "Patch" or "Shell-Patch"
 grid_center_set   = "Cell"                           ## grid center: chose "Cell" or "Vertex"
 grid_level        = 7                                ## total number of AMR grid levels (Reduced from 9)
 static_grid_level = 4                                ## number of AMR static grid levels (Reduced from 5)
 moving_grid_level = grid_level - static_grid_level   ## number of AMR moving grid levels
 analysis_level    = 0
 refinement_level  = 3                                ## time refinement start from this grid level
 largest_box_xyz_max = [320.0, 320.0, 320.0]          ## scale of the largest box
                                                     ## not ne cess ary to be cubic for "Patch" grid s tructure
                                                     ## need to be a cubic box for "Shell-Patch" grid structure
 largest_box_xyz_min = - numpy.array(largest_box_xyz_max)  
 static_grid_number = 48                              ## grid points of each static AMR grid (in x direction) (Reduced from 96)
                                                     ## (grid points in y and z directions are automatically adjusted)
 moving_grid_number = 24                              ## grid points of each moving AMR grid (Reduced from 48)
 shell_grid_number  = [32, 32, 100]                   ## grid points of Shell-Patch grid
                                                     ## in (phi, theta, r) direction
 devide_factor      = 2.0                             ## resolution between different grid levels dh0/dh1, only support 2.0 now
 static_grid_type   = 'Linear'                        ## AMR static grid structure , only supports "Linear"
 moving_grid_type   = 'Linear'                        ## AMR moving grid structure , only supports "Linear"
 quarter_sphere_number = 48                           ## grid number of 1/4 s pher ical surface (Reduced from 96)
                                                     ## (which is needed for evaluating the spherical surface integral)
 #################################################
 #################################################
 ## Setting the puncture information
 puncture_number       = 2                                     
 position_BH           = numpy.zeros( (puncture_number, 3) )   
 parameter_BH          = numpy.zeros( (puncture_number, 3) )   
 dimensionless_spin_BH = numpy.zeros( (puncture_number, 3) )   
 momentum_BH           = numpy.zeros( (puncture_number, 3) )   
 puncture_data_set     = "Manually"                       ## Method to give Puncture’s positions and momentum
                                                         ## choose "Manually" or "Automatically-BBH"
                                                         ## Prefer to choose "Manually", because "Automatically-BBH" is developing now
 ## initial orbital distance and ellipticity for BBHs system
 ## ( needed for "Automatically-BBH" case , not affect the "Manually" case )
 Distance = 10.0
 e0       = 0.0
 ## black hole parameter (M Q* a*)
 parameter_BH[0] = [ 36.0/(36.0+29.0),  0.0,  +0.31 ]   
 parameter_BH[1] = [ 29.0/(36.0+29.0),  0.0,  -0.46 ]  
 ## dimensionless spin in each direction
 dimensionless_spin_BH[0] = [ 0.0,  0.0,  +0.31 ]   
 dimensionless_spin_BH[1] = [ 0.0,  0.0,  -0.46 ]  
 ## use Brugmann's convention
 ##  -----0-----> y
 ##   -      +     
 #---------------------------------------------
 ## If puncture_data_set is chosen to be "Manually", it is necessary to set the position and momentum of each puncture manually
 ## initial position for each puncture
 position_BH[0]  = [  0.0,  10.0*29.0/(36.0+29.0), 0.0 ]  
 position_BH[1]  = [  0.0, -10.0*36.0/(36.0+29.0), 0.0 ] 
 ## initial mumentum for each puncture
 ## (needed for "Manually" case, does not affect the "Automatically-BBH" case)
 momentum_BH[0]  = [ -0.09530152296974252,  -0.00084541526517121,   0.0 ]
 momentum_BH[1]  = [ +0.09530152296974252,  +0.00084541526517121,   0.0 ]
 #################################################
 #################################################
 ## Setting the gravitational wave information
 GW_L_max        = 4                      ## maximal L number in gravitational wave
 GW_M_max        = 4                      ## maximal M number in gravitational wave
 Detector_Number = 12                     ## number of dector
 Detector_Rmin   = 50.0                   ## nearest dector distance
 Detector_Rmax   = 160.0                  ## farest dector distance
 #################################################
 #################################################
 ## Setting the apprent horizon
 AHF_Find       = "no"                    ## whether to find the apparent horizon: choose "yes" or "no"
 AHF_Find_Every = 24
 AHF_Dump_Time  = 20.0
 #################################################
 #################################################
 ## Other parameters (testing)
 ## Only influence the Equation_Class = "BSSN-EScalar" case
 FR_a2     = 3.0        ## f(R) = R + a2 * R^2    
 FR_l2     = 10000.0
 FR_phi0   = 0.00005
 FR_r0     = 120.0
 FR_sigma0 = 8.0
 FR_Choice = 2          ## Choice options: 1 2 3 4 5
                       ## 1: phi(r) = phi0 * Exp(-(r-r0)**2/sigma0)   
                       ##    V(r)   = 0
                       ## 2: phi(r) =  phi0 * a2^2/(1+a2^2)  
                       ##    V(r)   = Exp(-8*Sqrt(PI/3)*phi(r)) * (1-Exp(4*Sqrt(PI/3)*phi(r)))**2 / (32*PI*a2)
                       ## 3: Schrodinger-Newton gived by system phi(r) 
                       ##    V(r)   = Exp(-8*Sqrt(PI/3)*phi(r)) * (1-Exp(4*Sqrt(PI/3)*phi(r)))**2 / (32*PI*a2)
                       ## 4: phi(r) = phi0 * 0.5 * ( tanh((r+r0)/sigma0) - tanh((r-r0)/sigma0) )  
                       ##    V(r)   = 0
                       ##    f(R)   = R + a2*R^2  with a2 = +oo
                       ## 5: phi(r) = phi0 * Exp(-(r-r0)**2/sigma)   
                       ##    V(r)   = 0
 #################################################
 #################################################
 ## Other parameters (testing)
 ## (please do not change if not necessary)
 boundary_choice = "BAM-choice"     ## Sommerfeld boundary condition : choose "BAM-choice" or "Shibata-choice" 
                                   ## prefer "BAM-choice"
 gauge_choice  = 0                  ## gauge choice
                                   ## 0: B^i gauge
                                   ## 1: David's puncture gauge
                                   ## 2: MB B^i gauge               
                                   ## 3: RIT B^i gauge
                                   ## 4: MB beta gauge 
                                   ## 5: RIT beta gauge 
                                   ## 6: MGB1 B^i gauge
                                   ## 7: MGB2 B^i gauge
                                   ## prefer 0 or 1
 tetrad_type  = 2                   ## tetradtype 
                                   ##  v:r; u: phi; w: theta
                                   ##      v^a = (x,y,z)
                                   ## 0: orthonormal order: v,u,w
                                   ##    v^a = (x,y,z)   
                                   ##    m = (phi - i theta)/sqrt(2) 
                                   ##    following Frans, Eq.(8) of  PRD 75, 124018(2007)
                                   ## 1: orthonormal order: w,u,v
                                   ##    m = (theta + i phi)/sqrt(2) 
                                   ##    following Sperhake, Eq.(3.2) of  PRD 85, 124062(2012)    
                                   ## 2: orthonormal order: v,u,w
                                   ##    v_a = (x,y,z)
                                   ##    m = (phi - i theta)/sqrt(2) 
                                   ##    following Frans, Eq.(8) of  PRD 75, 124018(2007)
                                   ## this version recommend set to 2
                                   ## prefer 2
 #################################################
--- a/AMSS_NCKU_MiniProgram.py
+++ b/AMSS_NCKU_MiniProgram.py
@@ -1,224 +0,0 @@
 ##################################################################
 ##
 ## AMSS-NCKU Numerical Relativity Mini Test Program
 ## Author: Assistant (based on Xiaoqu's code)
 ## 2026/01/20
 ##
 ## This script runs a scaled-down version of the GW150914 test case
 ## suitable for laptop testing.
 ##
 ##################################################################
 import os
 import shutil
 import sys
 import time
 # --- Context Manager for Input File Swapping ---
 class InputFileSwapper:
    def __init__(self, mini_file="AMSS_NCKU_Input_Mini.py", target_file="AMSS_NCKU_Input.py"):
        self.mini_file = mini_file
        self.target_file = target_file
        self.backup_file = target_file + ".bak"
        self.swapped = False
    def __enter__(self):
        print(f"[MiniProgram] Swapping {self.target_file} with {self.mini_file}...")
        if os.path.exists(self.target_file):
            shutil.move(self.target_file, self.backup_file)
        shutil.copy(self.mini_file, self.target_file)
        self.swapped = True
        return self
    def __exit__(self, exc_type, exc_value, traceback):
        if self.swapped:
            print(f"[MiniProgram] Restoring original {self.target_file}...")
            os.remove(self.target_file)
            if os.path.exists(self.backup_file):
                shutil.move(self.backup_file, self.target_file)
 def main():
    # Use the swapper to ensure all imported modules see the mini configuration
    with InputFileSwapper():
        # Import modules AFTER swapping input file
        try:
            import AMSS_NCKU_Input as input_data
            import print_information
            import setup
            import numerical_grid
            import generate_macrodef
            import makefile_and_run
            import generate_TwoPuncture_input
            import renew_puncture_parameter
            import plot_xiaoqu
            import plot_GW_strain_amplitude_xiaoqu
        except ImportError as e:
            print(f"Error importing modules: {e}")
            return
        print_information.print_program_introduction()
        print("\n" + "#"*60)
        print(" RUNNING MINI TEST CASE: GW150914-mini")
        print("#"*60 + "\n")
        # --- Directory Setup ---
        File_directory = os.path.join(input_data.File_directory)
        if os.path.exists(File_directory):
            print(f" Output directory '{File_directory}' exists. Removing for mini test...")
            shutil.rmtree(File_directory, ignore_errors=True)
        os.mkdir(File_directory)
        shutil.copy("AMSS_NCKU_Input.py", File_directory) # Copies the current (mini) input
        output_directory = os.path.join(File_directory, "AMSS_NCKU_output")
        os.mkdir(output_directory)
        binary_results_directory = os.path.join(output_directory, input_data.Output_directory)
        os.mkdir(binary_results_directory)
        figure_directory = os.path.join(File_directory, "figure")
        os.mkdir(figure_directory)
        print(" Output directories generated.\n")
        # --- Setup and Input Generation ---
        setup.print_input_data(File_directory)
        setup.generate_AMSSNCKU_input()
        setup.print_puncture_information()
        print("\n Generating AMSS-NCKU input parfile...")
        numerical_grid.append_AMSSNCKU_cgh_input()
        print("\n Plotting initial grid...")
        numerical_grid.plot_initial_grid()
        print("\n Generating macro files...")
        generate_macrodef.generate_macrodef_h()
        generate_macrodef.generate_macrodef_fh()
        # --- Compilation Preparation ---
        print("\n Preparing to compile and run...")
        AMSS_NCKU_source_path = "AMSS_NCKU_source"
        AMSS_NCKU_source_copy = os.path.join(File_directory, "AMSS_NCKU_source_copy")
        if not os.path.exists(AMSS_NCKU_source_path):
             print(" Error: AMSS_NCKU_source not found! Please run in the project root.")
             return
        shutil.copytree(AMSS_NCKU_source_path, AMSS_NCKU_source_copy)
        macrodef_h_path  = os.path.join(File_directory, "macrodef.h") 
        macrodef_fh_path = os.path.join(File_directory, "macrodef.fh") 
        shutil.copy2(macrodef_h_path,  AMSS_NCKU_source_copy)
        shutil.copy2(macrodef_fh_path, AMSS_NCKU_source_copy)
        # --- Compilation ---
        cwd = os.getcwd()
        os.chdir(AMSS_NCKU_source_copy)
        print(" Compiling ABE...")
        makefile_and_run.makefile_ABE()
        if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ): 
            print(" Compiling TwoPunctureABE...")
            makefile_and_run.makefile_TwoPunctureABE()
        os.chdir(cwd)
        # --- Copy Executables ---
        if (input_data.GPU_Calculation == "no"):
            ABE_file = os.path.join(AMSS_NCKU_source_copy, "ABE")
        else:
            ABE_file = os.path.join(AMSS_NCKU_source_copy, "ABEGPU")
        if not os.path.exists(ABE_file):
            print(" Error: ABE executable compilation failed.")
            return
        shutil.copy2(ABE_file, output_directory)
        TwoPuncture_file = os.path.join(AMSS_NCKU_source_copy, "TwoPunctureABE")
        if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ):
            if not os.path.exists(TwoPuncture_file):
                print(" Error: TwoPunctureABE compilation failed.")
                return
            shutil.copy2(TwoPuncture_file, output_directory)
        # --- Execution ---
        start_time = time.time()
        if (input_data.Initial_Data_Method == "Ansorg-TwoPuncture" ):
             print("\n Generating TwoPuncture input...")
             generate_TwoPuncture_input.generate_AMSSNCKU_TwoPuncture_input()
             AMSS_NCKU_TwoPuncture_inputfile = 'AMSS-NCKU-TwoPuncture.input'
             AMSS_NCKU_TwoPuncture_inputfile_path = os.path.join( File_directory, AMSS_NCKU_TwoPuncture_inputfile )
             shutil.copy2( AMSS_NCKU_TwoPuncture_inputfile_path, os.path.join(output_directory, 'TwoPunctureinput.par') )
             print(" Running TwoPunctureABE...")
             os.chdir(output_directory)
             makefile_and_run.run_TwoPunctureABE()
             os.chdir(cwd)
        # Update Puncture Parameter
        renew_puncture_parameter.append_AMSSNCKU_BSSN_input(File_directory, output_directory)
        AMSS_NCKU_inputfile = 'AMSS-NCKU.input'
        AMSS_NCKU_inputfile_path = os.path.join(File_directory, AMSS_NCKU_inputfile)
        shutil.copy2( AMSS_NCKU_inputfile_path, os.path.join(output_directory, 'input.par') )
        print("\n Input files ready. Launching ABE...")
        os.chdir(output_directory)
        makefile_and_run.run_ABE()
        os.chdir(cwd)
        end_time = time.time()
        elapsed_time = end_time - start_time
        # --- Post-processing ---
        print("\n Copying output files for inspection...")
        AMSS_NCKU_error_file_path = os.path.join(binary_results_directory, "setting.par")
        if os.path.exists(AMSS_NCKU_error_file_path):
            shutil.copy( AMSS_NCKU_error_file_path, os.path.join(output_directory, "AMSSNCKU_setting_parameter") )
        AMSS_NCKU_error_file_path = os.path.join(binary_results_directory, "Error.log")
        if os.path.exists(AMSS_NCKU_error_file_path):
            shutil.copy( AMSS_NCKU_error_file_path, os.path.join(output_directory, "Error.log") )
        for fname in ["bssn_BH.dat", "bssn_ADMQs.dat", "bssn_psi4.dat", "bssn_constraint.dat"]:
            fpath = os.path.join(binary_results_directory, fname)
            if os.path.exists(fpath):
                shutil.copy(fpath, os.path.join(output_directory, fname))
        # --- Plotting ---
        print("\n Plotting results...")
        try:
            plot_xiaoqu.generate_puncture_orbit_plot(   binary_results_directory, figure_directory )
            plot_xiaoqu.generate_puncture_orbit_plot3D( binary_results_directory, figure_directory )
            plot_xiaoqu.generate_puncture_distence_plot( binary_results_directory, figure_directory )
            for i in range(input_data.Detector_Number):
                plot_xiaoqu.generate_gravitational_wave_psi4_plot( binary_results_directory, figure_directory, i )
                plot_GW_strain_amplitude_xiaoqu.generate_gravitational_wave_amplitude_plot( binary_results_directory, figure_directory, i )
            for i in range(input_data.Detector_Number):
                plot_xiaoqu.generate_ADMmass_plot( binary_results_directory, figure_directory, i )
            for i in range(input_data.grid_level):
                plot_xiaoqu.generate_constraint_check_plot( binary_results_directory, figure_directory, i )
            plot_xiaoqu.generate_binary_data_plot( binary_results_directory, figure_directory )
        except Exception as e:
            print(f"Warning: Plotting failed: {e}")
        print(f"\n Program Cost = {elapsed_time:.2f} Seconds \n")
        print(" AMSS-NCKU-Python simulation finished (Mini Test).\n")
 if __name__ == "__main__":
    main()
--- a/AMSS_NCKU_source/TwoPunctures.C
+++ b/AMSS_NCKU_source/TwoPunctures.C
@@ -5,6 +5,7 @@
 #include <cstdio>
 #include <cstdlib>
 #include <string>
 #include <cstring>
 #include <iostream>
 #include <iomanip>
 #include <fstream>
@@ -60,13 +61,110 @@ TwoPunctures::TwoPunctures(double mp, double mm, double b,
  F = dvector(0, ntotal - 1);
  allocate_derivs(&u, ntotal);
  allocate_derivs(&v, ntotal);
  // Allocate workspace buffers for hot-path allocation elimination
  int N = maximum3(n1, n2, n3);
  int maxn = maximum2(n1, n2);
  // LineRelax_be workspace (sized for n2)
  ws_diag_be = new double[n2];
  ws_e_be = new double[n2 - 1];
  ws_f_be = new double[n2 - 1];
  ws_b_be = new double[n2];
  ws_x_be = new double[n2];
  // LineRelax_al workspace (sized for n1)
  ws_diag_al = new double[n1];
  ws_e_al = new double[n1 - 1];
  ws_f_al = new double[n1 - 1];
  ws_b_al = new double[n1];
  ws_x_al = new double[n1];
  // ThomasAlgorithm workspace (sized for max(n1,n2))
  ws_thomas_y = new double[maxn];
  // JFD_times_dv workspace (sized for nvar)
  ws_jfd_values = dvector(0, nvar - 1);
  allocate_derivs(&ws_jfd_dU, nvar);
  allocate_derivs(&ws_jfd_U, nvar);
  // chebft_Zeros workspace (sized for N+1)
  ws_cheb_c = dvector(0, N);
  // fourft workspace (sized for N/2+1 each)
  ws_four_a = dvector(0, N / 2);
  ws_four_b = dvector(0, N / 2);
  // Derivatives_AB3 workspace
  ws_deriv_p = dvector(0, N);
  ws_deriv_dp = dvector(0, N);
  ws_deriv_d2p = dvector(0, N);
  ws_deriv_q = dvector(0, N);
  ws_deriv_dq = dvector(0, N);
  ws_deriv_r = dvector(0, N);
  ws_deriv_dr = dvector(0, N);
  ws_deriv_indx = ivector(0, N);
  // F_of_v workspace
  ws_fov_sources = new double[n1 * n2 * n3];
  ws_fov_values = dvector(0, nvar - 1);
  allocate_derivs(&ws_fov_U, nvar);
  // J_times_dv workspace
  ws_jtdv_values = dvector(0, nvar - 1);
  allocate_derivs(&ws_jtdv_dU, nvar);
  allocate_derivs(&ws_jtdv_U, nvar);
 }
 TwoPunctures::~TwoPunctures()
 {
  int const nvar = 1, n1 = npoints_A, n2 = npoints_B, n3 = npoints_phi;
  int N = maximum3(n1, n2, n3);
  free_dvector(F, 0, ntotal - 1);
  free_derivs(&u, ntotal);
  free_derivs(&v, ntotal);
  // Free workspace buffers
  delete[] ws_diag_be;
  delete[] ws_e_be;
  delete[] ws_f_be;
  delete[] ws_b_be;
  delete[] ws_x_be;
  delete[] ws_diag_al;
  delete[] ws_e_al;
  delete[] ws_f_al;
  delete[] ws_b_al;
  delete[] ws_x_al;
  delete[] ws_thomas_y;
  free_dvector(ws_jfd_values, 0, nvar - 1);
  free_derivs(&ws_jfd_dU, nvar);
  free_derivs(&ws_jfd_U, nvar);
  free_dvector(ws_cheb_c, 0, N);
  free_dvector(ws_four_a, 0, N / 2);
  free_dvector(ws_four_b, 0, N / 2);
  free_dvector(ws_deriv_p, 0, N);
  free_dvector(ws_deriv_dp, 0, N);
  free_dvector(ws_deriv_d2p, 0, N);
  free_dvector(ws_deriv_q, 0, N);
  free_dvector(ws_deriv_dq, 0, N);
  free_dvector(ws_deriv_r, 0, N);
  free_dvector(ws_deriv_dr, 0, N);
  free_ivector(ws_deriv_indx, 0, N);
  delete[] ws_fov_sources;
  free_dvector(ws_fov_values, 0, nvar - 1);
  free_derivs(&ws_fov_U, nvar);
  free_dvector(ws_jtdv_values, 0, nvar - 1);
  free_derivs(&ws_jtdv_dU, nvar);
  free_derivs(&ws_jtdv_U, nvar);
 }
 void TwoPunctures::Solve()
@@ -655,7 +753,7 @@ void TwoPunctures::chebft_Zeros(double u[], int n, int inv)
  int k, j, isignum;
  double fac, sum, Pion, *c;
-  c = dvector(0, n);
+  c = ws_cheb_c;
  Pion = Pi / n;
  if (inv == 0)
  {
@@ -686,7 +784,6 @@ void TwoPunctures::chebft_Zeros(double u[], int n, int inv)
  }
  for (j = 0; j < n; j++)
    u[j] = c[j];
  free_dvector(c, 0, n);
 }
 /* --------------------------------------------------------------------------*/
@@ -774,8 +871,8 @@ void TwoPunctures::fourft(double *u, int N, int inv)
  double x, x1, fac, Pi_fac, *a, *b;
  M = N / 2;
-  a = dvector(0, M);
+  a = ws_four_a;
-  b = dvector(1, M); /* Actually: b=vector(1,M-1) but this is problematic if M=1*/
+  b = ws_four_b - 1; /* offset to match dvector(1,M) indexing */
  fac = 1. / M;
  Pi_fac = Pi * fac;
  if (inv == 0)
@@ -824,8 +921,6 @@ void TwoPunctures::fourft(double *u, int N, int inv)
      iy = -iy;
    }
  }
  free_dvector(a, 0, M);
  free_dvector(b, 1, M);
 }
 /* -----------------------------------------*/
@@ -1118,14 +1213,14 @@ void TwoPunctures::Derivatives_AB3(int nvar, int n1, int n2, int n3, derivs v)
  double *p, *dp, *d2p, *q, *dq, *r, *dr;
  N = maximum3(n1, n2, n3);
-  p = dvector(0, N);
+  p = ws_deriv_p;
-  dp = dvector(0, N);
+  dp = ws_deriv_dp;
-  d2p = dvector(0, N);
+  d2p = ws_deriv_d2p;
-  q = dvector(0, N);
+  q = ws_deriv_q;
-  dq = dvector(0, N);
+  dq = ws_deriv_dq;
-  r = dvector(0, N);
+  r = ws_deriv_r;
-  dr = dvector(0, N);
+  dr = ws_deriv_dr;
-  indx = ivector(0, N);
+  indx = ws_deriv_indx;
  for (ivar = 0; ivar < nvar; ivar++)
  {
@@ -1208,14 +1303,6 @@ void TwoPunctures::Derivatives_AB3(int nvar, int n1, int n2, int n3, derivs v)
      }
    }
  }
  free_dvector(p, 0, N);
  free_dvector(dp, 0, N);
  free_dvector(d2p, 0, N);
  free_dvector(q, 0, N);
  free_dvector(dq, 0, N);
  free_dvector(r, 0, N);
  free_dvector(dr, 0, N);
  free_ivector(indx, 0, N);
 }
 /* --------------------------------------------------------------------------*/
 void TwoPunctures::Newton(int const nvar, int const n1, int const n2, int const n3,
@@ -1284,10 +1371,11 @@ void TwoPunctures::F_of_v(int nvar, int n1, int n2, int n3, derivs v, double *F,
  derivs U;
  double *sources;
-  values = dvector(0, nvar - 1);
+  values = ws_fov_values;
-  allocate_derivs(&U, nvar);
+  U = ws_fov_U;
-  sources = (double *)calloc(n1 * n2 * n3, sizeof(double));
+  sources = ws_fov_sources;
  memset(sources, 0, n1 * n2 * n3 * sizeof(double));
  if (0)
  {
    double *s_x, *s_y, *s_z;
@@ -1442,9 +1530,6 @@ void TwoPunctures::F_of_v(int nvar, int n1, int n2, int n3, derivs v, double *F,
  {
    fclose(debugfile);
  }
  free(sources);
  free_dvector(values, 0, nvar - 1);
  free_derivs(&U, nvar);
 }
 /* --------------------------------------------------------------------------*/
 double TwoPunctures::norm_inf(double const *F, int const ntotal)
@@ -1850,11 +1935,12 @@ void TwoPunctures::J_times_dv(int nvar, int n1, int n2, int n3, derivs dv, doubl
  Derivatives_AB3(nvar, n1, n2, n3, dv);
  values = ws_jtdv_values;
  dU = ws_jtdv_dU;
  U = ws_jtdv_U;
  for (i = 0; i < n1; i++)
  {
    values = dvector(0, nvar - 1);
    allocate_derivs(&dU, nvar);
    allocate_derivs(&U, nvar);
    for (j = 0; j < n2; j++)
    {
      for (k = 0; k < n3; k++)
@@ -1908,9 +1994,6 @@ void TwoPunctures::J_times_dv(int nvar, int n1, int n2, int n3, derivs dv, doubl
        }
      }
    }
    free_dvector(values, 0, nvar - 1);
    free_derivs(&dU, nvar);
    free_derivs(&U, nvar);
  }
 }
 /* --------------------------------------------------------------------------*/
@@ -1957,17 +2040,11 @@ void TwoPunctures::LineRelax_be(double *dv,
 {
  int j, m, Ic, Ip, Im, col, ivar;
-  double *diag = new double[n2];
+  double *diag = ws_diag_be;
-  double *e = new double[n2 - 1]; /* above diagonal */
+  double *e = ws_e_be;     /* above diagonal */
-  double *f = new double[n2 - 1]; /* below diagonal */
+  double *f = ws_f_be;     /* below diagonal */
-  double *b = new double[n2];     /* rhs */
+  double *b = ws_b_be;     /* rhs */
-  double *x = new double[n2];     /* solution vector */
+  double *x = ws_x_be;     /* solution vector */
  //  gsl_vector *diag = gsl_vector_alloc(n2);
  //  gsl_vector *e = gsl_vector_alloc(n2-1); /* above diagonal */
  //  gsl_vector *f = gsl_vector_alloc(n2-1); /* below diagonal */
  //  gsl_vector *b = gsl_vector_alloc(n2);   /* rhs */
  //  gsl_vector *x = gsl_vector_alloc(n2);   /* solution vector */
  for (ivar = 0; ivar < nvar; ivar++)
  {
@@ -1977,62 +2054,35 @@ void TwoPunctures::LineRelax_be(double *dv,
    }
    diag[n2 - 1] = 0;
    //    gsl_vector_set_zero(diag);
    //    gsl_vector_set_zero(e);
    //    gsl_vector_set_zero(f);
    for (j = 0; j < n2; j++)
    {
      Ip = Index(ivar, i, j + 1, k, nvar, n1, n2, n3);
      Ic = Index(ivar, i, j, k, nvar, n1, n2, n3);
      Im = Index(ivar, i, j - 1, k, nvar, n1, n2, n3);
      b[j] = rhs[Ic];
      //      gsl_vector_set(b,j,rhs[Ic]);
      for (m = 0; m < ncols[Ic]; m++)
      {
        col = cols[Ic][m];
        if (col != Ip && col != Ic && col != Im)
          b[j] -= JFD[Ic][m] * dv[col];
        //          *gsl_vector_ptr(b, j) -= JFD[Ic][m] * dv[col];
        else
        {
          if (col == Im && j > 0)
            f[j - 1] = JFD[Ic][m];
          //            gsl_vector_set(f,j-1,JFD[Ic][m]);
          if (col == Ic)
            diag[j] = JFD[Ic][m];
          //            gsl_vector_set(diag,j,JFD[Ic][m]);
          if (col == Ip && j < n2 - 1)
            e[j] = JFD[Ic][m];
          //            gsl_vector_set(e,j,JFD[Ic][m]);
        }
      }
    }
    //          A x = b
    //          A = ( d_0 e_0  0   0  )
    //              ( f_0 d_1 e_1  0  )
    //              (  0  f_1 d_2 e_2 )
    //              (  0   0  f_2 d_3 )
    //
    ThomasAlgorithm(n2, f, diag, e, x, b);
    //    gsl_linalg_solve_tridiag(diag, e, f, b, x);
    for (j = 0; j < n2; j++)
    {
      Ic = Index(ivar, i, j, k, nvar, n1, n2, n3);
      dv[Ic] = x[j];
      //      dv[Ic] = gsl_vector_get(x, j);
    }
  }
  delete[] diag;
  delete[] e;
  delete[] f;
  delete[] b;
  delete[] x;
  //  gsl_vector_free(diag);
  //  gsl_vector_free(e);
  //  gsl_vector_free(f);
  //  gsl_vector_free(b);
  //  gsl_vector_free(x);
 }
 /* --------------------------------------------------------------------------*/
 void TwoPunctures::JFD_times_dv(int i, int j, int k, int nvar, int n1, int n2,
@@ -2049,8 +2099,8 @@ void TwoPunctures::JFD_times_dv(int i, int j, int k, int nvar, int n1, int n2,
      ha, ga, ga2, hb, gb, gb2, hp, gp, gp2, gagb, gagp, gbgp;
  derivs dU, U;
-  allocate_derivs(&dU, nvar);
+  dU = ws_jfd_dU;
-  allocate_derivs(&U, nvar);
+  U = ws_jfd_U;
  if (k < 0)
    k = k + n3;
@@ -2168,9 +2218,6 @@ void TwoPunctures::JFD_times_dv(int i, int j, int k, int nvar, int n1, int n2,
  LinEquations(A, B, X, R, x, r, phi, y, z, dU, U, values);
  for (ivar = 0; ivar < nvar; ivar++)
    values[ivar] *= FAC;
  free_derivs(&dU, nvar);
  free_derivs(&U, nvar);
 }
 #undef FAC
 /*-----------------------------------------------------------*/
@@ -2202,17 +2249,11 @@ void TwoPunctures::LineRelax_al(double *dv,
 {
  int i, m, Ic, Ip, Im, col, ivar;
-  double *diag = new double[n1];
+  double *diag = ws_diag_al;
-  double *e = new double[n1 - 1]; /* above diagonal */
+  double *e = ws_e_al;     /* above diagonal */
-  double *f = new double[n1 - 1]; /* below diagonal */
+  double *f = ws_f_al;     /* below diagonal */
-  double *b = new double[n1];     /* rhs */
+  double *b = ws_b_al;     /* rhs */
-  double *x = new double[n1];     /* solution vector */
+  double *x = ws_x_al;     /* solution vector */
  //  gsl_vector *diag = gsl_vector_alloc(n1);
  //  gsl_vector *e = gsl_vector_alloc(n1-1); /* above diagonal */
  //  gsl_vector *f = gsl_vector_alloc(n1-1); /* below diagonal */
  //  gsl_vector *b = gsl_vector_alloc(n1);   /* rhs */
  //  gsl_vector *x = gsl_vector_alloc(n1);   /* solution vector */
  for (ivar = 0; ivar < nvar; ivar++)
  {
@@ -2222,57 +2263,35 @@ void TwoPunctures::LineRelax_al(double *dv,
    }
    diag[n1 - 1] = 0;
    //    gsl_vector_set_zero(diag);
    //    gsl_vector_set_zero(e);
    //    gsl_vector_set_zero(f);
    for (i = 0; i < n1; i++)
    {
      Ip = Index(ivar, i + 1, j, k, nvar, n1, n2, n3);
      Ic = Index(ivar, i, j, k, nvar, n1, n2, n3);
      Im = Index(ivar, i - 1, j, k, nvar, n1, n2, n3);
      b[i] = rhs[Ic];
      //      gsl_vector_set(b,i,rhs[Ic]);
      for (m = 0; m < ncols[Ic]; m++)
      {
        col = cols[Ic][m];
        if (col != Ip && col != Ic && col != Im)
          b[i] -= JFD[Ic][m] * dv[col];
        //          *gsl_vector_ptr(b, i) -= JFD[Ic][m] * dv[col];
        else
        {
          if (col == Im && i > 0)
            f[i - 1] = JFD[Ic][m];
          //            gsl_vector_set(f,i-1,JFD[Ic][m]);
          if (col == Ic)
            diag[i] = JFD[Ic][m];
          //            gsl_vector_set(diag,i,JFD[Ic][m]);
          if (col == Ip && i < n1 - 1)
            e[i] = JFD[Ic][m];
          //            gsl_vector_set(e,i,JFD[Ic][m]);
        }
      }
    }
    ThomasAlgorithm(n1, f, diag, e, x, b);
    //    gsl_linalg_solve_tridiag(diag, e, f, b, x);
    for (i = 0; i < n1; i++)
    {
      Ic = Index(ivar, i, j, k, nvar, n1, n2, n3);
      dv[Ic] = x[i];
      //      dv[Ic] = gsl_vector_get(x, i);
    }
  }
  delete[] diag;
  delete[] e;
  delete[] f;
  delete[] b;
  delete[] x;
  //  gsl_vector_free(diag);
  //  gsl_vector_free(e);
  //  gsl_vector_free(f);
  //  gsl_vector_free(b);
  //  gsl_vector_free(x);
 }
 /* -------------------------------------------------------------------------*/
 // a[N], b[N-1], c[N-1], x[N], q[N]
@@ -2284,44 +2303,29 @@ void TwoPunctures::LineRelax_al(double *dv,
 //"Parallel Scientific Computing in C++ and MPI" P361
 void TwoPunctures::ThomasAlgorithm(int N, double *b, double *a, double *c, double *x, double *q)
 {
  // In-place Thomas algorithm: uses a[] as d workspace, b[] as l workspace.
  // c[] is already u (above-diagonal). ws_thomas_y is pre-allocated workspace.
  int i;
-  double *l, *u, *d, *y;
+  double *y = ws_thomas_y;
  l = new double[N - 1];
  u = new double[N - 1];
  d = new double[N];
  y = new double[N];
  /* LU Decomposition */
  d[0] = a[0];
  u[0] = c[0];
  /* LU Decomposition (in-place: a becomes d, b becomes l) */
  for (i = 0; i < N - 2; i++)
  {
-    l[i] = b[i] / d[i];
+    b[i] = b[i] / a[i];
-    d[i + 1] = a[i + 1] - l[i] * u[i];
+    a[i + 1] = a[i + 1] - b[i] * c[i];
    u[i + 1] = c[i + 1];
  }
-
+  b[N - 2] = b[N - 2] / a[N - 2];
-  l[N - 2] = b[N - 2] / d[N - 2];
+  a[N - 1] = a[N - 1] - b[N - 2] * c[N - 2];
  d[N - 1] = a[N - 1] - l[N - 2] * u[N - 2];
  /* Forward Substitution [L][y] = [q] */
  y[0] = q[0];
  for (i = 1; i < N; i++)
-    y[i] = q[i] - l[i - 1] * y[i - 1];
+    y[i] = q[i] - b[i - 1] * y[i - 1];
  /* Backward Substitution [U][x] = [y] */
-  x[N - 1] = y[N - 1] / d[N - 1];
+  x[N - 1] = y[N - 1] / a[N - 1];
  for (i = N - 2; i >= 0; i--)
-    x[i] = (y[i] - u[i] * x[i + 1]) / d[i];
+    x[i] = (y[i] - c[i] * x[i + 1]) / a[i];
  delete[] l;
  delete[] u;
  delete[] d;
  delete[] y;
  return;
 }
 // --------------------------------------------------------------------------*/
 // Calculates the value of v at an arbitrary position (x,y,z) if the spectral coefficients are know*/*/
--- a/AMSS_NCKU_source/TwoPunctures.h
+++ b/AMSS_NCKU_source/TwoPunctures.h
@@ -42,6 +42,33 @@ private:
       int ntotal;
       // Pre-allocated workspace buffers for hot-path allocation elimination
       // LineRelax_be workspace (sized for n2)
       double *ws_diag_be, *ws_e_be, *ws_f_be, *ws_b_be, *ws_x_be;
       // LineRelax_al workspace (sized for n1)
       double *ws_diag_al, *ws_e_al, *ws_f_al, *ws_b_al, *ws_x_al;
       // ThomasAlgorithm workspace (sized for max(n1,n2))
       double *ws_thomas_y;
       // JFD_times_dv workspace (sized for nvar)
       double *ws_jfd_values;
       derivs ws_jfd_dU, ws_jfd_U;
       // chebft_Zeros workspace (sized for max(n1,n2,n3)+1)
       double *ws_cheb_c;
       // fourft workspace (sized for max(n1,n2,n3)/2+1 each)
       double *ws_four_a, *ws_four_b;
       // Derivatives_AB3 workspace
       double *ws_deriv_p, *ws_deriv_dp, *ws_deriv_d2p;
       double *ws_deriv_q, *ws_deriv_dq;
       double *ws_deriv_r, *ws_deriv_dr;
       int *ws_deriv_indx;
       // F_of_v workspace
       double *ws_fov_sources;
       double *ws_fov_values;
       derivs ws_fov_U;
       // J_times_dv workspace
       double *ws_jtdv_values;
       derivs ws_jtdv_dU, ws_jtdv_U;
       struct parameters
       {
              int nvar, n1, n2, n3;
--- a/AMSS_NCKU_source/bssn_rhs.f90
+++ b/AMSS_NCKU_source/bssn_rhs.f90
@@ -61,9 +61,7 @@
  real*8, dimension(ex(1),ex(2),ex(3)),intent(inout) :: ham_Res, movx_Res, movy_Res, movz_Res
  real*8, dimension(ex(1),ex(2),ex(3)),intent(inout) :: Gmx_Res, Gmy_Res, Gmz_Res
 !  gont = 0: success; gont = 1: something wrong
-  integer::gont,i,j,k
+  integer::gont
  real*8 :: val1, val2
  real*8 :: det, t_gupxx, t_gupxy, t_gupxz, t_gupyy, t_gupyz, t_gupzz
 !~~~~~~> Other variables:
@@ -86,10 +84,7 @@
  real*8, dimension(ex(1),ex(2),ex(3)) :: gupyy,gupyz,gupzz
  real*8,dimension(3) ::SSS,AAS,ASA,SAA,ASS,SAS,SSA
-  real*8            :: PI
+  real*8            :: dX, dY, dZ, PI
 #if (DEBUG_NAN_CHECK)
  real*8            :: dX
 #endif
  real*8, parameter :: ZEO = 0.d0,ONE = 1.D0, TWO = 2.D0, FOUR = 4.D0
  real*8, parameter :: EIGHT = 8.D0, HALF = 0.5D0, THR = 3.d0
  real*8, parameter :: SYM = 1.D0, ANTI= - 1.D0
@@ -111,8 +106,8 @@
  call getpbh(BHN,Porg,Mass)
 #endif
-#if (DEBUG_NAN_CHECK)
+!!! sanity check (disabled in production builds for performance)
-!!! sanity check
+#ifdef DEBUG
  dX = sum(chi)+sum(trK)+sum(dxx)+sum(gxy)+sum(gxz)+sum(dyy)+sum(gyz)+sum(dzz) &
      +sum(Axx)+sum(Axy)+sum(Axz)+sum(Ayy)+sum(Ayz)+sum(Azz)                   &
      +sum(Gamx)+sum(Gamy)+sum(Gamz)                                           &
@@ -146,6 +141,10 @@
  PI = dacos(-ONE)
  dX = X(2) - X(1)
  dY = Y(2) - Y(1)
  dZ = Z(2) - Z(1)
  alpn1 = Lap + ONE
  chin1 = chi + ONE
  gxx = dxx + ONE
@@ -159,16 +158,18 @@
  div_beta = betaxx + betayy + betazz
  call fderivs(ex,chi,chix,chiy,chiz,X,Y,Z,SYM,SYM,SYM,symmetry,Lev)
  call fderivs(ex,dxx,gxxx,gxxy,gxxz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  call fderivs(ex,dyy,gyyx,gyyy,gyyz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  call fderivs(ex,dzz,gzzx,gzzy,gzzz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  call fderivs(ex,gxy,gxyx,gxyy,gxyz,X,Y,Z,ANTI,ANTI,SYM ,Symmetry,Lev)
  call fderivs(ex,gxz,gxzx,gxzy,gxzz,X,Y,Z,ANTI,SYM ,ANTI,Symmetry,Lev)
  call fderivs(ex,gyz,gyzx,gyzy,gyzz,X,Y,Z,SYM ,ANTI,ANTI,Symmetry,Lev)
  chi_rhs = F2o3 *chin1*( alpn1 * trK - div_beta ) !rhs for chi
  call fderivs(ex,dxx,gxxx,gxxy,gxxz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  call fderivs(ex,gxy,gxyx,gxyy,gxyz,X,Y,Z,ANTI,ANTI,SYM ,Symmetry,Lev)
  call fderivs(ex,gxz,gxzx,gxzy,gxzz,X,Y,Z,ANTI,SYM ,ANTI,Symmetry,Lev)
  call fderivs(ex,dyy,gyyx,gyyy,gyyz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  call fderivs(ex,gyz,gyzx,gyzy,gyzz,X,Y,Z,SYM ,ANTI,ANTI,Symmetry,Lev)
  call fderivs(ex,dzz,gzzx,gzzy,gzzz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,Lev)
  !$OMP PARALLEL
  !$OMP WORKSHARE
  gxx_rhs = - TWO * alpn1 * Axx    -  F2o3 * gxx * div_beta          + &
              TWO *(  gxx * betaxx +   gxy * betayx +   gxz * betazx)
@@ -193,99 +194,75 @@
                                       gyz * betayx +   gzz * betazx   &
                                                    -   gxz * betayy     !rhs for gij
-! fused loop for metric inversion and connections
+! invert tilted metric
-  !DIR$ SIMD
+  gupzz =  gxx * gyy * gzz + gxy * gyz * gxz + gxz * gxy * gyz - &
-  do k=1,ex(3)
+           gxz * gyy * gxz - gxy * gxy * gzz - gxx * gyz * gyz
-  do j=1,ex(2)
+  gupxx =   ( gyy * gzz - gyz * gyz ) / gupzz
-  do i=1,ex(1)
+  gupxy = - ( gxy * gzz - gyz * gxz ) / gupzz
-     ! 1. Metric Inversion
+  gupxz =   ( gxy * gyz - gyy * gxz ) / gupzz
-     det = ONE / ( &
+  gupyy =   ( gxx * gzz - gxz * gxz ) / gupzz
-            gxx(i,j,k) * gyy(i,j,k) * gzz(i,j,k) + gxy(i,j,k) * gyz(i,j,k) * gxz(i,j,k) + &
+  gupyz = - ( gxx * gyz - gxy * gxz ) / gupzz
-            gxz(i,j,k) * gxy(i,j,k) * gyz(i,j,k) - gxz(i,j,k) * gyy(i,j,k) * gxz(i,j,k) - &
+  gupzz =   ( gxx * gyy - gxy * gxy ) / gupzz
-            gxy(i,j,k) * gxy(i,j,k) * gzz(i,j,k) - gxx(i,j,k) * gyz(i,j,k) * gyz(i,j,k) )
+  !$OMP END WORKSHARE
-     
+  !$OMP END PARALLEL
     t_gupxx =   ( gyy(i,j,k) * gzz(i,j,k) - gyz(i,j,k) * gyz(i,j,k) ) * det
     t_gupxy = - ( gxy(i,j,k) * gzz(i,j,k) - gyz(i,j,k) * gxz(i,j,k) ) * det
     t_gupxz =   ( gxy(i,j,k) * gyz(i,j,k) - gyy(i,j,k) * gxz(i,j,k) ) * det
     t_gupyy =   ( gxx(i,j,k) * gzz(i,j,k) - gxz(i,j,k) * gxz(i,j,k) ) * det
     t_gupyz = - ( gxx(i,j,k) * gyz(i,j,k) - gxy(i,j,k) * gxz(i,j,k) ) * det
     t_gupzz =   ( gxx(i,j,k) * gyy(i,j,k) - gxy(i,j,k) * gxy(i,j,k) ) * det
     gupxx(i,j,k) = t_gupxx
     gupxy(i,j,k) = t_gupxy
     gupxz(i,j,k) = t_gupxz
     gupyy(i,j,k) = t_gupyy
     gupyz(i,j,k) = t_gupyz
     gupzz(i,j,k) = t_gupzz
  if(co == 0)then
-        Gmx_Res(i,j,k) = Gamx(i,j,k) - (t_gupxx*(t_gupxx*gxxx(i,j,k)+t_gupxy*gxyx(i,j,k)+t_gupxz*gxzx(i,j,k))&
+! Gam^i_Res = Gam^i + gup^ij_,j
-                         +t_gupxy*(t_gupxx*gxyx(i,j,k)+t_gupxy*gyyx(i,j,k)+t_gupxz*gyzx(i,j,k))&
+  Gmx_Res = Gamx - (gupxx*(gupxx*gxxx+gupxy*gxyx+gupxz*gxzx)&
-                         +t_gupxz*(t_gupxx*gxzx(i,j,k)+t_gupxy*gyzx(i,j,k)+t_gupxz*gzzx(i,j,k))&
+                   +gupxy*(gupxx*gxyx+gupxy*gyyx+gupxz*gyzx)&
-                         +t_gupxx*(t_gupxy*gxxy(i,j,k)+t_gupyy*gxyy(i,j,k)+t_gupyz*gxzy(i,j,k))&
+                   +gupxz*(gupxx*gxzx+gupxy*gyzx+gupxz*gzzx)&
-                         +t_gupxy*(t_gupxy*gxyy(i,j,k)+t_gupyy*gyyy(i,j,k)+t_gupyz*gyzy(i,j,k))&
+                   +gupxx*(gupxy*gxxy+gupyy*gxyy+gupyz*gxzy)&
-                         +t_gupxz*(t_gupxy*gxzy(i,j,k)+t_gupyy*gyzy(i,j,k)+t_gupyz*gzzy(i,j,k))&
+                   +gupxy*(gupxy*gxyy+gupyy*gyyy+gupyz*gyzy)&
-                         +t_gupxx*(t_gupxz*gxxz(i,j,k)+t_gupyz*gxyz(i,j,k)+t_gupzz*gxzz(i,j,k))&
+                   +gupxz*(gupxy*gxzy+gupyy*gyzy+gupyz*gzzy)&
-                         +t_gupxy*(t_gupxz*gxyz(i,j,k)+t_gupyz*gyyz(i,j,k)+t_gupzz*gyzz(i,j,k))&
+                   +gupxx*(gupxz*gxxz+gupyz*gxyz+gupzz*gxzz)&
-                         +t_gupxz*(t_gupxz*gxzz(i,j,k)+t_gupyz*gyzz(i,j,k)+t_gupzz*gzzz(i,j,k)))
+                   +gupxy*(gupxz*gxyz+gupyz*gyyz+gupzz*gyzz)&
-        Gmy_Res(i,j,k) = Gamy(i,j,k) - (t_gupxx*(t_gupxy*gxxx(i,j,k)+t_gupyy*gxyx(i,j,k)+t_gupyz*gxzx(i,j,k))&
+                   +gupxz*(gupxz*gxzz+gupyz*gyzz+gupzz*gzzz))
-                         +t_gupxy*(t_gupxy*gxyx(i,j,k)+t_gupyy*gyyx(i,j,k)+t_gupyz*gyzx(i,j,k))&
+  Gmy_Res = Gamy - (gupxx*(gupxy*gxxx+gupyy*gxyx+gupyz*gxzx)&
-                         +t_gupxz*(t_gupxy*gxzx(i,j,k)+t_gupyy*gyzx(i,j,k)+t_gupyz*gzzx(i,j,k))&
+                   +gupxy*(gupxy*gxyx+gupyy*gyyx+gupyz*gyzx)&
-                         +t_gupxy*(t_gupxy*gxxy(i,j,k)+t_gupyy*gxyy(i,j,k)+t_gupyz*gxzy(i,j,k))&
+                   +gupxz*(gupxy*gxzx+gupyy*gyzx+gupyz*gzzx)&
-                         +t_gupyy*(t_gupxy*gxyy(i,j,k)+t_gupyy*gyyy(i,j,k)+t_gupyz*gyzy(i,j,k))&
+                   +gupxy*(gupxy*gxxy+gupyy*gxyy+gupyz*gxzy)&
-                         +t_gupyz*(t_gupxy*gxzy(i,j,k)+t_gupyy*gyzy(i,j,k)+t_gupyz*gzzy(i,j,k))&
+                   +gupyy*(gupxy*gxyy+gupyy*gyyy+gupyz*gyzy)&
-                         +t_gupxy*(t_gupxz*gxxz(i,j,k)+t_gupyz*gxyz(i,j,k)+t_gupzz*gxzz(i,j,k))&
+                   +gupyz*(gupxy*gxzy+gupyy*gyzy+gupyz*gzzy)&
-                         +t_gupyy*(t_gupxz*gxyz(i,j,k)+t_gupyz*gyyz(i,j,k)+t_gupzz*gyzz(i,j,k))&
+                   +gupxy*(gupxz*gxxz+gupyz*gxyz+gupzz*gxzz)&
-                         +t_gupyz*(t_gupxz*gxzz(i,j,k)+t_gupyz*gyzz(i,j,k)+t_gupzz*gzzz(i,j,k)))
+                   +gupyy*(gupxz*gxyz+gupyz*gyyz+gupzz*gyzz)&
-        Gmz_Res(i,j,k) = Gamz(i,j,k) - (t_gupxx*(t_gupxz*gxxx(i,j,k)+t_gupyz*gxyx(i,j,k)+t_gupzz*gxzx(i,j,k))&
+                   +gupyz*(gupxz*gxzz+gupyz*gyzz+gupzz*gzzz))
-                         +t_gupxy*(t_gupxz*gxyx(i,j,k)+t_gupyz*gyyx(i,j,k)+t_gupzz*gyzx(i,j,k))&
+  Gmz_Res = Gamz - (gupxx*(gupxz*gxxx+gupyz*gxyx+gupzz*gxzx)&
-                         +t_gupxz*(t_gupxz*gxzx(i,j,k)+t_gupyz*gyzx(i,j,k)+t_gupzz*gzzx(i,j,k))&
+                   +gupxy*(gupxz*gxyx+gupyz*gyyx+gupzz*gyzx)&
-                         +t_gupxy*(t_gupxz*gxxy(i,j,k)+t_gupyz*gxyy(i,j,k)+t_gupzz*gxzy(i,j,k))&
+                   +gupxz*(gupxz*gxzx+gupyz*gyzx+gupzz*gzzx)&
-                         +t_gupyy*(t_gupxz*gxyy(i,j,k)+t_gupyz*gyyy(i,j,k)+t_gupzz*gyzy(i,j,k))&
+                   +gupxy*(gupxz*gxxy+gupyz*gxyy+gupzz*gxzy)&
-                         +t_gupyz*(t_gupxz*gxzy(i,j,k)+t_gupyz*gyzy(i,j,k)+t_gupzz*gzzy(i,j,k))&
+                   +gupyy*(gupxz*gxyy+gupyz*gyyy+gupzz*gyzy)&
-                         +t_gupxz*(t_gupxz*gxxz(i,j,k)+t_gupyz*gxyz(i,j,k)+t_gupzz*gxzz(i,j,k))&
+                   +gupyz*(gupxz*gxzy+gupyz*gyzy+gupzz*gzzy)&
-                         +t_gupyz*(t_gupxz*gxyz(i,j,k)+t_gupyz*gyyz(i,j,k)+t_gupzz*gyzz(i,j,k))&
+                   +gupxz*(gupxz*gxxz+gupyz*gxyz+gupzz*gxzz)&
-                         +t_gupzz*(t_gupxz*gxzz(i,j,k)+t_gupyz*gyzz(i,j,k)+t_gupzz*gzzz(i,j,k)))
+                   +gupyz*(gupxz*gxyz+gupyz*gyyz+gupzz*gyzz)&
                   +gupzz*(gupxz*gxzz+gupyz*gyzz+gupzz*gzzz))
  endif
-     ! 2. Christoffel Symbols
+! second kind of connection
-     val1 = TWO * gxyx(i,j,k) - gxxy(i,j,k)
+  !$OMP PARALLEL
-     val2 = TWO * gxzx(i,j,k) - gxxz(i,j,k)
+  !$OMP WORKSHARE
-     Gamxxx(i,j,k) =HALF*( t_gupxx*gxxx(i,j,k) + t_gupxy*val1 + t_gupxz*val2 )
+  Gamxxx =HALF*( gupxx*gxxx + gupxy*(TWO*gxyx - gxxy ) + gupxz*(TWO*gxzx - gxxz ))
-     Gamyxx(i,j,k) =HALF*( t_gupxy*gxxx(i,j,k) + t_gupyy*val1 + t_gupyz*val2 )
+  Gamyxx =HALF*( gupxy*gxxx + gupyy*(TWO*gxyx - gxxy ) + gupyz*(TWO*gxzx - gxxz ))
-     Gamzxx(i,j,k) =HALF*( t_gupxz*gxxx(i,j,k) + t_gupyz*val1 + t_gupzz*val2 )
+  Gamzxx =HALF*( gupxz*gxxx + gupyz*(TWO*gxyx - gxxy ) + gupzz*(TWO*gxzx - gxxz ))
-     val1 = TWO * gxyy(i,j,k) - gyyx(i,j,k)
+  Gamxyy =HALF*( gupxx*(TWO*gxyy - gyyx ) + gupxy*gyyy + gupxz*(TWO*gyzy - gyyz ))
-     val2 = TWO * gyzy(i,j,k) - gyyz(i,j,k)
+  Gamyyy =HALF*( gupxy*(TWO*gxyy - gyyx ) + gupyy*gyyy + gupyz*(TWO*gyzy - gyyz ))
-     Gamxyy(i,j,k) =HALF*( t_gupxx*val1 + t_gupxy*gyyy(i,j,k) + t_gupxz*val2 )
+  Gamzyy =HALF*( gupxz*(TWO*gxyy - gyyx ) + gupyz*gyyy + gupzz*(TWO*gyzy - gyyz ))
     Gamyyy(i,j,k) =HALF*( t_gupxy*val1 + t_gupyy*gyyy(i,j,k) + t_gupyz*val2 )
     Gamzyy(i,j,k) =HALF*( t_gupxz*val1 + t_gupyz*gyyy(i,j,k) + t_gupzz*val2 )
-     val1 = TWO * gxzz(i,j,k) - gzzx(i,j,k)
+  Gamxzz =HALF*( gupxx*(TWO*gxzz - gzzx ) + gupxy*(TWO*gyzz - gzzy ) + gupxz*gzzz)
-     val2 = TWO * gyzz(i,j,k) - gzzy(i,j,k)
+  Gamyzz =HALF*( gupxy*(TWO*gxzz - gzzx ) + gupyy*(TWO*gyzz - gzzy ) + gupyz*gzzz)
-     Gamxzz(i,j,k) =HALF*( t_gupxx*val1 + t_gupxy*val2 + t_gupxz*gzzz(i,j,k) )
+  Gamzzz =HALF*( gupxz*(TWO*gxzz - gzzx ) + gupyz*(TWO*gyzz - gzzy ) + gupzz*gzzz)
     Gamyzz(i,j,k) =HALF*( t_gupxy*val1 + t_gupyy*val2 + t_gupyz*gzzz(i,j,k) )
     Gamzzz(i,j,k) =HALF*( t_gupxz*val1 + t_gupyz*val2 + t_gupzz*gzzz(i,j,k) )
-     val1 = gxzy(i,j,k) + gyzx(i,j,k) - gxyz(i,j,k)
+  Gamxxy =HALF*( gupxx*gxxy + gupxy*gyyx + gupxz*( gxzy + gyzx - gxyz ) )
-     Gamxxy(i,j,k) =HALF*( t_gupxx*gxxy(i,j,k) + t_gupxy*gyyx(i,j,k) + t_gupxz*val1 )
+  Gamyxy =HALF*( gupxy*gxxy + gupyy*gyyx + gupyz*( gxzy + gyzx - gxyz ) )
-     Gamyxy(i,j,k) =HALF*( t_gupxy*gxxy(i,j,k) + t_gupyy*gyyx(i,j,k) + t_gupyz*val1 )
+  Gamzxy =HALF*( gupxz*gxxy + gupyz*gyyx + gupzz*( gxzy + gyzx - gxyz ) )
     Gamzxy(i,j,k) =HALF*( t_gupxz*gxxy(i,j,k) + t_gupyz*gyyx(i,j,k) + t_gupzz*val1 )
     val1 = gxyz(i,j,k) + gyzx(i,j,k) - gxzy(i,j,k)
     Gamxxz(i,j,k) =HALF*( t_gupxx*gxxz(i,j,k) + t_gupxy*val1 + t_gupxz*gzzx(i,j,k) )
     Gamyxz(i,j,k) =HALF*( t_gupxy*gxxz(i,j,k) + t_gupyy*val1 + t_gupyz*gzzx(i,j,k) )
     Gamzxz(i,j,k) =HALF*( t_gupxz*gxxz(i,j,k) + t_gupyz*val1 + t_gupzz*gzzx(i,j,k) )
     val1 = gxyz(i,j,k) + gxzy(i,j,k) - gyzx(i,j,k)
     Gamxyz(i,j,k) =HALF*( t_gupxx*val1 + t_gupxy*gyyz(i,j,k) + t_gupxz*gzzy(i,j,k) )
     Gamyyz(i,j,k) =HALF*( t_gupxy*val1 + t_gupyy*gyyz(i,j,k) + t_gupyz*gzzy(i,j,k) )
     Gamzyz(i,j,k) =HALF*( t_gupxz*val1 + t_gupyz*gyyz(i,j,k) + t_gupzz*gzzy(i,j,k) )
  enddo
  enddo
  enddo
  Gamxxz =HALF*( gupxx*gxxz + gupxy*( gxyz + gyzx - gxzy ) + gupxz*gzzx )
  Gamyxz =HALF*( gupxy*gxxz + gupyy*( gxyz + gyzx - gxzy ) + gupyz*gzzx )
  Gamzxz =HALF*( gupxz*gxxz + gupyz*( gxyz + gyzx - gxzy ) + gupzz*gzzx )
  Gamxyz =HALF*( gupxx*( gxyz + gxzy - gyzx ) + gupxy*gyyz + gupxz*gzzy )
  Gamyyz =HALF*( gupxy*( gxyz + gxzy - gyzx ) + gupyy*gyyz + gupyz*gzzy )
  Gamzyz =HALF*( gupxz*( gxyz + gxzy - gyzx ) + gupyz*gyyz + gupzz*gzzy )
 ! Raise indices of \tilde A_{ij} and store in R_ij
  Rxx =    gupxx * gupxx * Axx + gupxy * gupxy * Ayy + gupxz * gupxz * Azz + &
@@ -311,45 +288,37 @@
          (gupxy * gupyz       + gupyy * gupxz)* Axy                       + &
          (gupxy * gupzz       + gupyz * gupxz)* Axz                       + &
          (gupyy * gupzz       + gupyz * gupyz)* Ayz
  !$OMP END WORKSHARE
  !$OMP END PARALLEL
 ! Right hand side for Gam^i without shift terms...
  call fderivs(ex,Lap,Lapx,Lapy,Lapz,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
  call fderivs(ex,trK,Kx,Ky,Kz,X,Y,Z,SYM,SYM,SYM,symmetry,Lev)
  ! reuse fxx/fxy/fxz as temporaries for matter-source combinations
  fxx = F2o3 * Kx + EIGHT * PI * Sx
  fxy = F2o3 * Ky + EIGHT * PI * Sy
  fxz = F2o3 * Kz + EIGHT * PI * Sz
  ! reuse Gamxa/Gamya/Gamza as temporaries for chix*R combinations
  Gamxa = chix * Rxx + chiy * Rxy + chiz * Rxz
  Gamya = chix * Rxy + chiy * Ryy + chiz * Ryz
  Gamza = chix * Rxz + chiy * Ryz + chiz * Rzz
   Gamx_rhs = - TWO * (   Lapx * Rxx +   Lapy * Rxy +   Lapz * Rxz ) + &
        TWO * alpn1 * (                                                &
-        -F3o2 * ONE/chin1 * Gamxa - &
+        -F3o2/chin1 * (   chix * Rxx +   chiy * Rxy +   chiz * Rxz ) - &
-              gupxx * fxx - &
+              gupxx * (   F2o3 * Kx  +  EIGHT * PI * Sx            ) - &
-              gupxy * fxy - &
+              gupxy * (   F2o3 * Ky  +  EIGHT * PI * Sy            ) - &
-              gupxz * fxz + &
+              gupxz * (   F2o3 * Kz  +  EIGHT * PI * Sz            ) + &
                        Gamxxx * Rxx + Gamxyy * Ryy + Gamxzz * Rzz   + &
                TWO * ( Gamxxy * Rxy + Gamxxz * Rxz + Gamxyz * Ryz ) )
   Gamy_rhs = - TWO * (   Lapx * Rxy +   Lapy * Ryy +   Lapz * Ryz ) + &
        TWO * alpn1 * (                                                &
-        -F3o2 * ONE/chin1 * Gamya - &
+        -F3o2/chin1 * (   chix * Rxy +  chiy * Ryy +    chiz * Ryz ) - &
-              gupxy * fxx - &
+              gupxy * (   F2o3 * Kx  +  EIGHT * PI * Sx            ) - &
-              gupyy * fxy - &
+              gupyy * (   F2o3 * Ky  +  EIGHT * PI * Sy            ) - &
-              gupyz * fxz + &
+              gupyz * (   F2o3 * Kz  +  EIGHT * PI * Sz            ) + &
                        Gamyxx * Rxx + Gamyyy * Ryy + Gamyzz * Rzz   + &
                TWO * ( Gamyxy * Rxy + Gamyxz * Rxz + Gamyyz * Ryz ) )
   Gamz_rhs = - TWO * (   Lapx * Rxz +   Lapy * Ryz +   Lapz * Rzz ) + &
        TWO * alpn1 * (                                                &
-        -F3o2 * ONE/chin1 * Gamza - &
+        -F3o2/chin1 * (   chix * Rxz +  chiy * Ryz +    chiz * Rzz ) - &
-              gupxz * fxx - &
+              gupxz * (   F2o3 * Kx  +  EIGHT * PI * Sx            ) - &
-              gupyz * fxy - &
+              gupyz * (   F2o3 * Ky  +  EIGHT * PI * Sy            ) - &
-              gupzz * fxz + &
+              gupzz * (   F2o3 * Kz  +  EIGHT * PI * Sz            ) + &
                        Gamzxx * Rxx + Gamzyy * Ryy + Gamzzz * Rzz   + &
                TWO * ( Gamzxy * Rxy + Gamzxz * Rxz + Gamzyz * Ryz ) )
@@ -375,6 +344,8 @@
  call fderivs(ex,Gamy,Gamyx,Gamyy,Gamyz,X,Y,Z,SYM ,ANTI,SYM ,Symmetry,Lev)
  call fderivs(ex,Gamz,Gamzx,Gamzy,Gamzz,X,Y,Z,SYM ,SYM ,ANTI,Symmetry,Lev)
  !$OMP PARALLEL
  !$OMP WORKSHARE
  Gamx_rhs =               Gamx_rhs +  F2o3 *  Gamxa * div_beta        - &
                     Gamxa * betaxx - Gamya * betaxy - Gamza * betaxz  + &
             F1o3 * (gupxx * fxx    + gupxy * fxy    + gupxz * fxz    ) + &
@@ -414,6 +385,8 @@
  gyyz = gxz * Gamxyy + gyz * Gamyyy + gzz * Gamzyy
  gyzz = gxz * Gamxyz + gyz * Gamyyz + gzz * Gamzyz
  gzzz = gxz * Gamxzz + gyz * Gamyzz + gzz * Gamzzz
  !$OMP END WORKSHARE
  !$OMP END PARALLEL
 !compute Ricci tensor for tilted metric
   call fdderivs(ex,dxx,fxx,fxy,fxz,fyy,fyz,fzz,X,Y,Z,SYM ,SYM ,SYM ,symmetry,Lev)
@@ -440,6 +413,8 @@
   Ryz =   gupxx * fxx + gupyy * fyy + gupzz * fzz + &
         ( gupxy * fxy + gupxz * fxz + gupyz * fyz ) * TWO
  !$OMP PARALLEL
  !$OMP WORKSHARE
  Rxx =     - HALF * Rxx                                   + &
               gxx * Gamxx+ gxy * Gamyx   +    gxz * Gamzx + &
             Gamxa * gxxx +  Gamya * gxyx +  Gamza * gxzx  + &
@@ -640,9 +615,13 @@
            Gamxyz * gxzz + Gamyyz * gyzz + Gamzyz * gzzz  + &
            Gamxzz * gxzy + Gamyzz * gyzy + Gamzzz * gzzy  + &
            Gamxyz * gzzx + Gamyyz * gzzy + Gamzyz * gzzz )
  !$OMP END WORKSHARE
  !$OMP END PARALLEL
 !covariant second derivative of chi respect to tilted metric
  call fdderivs(ex,chi,fxx,fxy,fxz,fyy,fyz,fzz,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
  !$OMP PARALLEL
  !$OMP WORKSHARE
  fxx = fxx - Gamxxx * chix - Gamyxx * chiy - Gamzxx * chiz
  fxy = fxy - Gamxxy * chix - Gamyxy * chiy - Gamzxy * chiz
  fxz = fxz - Gamxxz * chix - Gamyxz * chiy - Gamzxz * chiz
@@ -651,47 +630,51 @@
  fzz = fzz - Gamxzz * chix - Gamyzz * chiy - Gamzzz * chiz
 ! Store D^l D_l chi - 3/(2*chi) D^l chi D_l chi in f
-  f =        gupxx * ( fxx - F3o2 * ONE/chin1 * chix * chix ) + &
+  f =        gupxx * ( fxx - F3o2/chin1 * chix * chix ) + &
-             gupyy * ( fyy - F3o2 * ONE/chin1 * chiy * chiy ) + &
+             gupyy * ( fyy - F3o2/chin1 * chiy * chiy ) + &
-             gupzz * ( fzz - F3o2 * ONE/chin1 * chiz * chiz ) + &
+             gupzz * ( fzz - F3o2/chin1 * chiz * chiz ) + &
-       TWO * gupxy * ( fxy - F3o2 * ONE/chin1 * chix * chiy ) + &
+       TWO * gupxy * ( fxy - F3o2/chin1 * chix * chiy ) + &
-       TWO * gupxz * ( fxz - F3o2 * ONE/chin1 * chix * chiz ) + &
+       TWO * gupxz * ( fxz - F3o2/chin1 * chix * chiz ) + &
-       TWO * gupyz * ( fyz - F3o2 * ONE/chin1 * chiy * chiz ) 
+       TWO * gupyz * ( fyz - F3o2/chin1 * chiy * chiz ) 
 ! Add chi part to Ricci tensor:
-  Rxx = Rxx + (fxx - chix*chix*ONE/chin1*HALF + gxx * f) * ONE/chin1 * HALF
+  Rxx = Rxx + (fxx - chix*chix/chin1/TWO + gxx * f)/chin1/TWO
-  Ryy = Ryy + (fyy - chiy*chiy*ONE/chin1*HALF + gyy * f) * ONE/chin1 * HALF
+  Ryy = Ryy + (fyy - chiy*chiy/chin1/TWO + gyy * f)/chin1/TWO
-  Rzz = Rzz + (fzz - chiz*chiz*ONE/chin1*HALF + gzz * f) * ONE/chin1 * HALF
+  Rzz = Rzz + (fzz - chiz*chiz/chin1/TWO + gzz * f)/chin1/TWO
-  Rxy = Rxy + (fxy - chix*chiy*ONE/chin1*HALF + gxy * f) * ONE/chin1 * HALF
+  Rxy = Rxy + (fxy - chix*chiy/chin1/TWO + gxy * f)/chin1/TWO
-  Rxz = Rxz + (fxz - chix*chiz*ONE/chin1*HALF + gxz * f) * ONE/chin1 * HALF
+  Rxz = Rxz + (fxz - chix*chiz/chin1/TWO + gxz * f)/chin1/TWO
-  Ryz = Ryz + (fyz - chiy*chiz*ONE/chin1*HALF + gyz * f) * ONE/chin1 * HALF
+  Ryz = Ryz + (fyz - chiy*chiz/chin1/TWO + gyz * f)/chin1/TWO
  !$OMP END WORKSHARE
  !$OMP END PARALLEL
 ! covariant second derivatives of the lapse respect to physical metric
  call fdderivs(ex,Lap,fxx,fxy,fxz,fyy,fyz,fzz,X,Y,Z, &
                SYM,SYM,SYM,symmetry,Lev)
-  gxxx = (gupxx * chix + gupxy * chiy + gupxz * chiz) * ONE/chin1
+  !$OMP PARALLEL
-  gxxy = (gupxy * chix + gupyy * chiy + gupyz * chiz) * ONE/chin1
+  !$OMP WORKSHARE
-  gxxz = (gupxz * chix + gupyz * chiy + gupzz * chiz) * ONE/chin1
+  gxxx = (gupxx * chix + gupxy * chiy + gupxz * chiz)/chin1
  gxxy = (gupxy * chix + gupyy * chiy + gupyz * chiz)/chin1
  gxxz = (gupxz * chix + gupyz * chiy + gupzz * chiz)/chin1
 ! now get physical second kind of connection
-  Gamxxx = Gamxxx - ( TWO * chix * ONE/chin1 - gxx * gxxx )*HALF
+  Gamxxx = Gamxxx - ( (chix + chix)/chin1 - gxx * gxxx )*HALF
  Gamyxx = Gamyxx - (                     - gxx * gxxy )*HALF
  Gamzxx = Gamzxx - (                     - gxx * gxxz )*HALF
  Gamxyy = Gamxyy - (                     - gyy * gxxx )*HALF
-  Gamyyy = Gamyyy - ( TWO * chiy * ONE/chin1 - gyy * gxxy )*HALF
+  Gamyyy = Gamyyy - ( (chiy + chiy)/chin1 - gyy * gxxy )*HALF
  Gamzyy = Gamzyy - (                     - gyy * gxxz )*HALF
  Gamxzz = Gamxzz - (                     - gzz * gxxx )*HALF
  Gamyzz = Gamyzz - (                     - gzz * gxxy )*HALF
-  Gamzzz = Gamzzz - ( TWO * chiz * ONE/chin1 - gzz * gxxz )*HALF
+  Gamzzz = Gamzzz - ( (chiz + chiz)/chin1 - gzz * gxxz )*HALF
-  Gamxxy = Gamxxy - (  chiy * ONE/chin1 - gxy * gxxx )*HALF
+  Gamxxy = Gamxxy - (  chiy        /chin1 - gxy * gxxx )*HALF
-  Gamyxy = Gamyxy - (  chix * ONE/chin1 - gxy * gxxy )*HALF
+  Gamyxy = Gamyxy - (         chix /chin1 - gxy * gxxy )*HALF
  Gamzxy = Gamzxy - (                     - gxy * gxxz )*HALF
-  Gamxxz = Gamxxz - (  chiz * ONE/chin1 - gxz * gxxx )*HALF
+  Gamxxz = Gamxxz - (  chiz        /chin1 - gxz * gxxx )*HALF
  Gamyxz = Gamyxz - (                     - gxz * gxxy )*HALF
-  Gamzxz = Gamzxz - (  chix * ONE/chin1 - gxz * gxxz )*HALF
+  Gamzxz = Gamzxz - (         chix /chin1 - gxz * gxxz )*HALF
  Gamxyz = Gamxyz - (                     - gyz * gxxx )*HALF
-  Gamyyz = Gamyyz - (  chiz * ONE/chin1 - gyz * gxxy )*HALF
+  Gamyyz = Gamyyz - (  chiz        /chin1 - gyz * gxxy )*HALF
-  Gamzyz = Gamzyz - (  chiy * ONE/chin1 - gyz * gxxz )*HALF
+  Gamzyz = Gamzyz - (         chiy /chin1 - gyz * gxxz )*HALF
  fxx = fxx - Gamxxx*Lapx - Gamyxx*Lapy - Gamzxx*Lapz
  fyy = fyy - Gamxyy*Lapx - Gamyyy*Lapy - Gamzyy*Lapz
@@ -734,7 +717,7 @@
       gupxz * (Axy * Azz + Ayz * Axz) + &
       gupyz * (Ayy * Azz + Ayz * Ayz) ) )) -1.6d1*PI*rho + EIGHT * PI * S
  f = - F1o3 *(  gupxx * fxx + gupyy * fyy + gupzz * fzz + &
-        TWO* ( gupxy * fxy + gupxz * fxz + gupyz * fyz ) + alpn1 * ONE/chin1 * f)
+        TWO* ( gupxy * fxy + gupxz * fxz + gupyz * fyz ) + alpn1/chin1*f)
  fxx = alpn1 * (Rxx - EIGHT * PI * Sxx) - fxx
  fxy = alpn1 * (Rxy - EIGHT * PI * Sxy) - fxy
@@ -830,6 +813,8 @@
 !!!! gauge variable part
  Lap_rhs = -TWO*alpn1*trK
  !$OMP END WORKSHARE
  !$OMP END PARALLEL
 #if (GAUGE == 0)
  betax_rhs = FF*dtSfx
  betay_rhs = FF*dtSfy
@@ -854,8 +839,7 @@
  call fderivs(ex,chi,dtSfx_rhs,dtSfy_rhs,dtSfz_rhs,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
  reta = gupxx * dtSfx_rhs * dtSfx_rhs + gupyy * dtSfy_rhs * dtSfy_rhs + gupzz * dtSfz_rhs * dtSfz_rhs + &
       TWO * (gupxy * dtSfx_rhs * dtSfy_rhs + gupxz * dtSfx_rhs * dtSfz_rhs + gupyz * dtSfy_rhs * dtSfz_rhs)
-  fxx = dsqrt(chin1)
+  reta = 1.31d0/2*dsqrt(reta/chin1)/(1-dsqrt(chin1))**2
  reta = 1.31d0/2*dsqrt(reta*ONE/chin1)/(ONE-fxx)**2
  dtSfx_rhs = Gamx_rhs - reta*dtSfx
  dtSfy_rhs = Gamy_rhs - reta*dtSfy
  dtSfz_rhs = Gamz_rhs - reta*dtSfz
@@ -867,7 +851,7 @@
  call fderivs(ex,chi,dtSfx_rhs,dtSfy_rhs,dtSfz_rhs,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
  reta = gupxx * dtSfx_rhs * dtSfx_rhs + gupyy * dtSfy_rhs * dtSfy_rhs + gupzz * dtSfz_rhs * dtSfz_rhs + &
       TWO * (gupxy * dtSfx_rhs * dtSfy_rhs + gupxz * dtSfx_rhs * dtSfz_rhs + gupyz * dtSfy_rhs * dtSfz_rhs)
-  reta = 1.31d0/2*dsqrt(reta*ONE/chin1)/(ONE-chin1)**2
+  reta = 1.31d0/2*dsqrt(reta/chin1)/(1-chin1)**2
  dtSfx_rhs = Gamx_rhs - reta*dtSfx
  dtSfy_rhs = Gamy_rhs - reta*dtSfy
  dtSfz_rhs = Gamz_rhs - reta*dtSfz
@@ -875,8 +859,7 @@
  call fderivs(ex,chi,dtSfx_rhs,dtSfy_rhs,dtSfz_rhs,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
  reta = gupxx * dtSfx_rhs * dtSfx_rhs + gupyy * dtSfy_rhs * dtSfy_rhs + gupzz * dtSfz_rhs * dtSfz_rhs + &
       TWO * (gupxy * dtSfx_rhs * dtSfy_rhs + gupxz * dtSfx_rhs * dtSfz_rhs + gupyz * dtSfy_rhs * dtSfz_rhs)
-  fxx = dsqrt(chin1)
+  reta = 1.31d0/2*dsqrt(reta/chin1)/(1-dsqrt(chin1))**2
  reta = 1.31d0/2*dsqrt(reta*ONE/chin1)/(ONE-fxx)**2
  betax_rhs = FF*Gamx - reta*betax
  betay_rhs = FF*Gamy - reta*betay
  betaz_rhs = FF*Gamz - reta*betaz
@@ -888,7 +871,7 @@
  call fderivs(ex,chi,dtSfx_rhs,dtSfy_rhs,dtSfz_rhs,X,Y,Z,SYM,SYM,SYM,Symmetry,Lev)
  reta = gupxx * dtSfx_rhs * dtSfx_rhs + gupyy * dtSfy_rhs * dtSfy_rhs + gupzz * dtSfz_rhs * dtSfz_rhs + &
       TWO * (gupxy * dtSfx_rhs * dtSfy_rhs + gupxz * dtSfx_rhs * dtSfz_rhs + gupyz * dtSfy_rhs * dtSfz_rhs)
-  reta = 1.31d0/2*dsqrt(reta*ONE/chin1)/(ONE-chin1)**2
+  reta = 1.31d0/2*dsqrt(reta/chin1)/(1-chin1)**2
  betax_rhs = FF*Gamx - reta*betax
  betay_rhs = FF*Gamy - reta*betay
  betaz_rhs = FF*Gamz - reta*betaz
@@ -1120,48 +1103,48 @@ endif
 ! mov_Res_j = gupkj*(-1/chi d_k chi*A_ij + D_k A_ij) - 2/3 d_j trK - 8 PI s_j where D respect to physical metric
 ! store D_i A_jk - 1/chi d_i chi*A_jk in gjk_i
  call fderivs(ex,Axx,gxxx,gxxy,gxxz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,0)
  call fderivs(ex,Ayy,gyyx,gyyy,gyyz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,0)
  call fderivs(ex,Azz,gzzx,gzzy,gzzz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,0)
  call fderivs(ex,Axy,gxyx,gxyy,gxyz,X,Y,Z,ANTI,ANTI,SYM ,Symmetry,0)
  call fderivs(ex,Axz,gxzx,gxzy,gxzz,X,Y,Z,ANTI,SYM ,ANTI,Symmetry,0)
  call fderivs(ex,Ayy,gyyx,gyyy,gyyz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,0)
  call fderivs(ex,Ayz,gyzx,gyzy,gyzz,X,Y,Z,SYM ,ANTI,ANTI,Symmetry,0)
  call fderivs(ex,Azz,gzzx,gzzy,gzzz,X,Y,Z,SYM ,SYM ,SYM ,Symmetry,0)
  gxxx = gxxx - (  Gamxxx * Axx + Gamyxx * Axy + Gamzxx * Axz &
-                 + Gamxxx * Axx + Gamyxx * Axy + Gamzxx * Axz) - chix*Axx*ONE/chin1
+                 + Gamxxx * Axx + Gamyxx * Axy + Gamzxx * Axz) - chix*Axx/chin1
  gxyx = gxyx - (  Gamxxy * Axx + Gamyxy * Axy + Gamzxy * Axz &
-                 + Gamxxx * Axy + Gamyxx * Ayy + Gamzxx * Ayz) - chix*Axy*ONE/chin1
+                 + Gamxxx * Axy + Gamyxx * Ayy + Gamzxx * Ayz) - chix*Axy/chin1
  gxzx = gxzx - (  Gamxxz * Axx + Gamyxz * Axy + Gamzxz * Axz &
-                 + Gamxxx * Axz + Gamyxx * Ayz + Gamzxx * Azz) - chix*Axz*ONE/chin1
+                 + Gamxxx * Axz + Gamyxx * Ayz + Gamzxx * Azz) - chix*Axz/chin1
  gyyx = gyyx - (  Gamxxy * Axy + Gamyxy * Ayy + Gamzxy * Ayz &
-                 + Gamxxy * Axy + Gamyxy * Ayy + Gamzxy * Ayz) - chix*Ayy*ONE/chin1
+                 + Gamxxy * Axy + Gamyxy * Ayy + Gamzxy * Ayz) - chix*Ayy/chin1
  gyzx = gyzx - (  Gamxxz * Axy + Gamyxz * Ayy + Gamzxz * Ayz &
-                 + Gamxxy * Axz + Gamyxy * Ayz + Gamzxy * Azz) - chix*Ayz*ONE/chin1
+                 + Gamxxy * Axz + Gamyxy * Ayz + Gamzxy * Azz) - chix*Ayz/chin1
  gzzx = gzzx - (  Gamxxz * Axz + Gamyxz * Ayz + Gamzxz * Azz &
-                 + Gamxxz * Axz + Gamyxz * Ayz + Gamzxz * Azz) - chix*Azz*ONE/chin1
+                 + Gamxxz * Axz + Gamyxz * Ayz + Gamzxz * Azz) - chix*Azz/chin1
  gxxy = gxxy - (  Gamxxy * Axx + Gamyxy * Axy + Gamzxy * Axz &
-                 + Gamxxy * Axx + Gamyxy * Axy + Gamzxy * Axz) - chiy*Axx*ONE/chin1
+                 + Gamxxy * Axx + Gamyxy * Axy + Gamzxy * Axz) - chiy*Axx/chin1
  gxyy = gxyy - (  Gamxyy * Axx + Gamyyy * Axy + Gamzyy * Axz &
-                 + Gamxxy * Axy + Gamyxy * Ayy + Gamzxy * Ayz) - chiy*Axy*ONE/chin1
+                 + Gamxxy * Axy + Gamyxy * Ayy + Gamzxy * Ayz) - chiy*Axy/chin1
  gxzy = gxzy - (  Gamxyz * Axx + Gamyyz * Axy + Gamzyz * Axz &
-                 + Gamxxy * Axz + Gamyxy * Ayz + Gamzxy * Azz) - chiy*Axz*ONE/chin1
+                 + Gamxxy * Axz + Gamyxy * Ayz + Gamzxy * Azz) - chiy*Axz/chin1
  gyyy = gyyy - (  Gamxyy * Axy + Gamyyy * Ayy + Gamzyy * Ayz &
-                 + Gamxyy * Axy + Gamyyy * Ayy + Gamzyy * Ayz) - chiy*Ayy*ONE/chin1
+                 + Gamxyy * Axy + Gamyyy * Ayy + Gamzyy * Ayz) - chiy*Ayy/chin1
  gyzy = gyzy - (  Gamxyz * Axy + Gamyyz * Ayy + Gamzyz * Ayz &
-                 + Gamxyy * Axz + Gamyyy * Ayz + Gamzyy * Azz) - chiy*Ayz*ONE/chin1
+                 + Gamxyy * Axz + Gamyyy * Ayz + Gamzyy * Azz) - chiy*Ayz/chin1
  gzzy = gzzy - (  Gamxyz * Axz + Gamyyz * Ayz + Gamzyz * Azz &
-                 + Gamxyz * Axz + Gamyyz * Ayz + Gamzyz * Azz) - chiy*Azz*ONE/chin1
+                 + Gamxyz * Axz + Gamyyz * Ayz + Gamzyz * Azz) - chiy*Azz/chin1
  gxxz = gxxz - (  Gamxxz * Axx + Gamyxz * Axy + Gamzxz * Axz &
-                 + Gamxxz * Axx + Gamyxz * Axy + Gamzxz * Axz) - chiz*Axx*ONE/chin1
+                 + Gamxxz * Axx + Gamyxz * Axy + Gamzxz * Axz) - chiz*Axx/chin1
  gxyz = gxyz - (  Gamxyz * Axx + Gamyyz * Axy + Gamzyz * Axz &
-                 + Gamxxz * Axy + Gamyxz * Ayy + Gamzxz * Ayz) - chiz*Axy*ONE/chin1
+                 + Gamxxz * Axy + Gamyxz * Ayy + Gamzxz * Ayz) - chiz*Axy/chin1
  gxzz = gxzz - (  Gamxzz * Axx + Gamyzz * Axy + Gamzzz * Axz &
-                 + Gamxxz * Axz + Gamyxz * Ayz + Gamzxz * Azz) - chiz*Axz*ONE/chin1
+                 + Gamxxz * Axz + Gamyxz * Ayz + Gamzxz * Azz) - chiz*Axz/chin1
  gyyz = gyyz - (  Gamxyz * Axy + Gamyyz * Ayy + Gamzyz * Ayz &
-                 + Gamxyz * Axy + Gamyyz * Ayy + Gamzyz * Ayz) - chiz*Ayy*ONE/chin1
+                 + Gamxyz * Axy + Gamyyz * Ayy + Gamzyz * Ayz) - chiz*Ayy/chin1
  gyzz = gyzz - (  Gamxzz * Axy + Gamyzz * Ayy + Gamzzz * Ayz &
-                 + Gamxyz * Axz + Gamyyz * Ayz + Gamzyz * Azz) - chiz*Ayz*ONE/chin1
+                 + Gamxyz * Axz + Gamyyz * Ayz + Gamzyz * Azz) - chiz*Ayz/chin1
  gzzz = gzzz - (  Gamxzz * Axz + Gamyzz * Ayz + Gamzzz * Azz &
-                 + Gamxzz * Axz + Gamyzz * Ayz + Gamzzz * Azz) - chiz*Azz*ONE/chin1
+                 + Gamxzz * Axz + Gamyzz * Ayz + Gamzzz * Azz) - chiz*Azz/chin1
 movx_Res = gupxx*gxxx + gupyy*gxyy + gupzz*gxzz &
          +gupxy*gxyx + gupxz*gxzx + gupyz*gxzy &
          +gupxy*gxxy + gupxz*gxxz + gupyz*gxyz
--- a/AMSS_NCKU_source/diff_new.f90
+++ b/AMSS_NCKU_source/diff_new.f90
@@ -997,10 +997,10 @@
  fy = ZEO
  fz = ZEO
 #if 0
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
 #if 0  
 ! x direction
        if(i+2 <= imax .and. i-2 >= imin)then
 !
@@ -1040,9 +1040,13 @@
 ! set kmax and kmin 0
    endif
  enddo
  enddo
  enddo
 #elif 0
-! x direction   
+  do k=1,ex(3)-1
-        if(i+2 <= imax .and. i-2 >= imin)then
+  do j=1,ex(2)-1
  do i=1,ex(1)-1
 !
 !              f(i-2) - 8 f(i-1) + 8 f(i+1) - f(i+2)
 !  fx(i) = ---------------------------------------------
@@ -1079,8 +1083,32 @@
 ! set kmax and kmin 0
    endif
  enddo
  enddo
  enddo
 #else
-! for bam comparison
+! for bam comparison — split into branch-free interior + serial boundary
 ! Interior: all stencil points guaranteed in-bounds, no branches needed
  !$OMP PARALLEL DO COLLAPSE(2) SCHEDULE(static) PRIVATE(i,j,k)
  do k=max(3,1),min(ex(3)-1,kmax-2)
  do j=max(3,1),min(ex(2)-1,jmax-2)
  !DIR$ IVDEP
  do i=max(3,1),min(ex(1)-1,imax-2)
      fx(i,j,k)=d12dx*(fh(i-2,j,k)-EIT*fh(i-1,j,k)+EIT*fh(i+1,j,k)-fh(i+2,j,k))
      fy(i,j,k)=d12dy*(fh(i,j-2,k)-EIT*fh(i,j-1,k)+EIT*fh(i,j+1,k)-fh(i,j+2,k))
      fz(i,j,k)=d12dz*(fh(i,j,k-2)-EIT*fh(i,j,k-1)+EIT*fh(i,j,k+1)-fh(i,j,k+2))
  enddo
  enddo
  enddo
  !$OMP END PARALLEL DO
 ! Boundary shell: original branching logic for points near edges
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
   if(i >= 3 .and. i <= imax-2 .and. &
      j >= 3 .and. j <= jmax-2 .and. &
      k >= 3 .and. k <= kmax-2) cycle
   if(i+2 <= imax .and. i-2 >= imin .and. &
      j+2 <= jmax .and. j-2 >= jmin .and. &
      k+2 <= kmax .and. k-2 >= kmin) then
@@ -1094,10 +1122,10 @@
      fy(i,j,k)=d2dy*(-fh(i,j-1,k)+fh(i,j+1,k))
      fz(i,j,k)=d2dz*(-fh(i,j,k-1)+fh(i,j,k+1))
   endif
  enddo
  enddo
  enddo
 #endif
  enddo
  enddo
  enddo
  return
@@ -1401,10 +1429,10 @@
  fxz = ZEO
  fyz = ZEO
 #if 0
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
 #if 0  
 !~~~~~~ fxx
        if(i+2 <= imax .and. i-2 >= imin)then
 !
@@ -1482,8 +1510,47 @@
   elseif(j+1 <= jmax .and. j-1 >= jmin .and. k+1 <= kmax .and. k-1 >= kmin)then
   fyz(i,j,k) = Sdydz*(fh(i,j-1,k-1)-fh(i,j+1,k-1)-fh(i,j-1,k+1)+fh(i,j+1,k+1))
   endif
  enddo
  enddo
  enddo
 #else
-! for bam comparison
+! for bam comparison — split into branch-free interior + serial boundary
 ! Interior: all stencil points guaranteed in-bounds, no branches needed
  !$OMP PARALLEL DO COLLAPSE(2) SCHEDULE(static) PRIVATE(i,j,k)
  do k=max(3,1),min(ex(3)-1,kmax-2)
  do j=max(3,1),min(ex(2)-1,jmax-2)
  !DIR$ IVDEP
  do i=max(3,1),min(ex(1)-1,imax-2)
   fxx(i,j,k) = Fdxdx*(-fh(i-2,j,k)+F16*fh(i-1,j,k)-F30*fh(i,j,k) &
                       -fh(i+2,j,k)+F16*fh(i+1,j,k)              )
   fyy(i,j,k) = Fdydy*(-fh(i,j-2,k)+F16*fh(i,j-1,k)-F30*fh(i,j,k) &
                       -fh(i,j+2,k)+F16*fh(i,j+1,k)              )
   fzz(i,j,k) = Fdzdz*(-fh(i,j,k-2)+F16*fh(i,j,k-1)-F30*fh(i,j,k) &
                       -fh(i,j,k+2)+F16*fh(i,j,k+1)              )
   fxy(i,j,k) = Fdxdy*(     (fh(i-2,j-2,k)-F8*fh(i-1,j-2,k)+F8*fh(i+1,j-2,k)-fh(i+2,j-2,k))  &
                       -F8 *(fh(i-2,j-1,k)-F8*fh(i-1,j-1,k)+F8*fh(i+1,j-1,k)-fh(i+2,j-1,k))  &
                       +F8 *(fh(i-2,j+1,k)-F8*fh(i-1,j+1,k)+F8*fh(i+1,j+1,k)-fh(i+2,j+1,k))  &
                       -    (fh(i-2,j+2,k)-F8*fh(i-1,j+2,k)+F8*fh(i+1,j+2,k)-fh(i+2,j+2,k)))
   fxz(i,j,k) = Fdxdz*(     (fh(i-2,j,k-2)-F8*fh(i-1,j,k-2)+F8*fh(i+1,j,k-2)-fh(i+2,j,k-2))  &
                       -F8 *(fh(i-2,j,k-1)-F8*fh(i-1,j,k-1)+F8*fh(i+1,j,k-1)-fh(i+2,j,k-1))  &
                       +F8 *(fh(i-2,j,k+1)-F8*fh(i-1,j,k+1)+F8*fh(i+1,j,k+1)-fh(i+2,j,k+1))  &
                       -    (fh(i-2,j,k+2)-F8*fh(i-1,j,k+2)+F8*fh(i+1,j,k+2)-fh(i+2,j,k+2)))
   fyz(i,j,k) = Fdydz*(     (fh(i,j-2,k-2)-F8*fh(i,j-1,k-2)+F8*fh(i,j+1,k-2)-fh(i,j+2,k-2))  &
                       -F8 *(fh(i,j-2,k-1)-F8*fh(i,j-1,k-1)+F8*fh(i,j+1,k-1)-fh(i,j+2,k-1))  &
                       +F8 *(fh(i,j-2,k+1)-F8*fh(i,j-1,k+1)+F8*fh(i,j+1,k+1)-fh(i,j+2,k+1))  &
                       -    (fh(i,j-2,k+2)-F8*fh(i,j-1,k+2)+F8*fh(i,j+1,k+2)-fh(i,j+2,k+2)))
  enddo
  enddo
  enddo
  !$OMP END PARALLEL DO
 ! Boundary shell: original branching logic for points near edges
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
   if(i >= 3 .and. i <= imax-2 .and. &
      j >= 3 .and. j <= jmax-2 .and. &
      k >= 3 .and. k <= kmax-2) cycle
   if(i+2 <= imax .and. i-2 >= imin .and. &
      j+2 <= jmax .and. j-2 >= jmin .and. &
      k+2 <= kmax .and. k-2 >= kmin) then
@@ -1518,10 +1585,10 @@
   fxz(i,j,k) = Sdxdz*(fh(i-1,j,k-1)-fh(i+1,j,k-1)-fh(i-1,j,k+1)+fh(i+1,j,k+1))
   fyz(i,j,k) = Sdydz*(fh(i,j-1,k-1)-fh(i,j+1,k-1)-fh(i,j-1,k+1)+fh(i,j+1,k+1))
   endif
  enddo
  enddo
  enddo
 #endif
   enddo
   enddo
   enddo
  return
@@ -1939,309 +2006,6 @@
  return
  end subroutine fddyz
  subroutine fderivs_batch4(ex,f1,f2,f3,f4, &
                            f1x,f1y,f1z,f2x,f2y,f2z,f3x,f3y,f3z,f4x,f4y,f4z, &
                            X,Y,Z,SYM1,SYM2,SYM3,symmetry,onoff)
  implicit none
  integer,                               intent(in ):: ex(1:3),symmetry,onoff
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(in ):: f1,f2,f3,f4
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(out):: f1x,f1y,f1z
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(out):: f2x,f2y,f2z
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(out):: f3x,f3y,f3z
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(out):: f4x,f4y,f4z
  real*8,                                intent(in) :: X(ex(1)),Y(ex(2)),Z(ex(3))
  real*8,                                intent(in ):: SYM1,SYM2,SYM3
 !~~~~~~ other variables
  real*8 :: dX,dY,dZ
  real*8,dimension(-1:ex(1),-1:ex(2),-1:ex(3)) :: fh1,fh2,fh3,fh4
  real*8, dimension(3) :: SoA
  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
  real*8 :: d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
  real*8,  parameter :: ZEO=0.d0,ONE=1.d0
  real*8,  parameter :: TWO=2.d0,EIT=8.d0
  real*8,  parameter :: F12=1.2d1
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -1
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -1
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -1
  SoA(1) = SYM1
  SoA(2) = SYM2
  SoA(3) = SYM3
  call symmetry_bd(2,ex,f1,fh1,SoA)
  call symmetry_bd(2,ex,f2,fh2,SoA)
  call symmetry_bd(2,ex,f3,fh3,SoA)
  call symmetry_bd(2,ex,f4,fh4,SoA)
  d12dx = ONE/F12/dX
  d12dy = ONE/F12/dY
  d12dz = ONE/F12/dZ
  d2dx = ONE/TWO/dX
  d2dy = ONE/TWO/dY
  d2dz = ONE/TWO/dZ
  f1x = ZEO; f1y = ZEO; f1z = ZEO
  f2x = ZEO; f2y = ZEO; f2z = ZEO
  f3x = ZEO; f3y = ZEO; f3z = ZEO
  f4x = ZEO; f4y = ZEO; f4z = ZEO
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
   if(i+2 <= imax .and. i-2 >= imin .and. &
      j+2 <= jmax .and. j-2 >= jmin .and. &
      k+2 <= kmax .and. k-2 >= kmin) then
      f1x(i,j,k)=d12dx*(fh1(i-2,j,k)-EIT*fh1(i-1,j,k)+EIT*fh1(i+1,j,k)-fh1(i+2,j,k))
      f1y(i,j,k)=d12dy*(fh1(i,j-2,k)-EIT*fh1(i,j-1,k)+EIT*fh1(i,j+1,k)-fh1(i,j+2,k))
      f1z(i,j,k)=d12dz*(fh1(i,j,k-2)-EIT*fh1(i,j,k-1)+EIT*fh1(i,j,k+1)-fh1(i,j,k+2))
      f2x(i,j,k)=d12dx*(fh2(i-2,j,k)-EIT*fh2(i-1,j,k)+EIT*fh2(i+1,j,k)-fh2(i+2,j,k))
      f2y(i,j,k)=d12dy*(fh2(i,j-2,k)-EIT*fh2(i,j-1,k)+EIT*fh2(i,j+1,k)-fh2(i,j+2,k))
      f2z(i,j,k)=d12dz*(fh2(i,j,k-2)-EIT*fh2(i,j,k-1)+EIT*fh2(i,j,k+1)-fh2(i,j,k+2))
      f3x(i,j,k)=d12dx*(fh3(i-2,j,k)-EIT*fh3(i-1,j,k)+EIT*fh3(i+1,j,k)-fh3(i+2,j,k))
      f3y(i,j,k)=d12dy*(fh3(i,j-2,k)-EIT*fh3(i,j-1,k)+EIT*fh3(i,j+1,k)-fh3(i,j+2,k))
      f3z(i,j,k)=d12dz*(fh3(i,j,k-2)-EIT*fh3(i,j,k-1)+EIT*fh3(i,j,k+1)-fh3(i,j,k+2))
      f4x(i,j,k)=d12dx*(fh4(i-2,j,k)-EIT*fh4(i-1,j,k)+EIT*fh4(i+1,j,k)-fh4(i+2,j,k))
      f4y(i,j,k)=d12dy*(fh4(i,j-2,k)-EIT*fh4(i,j-1,k)+EIT*fh4(i,j+1,k)-fh4(i,j+2,k))
      f4z(i,j,k)=d12dz*(fh4(i,j,k-2)-EIT*fh4(i,j,k-1)+EIT*fh4(i,j,k+1)-fh4(i,j,k+2))
   elseif(i+1 <= imax .and. i-1 >= imin .and. &
          j+1 <= jmax .and. j-1 >= jmin .and. &
          k+1 <= kmax .and. k-1 >= kmin) then
      f1x(i,j,k)=d2dx*(-fh1(i-1,j,k)+fh1(i+1,j,k))
      f1y(i,j,k)=d2dy*(-fh1(i,j-1,k)+fh1(i,j+1,k))
      f1z(i,j,k)=d2dz*(-fh1(i,j,k-1)+fh1(i,j,k+1))
      f2x(i,j,k)=d2dx*(-fh2(i-1,j,k)+fh2(i+1,j,k))
      f2y(i,j,k)=d2dy*(-fh2(i,j-1,k)+fh2(i,j+1,k))
      f2z(i,j,k)=d2dz*(-fh2(i,j,k-1)+fh2(i,j,k+1))
      f3x(i,j,k)=d2dx*(-fh3(i-1,j,k)+fh3(i+1,j,k))
      f3y(i,j,k)=d2dy*(-fh3(i,j-1,k)+fh3(i,j+1,k))
      f3z(i,j,k)=d2dz*(-fh3(i,j,k-1)+fh3(i,j,k+1))
      f4x(i,j,k)=d2dx*(-fh4(i-1,j,k)+fh4(i+1,j,k))
      f4y(i,j,k)=d2dy*(-fh4(i,j-1,k)+fh4(i,j+1,k))
      f4z(i,j,k)=d2dz*(-fh4(i,j,k-1)+fh4(i,j,k+1))
   endif
  enddo
  enddo
  enddo
  return
  end subroutine fderivs_batch4
 !-----------------------------------------------------------------------------
 ! batch first derivatives (3 fields), same symmetry setup
 !-----------------------------------------------------------------------------
  subroutine fderivs_batch3(ex,f1,f2,f3, &
                            f1x,f1y,f1z,f2x,f2y,f2z,f3x,f3y,f3z, &
                            X,Y,Z,SYM1,SYM2,SYM3,symmetry,onoff)
  implicit none
  integer,                               intent(in ):: ex(1:3),symmetry,onoff
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(in ):: f1,f2,f3
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(out):: f1x,f1y,f1z
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(out):: f2x,f2y,f2z
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(out):: f3x,f3y,f3z
  real*8,                                intent(in) :: X(ex(1)),Y(ex(2)),Z(ex(3))
  real*8,                                intent(in ):: SYM1,SYM2,SYM3
 !~~~~~~ other variables
  real*8 :: dX,dY,dZ
  real*8,dimension(-1:ex(1),-1:ex(2),-1:ex(3)) :: fh1,fh2,fh3
  real*8, dimension(3) :: SoA
  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
  real*8 :: d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
  real*8,  parameter :: ZEO=0.d0,ONE=1.d0
  real*8,  parameter :: TWO=2.d0,EIT=8.d0
  real*8,  parameter :: F12=1.2d1
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -1
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -1
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -1
  SoA(1) = SYM1
  SoA(2) = SYM2
  SoA(3) = SYM3
  call symmetry_bd(2,ex,f1,fh1,SoA)
  call symmetry_bd(2,ex,f2,fh2,SoA)
  call symmetry_bd(2,ex,f3,fh3,SoA)
  d12dx = ONE/F12/dX
  d12dy = ONE/F12/dY
  d12dz = ONE/F12/dZ
  d2dx = ONE/TWO/dX
  d2dy = ONE/TWO/dY
  d2dz = ONE/TWO/dZ
  f1x = ZEO; f1y = ZEO; f1z = ZEO
  f2x = ZEO; f2y = ZEO; f2z = ZEO
  f3x = ZEO; f3y = ZEO; f3z = ZEO
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
   if(i+2 <= imax .and. i-2 >= imin .and. &
      j+2 <= jmax .and. j-2 >= jmin .and. &
      k+2 <= kmax .and. k-2 >= kmin) then
      f1x(i,j,k)=d12dx*(fh1(i-2,j,k)-EIT*fh1(i-1,j,k)+EIT*fh1(i+1,j,k)-fh1(i+2,j,k))
      f1y(i,j,k)=d12dy*(fh1(i,j-2,k)-EIT*fh1(i,j-1,k)+EIT*fh1(i,j+1,k)-fh1(i,j+2,k))
      f1z(i,j,k)=d12dz*(fh1(i,j,k-2)-EIT*fh1(i,j,k-1)+EIT*fh1(i,j,k+1)-fh1(i,j,k+2))
      f2x(i,j,k)=d12dx*(fh2(i-2,j,k)-EIT*fh2(i-1,j,k)+EIT*fh2(i+1,j,k)-fh2(i+2,j,k))
      f2y(i,j,k)=d12dy*(fh2(i,j-2,k)-EIT*fh2(i,j-1,k)+EIT*fh2(i,j+1,k)-fh2(i,j+2,k))
      f2z(i,j,k)=d12dz*(fh2(i,j,k-2)-EIT*fh2(i,j,k-1)+EIT*fh2(i,j,k+1)-fh2(i,j,k+2))
      f3x(i,j,k)=d12dx*(fh3(i-2,j,k)-EIT*fh3(i-1,j,k)+EIT*fh3(i+1,j,k)-fh3(i+2,j,k))
      f3y(i,j,k)=d12dy*(fh3(i,j-2,k)-EIT*fh3(i,j-1,k)+EIT*fh3(i,j+1,k)-fh3(i,j+2,k))
      f3z(i,j,k)=d12dz*(fh3(i,j,k-2)-EIT*fh3(i,j,k-1)+EIT*fh3(i,j,k+1)-fh3(i,j,k+2))
   elseif(i+1 <= imax .and. i-1 >= imin .and. &
          j+1 <= jmax .and. j-1 >= jmin .and. &
          k+1 <= kmax .and. k-1 >= kmin) then
      f1x(i,j,k)=d2dx*(-fh1(i-1,j,k)+fh1(i+1,j,k))
      f1y(i,j,k)=d2dy*(-fh1(i,j-1,k)+fh1(i,j+1,k))
      f1z(i,j,k)=d2dz*(-fh1(i,j,k-1)+fh1(i,j,k+1))
      f2x(i,j,k)=d2dx*(-fh2(i-1,j,k)+fh2(i+1,j,k))
      f2y(i,j,k)=d2dy*(-fh2(i,j-1,k)+fh2(i,j+1,k))
      f2z(i,j,k)=d2dz*(-fh2(i,j,k-1)+fh2(i,j,k+1))
      f3x(i,j,k)=d2dx*(-fh3(i-1,j,k)+fh3(i+1,j,k))
      f3y(i,j,k)=d2dy*(-fh3(i,j-1,k)+fh3(i,j+1,k))
      f3z(i,j,k)=d2dz*(-fh3(i,j,k-1)+fh3(i,j,k+1))
   endif
  enddo
  enddo
  enddo
  return
  end subroutine fderivs_batch3
 !-----------------------------------------------------------------------------
 ! batch first derivatives (2 fields), same symmetry setup
 !-----------------------------------------------------------------------------
  subroutine fderivs_batch2(ex,f1,f2, &
                            f1x,f1y,f1z,f2x,f2y,f2z, &
                            X,Y,Z,SYM1,SYM2,SYM3,symmetry,onoff)
  implicit none
  integer,                               intent(in ):: ex(1:3),symmetry,onoff
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(in ):: f1,f2
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(out):: f1x,f1y,f1z
  real*8,  dimension(ex(1),ex(2),ex(3)), intent(out):: f2x,f2y,f2z
  real*8,                                intent(in) :: X(ex(1)),Y(ex(2)),Z(ex(3))
  real*8,                                intent(in ):: SYM1,SYM2,SYM3
 !~~~~~~ other variables
  real*8 :: dX,dY,dZ
  real*8,dimension(-1:ex(1),-1:ex(2),-1:ex(3)) :: fh1,fh2
  real*8, dimension(3) :: SoA
  integer :: imin,jmin,kmin,imax,jmax,kmax,i,j,k
  real*8 :: d12dx,d12dy,d12dz,d2dx,d2dy,d2dz
  integer, parameter :: NO_SYMM = 0, EQ_SYMM = 1, OCTANT = 2
  real*8,  parameter :: ZEO=0.d0,ONE=1.d0
  real*8,  parameter :: TWO=2.d0,EIT=8.d0
  real*8,  parameter :: F12=1.2d1
  dX = X(2)-X(1)
  dY = Y(2)-Y(1)
  dZ = Z(2)-Z(1)
  imax = ex(1)
  jmax = ex(2)
  kmax = ex(3)
  imin = 1
  jmin = 1
  kmin = 1
  if(Symmetry > NO_SYMM .and. dabs(Z(1)) < dZ) kmin = -1
  if(Symmetry > EQ_SYMM .and. dabs(X(1)) < dX) imin = -1
  if(Symmetry > EQ_SYMM .and. dabs(Y(1)) < dY) jmin = -1
  SoA(1) = SYM1
  SoA(2) = SYM2
  SoA(3) = SYM3
  call symmetry_bd(2,ex,f1,fh1,SoA)
  call symmetry_bd(2,ex,f2,fh2,SoA)
  d12dx = ONE/F12/dX
  d12dy = ONE/F12/dY
  d12dz = ONE/F12/dZ
  d2dx = ONE/TWO/dX
  d2dy = ONE/TWO/dY
  d2dz = ONE/TWO/dZ
  f1x = ZEO; f1y = ZEO; f1z = ZEO
  f2x = ZEO; f2y = ZEO; f2z = ZEO
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
   if(i+2 <= imax .and. i-2 >= imin .and. &
      j+2 <= jmax .and. j-2 >= jmin .and. &
      k+2 <= kmax .and. k-2 >= kmin) then
      f1x(i,j,k)=d12dx*(fh1(i-2,j,k)-EIT*fh1(i-1,j,k)+EIT*fh1(i+1,j,k)-fh1(i+2,j,k))
      f1y(i,j,k)=d12dy*(fh1(i,j-2,k)-EIT*fh1(i,j-1,k)+EIT*fh1(i,j+1,k)-fh1(i,j+2,k))
      f1z(i,j,k)=d12dz*(fh1(i,j,k-2)-EIT*fh1(i,j,k-1)+EIT*fh1(i,j,k+1)-fh1(i,j,k+2))
      f2x(i,j,k)=d12dx*(fh2(i-2,j,k)-EIT*fh2(i-1,j,k)+EIT*fh2(i+1,j,k)-fh2(i+2,j,k))
      f2y(i,j,k)=d12dy*(fh2(i,j-2,k)-EIT*fh2(i,j-1,k)+EIT*fh2(i,j+1,k)-fh2(i,j+2,k))
      f2z(i,j,k)=d12dz*(fh2(i,j,k-2)-EIT*fh2(i,j,k-1)+EIT*fh2(i,j,k+1)-fh2(i,j,k+2))
   elseif(i+1 <= imax .and. i-1 >= imin .and. &
          j+1 <= jmax .and. j-1 >= jmin .and. &
          k+1 <= kmax .and. k-1 >= kmin) then
      f1x(i,j,k)=d2dx*(-fh1(i-1,j,k)+fh1(i+1,j,k))
      f1y(i,j,k)=d2dy*(-fh1(i,j-1,k)+fh1(i,j+1,k))
      f1z(i,j,k)=d2dz*(-fh1(i,j,k-1)+fh1(i,j,k+1))
      f2x(i,j,k)=d2dx*(-fh2(i-1,j,k)+fh2(i+1,j,k))
      f2y(i,j,k)=d2dy*(-fh2(i,j-1,k)+fh2(i,j+1,k))
      f2z(i,j,k)=d2dz*(-fh2(i,j,k-1)+fh2(i,j,k+1))
   endif
  enddo
  enddo
  enddo
  return
  end subroutine fderivs_batch2
 #elif (ghost_width == 4)
 ! sixth order code
@@ -2380,9 +2144,6 @@
  end subroutine fderivs
 !-----------------------------------------------------------------------------
 ! batch first derivatives (4 fields), same symmetry setup
 !-----------------------------------------------------------------------------
 !-----------------------------------------------------------------------------
 !
 ! single derivatives dx
 !
--- a/AMSS_NCKU_source/enforce_algebra.f90
+++ b/AMSS_NCKU_source/enforce_algebra.f90
@@ -19,48 +19,60 @@
 !~~~~~~~> Local variable:
-  real*8, dimension(ex(1),ex(2),ex(3)) :: trA,detg
+  integer :: i,j,k
-  real*8, dimension(ex(1),ex(2),ex(3)) :: gxx,gyy,gzz 
+  real*8 :: lgxx,lgyy,lgzz,ldetg
-  real*8, dimension(ex(1),ex(2),ex(3)) :: gupxx,gupxy,gupxz,gupyy,gupyz,gupzz
+  real*8 :: lgupxx,lgupxy,lgupxz,lgupyy,lgupyz,lgupzz
  real*8 :: ltrA,lscale
  real*8, parameter :: F1o3 = 1.D0 / 3.D0, ONE = 1.D0, TWO = 2.D0
 !~~~~~~>
-  gxx = dxx + ONE
+  do k=1,ex(3)
-  gyy = dyy + ONE
+  do j=1,ex(2)
-  gzz = dzz + ONE
+  do i=1,ex(1)
-  detg =  gxx * gyy * gzz + gxy * gyz * gxz + gxz * gxy * gyz - &
+    lgxx = dxx(i,j,k) + ONE
-          gxz * gyy * gxz - gxy * gxy * gzz - gxx * gyz * gyz
+    lgyy = dyy(i,j,k) + ONE
-  gupxx =   ( gyy * gzz - gyz * gyz ) / detg
+    lgzz = dzz(i,j,k) + ONE
  gupxy = - ( gxy * gzz - gyz * gxz ) / detg
  gupxz =   ( gxy * gyz - gyy * gxz ) / detg
  gupyy =   ( gxx * gzz - gxz * gxz ) / detg
  gupyz = - ( gxx * gyz - gxy * gxz ) / detg
  gupzz =   ( gxx * gyy - gxy * gxy ) / detg
-  trA =         gupxx * Axx + gupyy * Ayy + gupzz * Azz &
+    ldetg =  lgxx * lgyy * lgzz &
-       + TWO * (gupxy * Axy + gupxz * Axz + gupyz * Ayz)
+           + gxy(i,j,k) * gyz(i,j,k) * gxz(i,j,k) &
           + gxz(i,j,k) * gxy(i,j,k) * gyz(i,j,k) &
           - gxz(i,j,k) * lgyy * gxz(i,j,k) &
           - gxy(i,j,k) * gxy(i,j,k) * lgzz &
           - lgxx * gyz(i,j,k) * gyz(i,j,k)
-  Axx = Axx - F1o3 * gxx * trA
+    lgupxx =   ( lgyy * lgzz - gyz(i,j,k) * gyz(i,j,k) ) / ldetg
-  Axy = Axy - F1o3 * gxy * trA
+    lgupxy = - ( gxy(i,j,k) * lgzz - gyz(i,j,k) * gxz(i,j,k) ) / ldetg
-  Axz = Axz - F1o3 * gxz * trA
+    lgupxz =   ( gxy(i,j,k) * gyz(i,j,k) - lgyy * gxz(i,j,k) ) / ldetg
-  Ayy = Ayy - F1o3 * gyy * trA
+    lgupyy =   ( lgxx * lgzz - gxz(i,j,k) * gxz(i,j,k) ) / ldetg
-  Ayz = Ayz - F1o3 * gyz * trA
+    lgupyz = - ( lgxx * gyz(i,j,k) - gxy(i,j,k) * gxz(i,j,k) ) / ldetg
-  Azz = Azz - F1o3 * gzz * trA
+    lgupzz =   ( lgxx * lgyy - gxy(i,j,k) * gxy(i,j,k) ) / ldetg
-  detg = ONE / ( detg ** F1o3 ) 
+    ltrA =         lgupxx * Axx(i,j,k) + lgupyy * Ayy(i,j,k) &
                 + lgupzz * Azz(i,j,k) &
         + TWO * (lgupxy * Axy(i,j,k) + lgupxz * Axz(i,j,k) &
                 + lgupyz * Ayz(i,j,k))
-  gxx = gxx * detg
+    Axx(i,j,k) = Axx(i,j,k) - F1o3 * lgxx * ltrA
-  gxy = gxy * detg
+    Axy(i,j,k) = Axy(i,j,k) - F1o3 * gxy(i,j,k) * ltrA
-  gxz = gxz * detg
+    Axz(i,j,k) = Axz(i,j,k) - F1o3 * gxz(i,j,k) * ltrA
-  gyy = gyy * detg
+    Ayy(i,j,k) = Ayy(i,j,k) - F1o3 * lgyy * ltrA
-  gyz = gyz * detg
+    Ayz(i,j,k) = Ayz(i,j,k) - F1o3 * gyz(i,j,k) * ltrA
-  gzz = gzz * detg
+    Azz(i,j,k) = Azz(i,j,k) - F1o3 * lgzz * ltrA
-  dxx = gxx - ONE
+    lscale = ONE / ( ldetg ** F1o3 )
-  dyy = gyy - ONE
+
-  dzz = gzz - ONE
+    dxx(i,j,k) = lgxx * lscale - ONE
    gxy(i,j,k) = gxy(i,j,k) * lscale
    gxz(i,j,k) = gxz(i,j,k) * lscale
    dyy(i,j,k) = lgyy * lscale - ONE
    gyz(i,j,k) = gyz(i,j,k) * lscale
    dzz(i,j,k) = lgzz * lscale - ONE
  enddo
  enddo
  enddo
  return
@@ -83,50 +95,70 @@
 !~~~~~~~> Local variable:
-  real*8, dimension(ex(1),ex(2),ex(3)) :: trA
+  integer :: i,j,k
-  real*8, dimension(ex(1),ex(2),ex(3)) :: gxx,gyy,gzz 
+  real*8 :: lgxx,lgyy,lgzz,lscale
-  real*8, dimension(ex(1),ex(2),ex(3)) :: gupxx,gupxy,gupxz,gupyy,gupyz,gupzz
+  real*8 :: lgxy,lgxz,lgyz
  real*8 :: lgupxx,lgupxy,lgupxz,lgupyy,lgupyz,lgupzz
  real*8 :: ltrA
  real*8, parameter :: F1o3 = 1.D0 / 3.D0, ONE = 1.D0, TWO = 2.D0
 !~~~~~~>
-  gxx = dxx + ONE
+  do k=1,ex(3)
-  gyy = dyy + ONE
+  do j=1,ex(2)
-  gzz = dzz + ONE
+  do i=1,ex(1)
 ! for g
  gupzz =  gxx * gyy * gzz + gxy * gyz * gxz + gxz * gxy * gyz - &
           gxz * gyy * gxz - gxy * gxy * gzz - gxx * gyz * gyz
-  gupzz = ONE / ( gupzz ** F1o3 ) 
+! for g: normalize determinant first
    lgxx = dxx(i,j,k) + ONE
    lgyy = dyy(i,j,k) + ONE
    lgzz = dzz(i,j,k) + ONE
    lgxy = gxy(i,j,k)
    lgxz = gxz(i,j,k)
    lgyz = gyz(i,j,k)
-  gxx = gxx * gupzz
+    lscale =  lgxx * lgyy * lgzz + lgxy * lgyz * lgxz &
-  gxy = gxy * gupzz
+            + lgxz * lgxy * lgyz - lgxz * lgyy * lgxz &
-  gxz = gxz * gupzz
+            - lgxy * lgxy * lgzz - lgxx * lgyz * lgyz
  gyy = gyy * gupzz
  gyz = gyz * gupzz
  gzz = gzz * gupzz
-  dxx = gxx - ONE
+    lscale = ONE / ( lscale ** F1o3 )
  dyy = gyy - ONE
  dzz = gzz - ONE
 ! for A  
-  gupxx =   ( gyy * gzz - gyz * gyz )
+    lgxx = lgxx * lscale
-  gupxy = - ( gxy * gzz - gyz * gxz )
+    lgxy = lgxy * lscale
-  gupxz =   ( gxy * gyz - gyy * gxz )
+    lgxz = lgxz * lscale
-  gupyy =   ( gxx * gzz - gxz * gxz )
+    lgyy = lgyy * lscale
-  gupyz = - ( gxx * gyz - gxy * gxz )
+    lgyz = lgyz * lscale
-  gupzz =   ( gxx * gyy - gxy * gxy )
+    lgzz = lgzz * lscale
-  trA =         gupxx * Axx + gupyy * Ayy + gupzz * Azz &
+    dxx(i,j,k) = lgxx - ONE
-       + TWO * (gupxy * Axy + gupxz * Axz + gupyz * Ayz)
+    gxy(i,j,k) = lgxy
    gxz(i,j,k) = lgxz
    dyy(i,j,k) = lgyy - ONE
    gyz(i,j,k) = lgyz
    dzz(i,j,k) = lgzz - ONE
-  Axx = Axx - F1o3 * gxx * trA
+! for A: trace-free using normalized metric (det=1, no division needed)
-  Axy = Axy - F1o3 * gxy * trA
+    lgupxx =   ( lgyy * lgzz - lgyz * lgyz )
-  Axz = Axz - F1o3 * gxz * trA
+    lgupxy = - ( lgxy * lgzz - lgyz * lgxz )
-  Ayy = Ayy - F1o3 * gyy * trA
+    lgupxz =   ( lgxy * lgyz - lgyy * lgxz )
-  Ayz = Ayz - F1o3 * gyz * trA
+    lgupyy =   ( lgxx * lgzz - lgxz * lgxz )
-  Azz = Azz - F1o3 * gzz * trA
+    lgupyz = - ( lgxx * lgyz - lgxy * lgxz )
    lgupzz =   ( lgxx * lgyy - lgxy * lgxy )
    ltrA =         lgupxx * Axx(i,j,k) + lgupyy * Ayy(i,j,k) &
                 + lgupzz * Azz(i,j,k) &
         + TWO * (lgupxy * Axy(i,j,k) + lgupxz * Axz(i,j,k) &
                 + lgupyz * Ayz(i,j,k))
    Axx(i,j,k) = Axx(i,j,k) - F1o3 * lgxx * ltrA
    Axy(i,j,k) = Axy(i,j,k) - F1o3 * lgxy * ltrA
    Axz(i,j,k) = Axz(i,j,k) - F1o3 * lgxz * ltrA
    Ayy(i,j,k) = Ayy(i,j,k) - F1o3 * lgyy * ltrA
    Ayz(i,j,k) = Ayz(i,j,k) - F1o3 * lgyz * ltrA
    Azz(i,j,k) = Azz(i,j,k) - F1o3 * lgzz * ltrA
  enddo
  enddo
  enddo
  return
--- a/AMSS_NCKU_source/fmisc.f90
+++ b/AMSS_NCKU_source/fmisc.f90
@@ -324,7 +324,6 @@ subroutine symmetry_bd(ord,extc,func,funcc,SoA)
  integer::i
  funcc = 0.d0
  funcc(1:extc(1),1:extc(2),1:extc(3)) = func
   do i=0,ord-1
      funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1)
@@ -350,7 +349,6 @@ subroutine symmetry_tbd(ord,extc,func,funcc,SoA)
  integer::i
  funcc = 0.d0
  funcc(1:extc(1),1:extc(2),1:extc(3)) = func
   do i=0,ord-1
      funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1)
@@ -379,7 +377,6 @@ subroutine symmetry_stbd(ord,extc,func,funcc,SoA)
  integer::i
  funcc = 0.d0
  funcc(1:extc(1),1:extc(2),1:extc(3)) = func
   do i=0,ord-1
      funcc(-i,1:extc(2),1:extc(3)) = funcc(i+2,1:extc(2),1:extc(3))*SoA(1)
@@ -884,10 +881,18 @@ subroutine symmetry_bd(ord,extc,func,funcc,SoA)
  real*8, dimension(-ord+1:extc(1),-ord+1:extc(2),-ord+1:extc(3)),intent(out):: funcc
  real*8, dimension(1:3), intent(in) :: SoA
-  integer::i
+  integer::i,j,k
  !$OMP PARALLEL DO COLLAPSE(2) SCHEDULE(static) PRIVATE(i,j,k)
  do k=1,extc(3)
  do j=1,extc(2)
  do i=1,extc(1)
     funcc(i,j,k) = func(i,j,k)
  enddo
  enddo
  enddo
  !$OMP END PARALLEL DO
  funcc = 0.d0
  funcc(1:extc(1),1:extc(2),1:extc(3)) = func
   do i=0,ord-1
      funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1)
   enddo
@@ -912,7 +917,6 @@ subroutine symmetry_tbd(ord,extc,func,funcc,SoA)
  integer::i
  funcc = 0.d0
  funcc(1:extc(1),1:extc(2),1:extc(3)) = func
   do i=0,ord-1
      funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1)
@@ -941,7 +945,6 @@ subroutine symmetry_stbd(ord,extc,func,funcc,SoA)
  integer::i
  funcc = 0.d0
  funcc(1:extc(1),1:extc(2),1:extc(3)) = func
   do i=0,ord-1
      funcc(-i,1:extc(2),1:extc(3)) = funcc(i+1,1:extc(2),1:extc(3))*SoA(1)
@@ -1119,25 +1122,16 @@ end subroutine d2dump
 !------------------------------------------------------------------------------
  subroutine polint(xa, ya, x, y, dy, ordn)
  implicit none
 !~~~~~~> Input Parameter:
  integer, intent(in) :: ordn
  real*8, dimension(ordn), intent(in) :: xa, ya
  real*8, intent(in) :: x
  real*8, intent(out) :: y, dy
-!~~~~~~> Other parameter:
+  integer :: i, m, ns, n_m
-
+  real*8, dimension(ordn) :: c, d, ho
-  integer :: m,n,ns
+  real*8 :: dif, dift, hp, h, den_val
  real*8, dimension(ordn) :: c,d,den,ho
  real*8 :: dif,dift
 !~~~~~~>
  n=ordn
  m=ordn
  c = ya
  d = ya
@@ -1145,27 +1139,38 @@ end subroutine d2dump
  ns = 1
  dif = abs(x - xa(1))
-  do m=1,n
+
-   dift=abs(x-xa(m))
+  do i = 2, ordn
    dift = abs(x - xa(i))
    if (dift < dif) then
-    ns=m
+      ns = i
      dif = dift
    end if
  end do
  y = ya(ns)
  ns = ns - 1
-  do m=1,n-1
+
-    den(1:n-m)=ho(1:n-m)-ho(1+m:n)
+  do m = 1, ordn - 1
-    if (any(den(1:n-m) == 0.0))then
+    n_m = ordn - m
    do i = 1, n_m
      hp = ho(i)
      h  = ho(i+m)
      den_val = hp - h
      if (den_val == 0.0d0) then
        write(*,*) 'failure in polint for point',x
        write(*,*) 'with input points: ',xa
        stop
      end if
-    den(1:n-m)=(c(2:n-m+1)-d(1:n-m))/den(1:n-m)
+
-    d(1:n-m)=ho(1+m:n)*den(1:n-m)
+      den_val = (c(i+1) - d(i)) / den_val
-    c(1:n-m)=ho(1:n-m)*den(1:n-m)
+
-    if (2*ns < n-m) then
+      d(i) = h * den_val
      c(i) = hp * den_val
    end do
    if (2 * ns < n_m) then
      dy = c(ns + 1)
    else
      dy = d(ns)
@@ -1175,7 +1180,6 @@ end subroutine d2dump
  end do
  return
  end subroutine polint
 !------------------------------------------------------------------------------
 !
@@ -1183,35 +1187,37 @@ end subroutine d2dump
 !
 !------------------------------------------------------------------------------
  subroutine polin2(x1a,x2a,ya,x1,x2,y,dy,ordn)
  implicit none
 !~~~~~~> Input parameters:
  integer,intent(in) :: ordn
  real*8, dimension(1:ordn), intent(in) :: x1a,x2a
  real*8, dimension(1:ordn,1:ordn), intent(in) :: ya
  real*8, intent(in) :: x1,x2
  real*8, intent(out) :: y,dy
-!~~~~~~> Other parameters:
+#ifdef POLINT_LEGACY_ORDER
  integer  :: i,m
  real*8, dimension(ordn) :: ymtmp
  real*8, dimension(ordn) :: yntmp
  m=size(x1a)
  do i=1,m
    yntmp=ya(i,:)
    call polint(x2a,yntmp,x2,ymtmp(i),dy,ordn)
  end do
  call polint(x1a,ymtmp,x1,y,dy,ordn)
 #else
  integer  :: j
  real*8, dimension(ordn) :: ymtmp
  real*8 :: dy_temp
  do j=1,ordn
    call polint(x1a, ya(:,j), x1, ymtmp(j), dy_temp, ordn)
  end do
  call polint(x2a, ymtmp, x2, y, dy, ordn)
 #endif
  return
  end subroutine polin2
 !------------------------------------------------------------------------------
 !
@@ -1219,18 +1225,15 @@ end subroutine d2dump
 !
 !------------------------------------------------------------------------------
  subroutine polin3(x1a,x2a,x3a,ya,x1,x2,x3,y,dy,ordn)
  implicit none
 !~~~~~~> Input parameters:
  integer,intent(in) :: ordn
  real*8, dimension(1:ordn), intent(in) :: x1a,x2a,x3a
  real*8, dimension(1:ordn,1:ordn,1:ordn), intent(in) :: ya
  real*8, intent(in) :: x1,x2,x3
  real*8, intent(out) :: y,dy
-!~~~~~~> Other parameters:
+#ifdef POLINT_LEGACY_ORDER
  integer  :: i,j,m,n
  real*8, dimension(ordn,ordn) :: yatmp
  real*8, dimension(ordn) :: ymtmp
@@ -1239,24 +1242,33 @@ end subroutine d2dump
  m=size(x1a)
  n=size(x2a)
  do i=1,m
   do j=1,n
    yqtmp=ya(i,j,:)
    call polint(x3a,yqtmp,x3,yatmp(i,j),dy,ordn)
   end do
    yntmp=yatmp(i,:)
    call polint(x2a,yntmp,x2,ymtmp(i),dy,ordn)
  end do
  call polint(x1a,ymtmp,x1,y,dy,ordn)
 #else
  integer  :: j, k
  real*8, dimension(ordn,ordn) :: yatmp
  real*8, dimension(ordn) :: ymtmp
  real*8 :: dy_temp
  do k=1,ordn
    do j=1,ordn
      call polint(x1a, ya(:,j,k), x1, yatmp(j,k), dy_temp, ordn)
    end do
  end do
  do k=1,ordn
    call polint(x2a, yatmp(:,k), x2, ymtmp(k), dy_temp, ordn)
  end do
  call polint(x3a, ymtmp, x3, y, dy, ordn)
 #endif
  return
  end subroutine polin3
 !--------------------------------------------------------------------------------------
 ! calculate L2norm
--- a/AMSS_NCKU_source/kodiss.f90
+++ b/AMSS_NCKU_source/kodiss.f90
@@ -159,36 +159,12 @@ integer, parameter :: NO_SYMM=0, OCTANT=2
  call symmetry_bd(3,ex,f,fh,SoA)
-  do k=1,ex(3)
+! Interior: all stencil points guaranteed in-bounds
-  do j=1,ex(2)
+  !$OMP PARALLEL DO COLLAPSE(2) SCHEDULE(static) PRIVATE(i,j,k)
-  do i=1,ex(1)
+  do k=4,ex(3)-3
-
+  do j=4,ex(2)-3
-  if(i-3 >= imin .and. i+3 <= imax .and. &
+  !DIR$ IVDEP
-     j-3 >= jmin .and. j+3 <= jmax .and. &
+  do i=4,ex(1)-3
     k-3 >= kmin .and. k+3 <= kmax) then
 #if 0     
 ! x direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dX/cof * (     &
                              (fh(i-3,j,k)+fh(i+3,j,k)) - &
                          SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
                          FIT*(fh(i-1,j,k)+fh(i+1,j,k)) - &
                          TWT* fh(i,j,k)            )
 ! y direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dY/cof * (     &
                              (fh(i,j-3,k)+fh(i,j+3,k)) - &
                          SIX*(fh(i,j-2,k)+fh(i,j+2,k)) + &
                          FIT*(fh(i,j-1,k)+fh(i,j+1,k)) - &
                          TWT* fh(i,j,k)            )
 ! z direction
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/dZ/cof * (     &
                              (fh(i,j,k-3)+fh(i,j,k+3)) - &
                          SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
                          FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
                          TWT* fh(i,j,k)            )
 #else
 ! calculation order if important ?
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/cof *( (     &
                              (fh(i-3,j,k)+fh(i+3,j,k)) - &
                          SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
@@ -204,9 +180,37 @@ integer, parameter :: NO_SYMM=0, OCTANT=2
                          SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
                          FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
                          TWT* fh(i,j,k)            )/dZ )
-#endif
+  enddo
-  endif
+  enddo
  enddo
  !$OMP END PARALLEL DO
 ! Boundary shell: original branching logic for points near edges
  do k=1,ex(3)
  do j=1,ex(2)
  do i=1,ex(1)
  if(i >= 4 .and. i <= ex(1)-3 .and. &
     j >= 4 .and. j <= ex(2)-3 .and. &
     k >= 4 .and. k <= ex(3)-3) cycle
  if(i-3 >= imin .and. i+3 <= imax .and. &
     j-3 >= jmin .and. j+3 <= jmax .and. &
     k-3 >= kmin .and. k+3 <= kmax) then
   f_rhs(i,j,k)       = f_rhs(i,j,k) + eps/cof *( (     &
                              (fh(i-3,j,k)+fh(i+3,j,k)) - &
                          SIX*(fh(i-2,j,k)+fh(i+2,j,k)) + &
                          FIT*(fh(i-1,j,k)+fh(i+1,j,k)) - &
                          TWT* fh(i,j,k)            )/dX + &
                                                  (     &
                              (fh(i,j-3,k)+fh(i,j+3,k)) - &
                          SIX*(fh(i,j-2,k)+fh(i,j+2,k)) + &
                          FIT*(fh(i,j-1,k)+fh(i,j+1,k)) - &
                          TWT* fh(i,j,k)            )/dY + &
                                                  (     &
                              (fh(i,j,k-3)+fh(i,j,k+3)) - &
                          SIX*(fh(i,j,k-2)+fh(i,j,k+2)) + &
                          FIT*(fh(i,j,k-1)+fh(i,j,k+1)) - &
                          TWT* fh(i,j,k)            )/dZ )
  endif
  enddo
  enddo
  enddo
--- a/AMSS_NCKU_source/lopsidediff.f90
+++ b/AMSS_NCKU_source/lopsidediff.f90
@@ -233,6 +233,7 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
 ! upper bound set ex-1 only for efficiency,
 ! the loop body will set ex 0 also
  !$OMP PARALLEL DO COLLAPSE(2) SCHEDULE(static) PRIVATE(i,j,k)
  do k=1,ex(3)-1
  do j=1,ex(2)-1
  do i=1,ex(1)-1
@@ -482,6 +483,7 @@ subroutine lopsided(ex,X,Y,Z,f,f_rhs,Sfx,Sfy,Sfz,Symmetry,SoA)
  enddo
  enddo
  enddo
  !$OMP END PARALLEL DO
  return
--- a/AMSS_NCKU_source/makefile.inc
+++ b/AMSS_NCKU_source/makefile.inc
@@ -8,7 +8,7 @@ filein  = -I/usr/include/ -I${MKLROOT}/include
 ## Using sequential MKL (OpenMP disabled for better single-threaded performance)
 ## Added -lifcore for Intel Fortran runtime and -limf for Intel math library
-LDLIBS  = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lifcore -limf -lpthread -lm -ldl
+LDLIBS  = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lifcore -limf -lpthread -lm -ldl -qopenmp
 ## Aggressive optimization flags:
 ## -O3: Maximum optimization
@@ -16,10 +16,10 @@ LDLIBS  = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lifcore
 ## -fp-model fast=2: Aggressive floating-point optimizations
 ## -fma: Enable fused multiply-add instructions
 ## Note: OpenMP has been disabled (-qopenmp removed) due to performance issues
-CXXAPPFLAGS  = -O3 -xHost -fp-model fast=2 -fma \
+CXXAPPFLAGS  = -O3 -xHost -fp-model fast=2 -fma -ipo -qopenmp \
               -Dfortran3 -Dnewc -I${MKLROOT}/include
-f90appflags  = -O3 -xHost -fp-model fast=2 -fma \
+f90appflags  = -O3 -xHost -fp-model fast=2 -fma -ipo -qopenmp \
-               -fpp -I${MKLROOT}/include
+               -align array64byte -fpp -I${MKLROOT}/include
 f90          = ifx
 f77          = ifx
 CXX          = icpx
--- a/generate_macrodef.py
+++ b/generate_macrodef.py
@@ -392,17 +392,6 @@ def generate_macrodef_fh():
        print( "# Finite_Difference_Method #define ghost_width setting error!!!",   file=file1 )
        print(                                                   file=file1 )
    # Define macro DEBUG_NAN_CHECK
    # 0: off (default), 1: on
    debug_nan_check = getattr(input_data, "Debug_NaN_Check", 0)
    if debug_nan_check:
        print( "#define DEBUG_NAN_CHECK 1", file=file1 )
        print(                             file=file1 )
    else:
        print( "#define DEBUG_NAN_CHECK 0", file=file1 )
        print(                             file=file1 )
    # Whether to use a shell-patch grid
    # use shell or not
@@ -525,9 +514,6 @@ def generate_macrodef_fh():
    print( "    6th order: 4",                                                                      file=file1 )
    print( "    8th order: 5",                                                                      file=file1 )
    print(                                                                                          file=file1 )
    print( "define DEBUG_NAN_CHECK",                                                                file=file1 )
    print( "    0: off (default), 1: on",                                                           file=file1 )
    print(                                                                                          file=file1 )
    print( "define WithShell",                                                                      file=file1 )
    print( "    use shell or not",                                                                  file=file1 )
    print(                                                                                          file=file1 )
--- a/inputfile_example/AMSS_NCKU_Input.py
+++ b/inputfile_example/AMSS_NCKU_Input.py
@@ -36,7 +36,6 @@ Equation_Class           = "BSSN"                  ## Evolution Equation: choose
 Initial_Data_Method      = "Ansorg-TwoPuncture"    ## initial data method: choose "Ansorg-TwoPuncture", "Lousto-Analytical", "Cao-Analytical", "KerrSchild-Analytical"
 Time_Evolution_Method    = "runge-kutta-45"        ## time evolution method: choose "runge-kutta-45"
 Finite_Diffenence_Method = "4th-order"             ## finite-difference method: choose "2nd-order", "4th-order", "6th-order", "8th-order"
 Debug_NaN_Check          = 0                       ## enable NaN checks in compute_rhs_bssn: 0 (off) or 1 (on)
 #################################################
--- a/makefile_and_run.py
+++ b/makefile_and_run.py
@@ -15,13 +15,12 @@ import subprocess
 ## taskset ensures all child processes inherit the CPU affinity mask
 ## This forces make and all compiler processes to use only nohz_full cores (4-55, 60-111)
 ## Format: taskset -c 4-55,60-111 ensures processes only run on these cores
-#NUMACTL_CPU_BIND = "taskset -c 4-55,60-111"
+NUMACTL_CPU_BIND = "taskset -c 0-111"
 NUMACTL_CPU_BIND = ""
 ## Build parallelism configuration
 ## Use nohz_full cores (4-55, 60-111) for compilation: 52 + 52 = 104 cores
 ## Set make -j to utilize available cores for faster builds
-BUILD_JOBS = 14
+BUILD_JOBS = 104
 ##################################################################
Author	SHA1	Message	Date
CGH0S7	c6e4d4ab71	Add OpenMP parallelization to BSSN RHS hot-path stencil routines Enable OpenMP threading for the dominant computational kernels: - makefile.inc: add -qopenmp to f90appflags - diff_new.f90: split fderivs/fdderivs into OpenMP interior + serial boundary - kodiss.f90: split kodis into OpenMP interior + serial boundary - lopsidediff.f90: add OMP PARALLEL DO COLLAPSE(2) to lopsided - fmisc.f90: parallelize symmetry_bd bulk array copy - bssn_rhs.f90: add OMP WORKSHARE to array-syntax operations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-07 13:58:55 +08:00
CGH0S7	09ffdb553d	Eliminate hot-path heap allocations in TwoPunctures spectral solver Pre-allocate workspace buffers as class members to remove ~8M malloc/free pairs per Newton iteration from LineRelax, ThomasAlgorithm, JFD_times_dv, J_times_dv, chebft_Zeros, fourft, Derivatives_AB3, and F_of_v. Rewrite ThomasAlgorithm to operate in-place on input arrays. Results are bit-identical; no algorithmic changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 21:20:35 +08:00
CGH0S7	699e443c7a	Optimize polint/polin2/polin3 interpolation for cache locality Changes: - polint: Rewrite Neville algorithm from array-slice operations to scalar loop. Mathematically identical, avoids temporary array allocations for den(1:n-m) slices. (Credit: yx-fmisc branch) - polin2: Swap interpolation order so inner loop accesses ya(:,j) (contiguous in Fortran column-major) instead of ya(i,:) (strided). Tensor product interpolation is commutative; all call sites pass identical coordinate arrays for all dimensions. - polin3: Swap interpolation order to process contiguous first dimension first: ya(:,j,k) -> yatmp(:,k) -> ymtmp(:). Same commutativity argument as polin2. Compile-time safety switch: -DPOLINT_LEGACY_ORDER restores original dimension ordering Default (no flag): uses optimized contiguous-memory ordering Usage: # Production (optimized order): make clean && make -j ABE # Fallback if results differ (original order): Add -DPOLINT_LEGACY_ORDER to f90appflags in makefile.inc Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 19:00:35 +08:00
CGH0S7	24bfa44911	Disable NaN sanity check in bssn_rhs.f90 for production builds Wrap the NaN sanity check (21 sum() full-array traversals per RHS call) with #ifdef DEBUG so it is compiled out in production builds. This eliminates 84 redundant full-array scans per timestep (21 per RHS call × 4 RK4 substages) that serve no purpose when input data is valid. Usage: - Production build (default): NaN check is disabled, no changes needed. - Debug build: add -DDEBUG to f90appflags in makefile.inc, e.g. f90appflags = -O3 ... -DDEBUG -fpp ... to re-enable the NaN sanity check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 18:36:29 +08:00
CGH0S7	6738854a9d	Compiler-level and hot-path optimizations for GW150914 - makefile.inc: add -ipo (interprocedural optimization) and -align array64byte (64-byte array alignment for vectorization) - fmisc.f90: remove redundant funcc=0.d0 zeroing from symmetry_bd, symmetry_tbd, symmetry_stbd (~328+ full-array memsets eliminated per timestep) - enforce_algebra.f90: rewrite enforce_ag and enforce_ga as point-wise loops, replacing 12 stack-allocated 3D temporary arrays with scalar locals for better cache locality All changes are mathematically equivalent — no algorithmic modifications. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 17:13:39 +08:00
CGH0S7	223ec17a54	input updated	2026-02-06 13:57:48 +08:00