Add !$omp parallel do collapse(2) directives to all triple-loop
stencil kernels (fderivs, fdderivs, fdx/fdy/fdz, kodis, lopsided,
enforce_ag/enforce_ga) across all ghost_width variants. Add !$omp
parallel workshare to RK4/ICN/Euler whole-array update routines.
Build system: add -qopenmp to compile and link flags, switch MKL
from sequential to threaded (-lmkl_intel_thread -liomp5).
Runtime: set OMP_NUM_THREADS=96, OMP_STACKSIZE=16M, OMP_PROC_BIND=close,
OMP_PLACES=cores for 96-core server target.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- makefile.inc: add -ipo (interprocedural optimization) and
-align array64byte (64-byte array alignment for vectorization)
- fmisc.f90: remove redundant funcc=0.d0 zeroing from symmetry_bd,
symmetry_tbd, symmetry_stbd (~328+ full-array memsets eliminated
per timestep)
- enforce_algebra.f90: rewrite enforce_ag and enforce_ga as point-wise
loops, replacing 12 stack-allocated 3D temporary arrays with scalar
locals for better cache locality
All changes are mathematically equivalent — no algorithmic modifications.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>