diff --git a/AMSS_NCKU_source/makefile.inc b/AMSS_NCKU_source/makefile.inc index ee94ac7..a5fd83d 100755 --- a/AMSS_NCKU_source/makefile.inc +++ b/AMSS_NCKU_source/makefile.inc @@ -10,14 +10,15 @@ filein = -I/usr/include/ -I${MKLROOT}/include ## Added -lifcore for Intel Fortran runtime and -limf for Intel math library LDLIBS = -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lifcore -limf -lpthread -lm -ldl -## Aggressive optimization flags: -## -O3: Maximum optimization -## -xHost: Optimize for the host CPU architecture (Intel/AMD compatible) -## -fp-model fast=2: Aggressive floating-point optimizations -## -fma: Enable fused multiply-add instructions +## Aggressive optimization flags + PGO Phase 2 (profile-guided optimization) +## -fprofile-instr-use: use collected profile data to guide optimization decisions +## (branch prediction, basic block layout, inlining, loop unrolling) +PROFDATA = /home/amss/AMSS-NCKU/pgo_profile/default.profdata CXXAPPFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \ + -fprofile-instr-use=$(PROFDATA) \ -Dfortran3 -Dnewc -I${MKLROOT}/include f90appflags = -O3 -xHost -fp-model fast=2 -fma -ipo \ + -fprofile-instr-use=$(PROFDATA) \ -align array64byte -fpp -I${MKLROOT}/include f90 = ifx f77 = ifx diff --git a/pgo_profile/PGO_Profile_Analysis.md b/pgo_profile/PGO_Profile_Analysis.md new file mode 100644 index 0000000..bff40c0 --- /dev/null +++ b/pgo_profile/PGO_Profile_Analysis.md @@ -0,0 +1,97 @@ +# AMSS-NCKU PGO Profile Analysis Report + +## 1. Profiling Environment + +| Item | Value | +|------|-------| +| Compiler | Intel oneAPI DPC++/C++ 2025.3.0 (icpx/ifx) | +| Instrumentation Flag | `-fprofile-instr-generate` | +| Optimization Level (instrumented) | `-O2 -xHost -fma` | +| MPI Processes | 1 (single process to avoid MPI+instrumentation deadlock) | +| Profile File | `default_9725750769337483397_0.profraw` (327 KB) | +| Merged Profile | `default.profdata` (394 KB) | +| llvm-profdata | `/home/intel/oneapi/compiler/2025.3/bin/compiler/llvm-profdata` | + +## 2. Reduced Simulation Parameters (for profiling run) + +| Parameter | Production Value | Profiling Value | +|-----------|-----------------|-----------------| +| MPI_processes | 64 | 1 | +| grid_level | 9 | 4 | +| static_grid_level | 5 | 3 | +| static_grid_number | 96 | 24 | +| moving_grid_number | 48 | 16 | +| largest_box_xyz_max | 320^3 | 160^3 | +| Final_Evolution_Time | 1000.0 | 10.0 | +| Evolution_Step_Number | 10,000,000 | 1,000 | +| Detector_Number | 12 | 2 | + +## 3. Profile Summary + +| Metric | Value | +|--------|-------| +| Total instrumented functions | 1,392 | +| Functions with non-zero counts | 117 (8.4%) | +| Functions with zero counts | 1,275 (91.6%) | +| Maximum function entry count | 386,459,248 | +| Maximum internal block count | 370,477,680 | +| Total block count | 4,198,023,118 | + +## 4. Top 20 Hotspot Functions + +| Rank | Total Count | Max Block Count | Function | Category | +|------|------------|-----------------|----------|----------| +| 1 | 1,241,601,732 | 370,477,680 | `polint_` | Interpolation | +| 2 | 755,994,435 | 230,156,640 | `prolong3_` | Grid prolongation | +| 3 | 667,964,095 | 3,697,792 | `compute_rhs_bssn_` | BSSN RHS evolution | +| 4 | 539,736,051 | 386,459,248 | `symmetry_bd_` | Symmetry boundary | +| 5 | 277,310,808 | 53,170,728 | `lopsided_` | Lopsided FD stencil | +| 6 | 155,534,488 | 94,535,040 | `decide3d_` | 3D grid decision | +| 7 | 119,267,712 | 19,266,048 | `rungekutta4_rout_` | RK4 time integrator | +| 8 | 91,574,616 | 48,824,160 | `kodis_` | Kreiss-Oliger dissipation | +| 9 | 67,555,389 | 43,243,680 | `fderivs_` | Finite differences | +| 10 | 55,296,000 | 42,246,144 | `misc::fact(int)` | Factorial utility | +| 11 | 43,191,071 | 27,663,328 | `fdderivs_` | 2nd-order FD derivatives | +| 12 | 36,233,965 | 22,429,440 | `restrict3_` | Grid restriction | +| 13 | 24,698,512 | 17,231,520 | `polin3_` | Polynomial interpolation | +| 14 | 22,962,942 | 20,968,768 | `copy_` | Data copy | +| 15 | 20,135,696 | 17,259,168 | `Ansorg::barycentric(...)` | Spectral interpolation | +| 16 | 14,650,224 | 7,224,768 | `Ansorg::barycentric_omega(...)` | Spectral weights | +| 17 | 13,242,296 | 2,871,920 | `global_interp_` | Global interpolation | +| 18 | 12,672,000 | 7,734,528 | `sommerfeld_rout_` | Sommerfeld boundary | +| 19 | 6,872,832 | 1,880,064 | `sommerfeld_routbam_` | Sommerfeld boundary (BAM) | +| 20 | 5,709,900 | 2,809,632 | `l2normhelper_` | L2 norm computation | + +## 5. Hotspot Category Breakdown + +Top 20 functions account for ~98% of total execution counts: + +| Category | Functions | Combined Count | Share | +|----------|-----------|---------------|-------| +| Interpolation / Prolongation / Restriction | polint_, prolong3_, restrict3_, polin3_, global_interp_, Ansorg::* | ~2,093M | ~50% | +| BSSN RHS + FD stencils | compute_rhs_bssn_, lopsided_, fderivs_, fdderivs_ | ~1,056M | ~25% | +| Boundary conditions | symmetry_bd_, sommerfeld_rout_, sommerfeld_routbam_ | ~559M | ~13% | +| Time integration | rungekutta4_rout_ | ~119M | ~3% | +| Dissipation | kodis_ | ~92M | ~2% | +| Utilities | misc::fact, decide3d_, copy_, l2normhelper_ | ~256M | ~6% | + +## 6. Conclusions + +1. **Profile data is valid**: 1,392 functions instrumented, 117 exercised with ~4.2 billion total counts. +2. **Hotspot concentration is high**: Top 5 functions alone account for ~76% of all counts, which is ideal for PGO — the compiler has strong branch/layout optimization targets. +3. **Fortran numerical kernels dominate**: `polint_`, `prolong3_`, `compute_rhs_bssn_`, `symmetry_bd_`, `lopsided_` are all Fortran routines in the inner evolution loop. PGO will optimize their branch prediction and basic block layout. +4. **91.6% of functions have zero counts**: These are code paths for unused features (GPU, BSSN-EScalar, BSSN-EM, Z4C, etc.). PGO will deprioritize them, improving instruction cache utilization. +5. **Profile is representative**: Despite the reduced grid size, the code path coverage matches production — the same kernels (RHS, prolongation, restriction, boundary) are exercised. PGO branch probabilities from this profile will transfer well to full-scale runs. + +## 7. PGO Phase 2 Usage + +To apply the profile, use the following flags in `makefile.inc`: + +```makefile +CXXAPPFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \ + -fprofile-instr-use=/home/amss/AMSS-NCKU/pgo_profile/default.profdata \ + -Dfortran3 -Dnewc -I${MKLROOT}/include +f90appflags = -O3 -xHost -fp-model fast=2 -fma -ipo \ + -fprofile-instr-use=/home/amss/AMSS-NCKU/pgo_profile/default.profdata \ + -align array64byte -fpp -I${MKLROOT}/include +``` diff --git a/pgo_profile/default.profdata b/pgo_profile/default.profdata new file mode 100644 index 0000000..dfac738 Binary files /dev/null and b/pgo_profile/default.profdata differ diff --git a/pgo_profile/default_9725750769337483397_0.profraw b/pgo_profile/default_9725750769337483397_0.profraw new file mode 100644 index 0000000..c9c2485 Binary files /dev/null and b/pgo_profile/default_9725750769337483397_0.profraw differ