# AMSS-NCKU PGO Profile Analysis Report ## 1. Profiling Environment | Item | Value | |------|-------| | Compiler | Intel oneAPI DPC++/C++ 2025.3.0 (icpx/ifx) | | Instrumentation Flag | `-fprofile-instr-generate` | | Optimization Level (instrumented) | `-O2 -xHost -fma` | | MPI Processes | 1 (single process to avoid MPI+instrumentation deadlock) | | Profile File | `default_9725750769337483397_0.profraw` (327 KB) | | Merged Profile | `default.profdata` (394 KB) | | llvm-profdata | `/home/intel/oneapi/compiler/2025.3/bin/compiler/llvm-profdata` | ## 2. Reduced Simulation Parameters (for profiling run) | Parameter | Production Value | Profiling Value | |-----------|-----------------|-----------------| | MPI_processes | 64 | 1 | | grid_level | 9 | 4 | | static_grid_level | 5 | 3 | | static_grid_number | 96 | 24 | | moving_grid_number | 48 | 16 | | largest_box_xyz_max | 320^3 | 160^3 | | Final_Evolution_Time | 1000.0 | 10.0 | | Evolution_Step_Number | 10,000,000 | 1,000 | | Detector_Number | 12 | 2 | ## 3. Profile Summary | Metric | Value | |--------|-------| | Total instrumented functions | 1,392 | | Functions with non-zero counts | 117 (8.4%) | | Functions with zero counts | 1,275 (91.6%) | | Maximum function entry count | 386,459,248 | | Maximum internal block count | 370,477,680 | | Total block count | 4,198,023,118 | ## 4. Top 20 Hotspot Functions | Rank | Total Count | Max Block Count | Function | Category | |------|------------|-----------------|----------|----------| | 1 | 1,241,601,732 | 370,477,680 | `polint_` | Interpolation | | 2 | 755,994,435 | 230,156,640 | `prolong3_` | Grid prolongation | | 3 | 667,964,095 | 3,697,792 | `compute_rhs_bssn_` | BSSN RHS evolution | | 4 | 539,736,051 | 386,459,248 | `symmetry_bd_` | Symmetry boundary | | 5 | 277,310,808 | 53,170,728 | `lopsided_` | Lopsided FD stencil | | 6 | 155,534,488 | 94,535,040 | `decide3d_` | 3D grid decision | | 7 | 119,267,712 | 19,266,048 | `rungekutta4_rout_` | RK4 time integrator | | 8 | 91,574,616 | 48,824,160 | `kodis_` | Kreiss-Oliger dissipation | | 9 | 67,555,389 | 43,243,680 | `fderivs_` | Finite differences | | 10 | 55,296,000 | 42,246,144 | `misc::fact(int)` | Factorial utility | | 11 | 43,191,071 | 27,663,328 | `fdderivs_` | 2nd-order FD derivatives | | 12 | 36,233,965 | 22,429,440 | `restrict3_` | Grid restriction | | 13 | 24,698,512 | 17,231,520 | `polin3_` | Polynomial interpolation | | 14 | 22,962,942 | 20,968,768 | `copy_` | Data copy | | 15 | 20,135,696 | 17,259,168 | `Ansorg::barycentric(...)` | Spectral interpolation | | 16 | 14,650,224 | 7,224,768 | `Ansorg::barycentric_omega(...)` | Spectral weights | | 17 | 13,242,296 | 2,871,920 | `global_interp_` | Global interpolation | | 18 | 12,672,000 | 7,734,528 | `sommerfeld_rout_` | Sommerfeld boundary | | 19 | 6,872,832 | 1,880,064 | `sommerfeld_routbam_` | Sommerfeld boundary (BAM) | | 20 | 5,709,900 | 2,809,632 | `l2normhelper_` | L2 norm computation | ## 5. Hotspot Category Breakdown Top 20 functions account for ~98% of total execution counts: | Category | Functions | Combined Count | Share | |----------|-----------|---------------|-------| | Interpolation / Prolongation / Restriction | polint_, prolong3_, restrict3_, polin3_, global_interp_, Ansorg::* | ~2,093M | ~50% | | BSSN RHS + FD stencils | compute_rhs_bssn_, lopsided_, fderivs_, fdderivs_ | ~1,056M | ~25% | | Boundary conditions | symmetry_bd_, sommerfeld_rout_, sommerfeld_routbam_ | ~559M | ~13% | | Time integration | rungekutta4_rout_ | ~119M | ~3% | | Dissipation | kodis_ | ~92M | ~2% | | Utilities | misc::fact, decide3d_, copy_, l2normhelper_ | ~256M | ~6% | ## 6. Conclusions 1. **Profile data is valid**: 1,392 functions instrumented, 117 exercised with ~4.2 billion total counts. 2. **Hotspot concentration is high**: Top 5 functions alone account for ~76% of all counts, which is ideal for PGO — the compiler has strong branch/layout optimization targets. 3. **Fortran numerical kernels dominate**: `polint_`, `prolong3_`, `compute_rhs_bssn_`, `symmetry_bd_`, `lopsided_` are all Fortran routines in the inner evolution loop. PGO will optimize their branch prediction and basic block layout. 4. **91.6% of functions have zero counts**: These are code paths for unused features (GPU, BSSN-EScalar, BSSN-EM, Z4C, etc.). PGO will deprioritize them, improving instruction cache utilization. 5. **Profile is representative**: Despite the reduced grid size, the code path coverage matches production — the same kernels (RHS, prolongation, restriction, boundary) are exercised. PGO branch probabilities from this profile will transfer well to full-scale runs. ## 7. PGO Phase 2 Usage To apply the profile, use the following flags in `makefile.inc`: ```makefile CXXAPPFLAGS = -O3 -xHost -fp-model fast=2 -fma -ipo \ -fprofile-instr-use=/home/amss/AMSS-NCKU/pgo_profile/default.profdata \ -Dfortran3 -Dnewc -I${MKLROOT}/include f90appflags = -O3 -xHost -fp-model fast=2 -fma -ipo \ -fprofile-instr-use=/home/amss/AMSS-NCKU/pgo_profile/default.profdata \ -align array64byte -fpp -I${MKLROOT}/include ```