AMSS-NCKU PGO Profile Analysis Report
1. Profiling Environment
| Item |
Value |
| Compiler |
Intel oneAPI DPC++/C++ 2025.3.0 (icpx/ifx) |
| Instrumentation Flag |
-fprofile-instr-generate |
| Optimization Level (instrumented) |
-O2 -xHost -fma |
| MPI Processes |
1 (single process to avoid MPI+instrumentation deadlock) |
| Profile File |
default_9725750769337483397_0.profraw (327 KB) |
| Merged Profile |
default.profdata (394 KB) |
| llvm-profdata |
/home/intel/oneapi/compiler/2025.3/bin/compiler/llvm-profdata |
2. Reduced Simulation Parameters (for profiling run)
| Parameter |
Production Value |
Profiling Value |
| MPI_processes |
64 |
1 |
| grid_level |
9 |
4 |
| static_grid_level |
5 |
3 |
| static_grid_number |
96 |
24 |
| moving_grid_number |
48 |
16 |
| largest_box_xyz_max |
320^3 |
160^3 |
| Final_Evolution_Time |
1000.0 |
10.0 |
| Evolution_Step_Number |
10,000,000 |
1,000 |
| Detector_Number |
12 |
2 |
3. Profile Summary
| Metric |
Value |
| Total instrumented functions |
1,392 |
| Functions with non-zero counts |
117 (8.4%) |
| Functions with zero counts |
1,275 (91.6%) |
| Maximum function entry count |
386,459,248 |
| Maximum internal block count |
370,477,680 |
| Total block count |
4,198,023,118 |
4. Top 20 Hotspot Functions
| Rank |
Total Count |
Max Block Count |
Function |
Category |
| 1 |
1,241,601,732 |
370,477,680 |
polint_ |
Interpolation |
| 2 |
755,994,435 |
230,156,640 |
prolong3_ |
Grid prolongation |
| 3 |
667,964,095 |
3,697,792 |
compute_rhs_bssn_ |
BSSN RHS evolution |
| 4 |
539,736,051 |
386,459,248 |
symmetry_bd_ |
Symmetry boundary |
| 5 |
277,310,808 |
53,170,728 |
lopsided_ |
Lopsided FD stencil |
| 6 |
155,534,488 |
94,535,040 |
decide3d_ |
3D grid decision |
| 7 |
119,267,712 |
19,266,048 |
rungekutta4_rout_ |
RK4 time integrator |
| 8 |
91,574,616 |
48,824,160 |
kodis_ |
Kreiss-Oliger dissipation |
| 9 |
67,555,389 |
43,243,680 |
fderivs_ |
Finite differences |
| 10 |
55,296,000 |
42,246,144 |
misc::fact(int) |
Factorial utility |
| 11 |
43,191,071 |
27,663,328 |
fdderivs_ |
2nd-order FD derivatives |
| 12 |
36,233,965 |
22,429,440 |
restrict3_ |
Grid restriction |
| 13 |
24,698,512 |
17,231,520 |
polin3_ |
Polynomial interpolation |
| 14 |
22,962,942 |
20,968,768 |
copy_ |
Data copy |
| 15 |
20,135,696 |
17,259,168 |
Ansorg::barycentric(...) |
Spectral interpolation |
| 16 |
14,650,224 |
7,224,768 |
Ansorg::barycentric_omega(...) |
Spectral weights |
| 17 |
13,242,296 |
2,871,920 |
global_interp_ |
Global interpolation |
| 18 |
12,672,000 |
7,734,528 |
sommerfeld_rout_ |
Sommerfeld boundary |
| 19 |
6,872,832 |
1,880,064 |
sommerfeld_routbam_ |
Sommerfeld boundary (BAM) |
| 20 |
5,709,900 |
2,809,632 |
l2normhelper_ |
L2 norm computation |
5. Hotspot Category Breakdown
Top 20 functions account for ~98% of total execution counts:
| Category |
Functions |
Combined Count |
Share |
| Interpolation / Prolongation / Restriction |
polint_, prolong3_, restrict3_, polin3_, global_interp_, Ansorg::* |
~2,093M |
~50% |
| BSSN RHS + FD stencils |
compute_rhs_bssn_, lopsided_, fderivs_, fdderivs_ |
~1,056M |
~25% |
| Boundary conditions |
symmetry_bd_, sommerfeld_rout_, sommerfeld_routbam_ |
~559M |
~13% |
| Time integration |
rungekutta4_rout_ |
~119M |
~3% |
| Dissipation |
kodis_ |
~92M |
~2% |
| Utilities |
misc::fact, decide3d_, copy_, l2normhelper_ |
~256M |
~6% |
6. Conclusions
- Profile data is valid: 1,392 functions instrumented, 117 exercised with ~4.2 billion total counts.
- Hotspot concentration is high: Top 5 functions alone account for ~76% of all counts, which is ideal for PGO — the compiler has strong branch/layout optimization targets.
- Fortran numerical kernels dominate:
polint_, prolong3_, compute_rhs_bssn_, symmetry_bd_, lopsided_ are all Fortran routines in the inner evolution loop. PGO will optimize their branch prediction and basic block layout.
- 91.6% of functions have zero counts: These are code paths for unused features (GPU, BSSN-EScalar, BSSN-EM, Z4C, etc.). PGO will deprioritize them, improving instruction cache utilization.
- Profile is representative: Despite the reduced grid size, the code path coverage matches production — the same kernels (RHS, prolongation, restriction, boundary) are exercised. PGO branch probabilities from this profile will transfer well to full-scale runs.
7. PGO Phase 2 Usage
To apply the profile, use the following flags in makefile.inc: