性能剖析

This commit is contained in:
2026-01-18 00:31:39 +08:00
parent 25c6fc04db
commit c86c2be5ff
26 changed files with 272 additions and 54 deletions

View File

@@ -0,0 +1,85 @@
================================================================================
PERFORMANCE PROFILING REPORT
================================================================================
----------------------------------------
MACRO-LEVEL TIMING SUMMARY
----------------------------------------
Section Count Total(ms) Avg(ms) CUDA Avg(ms)
--------------------------------------------------------------------------------------
action_generation 11 399707.47 36337.04 36336.85
data_loading 1 52.85 52.85 52.88
get_latent_z/encode 22 901.39 40.97 41.01
iteration_total 11 836793.23 76072.11 76071.63
load_transitions 1 2.24 2.24 2.28
model_loading/checkpoint 1 11833.31 11833.31 11833.43
model_loading/config 1 49774.19 49774.19 49774.16
model_to_cuda 1 8909.30 8909.30 8909.33
prepare_init_input 1 10.52 10.52 10.55
prepare_observation 11 5.41 0.49 0.53
prepare_wm_observation 11 2.12 0.19 0.22
save_results 11 38668.06 3515.28 3515.32
synthesis/conditioning_prep 22 2916.63 132.57 132.61
synthesis/ddim_sampling 22 782695.01 35577.05 35576.86
synthesis/decode_first_stage 22 12444.31 565.65 565.70
update_action_queues 11 6.85 0.62 0.65
update_state_queues 11 17.67 1.61 1.64
world_model_interaction 11 398375.58 36215.96 36215.75
--------------------------------------------------------------------------------------
TOTAL 2543116.13
----------------------------------------
GPU MEMORY SUMMARY
----------------------------------------
Peak allocated: 17890.50 MB
Average allocated: 16129.98 MB
----------------------------------------
TOP 30 OPERATORS BY CUDA TIME
----------------------------------------
Operator Count CUDA(ms) CPU(ms) Self CUDA(ms)
------------------------------------------------------------------------------------------------
ProfilerStep* 6 443804.16 237696.98 237689.25
aten::linear 171276 112286.23 13179.82 0.00
aten::addmm 81456 79537.36 3799.84 79296.37
ampere_sgemm_128x64_tn 26400 52052.10 0.00 52052.10
aten::matmul 90468 34234.05 6281.32 0.00
aten::_convolution 100242 33623.79 13105.89 0.00
aten::mm 89820 33580.74 3202.22 33253.18
aten::convolution 100242 33575.23 13714.47 0.00
aten::cudnn_convolution 98430 30932.19 8640.50 29248.12
ampere_sgemm_32x128_tn 42348 20394.52 0.00 20394.52
aten::conv2d 42042 18115.35 5932.30 0.00
ampere_sgemm_128x32_tn 40938 16429.81 0.00 16429.81
xformers::efficient_attention_forward_cutlass 24000 15222.23 2532.93 15120.44
fmha_cutlassF_f32_aligned_64x64_rf_sm80(Attenti... 24000 15121.31 0.00 15121.31
ampere_sgemm_64x64_tn 21000 14627.12 0.00 14627.12
aten::copy_ 231819 14504.87 127056.51 14038.39
aten::group_norm 87144 12033.73 10659.57 0.00
aten::native_group_norm 87144 11473.40 9449.36 11002.02
aten::conv3d 26400 8852.13 3365.43 0.00
void at::native::(anonymous namespace)::Rowwise... 87144 8714.68 0.00 8714.68
void cudnn::ops::nchwToNhwcKernel<float, float,... 169824 8525.44 0.00 8525.44
aten::clone 214314 8200.26 8568.82 0.00
void at::native::elementwise_kernel<128, 2, at:... 220440 8109.62 0.00 8109.62
void cutlass::Kernel<cutlass_80_simt_sgemm_128x... 15000 7919.30 0.00 7919.30
aten::_to_copy 12219 5963.43 122411.53 0.00
aten::to 58101 5952.65 122443.72 0.00
aten::conv1d 30000 5878.95 4556.48 0.00
Memcpy HtoD (Pageable -> Device) 6696 5856.39 0.00 5856.39
aten::reshape 671772 5124.03 9636.01 0.00
sm80_xmma_fprop_implicit_gemm_indexed_tf32f32_t... 16272 5097.70 0.00 5097.70
----------------------------------------
OPERATOR CATEGORY BREAKDOWN
----------------------------------------
Category CUDA Time(ms) Percentage
---------------------------------------------------------
Other 481950.47 41.9%
Linear/GEMM 342333.09 29.8%
Convolution 159920.77 13.9%
Elementwise 54682.93 4.8%
Memory 36883.36 3.2%
Attention 34736.13 3.0%
Normalization 32081.19 2.8%
Activation 6449.19 0.6%