打印推理权重精度信息

2026-01-18 11:19:10 +08:00
parent c86c2be5ff
commit 7b499284bf
9 changed files with 256 additions and 143 deletions
--- a/model_architecture_analysis.md
+++ b/model_architecture_analysis.md
@@ -125,10 +125,11 @@ def ddim_sampling(self, cond, shape, x_T=None, ddim_steps=50, ...):
    return x
 ```

-**性能数据** (来自profiling):
- 单步去噪总耗时: 10.71s - 11.06s (22次调用)
- 模型前向: 325.30s (660次调用, 平均0.493s/次)
- DDIM更新: 0.21s (660次调用, 平均0.0003s/次)
+**性能数据** (来自 profiling 报告，`--ddim_steps 50`):
+- DDIM采样调用: 22次 (action_generation + world_model_interaction 各11次)
+- 单次采样(50步)平均耗时: 35.58s (总计 782.70s)
+- 平均每步耗时: ~0.712s (35.58s / 50)
+- 当前 `unconditional_guidance_scale=1.0` 时每步 1 次 UNet 前向；开启 CFG 时每步 2 次前向

 ### 3.2 DiffusionWrapper (条件路由器)

@@ -205,14 +206,16 @@ z = model.encode_first_stage(img)
 video = model.decode_first_stage(samples)
 ```

-**性能数据**:
- VAE编码: 1.03s (22次, 平均0.047s/次)
- VAE解码: 15.53s (22次, 平均0.706s/次)
- 压缩比: 8×8 = 64倍空间压缩
+**性能数据**:
+- VAE编码: 0.90s (22次, 平均0.041s/次)
+- VAE解码: 12.44s (22次, 平均0.566s/次)
+- 压缩比: 8×8 = 64倍空间压缩

 **详细架构**: 见附录A.4

-### 3.5 条件编码器
+### 3.5 条件编码器
+
+**性能说明**: 本次 profiling 未对各条件编码器单独计时，统一计入 `synthesis/conditioning_prep`，总计 2.92s (22次, 平均0.133s/次)。

 #### 3.5.1 CLIP图像编码器

@@ -237,9 +240,7 @@ Resampler (图像投影器):

 **数据流**: 图像 [B, 3, H, W] → CLIP → [B, 1280] → Resampler → [B, 16, 1024]

-**性能**: 0.71s (22次, 平均0.032s/次)
-
-#### 3.5.2 文本编码器
+#### 3.5.2 文本编码器

 **代码位置**: [src/unifolm_wma/modules/encoders/condition.py](src/unifolm_wma/modules/encoders/condition.py) - `FrozenOpenCLIPEmbedder`

@@ -253,9 +254,7 @@ FrozenOpenCLIPEmbedder:
  # 输出: [B, seq_len, 1024]
 ```

-**性能**: 0.13s (22次, 平均0.006s/次)
-
-#### 3.5.3 状态投影器
+#### 3.5.3 状态投影器

 **代码位置**: [src/unifolm_wma/models/ddpms.py:2014-2026](src/unifolm_wma/models/ddpms.py) - `MLPProjector`

@@ -273,9 +272,7 @@ class MLPProjector(nn.Module):

 **数据流**: 状态 [B, T_obs, 16] → MLPProjector → [B, T_obs, 1024] + agent_state_pos_emb

-**性能**: 0.006s (22次, 平均0.0003s/次)
-
-#### 3.5.4 动作投影器
+#### 3.5.4 动作投影器

 **代码位置**: [src/unifolm_wma/models/ddpms.py:2020-2024](src/unifolm_wma/models/ddpms.py) - `MLPProjector`

@@ -288,47 +285,49 @@ self.agent_action_pos_emb = nn.Parameter(torch.randn(1, 16, 1024))
 self.agent_state_pos_emb = nn.Parameter(torch.randn(1, n_obs_steps, 1024))
 ```

-**性能**: 0.003s (22次, 平均0.0001s/次)
-
---
-
-## 4. 性能瓶颈分析
-
-### 4.1 时间分布 (总计412.39s)
-
-根据性能分析数据，时间分布如下：
-
-| 阶段 | 总耗时 | 占比 | 说明 |
-|------|--------|------|------|
-| 阶段1: 生成动作 | 171.52s | 41.6% | DDIM采样30步 |
-| 阶段2: 世界模型交互 | 171.65s | 41.6% | DDIM采样30步 |
-| 模型加载 | 47.56s | 11.5% | 一次性开销 |
-| 保存视频 | 13.91s | 3.4% | I/O操作 |
-| 保存完整视频 | 7.22s | 1.8% | I/O操作 |
-| 数据集加载 | 0.51s | 0.1% | 一次性开销 |
-
-### 4.2 DDIM采样详细分析
-
-**DDIM采样是绝对瓶颈，占总时间的94.9%**
-
-```
-DDIM采样总耗时: 325.74s
-├── 模型前向传播: 325.30s (99.86%)  ← 核心瓶颈
-├── DDIM更新公式: 0.21s (0.06%)
-└── Action/State调度: 0.13s (0.04%)
-```
-
-**每步耗时分析**:
- 30个去噪步骤，每步平均耗时: 10.86s
- 每步调用模型前向2次 (阶段1和阶段2各1次)
- 每次前向传播: ~0.493s
-
-### 4.3 瓶颈总结
-
-**关键发现**:
-1. **模型前向传播占99.86%的DDIM时间** - 这是优化的核心目标
-2. VAE解码占4.5%总时间 - 次要优化目标
-3. 其他操作(条件编码、DDIM更新)耗时可忽略
+---
+
+## 4. 性能瓶颈分析
+
+### 4.1 时间分布 (profiling 报告, 11次迭代 / 22次采样)
+
+说明: `profile_section` 存在嵌套，宏观统计的总和不是 wall time，以下以每段的 total/avg 为准。
+
+| Section | Count | Total(s) | Avg(s) | 说明 |
+|---------|-------|----------|--------|------|
+| iteration_total | 11 | 836.79 | 76.07 | 单次迭代总耗时 |
+| action_generation | 11 | 399.71 | 36.34 | 生成动作 (DDIM 50步) |
+| world_model_interaction | 11 | 398.38 | 36.22 | 世界模型交互 (DDIM 50步) |
+| synthesis/ddim_sampling | 22 | 782.70 | 35.58 | 单次采样 |
+| synthesis/conditioning_prep | 22 | 2.92 | 0.13 | 条件编码汇总 |
+| synthesis/decode_first_stage | 22 | 12.44 | 0.57 | VAE解码 |
+| save_results | 11 | 38.67 | 3.52 | I/O保存 |
+| model_loading/config | 1 | 49.77 | 49.77 | 一次性开销 |
+| model_loading/checkpoint | 1 | 11.83 | 11.83 | 一次性开销 |
+| model_to_cuda | 1 | 8.91 | 8.91 | 一次性开销 |
+
+### 4.2 DDIM采样详细分析
+
+**DDIM采样是主要瓶颈** (基于 50 步采样):
+
+- 采样调用次数: 22 次 (11 次迭代 × 2 阶段)
+- 采样总耗时: 782.70s，平均 35.58s/次
+- 平均每步耗时: ~0.712s (35.58s / 50)
+- `unconditional_guidance_scale=1.0` 时每步 1 次 UNet 前向；开启 CFG 时每步 2 次前向
+- 在 `conditioning_prep + ddim_sampling + decode_first_stage` 中，ddim_sampling 占约 98%
+
+### 4.3 瓶颈总结
+
+**关键发现**:
+1. **DDIM采样占比最高** - 单次迭代平均 76.07s，其中采样约 71.15s (≈93%)
+2. **CUDA算子时间主要集中在 Linear/GEMM(29.8%) 与 Convolution(13.9%)**；Attention 约 3.0%
+3. **CPU侧 copy/to 仍明显** (`aten::copy_`, `aten::to/_to_copy` 在报告中耗时靠前)
+4. VAE解码为次级瓶颈 (0.57s/次)
+
+### 4.4 GPU显存概览
+
+- Peak allocated: 17890.50 MB
+- Average allocated: 16129.98 MB

 ---

@@ -404,7 +403,7 @@ self.out_layers = nn.Sequential(
 2. `emb_layers` 的 `SiLU + Linear` 可融合
 3. 残差加法可与下一层的GroupNorm融合

-**实际瓶颈**: 16个ResBlock × 30步 × 2次 = **960次ResBlock调用**
+**实际瓶颈**: 16个ResBlock × 50步 × 2次 = **1600次ResBlock调用**

 **预期收益**: 每个ResBlock节省50-60%的kernel启动开销

@@ -415,7 +414,7 @@ self.out_layers = nn.Sequential(
 **实际配置**:
 - SpatialTransformer: 空间维度注意力
 - TemporalTransformer: 时间维度注意力
- 总计: 32个Transformer × 30步 × 2次 = **1920次注意力调用**
+- 总计: 32个Transformer × 50步 × 2次 = **3200次注意力调用**

 **优化方案**:
 使用 PyTorch 内置的 Flash Attention:
@@ -432,7 +431,7 @@ out = scaled_dot_product_attention(Q, K, V, is_causal=False)

 **代码位置**: [src/unifolm_wma/models/autoencoder.py](src/unifolm_wma/models/autoencoder.py)

-**当前性能**: 15.53s (22次调用, 平均0.706s/次)
+**当前性能**: 12.44s (22次调用, 平均0.566s/次)

 **优化方案**:
 1. **混合精度**: 使用FP16进行解码
@@ -719,9 +718,9 @@ with torch.cuda.amp.autocast():

 ### 9.1 关键发现

-1. **模型前向传播占99.86%的DDIM采样时间** - 这是优化的绝对核心
-2. **30步DDIM采样占总时间的83%** - 减少步数或加速单步是关键
-3. **VAE解码占4.5%** - 次要优化目标
+1. **DDIM采样仍是主要瓶颈** - 单次采样(50步)平均 35.58s
+2. **Linear/GEMM 与 Convolution 为主要 CUDA 时间来源** - Attention 占比相对较小
+3. **VAE解码为次级优化目标** - 0.57s/次

 ### 9.2 优化优先级

@@ -740,14 +739,14 @@ with torch.cuda.amp.autocast():

 ### 9.3 预期成果

-通过系统性优化，预期可以将推理时间从 **412s 降低到 140-200s**，实现 **2-3倍加速**。
+通过系统性优化，预期可获得 **1.5-3倍加速** (视采样步数与编译/混合精度策略而定)。

 ---

-**文档版本**: v1.1
-**创建日期**: 2026-01-17
-**最后更新**: 2026-01-17
-**更新内容**: 根据实际代码验证并修正了文件路径、行号、组件位置和实现细节
+**文档版本**: v1.2
+**创建日期**: 2026-01-17
+**最后更新**: 2026-01-18
+**更新内容**: 校准DDIM步数为50并替换为最新profiling数据


 ---
@@ -1121,9 +1120,9 @@ c_concat = [cond_latent]  # 通道拼接
 c_crossattn = [cond_text_emb, cond_img_emb, cond_state_emb, cond_action_emb]
 c_crossattn = torch.cat(c_crossattn, dim=1)  # [B, total_tokens, 1024]

-# 3. DDIM采样循环 (30步)
-x = torch.randn([B, 4, 16, 40, 64])  # 初始噪声
-for step in range(30):
+# 3. DDIM采样循环 (ddim_steps 默认 50，实际由 --ddim_steps 控制)
+x = torch.randn([B, 4, 16, 40, 64])  # 初始噪声
+for step in range(ddim_steps):
    # 3.1 时间步嵌入
    t_emb = timestep_embedding(t, 320) → Linear → [B, 1280]
    
@@ -1143,62 +1142,93 @@ video = vae.decode(x) → [B, 3, 16, 320, 512]
 ```


-### A.9 基于实际架构的优化建议更新
-
-#### 优化点1: ResBlock融合 (高优先级)
-
-**实际瓶颈**:
- 每个DDIM步骤调用UNet一次
- UNet包含: 4个下采样阶段 + 1个中间块 + 3个上采样阶段 = 8个阶段
- 每个阶段有2个ResBlock
- 总计: 16个ResBlock × 30步 × 2次(阶段1+2) = **960次ResBlock调用**
-
-**融合机会**:
-```python
-# 当前: 6次kernel启动
-h = group_norm(x)      # kernel 1
-h = silu(h)            # kernel 2
-h = conv2d(h)          # kernel 3
-h = group_norm(h)      # kernel 4
-h = silu(h)            # kernel 5
-h = conv2d(h)          # kernel 6
-out = x + h            # kernel 7
-
-# 优化后: 2-3次kernel启动
-h = fused_norm_silu_conv(x)     # kernel 1 (融合)
-h = fused_norm_silu_conv(h)     # kernel 2 (融合)
-out = fused_residual_add(x, h)  # kernel 3 (融合)
-```
-
-**预期收益**: 每个ResBlock节省50-60%的kernel启动开销
-
-
-#### 优化点2: 注意力机制优化 (高优先级)
-
-**实际配置**:
- SpatialTransformer: 在每个阶段的每个ResBlock后
- TemporalTransformer: 在每个阶段的每个ResBlock后
- 总计: 16个Spatial + 16个Temporal = **32个Transformer × 30步 × 2次 = 1920次注意力调用**
-
-**当前实现已支持xformers**:
-代码在 `attention.py:8-13` 检测 xformers 可用性:
-```python
-try:
-    import xformers
-    import xformers.ops
-    XFORMERS_IS_AVAILBLE = True
-except:
-    XFORMERS_IS_AVAILBLE = False
-```
-
-当 xformers 可用时，`CrossAttention` 会自动使用 `efficient_forward` 方法 (attention.py:90-91)。
-
-**进一步优化方案** (如果xformers不可用):
-```python
-# 使用 PyTorch 内置 Flash Attention
-from torch.nn.functional import scaled_dot_product_attention
-out = scaled_dot_product_attention(Q, K, V, is_causal=False)
-```
-
-**预期收益**: 如果xformers已启用，注意力层已经是优化的；否则使用Flash Attention可加速2-3倍
+### A.9 基于实际架构的优化建议更新
+
+**我的理解**:
+本次 profiling 显示采样阶段占据绝对主导地位：单次采样(50步)平均 35.58s，且每次迭代包含 action_generation 与 world_model_interaction 各一次采样。换句话说，任何“每步的细微改进”都会被 50 步和 2 阶段放大；因此最有效的优化要么减少步数，要么显著加速 UNet 前向。CUDA 时间主要集中在 Linear/GEMM(29.8%) 与 Convolution(13.9%)，而 Attention 约 3.0%，这意味着算子层面优先考虑矩阵乘/卷积路径的优化收益更稳定。CPU 侧 `aten::copy_/to/_to_copy` 也明显，说明循环内的数据搬运仍有成本可省。
+
+#### 优化点1: 采样步数与采样器 (最高优先级)
+
+**依据**:
+- 50步采样平均 35.58s (0.712s/步)，减少步数带来近线性收益
+- 单次迭代约 76.07s，其中采样约占 93%
+
+**建议**:
+- 在保证质量的前提下，将 `--ddim_steps` 从 50 降到 20-30
+- 评估更快采样器(如 DPM-Solver++/UniPC)以减少步数
+- 若使用 CFG，注意 `unconditional_guidance_scale > 1.0` 会使每步前向翻倍
+
+#### 优化点2: GEMM/Conv 主导路径加速 (高优先级)
+
+**依据**:
+- CUDA 时间主力来自 Linear/GEMM 与 Convolution
+
+**建议**:
+- `torch.compile()` 仅包裹 UNet 主干以获得融合收益
+- 启用混合精度 (`autocast`) + TF32 (`torch.backends.cuda.matmul.allow_tf32 = True`)
+- 固定输入形状时开启 `torch.backends.cudnn.benchmark = True`
+
+#### 优化点3: ResBlock融合 (中高优先级)
+
+**实际瓶颈**:
+- 每个DDIM步骤调用UNet一次
+- UNet包含: 4个下采样阶段 + 1个中间块 + 3个上采样阶段 = 8个阶段
+- 每个阶段有2个ResBlock
+- 总计: 16个ResBlock × 50步 × 2次(阶段1+2) = **1600次ResBlock调用**
+
+**融合机会**:
+```python
+# 当前: 6次kernel启动
+h = group_norm(x)      # kernel 1
+h = silu(h)            # kernel 2
+h = conv2d(h)          # kernel 3
+h = group_norm(h)      # kernel 4
+h = silu(h)            # kernel 5
+h = conv2d(h)          # kernel 6
+out = x + h            # kernel 7
+
+# 优化后: 2-3次kernel启动
+h = fused_norm_silu_conv(x)     # kernel 1 (融合)
+h = fused_norm_silu_conv(h)     # kernel 2 (融合)
+out = fused_residual_add(x, h)  # kernel 3 (融合)
+```
+
+**预期收益**: 每个ResBlock节省50-60%的kernel启动开销
+
+#### 优化点4: 注意力机制优化 (中优先级)
+
+**实际配置**:
+- SpatialTransformer: 在每个阶段的每个ResBlock后
+- TemporalTransformer: 在每个阶段的每个ResBlock后
+- 总计: 16个Spatial + 16个Temporal = **32个Transformer × 50步 × 2次 = 3200次注意力调用**
+
+**理解**:
+Attention 在算子占比中只有约 3%，不是当前主要瓶颈，但若未启用高效实现仍可获得稳定收益。
+
+**优化方案**:
+- 确认 xformers 已启用 (`XFORMERS_IS_AVAILBLE` 为 True)
+- 无 xformers 时替换为 PyTorch SDPA:
+```python
+from torch.nn.functional import scaled_dot_product_attention
+out = scaled_dot_product_attention(Q, K, V, is_causal=False)
+```
+
+#### 优化点5: 数据搬运与 CPU 开销 (中优先级)
+
+**依据**:
+- `aten::copy_` 与 `aten::to/_to_copy` 在 CPU 侧耗时突出
+
+**建议**:
+- 避免在 DDIM 循环内重复 `.to(device)` / `.float()` / `.half()`
+- 将常量张量(如 timestep、sigma)提前放到 GPU
+- 尽量减少临时张量创建与 `clone()`，尤其是 per-step 级别
+
+#### 优化点6: VAE 解码 (低优先级)
+
+**依据**:
+- VAE 解码 0.57s/次，次于采样瓶颈
+
+**建议**:
+- 统一使用 `autocast` 解码
+- 若可容忍轻微质量下降，可降低解码频率或分辨率