DDIM loop 内小张量分配优化，attention mask 缓存到 GPU，加速30s左右

2026-01-18 22:37:55 +08:00
parent a90efc6718
commit cb334f308b
9 changed files with 103 additions and 49 deletions
--- a/useful.sh
+++ b/useful.sh
@@ -106,4 +106,16 @@ embedder：

  1. 新增 --encoder_mode {fp32, autocast, bf16_full}
  2. bf16_full = 权重 BF16 + 前向 BF16
-  3. autocast = 权重 FP32 + 仅主干 autocast（现在的实现）
+  3. autocast = 权重 FP32 + 仅主干 autocast（现在的实现）
+
+
+
+    1. DDIM loop 内小张量分配优化（已完成）
+
+  - 每步 torch.full(...) 改成预先构造/广播，减少 loop 内分配
+  - 位置：src/unifolm_wma/models/samplers/ddim.py
+
+  2. attention mask 缓存到 GPU（已完成）
+
+  - _get_attn_mask_aa 现在直接在目标 device 构造并缓存，避免每步 CPU→GPU 拷贝
+  - 位置：src/unifolm_wma/modules/attention.py