DDIM loop 内小张量分配优化,attention mask 缓存到 GPU,加速30s左右
This commit is contained in:
14
useful.sh
14
useful.sh
@@ -106,4 +106,16 @@ embedder:
|
||||
|
||||
1. 新增 --encoder_mode {fp32, autocast, bf16_full}
|
||||
2. bf16_full = 权重 BF16 + 前向 BF16
|
||||
3. autocast = 权重 FP32 + 仅主干 autocast(现在的实现)
|
||||
3. autocast = 权重 FP32 + 仅主干 autocast(现在的实现)
|
||||
|
||||
|
||||
|
||||
1. DDIM loop 内小张量分配优化(已完成)
|
||||
|
||||
- 每步 torch.full(...) 改成预先构造/广播,减少 loop 内分配
|
||||
- 位置:src/unifolm_wma/models/samplers/ddim.py
|
||||
|
||||
2. attention mask 缓存到 GPU(已完成)
|
||||
|
||||
- _get_attn_mask_aa 现在直接在目标 device 构造并缓存,避免每步 CPU→GPU 拷贝
|
||||
- 位置:src/unifolm_wma/modules/attention.py
|
||||
Reference in New Issue
Block a user