7 Commits

Author SHA1 Message Date
qhy
43ab0f71b0 优化写入后新的所有结果 2026-02-19 20:18:31 +08:00
qhy
5e0e21d91b 复原sh为原始版本 2026-02-18 14:11:55 +08:00
qhy
d5bec53f61 优化后的全部结果 2026-02-11 19:21:06 +08:00
qhy
508b91f5a2 延迟 decode,只解码 CLIP 需要的 1 帧
- world model 调用 decode_video=False,跳过 16 帧全量 decode
- 只 decode 最后 1 帧给 CLIP embedding / observation queue
- 存 raw latent,循环结束后统一 batch decode 生成最终视频
- 每轮省 15 次 VAE decode,8 轮共省 120 次
- 跳过中间迭代的 wm tensorboard/mp4 保存
psnr微弱下降
2026-02-11 17:07:33 +08:00
qhy
3101252c25 速度变化不明显psnr显著提升 2026-02-11 16:38:21 +08:00
qhy
f386a5810b 补充上次提交 2026-02-11 16:24:40 +08:00
qhy
352a79035f 主干部分fp16,最敏感psnr=25.21,可以考虑对主干部分太敏感的部分回退fp32 2026-02-11 16:23:21 +08:00
70 changed files with 8007 additions and 217 deletions

View File

@@ -9,7 +9,13 @@
"Bash(nvidia-smi:*)", "Bash(nvidia-smi:*)",
"Bash(conda activate unifolm-wma)", "Bash(conda activate unifolm-wma)",
"Bash(conda info:*)", "Bash(conda info:*)",
"Bash(direnv allow:*)" "Bash(direnv allow:*)",
"Bash(ls:*)",
"Bash(for scenario in unitree_g1_pack_camera unitree_z1_dual_arm_cleanup_pencils unitree_z1_dual_arm_stackbox unitree_z1_dual_arm_stackbox_v2 unitree_z1_stackbox)",
"Bash(do for case in case1 case2 case3 case4)",
"Bash(done)",
"Bash(chmod:*)",
"Bash(ln:*)"
] ]
} }
} }

View File

@@ -0,0 +1,122 @@
== Task Comprehension: Diffusion Model and UnifoLM-WMA
This section provides a comprehensive overview of the UnifoLM-WMA-0 deep learning architecture, serving as a practical foundation for the optimization strategies discussed in subsequent sections.
=== Overall Inference Pipeline
UnifoLM-WMA-0 is Unitree Robotics' open-source World-Model-Action framework. Its core task is to predict future video frame sequences along with the corresponding robot action and state trajectories, given a current observation image and a text instruction. The model operates in an interactive simulation mode: each iteration consumes the previous prediction as input and generates the next segment of video and actions, thereby forming a closed-loop rollout. A single iteration of this pipeline can be decomposed into four sequential stages condition encoding, VAE encoding, DDIM diffusion sampling, and VAE decoding each of which is described below.
==== Condition Encoding
The condition encoding stage transforms raw multi-modal inputs into a unified context vector that guides the diffusion denoising process, through three parallel encoding paths. On the image side, the input observation image (320#sym.times 512) is processed by a frozen OpenCLIP ViT-H-14 vision encoder, then compressed through a Resampler a Perceiver-based cross-attention module (4 layers, 12 heads, dim\_head=64, embed\_dim 1280 #sym.arrow 1024) into 16 image condition tokens per frame, yielding $16 times T = 256$ image tokens for T=16 frames.
On the text side, the instruction is encoded by a frozen OpenCLIP text encoder (`FrozenOpenCLIPEmbedder`, penultimate layer output) into 77 tokens of dimension 1024, computed once and reused across all DDIM steps. On the state side, the robot proprioceptive state (dim 16) is mapped through a SATokenProjector (Perceiver Attention, 1 layer, 16 heads, dim\_head=64, 16 learnable queries) into 16 tokens of dimension 1024.
These three token sets are concatenated to form the unified context vector: `[agent_state(2) | agent_action(16) | text(77) | image(256)]`, totaling 351 tokens per cross-attention operation.
==== VAE Encoding
The observation images are encoded into a compact latent space through an AutoencoderKL (`autoencoder.py`) — a variational autoencoder regularized by KL divergence. The encoder follows a convolutional architecture with 4-level channel multipliers [1, 2, 4, 4] (base channels ch=128, yielding channel widths [128, 256, 512, 512]), 2 residual blocks per level, and a latent channel count of z\_channels=4. The input RGB frames at resolution 320#sym.times 512 are encoded into latent representations at 1/8 spatial resolution, producing tensors of shape `(B, 4, T, 40, 64)`.
A critical configuration parameter is `perframe_ae=True`, which means the VAE processes each of the T=16 frames independently rather than as a 3D volume. While this per-frame strategy avoids the memory overhead of volumetric convolutions, it introduces a sequential loop of T forward passes through the encoder a point worth noting for latency optimization. The latent representations are scaled by a fixed factor of `scale_factor=0.18215` before being fed into the diffusion process.
==== DDIM Diffusion Sampling
This is the core time-consuming part of inference. A DDIM (Denoising Diffusion Implicit Models) sampler (`ddim.py`) is employed with a default of 50 denoising steps. The diffusion process is parameterized with v-prediction (`parameterization="v"`), 1000 training timesteps, and a linear beta schedule from `linear_start=0.00085` to `linear_end=0.012`, with zero-SNR terminal rescaling enabled (`rescale_betas_zero_snr=True`) and dynamic rescaling applied at `base_scale=0.7` to stabilize generation quality.
Unlike standard video diffusion models that only predict denoised video latents, UnifoLM-WMA simultaneously produces three outputs per step: a video latent prediction `y` of shape `(B, 4, T, 40, 64)`, an action trajectory prediction `a_y` of shape `(B, T, 16)`, and a state trajectory prediction `s_y` of shape `(B, T, 16)`. The three predictions share the same diffusion timestep but employ heterogeneous noise schedules the video stream uses the DDPM schedule with v-prediction, while the action and state streams use a `DDIMScheduler` from the `diffusers` library with epsilon-prediction and a `squaredcos_cap_v2` beta schedule. This design allows each modality to adopt its optimal denoising strategy.
The sampler also supports classifier-free guidance with `unconditional_guidance_scale` and guidance rescaling, applied only to the video stream to balance generation quality and diversity.
==== VAE Decoding
After the DDIM sampling loop completes, the denoised video latent tensor $x_0$ of shape `(B, 4, T, 40, 64)` is decoded back to RGB pixel space through the AutoencoderKL decoder. Due to the `perframe_ae=True` configuration, decoding is likewise performed frame-by-frame: each of the T=16 latent frames is individually inverse-scaled by $1 slash "scale_factor"$, passed through the decoder's convolutional transpose layers, and reconstructed to a 320#sym.times 512 RGB frame.
In the interactive simulation mode, the decoded video serves a dual purpose providing the observation image for the next iteration's condition encoding (only the first `exe_steps` frames are needed) and producing the final output video for visualization and evaluation. The action and state trajectories predicted by the DDIM loop are directly used for robot control without further decoding.
=== WMAModel Backbone: Dual-UNet Collaborative Architecture
The WMAModel (`wma_model.py:326`) is the core neural network invoked at every DDIM step, employing a unique dual-UNet collaborative architecture that jointly predicts video, actions, and states within a single forward pass. This tightly-coupled design enables the action and state predictions to directly leverage the rich spatiotemporal features extracted by the video generation backbone, rather than treating them as independent prediction heads.
==== Video UNet
The primary backbone is a 2D convolution-based UNet with temporal extensions. Its key configuration is summarized in the following table:
#figure(
table(
columns: (3fr, 5fr),
[*Parameter*], [*Value*],
[Input / Output channels], [8 (4 latent + 4 conditioning) / 4],
[Base model channels], [320],
[Channel multipliers], [\[1, 2, 4, 4\] #sym.arrow widths \[320, 640, 1280, 1280\]],
[Residual blocks per level], [2],
[Attention resolutions], [\[4, 2, 1\] (3 of 4 resolution levels)],
[Attention head channels], [64],
[Transformer depth], [1 per attention resolution],
[Context dimension], [1024],
[Temporal length], [16 frames],
),
caption: [Video UNet configuration parameters.],
)
The UNet follows the classic encoder-middle-decoder structure with skip connections. At each attention-enabled resolution level, every ResBlock is followed by two transformer modules: a SpatialTransformer that performs spatial self-attention among all $H times W$ tokens within each frame followed by cross-attention with the 351-token context vector, and a TemporalTransformer that performs self-attention among T=16 time-step tokens at each spatial position (configured with `temporal_selfatt_only=True`, i.e., no cross-attention).
During the forward pass, intermediate feature maps are collected after each Downsample layer and the middle block, reshaped from $(B times T, C, H, W)$ to $(B, T, C, H, W)$, accumulating 10 multi-scale feature maps in `hs_a` the bridge to the Action/State UNets.
==== Action UNet and State UNet
The Action UNet (`conditional_unet1d.py`) is a 1D convolutional UNet specifically designed for predicting robot action trajectories. Its configuration is as follows:
#figure(
table(
columns: (3fr, 5fr),
[*Parameter*], [*Value*],
[Input dimension], [16 (agent\_action\_dim)],
[Down channel widths], [\[256, 512, 1024, 2048\]],
[Kernel size], [5],
[GroupNorm groups], [8],
[Diffusion step embedding dim], [128],
[Horizon], [16],
[Action projection dim], [32],
),
caption: [Action UNet (ConditionalUnet1D) configuration parameters.],
)
The Action UNet receives the 10 `hs_a` feature maps from the Video UNet as visual conditioning. The conditioning pipeline involves three stages: (1) SpatialSoftmax compresses each 2D feature map into keypoint coordinates $(B times T, C, 2)$; (2) the compressed features are concatenated with the diffusion timestep embedding and observation encoding (ResNet-18 `MultiImageObsEncoder`), then injected via FiLM modulation to produce per-channel scale/bias for the 1D convolution blocks; (3) `ActionLatentImageCrossAttention` enables action tokens to cross-attend to the Video UNet's spatiotemporal features, allowing visually-grounded action planning.
The input action tensor $(B, T, 16)$ is projected to act\_proj\_dim=32, processed through the 1D UNet, then projected back to $(B, T, 16)$.
The State UNet is an identical `ConditionalUnet1D` instance with the same hyperparameters, operating on the state tensor `x_state` $(B, T, 16)$ instead of the action tensor.
A critical optimization observation: the Action and State UNets are computationally independent sharing read-only inputs with no data dependencies. The original code executes them sequentially, leaving significant room for CUDA stream parallelization.
=== Multi-Level Design of Attention Mechanisms
The attention mechanisms in UnifoLM-WMA constitute the core computational bottleneck of inference. Their design encompasses four distinct levels, each serving a different purpose in the model's spatiotemporal reasoning, and understanding their structure is essential for identifying optimization opportunities.
The first level is *spatial self-attention* within the SpatialTransformer. For a latent frame at resolution $H times W$, the token count is $H times W$ (e.g., $40 times 64 = 2560$ at the highest resolution). Implemented via xformers `memory_efficient_attention`, reducing peak memory from $O(N^2)$ to $O(N)$. Q/K/V use bias-free linear layers with head count = channel\_dim / num\_head\_channels (e.g., 1280/64 = 20 heads).
The second level is *multi-source cross-attention*, the most distinctive design in UnifoLM-WMA. The unified context vector is split into four semantic sources, each with dedicated K/V projection layers:
#figure(
table(
columns: (2fr, 1fr, 3fr, 2fr),
[*Source*], [*Tokens*], [*K/V Projections*], [*Scale*],
[Text], [77], [`to_k` / `to_v` (shared base)], [1.0],
[Image], [16#sym.times T], [`to_k_ip` / `to_v_ip`], [`image_cross_attention_scale`],
[Agent state], [2], [`to_k_as` / `to_v_as`], [`agent_state_cross_attention_scale`],
[Agent action], [16], [`to_k_aa` / `to_v_aa`], [`agent_action_cross_attention_scale`],
),
caption: [Multi-source cross-attention configuration.],
)
The Query vector Q is always derived from the video latent features via `to_q`. For each of the four sources, independent attention scores are computed — $"softmax"(Q dot K_i^T \/ sqrt(d)) dot V_i$ — producing four separate attention outputs. These outputs are then combined via weighted summation:
$ "out" = "out"_"text" + alpha_"img" dot "out"_"ip" + alpha_"state" dot "out"_"as" + alpha_"action" dot "out"_"aa" $
In the current configuration, `cross_attention_scale_learnable=False` (fixed scales). This decoupled design adds 8 extra linear layers versus standard single-source cross-attention, creating opportunities for KV fusion optimization.
The third level is *temporal self-attention* within the TemporalTransformer. The input $(B, C, T, H, W)$ is reshaped to $(B times H times W, C, T)$, so each spatial position becomes an independent batch element and T=16 time steps form the token sequence. Supports relative position encoding via a `RelativePosition` module and optional causal masks; current configuration uses bidirectional temporal attention.
The fourth level is *action-latent-image cross-attention* in the `ActionLatentImageCrossAttention` module. Action tokens $(B, "action_dim", "act_proj_dim")$ as Query cross-attend to Video UNet features reshaped to $(B, T times H times W, C)$ as Key/Value. A `BasicTransformerBlock` (depth=1) performs action self-attention then cross-attention to video features, with zero-initialized `proj_out` and residual connection. This mechanism is the key bridge enabling the action head to access the visual world model's internal representations.

114
run_all_case.sh Normal file
View File

@@ -0,0 +1,114 @@
#!/bin/bash
# 自动执行所有场景的所有case
# 总共5个场景每个场景4个case共20个case
# 设置环境变量(离线模式)
export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
# 颜色定义
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# 定义所有场景
SCENARIOS=(
"unitree_g1_pack_camera"
"unitree_z1_dual_arm_cleanup_pencils"
"unitree_z1_dual_arm_stackbox"
"unitree_z1_dual_arm_stackbox_v2"
"unitree_z1_stackbox"
)
# 定义case数量
CASES=(1 2 3 4)
# 记录开始时间
START_TIME=$(date +%s)
LOG_FILE="run_all_cases_$(date +%Y%m%d_%H%M%S).log"
echo -e "${BLUE}========================================${NC}"
echo -e "${BLUE}开始执行所有场景的case${NC}"
echo -e "${BLUE}总共: ${#SCENARIOS[@]} 个场景 x ${#CASES[@]} 个case = $((${#SCENARIOS[@]} * ${#CASES[@]})) 个任务${NC}"
echo -e "${BLUE}日志文件: ${LOG_FILE}${NC}"
echo -e "${BLUE}========================================${NC}"
echo ""
# 初始化计数器
TOTAL_CASES=$((${#SCENARIOS[@]} * ${#CASES[@]}))
CURRENT_CASE=0
SUCCESS_COUNT=0
FAIL_COUNT=0
# 记录失败的case
declare -a FAILED_CASES
# 遍历所有场景
for scenario in "${SCENARIOS[@]}"; do
echo -e "${YELLOW}>>> 场景: ${scenario}${NC}"
# 遍历所有case
for case_num in "${CASES[@]}"; do
CURRENT_CASE=$((CURRENT_CASE + 1))
case_dir="${scenario}/case${case_num}"
script_path="${case_dir}/run_world_model_interaction.sh"
echo -e "${BLUE}[${CURRENT_CASE}/${TOTAL_CASES}] 执行: ${case_dir}${NC}"
# 检查脚本是否存在
if [ ! -f "${script_path}" ]; then
echo -e "${RED}错误: 脚本不存在 ${script_path}${NC}"
FAIL_COUNT=$((FAIL_COUNT + 1))
FAILED_CASES+=("${case_dir} (脚本不存在)")
continue
fi
# 执行脚本
echo "开始时间: $(date '+%Y-%m-%d %H:%M:%S')"
if bash "${script_path}" >> "${LOG_FILE}" 2>&1; then
echo -e "${GREEN}✓ 成功: ${case_dir}${NC}"
SUCCESS_COUNT=$((SUCCESS_COUNT + 1))
else
echo -e "${RED}✗ 失败: ${case_dir}${NC}"
FAIL_COUNT=$((FAIL_COUNT + 1))
FAILED_CASES+=("${case_dir}")
fi
echo "结束时间: $(date '+%Y-%m-%d %H:%M:%S')"
echo ""
done
echo ""
done
# 计算总耗时
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
HOURS=$((DURATION / 3600))
MINUTES=$(((DURATION % 3600) / 60))
SECONDS=$((DURATION % 60))
# 输出总结
echo -e "${BLUE}========================================${NC}"
echo -e "${BLUE}执行完成!${NC}"
echo -e "${BLUE}========================================${NC}"
echo -e "总任务数: ${TOTAL_CASES}"
echo -e "${GREEN}成功: ${SUCCESS_COUNT}${NC}"
echo -e "${RED}失败: ${FAIL_COUNT}${NC}"
echo -e "总耗时: ${HOURS}小时 ${MINUTES}分钟 ${SECONDS}"
echo -e "详细日志: ${LOG_FILE}"
echo ""
# 如果有失败的case列出来
if [ ${FAIL_COUNT} -gt 0 ]; then
echo -e "${RED}失败的case列表:${NC}"
for failed_case in "${FAILED_CASES[@]}"; do
echo -e "${RED} - ${failed_case}${NC}"
done
echo ""
fi
echo -e "${BLUE}========================================${NC}"

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,37 @@
2026-02-11 17:34:29.188470: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-11 17:34:29.238296: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-11 17:34:29.238342: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-11 17:34:29.239649: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-11 17:34:29.247152: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-11 17:34:30.172640: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096

File diff suppressed because it is too large Load Diff

View File

File diff suppressed because it is too large Load Diff

61
run_all_psnr.sh Executable file
View File

@@ -0,0 +1,61 @@
#!/bin/bash
set -e
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
cd "$SCRIPT_DIR"
SCENARIOS=(
unitree_g1_pack_camera
unitree_z1_dual_arm_cleanup_pencils
unitree_z1_dual_arm_stackbox
unitree_z1_dual_arm_stackbox_v2
unitree_z1_stackbox
)
CASES=(case1 case2 case3 case4)
total=0
success=0
fail=0
for scenario in "${SCENARIOS[@]}"; do
for case in "${CASES[@]}"; do
case_dir="${scenario}/${case}"
gt_video="${case_dir}/${scenario}_${case}.mp4"
pred_video=$(ls "${case_dir}"/output/inference/*_full_fs*.mp4 2>/dev/null | head -1)
output_file="${case_dir}/psnr_result.json"
total=$((total + 1))
echo "=========================================="
echo "[${total}/20] ${case_dir}"
if [ ! -f "$gt_video" ]; then
echo " SKIP: GT video not found: $gt_video"
fail=$((fail + 1))
continue
fi
if [ -z "$pred_video" ]; then
echo " SKIP: pred video not found in ${case_dir}/output/inference/"
fail=$((fail + 1))
continue
fi
echo " GT: $gt_video"
echo " Pred: $pred_video"
echo " Out: $output_file"
if python3 psnr_score_for_challenge.py \
--gt_video "$gt_video" \
--pred_video "$pred_video" \
--output_file "$output_file"; then
success=$((success + 1))
echo " DONE"
else
fail=$((fail + 1))
echo " FAILED"
fi
done
done
echo "=========================================="
echo "Finished: ${success} success, ${fail} fail, ${total} total"

View File

@@ -10,6 +10,7 @@ import einops
import warnings import warnings
import imageio import imageio
import atexit import atexit
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor from concurrent.futures import ThreadPoolExecutor
from pytorch_lightning import seed_everything from pytorch_lightning import seed_everything
@@ -231,6 +232,32 @@ def log_to_tensorboard_async(writer, data: Tensor, tag: str, fps: int = 10) -> N
_io_futures.append(fut) _io_futures.append(fut)
def _video_tensor_to_frames(video: Tensor) -> np.ndarray:
video = torch.clamp(video.float(), -1., 1.)
n = video.shape[0]
video = video.permute(2, 0, 1, 3, 4)
frame_grids = [
torchvision.utils.make_grid(f, nrow=int(n), padding=0) for f in video
]
grid = torch.stack(frame_grids, dim=0)
grid = ((grid + 1.0) / 2.0 * 255).to(torch.uint8).permute(0, 2, 3, 1)
return grid.numpy()[:, :, :, ::-1]
def _video_writer_process(q: mp.Queue, filename: str, fps: int):
frames = []
while True:
item = q.get()
if item is None:
break
frames.append(_video_tensor_to_frames(item))
if frames:
grid = np.concatenate(frames, axis=0)
grid = torch.from_numpy(grid[:, :, :, ::-1].copy()) # BGR → RGB
torchvision.io.write_video(filename, grid, fps=fps,
video_codec='h264', options={'crf': '10'})
def get_init_frame_path(data_dir: str, sample: dict) -> str: def get_init_frame_path(data_dir: str, sample: dict) -> str:
"""Construct the init_frame path from directory and sample metadata. """Construct the init_frame path from directory and sample metadata.
@@ -450,6 +477,7 @@ def image_guided_synthesis_sim_mode(
img = observation['observation.images.top'].permute(0, 2, 1, 3, 4) img = observation['observation.images.top'].permute(0, 2, 1, 3, 4)
cond_img = rearrange(img, 'b o c h w -> (b o) c h w')[-1:] cond_img = rearrange(img, 'b o c h w -> (b o) c h w')[-1:]
with torch.cuda.amp.autocast(dtype=torch.float16):
cond_img_emb = model.embedder(cond_img) cond_img_emb = model.embedder(cond_img)
cond_img_emb = model.image_proj_model(cond_img_emb) cond_img_emb = model.image_proj_model(cond_img_emb)
@@ -465,6 +493,7 @@ def image_guided_synthesis_sim_mode(
prompts = [""] * batch_size prompts = [""] * batch_size
cond_ins_emb = model.get_learned_conditioning(prompts) cond_ins_emb = model.get_learned_conditioning(prompts)
with torch.cuda.amp.autocast(dtype=torch.float16):
cond_state_emb = model.state_projector(observation['observation.state']) cond_state_emb = model.state_projector(observation['observation.state'])
cond_state_emb = cond_state_emb + model.agent_state_pos_emb cond_state_emb = cond_state_emb + model.agent_state_pos_emb
@@ -492,6 +521,7 @@ def image_guided_synthesis_sim_mode(
cond_mask = None cond_mask = None
cond_z0 = None cond_z0 = None
batch_variants = None batch_variants = None
samples = None
if ddim_sampler is not None: if ddim_sampler is not None:
samples, actions, states, intermedia = ddim_sampler.sample( samples, actions, states, intermedia = ddim_sampler.sample(
S=ddim_steps, S=ddim_steps,
@@ -515,7 +545,7 @@ def image_guided_synthesis_sim_mode(
batch_images = model.decode_first_stage(samples) batch_images = model.decode_first_stage(samples)
batch_variants = batch_images batch_variants = batch_images
return batch_variants, actions, states return batch_variants, actions, states, samples
def run_inference(args: argparse.Namespace, gpu_num: int, gpu_no: int) -> None: def run_inference(args: argparse.Namespace, gpu_num: int, gpu_no: int) -> None:
@@ -571,6 +601,22 @@ def run_inference(args: argparse.Namespace, gpu_num: int, gpu_no: int) -> None:
torch.save(model, prepared_path) torch.save(model, prepared_path)
print(f">>> Prepared model saved ({os.path.getsize(prepared_path) / 1024**3:.1f} GB).") print(f">>> Prepared model saved ({os.path.getsize(prepared_path) / 1024**3:.1f} GB).")
# ---- FP16: convert diffusion backbone + conditioning modules ----
model.model.to(torch.float16)
model.model.diffusion_model.dtype = torch.float16
print(">>> Diffusion backbone (model.model) converted to FP16.")
# Projectors / MLP → FP16
model.image_proj_model.half()
model.state_projector.half()
model.action_projector.half()
print(">>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.")
# Text/image encoders → FP16
model.cond_stage_model.half()
model.embedder.half()
print(">>> Encoders (cond_stage_model, embedder) converted to FP16.")
# Build normalizer (always needed, independent of model loading path) # Build normalizer (always needed, independent of model loading path)
logging.info("***** Configing Data *****") logging.info("***** Configing Data *****")
data = instantiate_from_config(config.data) data = instantiate_from_config(config.data)
@@ -629,8 +675,13 @@ def run_inference(args: argparse.Namespace, gpu_num: int, gpu_no: int) -> None:
# For saving environmental changes in world-model # For saving environmental changes in world-model
sample_save_dir = f'{video_save_dir}/wm/{fs}' sample_save_dir = f'{video_save_dir}/wm/{fs}'
os.makedirs(sample_save_dir, exist_ok=True) os.makedirs(sample_save_dir, exist_ok=True)
# For collecting interaction videos # Writer process for incremental video saving
wm_video = [] sample_full_video_file = f"{video_save_dir}/../{sample['videoid']}_full_fs{fs}.mp4"
write_q = mp.Queue()
writer_proc = mp.Process(
target=_video_writer_process,
args=(write_q, sample_full_video_file, args.save_fps))
writer_proc.start()
# Initialize observation queues # Initialize observation queues
cond_obs_queues = { cond_obs_queues = {
"observation.images.top": "observation.images.top":
@@ -686,7 +737,7 @@ def run_inference(args: argparse.Namespace, gpu_num: int, gpu_no: int) -> None:
# Use world-model in policy to generate action # Use world-model in policy to generate action
print(f'>>> Step {itr}: generating actions ...') print(f'>>> Step {itr}: generating actions ...')
pred_videos_0, pred_actions, _ = image_guided_synthesis_sim_mode( pred_videos_0, pred_actions, _, _ = image_guided_synthesis_sim_mode(
model, model,
sample['instruction'], sample['instruction'],
observation, observation,
@@ -728,7 +779,7 @@ def run_inference(args: argparse.Namespace, gpu_num: int, gpu_no: int) -> None:
# Interaction with the world-model # Interaction with the world-model
print(f'>>> Step {itr}: interacting with world model ...') print(f'>>> Step {itr}: interacting with world model ...')
pred_videos_1, _, pred_states = image_guided_synthesis_sim_mode( pred_videos_1, _, pred_states, wm_samples = image_guided_synthesis_sim_mode(
model, model,
"", "",
observation, observation,
@@ -741,12 +792,16 @@ def run_inference(args: argparse.Namespace, gpu_num: int, gpu_no: int) -> None:
fs=model_input_fs, fs=model_input_fs,
text_input=False, text_input=False,
timestep_spacing=args.timestep_spacing, timestep_spacing=args.timestep_spacing,
guidance_rescale=args.guidance_rescale) guidance_rescale=args.guidance_rescale,
decode_video=False)
# Decode only the last frame for CLIP embedding in next iteration
last_frame_pixel = model.decode_first_stage(wm_samples[:, :, -1:, :, :])
for idx in range(args.exe_steps): for idx in range(args.exe_steps):
observation = { observation = {
'observation.images.top': 'observation.images.top':
pred_videos_1[0][:, idx:idx + 1].permute(1, 0, 2, 3), last_frame_pixel[0, :, 0:1].permute(1, 0, 2, 3),
'observation.state': 'observation.state':
torch.zeros_like(pred_states[0][idx:idx + 1]) if torch.zeros_like(pred_states[0][idx:idx + 1]) if
args.zero_pred_state else pred_states[0][idx:idx + 1], args.zero_pred_state else pred_states[0][idx:idx + 1],
@@ -764,37 +819,16 @@ def run_inference(args: argparse.Namespace, gpu_num: int, gpu_no: int) -> None:
pred_videos_0, pred_videos_0,
sample_tag, sample_tag,
fps=args.save_fps) fps=args.save_fps)
# Save videos environment changes via world-model interaction
sample_tag = f"{args.dataset}-vid{sample['videoid']}-wd-fs-{fs}/itr-{itr}"
log_to_tensorboard_async(writer,
pred_videos_1,
sample_tag,
fps=args.save_fps)
# Save the imagen videos for decision-making
if pred_videos_0 is not None:
sample_video_file = f'{video_save_dir}/dm/{fs}/itr-{itr}.mp4'
save_results_async(pred_videos_0,
sample_video_file,
fps=args.save_fps)
# Save videos environment changes via world-model interaction
sample_video_file = f'{video_save_dir}/wm/{fs}/itr-{itr}.mp4'
save_results_async(pred_videos_1,
sample_video_file,
fps=args.save_fps)
print('>' * 24) print('>' * 24)
# Collect the result of world-model interactions # Decode segment and send to writer process
wm_video.append(pred_videos_1[:, :, :args.exe_steps].cpu()) seg_video = model.decode_first_stage(
wm_samples[:, :, :args.exe_steps]).detach().cpu()
write_q.put(seg_video)
full_video = torch.cat(wm_video, dim=2) # Stop writer process
sample_tag = f"{args.dataset}-vid{sample['videoid']}-wd-fs-{fs}/full" write_q.put(None)
log_to_tensorboard_async(writer, writer_proc.join()
full_video,
sample_tag,
fps=args.save_fps)
sample_full_video_file = f"{video_save_dir}/../{sample['videoid']}_full_fs{fs}.mp4"
save_results_async(full_video, sample_full_video_file, fps=args.save_fps)
# Wait for all async I/O to complete # Wait for all async I/O to complete
_flush_io() _flush_io()
@@ -915,7 +949,7 @@ def get_parser():
parser.add_argument( parser.add_argument(
"--fast_policy_no_decode", "--fast_policy_no_decode",
action='store_true', action='store_true',
default=False, default=True,
help="Speed mode: policy pass only predicts actions, skip policy video decode/log/save.") help="Speed mode: policy pass only predicts actions, skip policy video decode/log/save.")
parser.add_argument("--save_fps", parser.add_argument("--save_fps",
type=int, type=int,

View File

@@ -988,7 +988,7 @@ class LatentDiffusion(DDPM):
def instantiate_cond_stage(self, config: OmegaConf) -> None: def instantiate_cond_stage(self, config: OmegaConf) -> None:
""" """
Build the conditioning stage model. Build the conditioning stage model. Frozen models are converted to FP16.
Args: Args:
config: OmegaConf config describing the conditioning model to instantiate. config: OmegaConf config describing the conditioning model to instantiate.
@@ -1000,6 +1000,7 @@ class LatentDiffusion(DDPM):
self.cond_stage_model.train = disabled_train self.cond_stage_model.train = disabled_train
for param in self.cond_stage_model.parameters(): for param in self.cond_stage_model.parameters():
param.requires_grad = False param.requires_grad = False
self.cond_stage_model.half()
else: else:
model = instantiate_from_config(config) model = instantiate_from_config(config)
self.cond_stage_model = model self.cond_stage_model = model
@@ -1014,6 +1015,7 @@ class LatentDiffusion(DDPM):
Returns: Returns:
Conditioning embedding as a tensor (shape depends on cond model). Conditioning embedding as a tensor (shape depends on cond model).
""" """
with torch.cuda.amp.autocast(dtype=torch.float16):
if self.cond_stage_forward is None: if self.cond_stage_forward is None:
if hasattr(self.cond_stage_model, 'encode') and callable( if hasattr(self.cond_stage_model, 'encode') and callable(
self.cond_stage_model.encode): self.cond_stage_model.encode):
@@ -1957,6 +1959,7 @@ class LatentVisualDiffusion(LatentDiffusion):
self.image_proj_model.train = disabled_train self.image_proj_model.train = disabled_train
for param in self.image_proj_model.parameters(): for param in self.image_proj_model.parameters():
param.requires_grad = False param.requires_grad = False
self.image_proj_model.half()
def _init_embedder(self, config: OmegaConf, freeze: bool = True) -> None: def _init_embedder(self, config: OmegaConf, freeze: bool = True) -> None:
""" """
@@ -1972,6 +1975,7 @@ class LatentVisualDiffusion(LatentDiffusion):
self.embedder.train = disabled_train self.embedder.train = disabled_train
for param in self.embedder.parameters(): for param in self.embedder.parameters():
param.requires_grad = False param.requires_grad = False
self.embedder.half()
def init_normalizers(self, normalize_config: OmegaConf, def init_normalizers(self, normalize_config: OmegaConf,
dataset_stats: Mapping[str, Any]) -> None: dataset_stats: Mapping[str, Any]) -> None:
@@ -2175,6 +2179,7 @@ class LatentVisualDiffusion(LatentDiffusion):
(random_num < 3 * self.uncond_prob).float(), "n -> n 1 1 1") (random_num < 3 * self.uncond_prob).float(), "n -> n 1 1 1")
cond_img = input_mask * img cond_img = input_mask * img
with torch.cuda.amp.autocast(dtype=torch.float16):
cond_img_emb = self.embedder(cond_img) cond_img_emb = self.embedder(cond_img)
cond_img_emb = self.image_proj_model(cond_img_emb) cond_img_emb = self.image_proj_model(cond_img_emb)
@@ -2191,6 +2196,7 @@ class LatentVisualDiffusion(LatentDiffusion):
repeat=z.shape[2]) repeat=z.shape[2])
cond["c_concat"] = [img_cat_cond] cond["c_concat"] = [img_cat_cond]
with torch.cuda.amp.autocast(dtype=torch.float16):
cond_action = self.action_projector(action) cond_action = self.action_projector(action)
cond_action_emb = self.agent_action_pos_emb + cond_action cond_action_emb = self.agent_action_pos_emb + cond_action
# Get conditioning states # Get conditioning states
@@ -2457,7 +2463,17 @@ class DiffusionWrapper(pl.LightningModule):
Returns: Returns:
Output from the inner diffusion model (tensor or tuple, depending on the model). Output from the inner diffusion model (tensor or tuple, depending on the model).
""" """
with torch.cuda.amp.autocast(dtype=torch.float16):
return self._forward_impl(x, x_action, x_state, t,
c_concat, c_crossattn, c_crossattn_action,
c_adm, s, mask, **kwargs)
def _forward_impl(
self,
x, x_action, x_state, t,
c_concat=None, c_crossattn=None, c_crossattn_action=None,
c_adm=None, s=None, mask=None, **kwargs,
):
if self.conditioning_key is None: if self.conditioning_key is None:
out = self.diffusion_model(x, t) out = self.diffusion_model(x, t)
elif self.conditioning_key == 'concat': elif self.conditioning_key == 'concat':

View File

@@ -0,0 +1,74 @@
2026-02-19 18:55:32.160020: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 18:55:32.207538: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 18:55:32.207581: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 18:55:32.208613: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 18:55:32.215249: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 18:55:33.121466: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:23<03:56, 23.63s/it]
18%|█▊ | 2/11 [00:46<03:29, 23.24s/it]
27%|██▋ | 3/11 [01:09<03:05, 23.25s/it]
36%|███▋ | 4/11 [01:33<02:43, 23.31s/it]
45%|████▌ | 5/11 [01:56<02:19, 23.32s/it]
55%|█████▍ | 6/11 [02:19<01:56, 23.33s/it]
64%|██████▎ | 7/11 [02:43<01:33, 23.32s/it]
73%|███████▎ | 8/11 [03:06<01:09, 23.32s/it]
82%|████████▏ | 9/11 [03:29<00:46, 23.31s/it]
91%|█████████ | 10/11 [03:53<00:23, 23.30s/it]
100%|██████████| 11/11 [04:16<00:00, 23.29s/it]
100%|██████████| 11/11 [04:16<00:00, 23.31s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_g1_pack_camera/case1/unitree_g1_pack_camera_case1.mp4",
"pred_video": "unitree_g1_pack_camera/case1/output/inference/0_full_fs6.mp4",
"psnr": 32.34126103448495
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_g1_pack_camera"
--n_iter 11 \ --n_iter 11 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,74 @@
2026-02-19 19:00:05.944067: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:00:05.991354: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:00:05.991392: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:00:05.992414: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:00:05.999050: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:00:06.916175: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:24<04:00, 24.04s/it]
18%|█▊ | 2/11 [00:47<03:31, 23.55s/it]
27%|██▋ | 3/11 [01:10<03:07, 23.43s/it]
36%|███▋ | 4/11 [01:33<02:43, 23.42s/it]
45%|████▌ | 5/11 [01:57<02:20, 23.39s/it]
55%|█████▍ | 6/11 [02:20<01:56, 23.37s/it]
64%|██████▎ | 7/11 [02:43<01:33, 23.34s/it]
73%|███████▎ | 8/11 [03:07<01:09, 23.33s/it]
82%|████████▏ | 9/11 [03:30<00:46, 23.33s/it]
91%|█████████ | 10/11 [03:53<00:23, 23.33s/it]
100%|██████████| 11/11 [04:17<00:00, 23.31s/it]
100%|██████████| 11/11 [04:17<00:00, 23.38s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_g1_pack_camera/case2/unitree_g1_pack_camera_case2.mp4",
"pred_video": "unitree_g1_pack_camera/case2/output/inference/50_full_fs6.mp4",
"psnr": 37.49178506869336
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_g1_pack_camera"
--n_iter 11 \ --n_iter 11 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,74 @@
2026-02-19 19:04:41.036634: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:04:41.084414: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:04:41.084452: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:04:41.085481: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:04:41.092287: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:04:42.000614: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:24<04:01, 24.19s/it]
18%|█▊ | 2/11 [00:47<03:32, 23.64s/it]
27%|██▋ | 3/11 [01:10<03:08, 23.50s/it]
36%|███▋ | 4/11 [01:34<02:44, 23.47s/it]
45%|████▌ | 5/11 [01:57<02:20, 23.45s/it]
55%|█████▍ | 6/11 [02:21<01:57, 23.44s/it]
64%|██████▎ | 7/11 [02:44<01:33, 23.41s/it]
73%|███████▎ | 8/11 [03:07<01:10, 23.38s/it]
82%|████████▏ | 9/11 [03:31<00:46, 23.36s/it]
91%|█████████ | 10/11 [03:54<00:23, 23.35s/it]
100%|██████████| 11/11 [04:17<00:00, 23.33s/it]
100%|██████████| 11/11 [04:17<00:00, 23.42s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_g1_pack_camera/case3/unitree_g1_pack_camera_case3.mp4",
"pred_video": "unitree_g1_pack_camera/case3/output/inference/100_full_fs6.mp4",
"psnr": 29.88155122131729
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_g1_pack_camera"
--n_iter 11 \ --n_iter 11 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,74 @@
2026-02-19 19:09:16.122268: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:09:16.170290: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:09:16.170331: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:09:16.171349: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:09:16.177993: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:09:17.087425: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:24<04:01, 24.17s/it]
18%|█▊ | 2/11 [00:47<03:32, 23.62s/it]
27%|██▋ | 3/11 [01:10<03:07, 23.49s/it]
36%|███▋ | 4/11 [01:34<02:44, 23.46s/it]
45%|████▌ | 5/11 [01:57<02:20, 23.42s/it]
55%|█████▍ | 6/11 [02:20<01:56, 23.40s/it]
64%|██████▎ | 7/11 [02:44<01:33, 23.38s/it]
73%|███████▎ | 8/11 [03:07<01:10, 23.36s/it]
82%|████████▏ | 9/11 [03:30<00:46, 23.35s/it]
91%|█████████ | 10/11 [03:54<00:23, 23.34s/it]
100%|██████████| 11/11 [04:17<00:00, 23.33s/it]
100%|██████████| 11/11 [04:17<00:00, 23.40s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_g1_pack_camera/case4/unitree_g1_pack_camera_case4.mp4",
"pred_video": "unitree_g1_pack_camera/case4/output/inference/200_full_fs6.mp4",
"psnr": 35.62512454155058
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_g1_pack_camera"
--n_iter 11 \ --n_iter 11 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -1,24 +1,16 @@
2026-02-10 15:38:28.973314: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-02-19 19:13:51.554194: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-10 15:38:29.023024: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2026-02-19 19:13:51.601580: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-10 15:38:29.023070: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2026-02-19 19:13:51.601622: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-10 15:38:29.024393: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2026-02-19 19:13:51.602646: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-10 15:38:29.031901: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. 2026-02-19 19:13:51.609297: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-10 15:38:29.955454: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2026-02-19 19:13:52.517676: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123 Global seed set to 123
INFO:mainlogger:LatentVisualDiffusion: Running in v-prediction mode >>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
INFO:unifolm_wma.models.diffusion_head.conditional_unet1d:number of parameters: 5.010531e+08 >>> Prepared model loaded.
INFO:unifolm_wma.models.diffusion_head.conditional_unet1d:number of parameters: 5.010531e+08 >>> Diffusion backbone (model.model) converted to FP16.
AE working on z of shape (1, 4, 32, 32) = 4096 dimensions. >>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
INFO:root:Loaded ViT-H-14 model config. >>> Encoders (cond_stage_model, embedder) converted to FP16.
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): hf-mirror.com:443
DEBUG:urllib3.connectionpool:https://hf-mirror.com:443 "HEAD /laion/CLIP-ViT-H-14-laion2B-s32B-b79K/resolve/main/open_clip_pytorch_model.bin HTTP/1.1" 302 0
INFO:root:Loading pretrained ViT-H-14 weights (laion2b_s32b_b79k).
INFO:root:Loaded ViT-H-14 model config.
DEBUG:urllib3.connectionpool:https://hf-mirror.com:443 "HEAD /laion/CLIP-ViT-H-14-laion2B-s32B-b79K/resolve/main/open_clip_pytorch_model.bin HTTP/1.1" 302 0
INFO:root:Loading pretrained ViT-H-14 weights (laion2b_s32b_b79k).
>>> model checkpoint loaded.
>>> Load pre-trained model ...
INFO:root:***** Configing Data ***** INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded. >>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded. >>> unitree_z1_stackbox: data stats loaded.
@@ -36,63 +28,15 @@ INFO:root:***** Configing Data *****
>>> unitree_g1_pack_camera: data stats loaded. >>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated. >>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ... >>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ... >>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5 DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13 DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9 DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096 DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/8 [00:00<?, ?it/s]
12%|█▎ | 1/8 [00:24<02:49, 24.16s/it] 12%|█▎ | 1/8 [00:24<02:49, 24.16s/it]
>>> Step 0: interacting with world model ...
DEBUG:PIL.Image:Importing BlpImagePlugin
DEBUG:PIL.Image:Importing BmpImagePlugin
DEBUG:PIL.Image:Importing BufrStubImagePlugin
DEBUG:PIL.Image:Importing CurImagePlugin
DEBUG:PIL.Image:Importing DcxImagePlugin
DEBUG:PIL.Image:Importing DdsImagePlugin
DEBUG:PIL.Image:Importing EpsImagePlugin
DEBUG:PIL.Image:Importing FitsImagePlugin
DEBUG:PIL.Image:Importing FitsStubImagePlugin
DEBUG:PIL.Image:Importing FliImagePlugin
DEBUG:PIL.Image:Importing FpxImagePlugin
DEBUG:PIL.Image:Image: failed to import FpxImagePlugin: No module named 'olefile'
DEBUG:PIL.Image:Importing FtexImagePlugin
DEBUG:PIL.Image:Importing GbrImagePlugin
DEBUG:PIL.Image:Importing GifImagePlugin
DEBUG:PIL.Image:Importing GribStubImagePlugin
DEBUG:PIL.Image:Importing Hdf5StubImagePlugin
DEBUG:PIL.Image:Importing IcnsImagePlugin
DEBUG:PIL.Image:Importing IcoImagePlugin
DEBUG:PIL.Image:Importing ImImagePlugin
DEBUG:PIL.Image:Importing ImtImagePlugin
DEBUG:PIL.Image:Importing IptcImagePlugin
DEBUG:PIL.Image:Importing JpegImagePlugin
DEBUG:PIL.Image:Importing Jpeg2KImagePlugin
DEBUG:PIL.Image:Importing McIdasImagePlugin
DEBUG:PIL.Image:Importing MicImagePlugin
DEBUG:PIL.Image:Image: failed to import MicImagePlugin: No module named 'olefile'
DEBUG:PIL.Image:Importing MpegImagePlugin
DEBUG:PIL.Image:Importing MpoImagePlugin
DEBUG:PIL.Image:Importing MspImagePlugin
DEBUG:PIL.Image:Importing PalmImagePlugin
DEBUG:PIL.Image:Importing PcdImagePlugin
DEBUG:PIL.Image:Importing PcxImagePlugin
DEBUG:PIL.Image:Importing PdfImagePlugin
DEBUG:PIL.Image:Importing PixarImagePlugin
DEBUG:PIL.Image:Importing PngImagePlugin
DEBUG:PIL.Image:Importing PpmImagePlugin
DEBUG:PIL.Image:Importing PsdImagePlugin
DEBUG:PIL.Image:Importing QoiImagePlugin
DEBUG:PIL.Image:Importing SgiImagePlugin
DEBUG:PIL.Image:Importing SpiderImagePlugin
DEBUG:PIL.Image:Importing SunImagePlugin
DEBUG:PIL.Image:Importing TgaImagePlugin
DEBUG:PIL.Image:Importing TiffImagePlugin
DEBUG:PIL.Image:Importing WebPImagePlugin
DEBUG:PIL.Image:Importing WmfImagePlugin
DEBUG:PIL.Image:Importing XbmImagePlugin
DEBUG:PIL.Image:Importing XpmImagePlugin
DEBUG:PIL.Image:Importing XVThumbImagePlugin
25%|██▌ | 2/8 [00:47<02:22, 23.67s/it] 25%|██▌ | 2/8 [00:47<02:22, 23.67s/it]
38%|███▊ | 3/8 [01:10<01:57, 23.55s/it] 38%|███▊ | 3/8 [01:10<01:57, 23.55s/it]
50%|█████ | 4/8 [01:34<01:34, 23.51s/it] 50%|█████ | 4/8 [01:34<01:34, 23.51s/it]
@@ -116,6 +60,6 @@ DEBUG:PIL.Image:Importing XVThumbImagePlugin
>>> Step 4: generating actions ... >>> Step 4: generating actions ...
>>> Step 4: interacting with world model ... >>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ... >>> Step 5: generating actions ...
>>> Step 5: interacting with world model ... >>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -1,5 +1,5 @@
{ {
"gt_video": "unitree_z1_dual_arm_cleanup_pencils/case1/unitree_z1_dual_arm_cleanup_pencils_case1.mp4", "gt_video": "unitree_z1_dual_arm_cleanup_pencils/case1/unitree_z1_dual_arm_cleanup_pencils_case1.mp4",
"pred_video": "unitree_z1_dual_arm_cleanup_pencils/case1/output/inference/0_full_fs4.mp4", "pred_video": "unitree_z1_dual_arm_cleanup_pencils/case1/output/inference/0_full_fs4.mp4",
"psnr": 47.911564449209735 "psnr": 38.269577028444445
} }

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_cleanup_pencils"
--n_iter 8 \ --n_iter 8 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,65 @@
2026-02-19 19:17:16.282875: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:17:16.330519: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:17:16.330561: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:17:16.331631: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:17:16.338413: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:17:17.250653: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/8 [00:00<?, ?it/s]
12%|█▎ | 1/8 [00:24<02:48, 24.06s/it]
25%|██▌ | 2/8 [00:47<02:21, 23.61s/it]
38%|███▊ | 3/8 [01:10<01:57, 23.47s/it]
50%|█████ | 4/8 [01:34<01:33, 23.44s/it]
62%|██████▎ | 5/8 [01:57<01:10, 23.41s/it]
75%|███████▌ | 6/8 [02:20<00:46, 23.37s/it]
88%|████████▊ | 7/8 [02:44<00:23, 23.35s/it]
100%|██████████| 8/8 [03:07<00:00, 23.34s/it]
100%|██████████| 8/8 [03:07<00:00, 23.42s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_cleanup_pencils/case2/unitree_z1_dual_arm_cleanup_pencils_case2.mp4",
"pred_video": "unitree_z1_dual_arm_cleanup_pencils/case2/output/inference/50_full_fs4.mp4",
"psnr": 44.50028075962896
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_cleanup_pencils"
--n_iter 8 \ --n_iter 8 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,65 @@
2026-02-19 19:20:40.444703: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:20:40.492237: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:20:40.492278: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:20:40.493360: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:20:40.500130: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:20:41.414718: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/8 [00:00<?, ?it/s]
12%|█▎ | 1/8 [00:24<02:48, 24.06s/it]
25%|██▌ | 2/8 [00:47<02:21, 23.58s/it]
38%|███▊ | 3/8 [01:10<01:57, 23.45s/it]
50%|█████ | 4/8 [01:33<01:33, 23.41s/it]
62%|██████▎ | 5/8 [01:57<01:10, 23.38s/it]
75%|███████▌ | 6/8 [02:20<00:46, 23.37s/it]
88%|████████▊ | 7/8 [02:43<00:23, 23.36s/it]
100%|██████████| 8/8 [03:07<00:00, 23.34s/it]
100%|██████████| 8/8 [03:07<00:00, 23.41s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_cleanup_pencils/case3/unitree_z1_dual_arm_cleanup_pencils_case3.mp4",
"pred_video": "unitree_z1_dual_arm_cleanup_pencils/case3/output/inference/100_full_fs4.mp4",
"psnr": 32.29959078097713
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_cleanup_pencils"
--n_iter 8 \ --n_iter 8 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,65 @@
2026-02-19 19:24:05.230366: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:24:05.278058: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:24:05.278100: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:24:05.279133: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:24:05.285789: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:24:06.199101: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/8 [00:00<?, ?it/s]
12%|█▎ | 1/8 [00:24<02:48, 24.06s/it]
25%|██▌ | 2/8 [00:47<02:21, 23.56s/it]
38%|███▊ | 3/8 [01:10<01:57, 23.45s/it]
50%|█████ | 4/8 [01:33<01:33, 23.43s/it]
62%|██████▎ | 5/8 [01:57<01:10, 23.38s/it]
75%|███████▌ | 6/8 [02:20<00:46, 23.35s/it]
88%|████████▊ | 7/8 [02:43<00:23, 23.33s/it]
100%|██████████| 8/8 [03:07<00:00, 23.33s/it]
100%|██████████| 8/8 [03:07<00:00, 23.40s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_cleanup_pencils/case4/unitree_z1_dual_arm_cleanup_pencils_case4.mp4",
"pred_video": "unitree_z1_dual_arm_cleanup_pencils/case4/output/inference/200_full_fs4.mp4",
"psnr": 45.051241961122535
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_cleanup_pencils"
--n_iter 8 \ --n_iter 8 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,62 @@
2026-02-19 19:27:29.317502: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:27:29.365030: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:27:29.365079: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:27:29.366111: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:27:29.372733: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:27:30.291220: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/7 [00:00<?, ?it/s]
14%|█▍ | 1/7 [00:24<02:24, 24.09s/it]
29%|██▊ | 2/7 [00:47<01:57, 23.59s/it]
43%|████▎ | 3/7 [01:10<01:33, 23.46s/it]
57%|█████▋ | 4/7 [01:33<01:10, 23.42s/it]
71%|███████▏ | 5/7 [01:57<00:46, 23.39s/it]
86%|████████▌ | 6/7 [02:20<00:23, 23.37s/it]
100%|██████████| 7/7 [02:43<00:00, 23.35s/it]
100%|██████████| 7/7 [02:43<00:00, 23.42s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_stackbox/case1/unitree_z1_dual_arm_stackbox_case1.mp4",
"pred_video": "unitree_z1_dual_arm_stackbox/case1/output/inference/5_full_fs4.mp4",
"psnr": 42.717688631296596
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_stackbox"
--n_iter 7 \ --n_iter 7 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,62 @@
2026-02-19 19:30:30.058862: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:30:30.106200: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:30:30.106243: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:30:30.107276: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:30:30.113917: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:30:31.026240: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/7 [00:00<?, ?it/s]
14%|█▍ | 1/7 [00:24<02:24, 24.09s/it]
29%|██▊ | 2/7 [00:47<01:58, 23.60s/it]
43%|████▎ | 3/7 [01:10<01:33, 23.48s/it]
57%|█████▋ | 4/7 [01:34<01:10, 23.43s/it]
71%|███████▏ | 5/7 [01:57<00:46, 23.43s/it]
86%|████████▌ | 6/7 [02:20<00:23, 23.42s/it]
100%|██████████| 7/7 [02:44<00:00, 23.40s/it]
100%|██████████| 7/7 [02:44<00:00, 23.46s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_stackbox/case2/unitree_z1_dual_arm_stackbox_case2.mp4",
"pred_video": "unitree_z1_dual_arm_stackbox/case2/output/inference/15_full_fs4.mp4",
"psnr": 44.90750363879194
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_stackbox"
--n_iter 7 \ --n_iter 7 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,62 @@
2026-02-19 19:33:31.235859: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:33:31.283866: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:33:31.283908: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:33:31.284941: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:33:31.291610: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:33:32.199716: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/7 [00:00<?, ?it/s]
14%|█▍ | 1/7 [00:24<02:24, 24.10s/it]
29%|██▊ | 2/7 [00:47<01:58, 23.62s/it]
43%|████▎ | 3/7 [01:10<01:34, 23.51s/it]
57%|█████▋ | 4/7 [01:34<01:10, 23.46s/it]
71%|███████▏ | 5/7 [01:57<00:46, 23.44s/it]
86%|████████▌ | 6/7 [02:20<00:23, 23.43s/it]
100%|██████████| 7/7 [02:44<00:00, 23.40s/it]
100%|██████████| 7/7 [02:44<00:00, 23.47s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_stackbox/case3/unitree_z1_dual_arm_stackbox_case3.mp4",
"pred_video": "unitree_z1_dual_arm_stackbox/case3/output/inference/25_full_fs4.mp4",
"psnr": 39.63695040491171
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_stackbox"
--n_iter 7 \ --n_iter 7 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,62 @@
2026-02-19 19:36:32.251051: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:36:32.298464: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:36:32.298506: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:36:32.299538: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:36:32.306168: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:36:33.213503: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/7 [00:00<?, ?it/s]
14%|█▍ | 1/7 [00:24<02:24, 24.05s/it]
29%|██▊ | 2/7 [00:47<01:57, 23.58s/it]
43%|████▎ | 3/7 [01:10<01:33, 23.45s/it]
57%|█████▋ | 4/7 [01:33<01:10, 23.43s/it]
71%|███████▏ | 5/7 [01:57<00:46, 23.41s/it]
86%|████████▌ | 6/7 [02:20<00:23, 23.38s/it]
100%|██████████| 7/7 [02:43<00:00, 23.35s/it]
100%|██████████| 7/7 [02:43<00:00, 23.42s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_stackbox/case4/unitree_z1_dual_arm_stackbox_case4.mp4",
"pred_video": "unitree_z1_dual_arm_stackbox/case4/output/inference/35_full_fs4.mp4",
"psnr": 42.34177660061245
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_stackbox"
--n_iter 7 \ --n_iter 7 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -1,13 +1,16 @@
2026-02-11 11:59:27.241485: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-02-19 19:39:32.908698: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-11 11:59:27.291755: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2026-02-19 19:39:32.956378: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-11 11:59:27.291807: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2026-02-19 19:39:32.956417: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-11 11:59:27.293169: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2026-02-19 19:39:32.957459: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-11 11:59:27.300838: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. 2026-02-19 19:39:32.964104: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-11 11:59:28.228009: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2026-02-19 19:39:33.875854: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123 Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ... >>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded. >>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data ***** INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded. >>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded. >>> unitree_z1_stackbox: data stats loaded.
@@ -31,60 +34,11 @@ DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13 DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9 DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096 DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:24<04:01, 24.10s/it] 9%|▉ | 1/11 [00:24<04:01, 24.10s/it]
18%|█▊ | 2/11 [00:47<03:32, 23.61s/it] 18%|█▊ | 2/11 [00:47<03:32, 23.61s/it]
27%|██▋ | 3/11 [01:10<03:08, 23.50s/it] 27%|██▋ | 3/11 [01:10<03:08, 23.50s/it]
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
DEBUG:PIL.Image:Importing BlpImagePlugin
DEBUG:PIL.Image:Importing BmpImagePlugin
DEBUG:PIL.Image:Importing BufrStubImagePlugin
DEBUG:PIL.Image:Importing CurImagePlugin
DEBUG:PIL.Image:Importing DcxImagePlugin
DEBUG:PIL.Image:Importing DdsImagePlugin
DEBUG:PIL.Image:Importing EpsImagePlugin
DEBUG:PIL.Image:Importing FitsImagePlugin
DEBUG:PIL.Image:Importing FitsStubImagePlugin
DEBUG:PIL.Image:Importing FliImagePlugin
DEBUG:PIL.Image:Importing FpxImagePlugin
DEBUG:PIL.Image:Image: failed to import FpxImagePlugin: No module named 'olefile'
DEBUG:PIL.Image:Importing FtexImagePlugin
DEBUG:PIL.Image:Importing GbrImagePlugin
DEBUG:PIL.Image:Importing GifImagePlugin
DEBUG:PIL.Image:Importing GribStubImagePlugin
DEBUG:PIL.Image:Importing Hdf5StubImagePlugin
DEBUG:PIL.Image:Importing IcnsImagePlugin
DEBUG:PIL.Image:Importing IcoImagePlugin
DEBUG:PIL.Image:Importing ImImagePlugin
DEBUG:PIL.Image:Importing ImtImagePlugin
DEBUG:PIL.Image:Importing IptcImagePlugin
DEBUG:PIL.Image:Importing JpegImagePlugin
DEBUG:PIL.Image:Importing Jpeg2KImagePlugin
DEBUG:PIL.Image:Importing McIdasImagePlugin
DEBUG:PIL.Image:Importing MicImagePlugin
DEBUG:PIL.Image:Image: failed to import MicImagePlugin: No module named 'olefile'
DEBUG:PIL.Image:Importing MpegImagePlugin
DEBUG:PIL.Image:Importing MpoImagePlugin
DEBUG:PIL.Image:Importing MspImagePlugin
DEBUG:PIL.Image:Importing PalmImagePlugin
DEBUG:PIL.Image:Importing PcdImagePlugin
DEBUG:PIL.Image:Importing PcxImagePlugin
DEBUG:PIL.Image:Importing PdfImagePlugin
DEBUG:PIL.Image:Importing PixarImagePlugin
DEBUG:PIL.Image:Importing PngImagePlugin
DEBUG:PIL.Image:Importing PpmImagePlugin
DEBUG:PIL.Image:Importing PsdImagePlugin
DEBUG:PIL.Image:Importing QoiImagePlugin
DEBUG:PIL.Image:Importing SgiImagePlugin
DEBUG:PIL.Image:Importing SpiderImagePlugin
DEBUG:PIL.Image:Importing SunImagePlugin
DEBUG:PIL.Image:Importing TgaImagePlugin
DEBUG:PIL.Image:Importing TiffImagePlugin
DEBUG:PIL.Image:Importing WebPImagePlugin
DEBUG:PIL.Image:Importing WmfImagePlugin
DEBUG:PIL.Image:Importing XbmImagePlugin
DEBUG:PIL.Image:Importing XpmImagePlugin
36%|███▋ | 4/11 [01:34<02:44, 23.45s/it] 36%|███▋ | 4/11 [01:34<02:44, 23.45s/it]
45%|████▌ | 5/11 [01:57<02:20, 23.41s/it] 45%|████▌ | 5/11 [01:57<02:20, 23.41s/it]
55%|█████▍ | 6/11 [02:20<01:56, 23.39s/it] 55%|█████▍ | 6/11 [02:20<01:56, 23.39s/it]
@@ -115,6 +69,6 @@ DEBUG:PIL.Image:Importing XVThumbImagePlugin
>>> Step 6: generating actions ... >>> Step 6: generating actions ...
>>> Step 6: interacting with world model ... >>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ... >>> Step 7: generating actions ...
>>> Step 7: interacting with world model ... >>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -1,5 +1,5 @@
{ {
"gt_video": "/home/qhy/unifolm-world-model-action/unitree_z1_dual_arm_stackbox_v2/case1/unitree_z1_dual_arm_stackbox_v2_case1.mp4", "gt_video": "unitree_z1_dual_arm_stackbox_v2/case1/unitree_z1_dual_arm_stackbox_v2_case1.mp4",
"pred_video": "/home/qhy/unifolm-world-model-action/unitree_z1_dual_arm_stackbox_v2/case1/output/inference/5_full_fs4.mp4", "pred_video": "unitree_z1_dual_arm_stackbox_v2/case1/output/inference/5_full_fs4.mp4",
"psnr": 28.167025381705358 "psnr": 26.68301835085306
} }

View File

@@ -0,0 +1,74 @@
2026-02-19 19:44:07.724109: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:44:07.771461: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:44:07.771505: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:44:07.772537: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:44:07.779172: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:44:08.688975: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:24<04:00, 24.03s/it]
18%|█▊ | 2/11 [00:47<03:31, 23.54s/it]
27%|██▋ | 3/11 [01:10<03:07, 23.42s/it]
36%|███▋ | 4/11 [01:33<02:43, 23.40s/it]
45%|████▌ | 5/11 [01:57<02:20, 23.39s/it]
55%|█████▍ | 6/11 [02:20<01:56, 23.37s/it]
64%|██████▎ | 7/11 [02:43<01:33, 23.34s/it]
73%|███████▎ | 8/11 [03:07<01:09, 23.32s/it]
82%|████████▏ | 9/11 [03:30<00:46, 23.31s/it]
91%|█████████ | 10/11 [03:53<00:23, 23.29s/it]
100%|██████████| 11/11 [04:16<00:00, 23.28s/it]
100%|██████████| 11/11 [04:16<00:00, 23.36s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_stackbox_v2/case2/unitree_z1_dual_arm_stackbox_v2_case2.mp4",
"pred_video": "unitree_z1_dual_arm_stackbox_v2/case2/output/inference/15_full_fs4.mp4",
"psnr": 27.46347145461597
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_stackbox_v2"
--n_iter 11 \ --n_iter 11 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,74 @@
2026-02-19 19:48:42.460586: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:48:42.508096: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:48:42.508140: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:48:42.509152: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:48:42.515865: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:48:43.425699: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:24<04:00, 24.07s/it]
18%|█▊ | 2/11 [00:47<03:32, 23.62s/it]
27%|██▋ | 3/11 [01:10<03:08, 23.51s/it]
36%|███▋ | 4/11 [01:34<02:44, 23.46s/it]
45%|████▌ | 5/11 [01:57<02:20, 23.42s/it]
55%|█████▍ | 6/11 [02:20<01:56, 23.39s/it]
64%|██████▎ | 7/11 [02:44<01:33, 23.37s/it]
73%|███████▎ | 8/11 [03:07<01:10, 23.36s/it]
82%|████████▏ | 9/11 [03:30<00:46, 23.36s/it]
91%|█████████ | 10/11 [03:54<00:23, 23.36s/it]
100%|██████████| 11/11 [04:17<00:00, 23.36s/it]
100%|██████████| 11/11 [04:17<00:00, 23.42s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_stackbox_v2/case3/unitree_z1_dual_arm_stackbox_v2_case3.mp4",
"pred_video": "unitree_z1_dual_arm_stackbox_v2/case3/output/inference/25_full_fs4.mp4",
"psnr": 28.604047286947512
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_stackbox_v2"
--n_iter 11 \ --n_iter 11 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,74 @@
2026-02-19 19:53:17.574354: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:53:17.621335: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:53:17.621388: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:53:17.622415: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:53:17.629050: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:53:18.537233: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/11 [00:00<?, ?it/s]
9%|▉ | 1/11 [00:24<04:00, 24.09s/it]
18%|█▊ | 2/11 [00:47<03:32, 23.62s/it]
27%|██▋ | 3/11 [01:10<03:07, 23.49s/it]
36%|███▋ | 4/11 [01:34<02:44, 23.47s/it]
45%|████▌ | 5/11 [01:57<02:20, 23.45s/it]
55%|█████▍ | 6/11 [02:20<01:57, 23.42s/it]
64%|██████▎ | 7/11 [02:44<01:33, 23.39s/it]
73%|███████▎ | 8/11 [03:07<01:10, 23.37s/it]
82%|████████▏ | 9/11 [03:30<00:46, 23.35s/it]
91%|█████████ | 10/11 [03:54<00:23, 23.35s/it]
100%|██████████| 11/11 [04:17<00:00, 23.33s/it]
100%|██████████| 11/11 [04:17<00:00, 23.41s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_dual_arm_stackbox_v2/case4/unitree_z1_dual_arm_stackbox_v2_case4.mp4",
"pred_video": "unitree_z1_dual_arm_stackbox_v2/case4/output/inference/35_full_fs4.mp4",
"psnr": 25.578757174083307
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_dual_arm_stackbox_v2"
--n_iter 11 \ --n_iter 11 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,77 @@
2026-02-19 19:57:52.488339: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 19:57:52.536176: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 19:57:52.536222: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 19:57:52.537285: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 19:57:52.544051: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 19:57:53.469912: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/12 [00:00<?, ?it/s]
8%|▊ | 1/12 [00:24<04:24, 24.06s/it]
17%|█▋ | 2/12 [00:47<03:55, 23.56s/it]
25%|██▌ | 3/12 [01:10<03:31, 23.46s/it]
33%|███▎ | 4/12 [01:33<03:07, 23.43s/it]
42%|████▏ | 5/12 [01:57<02:43, 23.41s/it]
50%|█████ | 6/12 [02:20<02:20, 23.42s/it]
58%|█████▊ | 7/12 [02:44<01:56, 23.39s/it]
67%|██████▋ | 8/12 [03:07<01:33, 23.36s/it]
75%|███████▌ | 9/12 [03:30<01:09, 23.33s/it]
83%|████████▎ | 10/12 [03:53<00:46, 23.31s/it]
92%|█████████▏| 11/12 [04:17<00:23, 23.30s/it]
100%|██████████| 12/12 [04:40<00:00, 23.29s/it]
100%|██████████| 12/12 [04:40<00:00, 23.37s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 8: generating actions ...
>>> Step 8: interacting with world model ...

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_stackbox/case1/unitree_z1_stackbox_case1.mp4",
"pred_video": "unitree_z1_stackbox/case1/output/inference/5_full_fs4.mp4",
"psnr": 46.05271283048069
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_stackbox"
--n_iter 12 \ --n_iter 12 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,77 @@
2026-02-19 20:02:50.975402: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 20:02:51.023211: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 20:02:51.023253: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 20:02:51.024328: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 20:02:51.031176: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 20:02:51.947400: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/12 [00:00<?, ?it/s]
8%|▊ | 1/12 [00:24<04:24, 24.08s/it]
17%|█▋ | 2/12 [00:47<03:56, 23.62s/it]
25%|██▌ | 3/12 [01:10<03:31, 23.51s/it]
33%|███▎ | 4/12 [01:34<03:07, 23.48s/it]
42%|████▏ | 5/12 [01:57<02:44, 23.46s/it]
50%|█████ | 6/12 [02:20<02:20, 23.43s/it]
58%|█████▊ | 7/12 [02:44<01:57, 23.40s/it]
67%|██████▋ | 8/12 [03:07<01:33, 23.39s/it]
75%|███████▌ | 9/12 [03:31<01:10, 23.39s/it]
83%|████████▎ | 10/12 [03:54<00:46, 23.38s/it]
92%|█████████▏| 11/12 [04:17<00:23, 23.37s/it]
100%|██████████| 12/12 [04:41<00:00, 23.35s/it]
100%|██████████| 12/12 [04:41<00:00, 23.42s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 8: generating actions ...
>>> Step 8: interacting with world model ...

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_stackbox/case2/unitree_z1_stackbox_case2.mp4",
"pred_video": "unitree_z1_stackbox/case2/output/inference/15_full_fs4.mp4",
"psnr": 43.005233352958804
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_stackbox"
--n_iter 12 \ --n_iter 12 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,77 @@
2026-02-19 20:07:49.410622: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 20:07:49.457896: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 20:07:49.457948: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 20:07:49.458967: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 20:07:49.465632: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 20:07:50.373326: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/12 [00:00<?, ?it/s]
8%|▊ | 1/12 [00:24<04:25, 24.17s/it]
17%|█▋ | 2/12 [00:47<03:56, 23.64s/it]
25%|██▌ | 3/12 [01:10<03:31, 23.53s/it]
33%|███▎ | 4/12 [01:34<03:07, 23.49s/it]
42%|████▏ | 5/12 [01:57<02:44, 23.45s/it]
50%|█████ | 6/12 [02:21<02:20, 23.43s/it]
58%|█████▊ | 7/12 [02:44<01:57, 23.41s/it]
67%|██████▋ | 8/12 [03:07<01:33, 23.39s/it]
75%|███████▌ | 9/12 [03:31<01:10, 23.37s/it]
83%|████████▎ | 10/12 [03:54<00:46, 23.37s/it]
92%|█████████▏| 11/12 [04:17<00:23, 23.36s/it]
100%|██████████| 12/12 [04:41<00:00, 23.34s/it]
100%|██████████| 12/12 [04:41<00:00, 23.43s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 8: generating actions ...
>>> Step 8: interacting with world model ...

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_stackbox/case3/unitree_z1_stackbox_case3.mp4",
"pred_video": "unitree_z1_stackbox/case3/output/inference/25_full_fs4.mp4",
"psnr": 49.489774674892764
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_stackbox"
--n_iter 12 \ --n_iter 12 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"

View File

@@ -0,0 +1,77 @@
2026-02-19 20:12:48.029611: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-19 20:12:48.076914: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2026-02-19 20:12:48.076957: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2026-02-19 20:12:48.077981: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-19 20:12:48.084620: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-19 20:12:49.004753: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 123
>>> Loading prepared model from ckpts/unifolm_wma_dual.ckpt.prepared.pt ...
>>> Prepared model loaded.
>>> Diffusion backbone (model.model) converted to FP16.
>>> Projectors (image_proj_model, state_projector, action_projector) converted to FP16.
>>> Encoders (cond_stage_model, embedder) converted to FP16.
INFO:root:***** Configing Data *****
>>> unitree_z1_stackbox: 1 data samples loaded.
>>> unitree_z1_stackbox: data stats loaded.
>>> unitree_z1_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox: data stats loaded.
>>> unitree_z1_dual_arm_stackbox: normalizer initiated.
>>> unitree_z1_dual_arm_stackbox_v2: 1 data samples loaded.
>>> unitree_z1_dual_arm_stackbox_v2: data stats loaded.
>>> unitree_z1_dual_arm_stackbox_v2: normalizer initiated.
>>> unitree_z1_dual_arm_cleanup_pencils: 1 data samples loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: data stats loaded.
>>> unitree_z1_dual_arm_cleanup_pencils: normalizer initiated.
>>> unitree_g1_pack_camera: 1 data samples loaded.
>>> unitree_g1_pack_camera: data stats loaded.
>>> unitree_g1_pack_camera: normalizer initiated.
>>> Dataset is successfully loaded ...
✓ KV fused: 66 attention layers
>>> Generate 16 frames under each generation ...
DEBUG:h5py._conv:Creating converter from 3 to 5
DEBUG:PIL.PngImagePlugin:STREAM b'IHDR' 16 13
DEBUG:PIL.PngImagePlugin:STREAM b'pHYs' 41 9
DEBUG:PIL.PngImagePlugin:STREAM b'IDAT' 62 4096
0%| | 0/12 [00:00<?, ?it/s]
8%|▊ | 1/12 [00:24<04:24, 24.06s/it]
17%|█▋ | 2/12 [00:47<03:55, 23.59s/it]
25%|██▌ | 3/12 [01:10<03:31, 23.49s/it]
33%|███▎ | 4/12 [01:34<03:07, 23.44s/it]
42%|████▏ | 5/12 [01:57<02:43, 23.41s/it]
50%|█████ | 6/12 [02:20<02:20, 23.40s/it]
58%|█████▊ | 7/12 [02:44<01:56, 23.38s/it]
67%|██████▋ | 8/12 [03:07<01:33, 23.37s/it]
75%|███████▌ | 9/12 [03:30<01:10, 23.36s/it]
83%|████████▎ | 10/12 [03:54<00:46, 23.35s/it]
92%|█████████▏| 11/12 [04:17<00:23, 23.33s/it]
100%|██████████| 12/12 [04:40<00:00, 23.32s/it]
100%|██████████| 12/12 [04:40<00:00, 23.39s/it]
>>> Step 0: generating actions ...
>>> Step 0: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 1: generating actions ...
>>> Step 1: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 2: generating actions ...
>>> Step 2: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 3: generating actions ...
>>> Step 3: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 4: generating actions ...
>>> Step 4: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 5: generating actions ...
>>> Step 5: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 6: generating actions ...
>>> Step 6: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 7: generating actions ...
>>> Step 7: interacting with world model ...
>>>>>>>>>>>>>>>>>>>>>>>>
>>> Step 8: generating actions ...
>>> Step 8: interacting with world model ...

View File

@@ -0,0 +1,5 @@
{
"gt_video": "unitree_z1_stackbox/case4/unitree_z1_stackbox_case4.mp4",
"pred_video": "unitree_z1_stackbox/case4/output/inference/35_full_fs4.mp4",
"psnr": 47.18724378194084
}

View File

@@ -20,5 +20,6 @@ dataset="unitree_z1_stackbox"
--n_iter 12 \ --n_iter 12 \
--timestep_spacing 'uniform_trailing' \ --timestep_spacing 'uniform_trailing' \
--guidance_rescale 0.7 \ --guidance_rescale 0.7 \
--perframe_ae --perframe_ae \
--fast_policy_no_decode
} 2>&1 | tee "${res_dir}/output.log" } 2>&1 | tee "${res_dir}/output.log"