[Paper Notes] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

11 minute read

Published:

This post supports English / 中文 switching via the site language toggle in the top navigation.

TL;DR

DreamDojo is a foundation action-conditioned world model (AC-WM) that learns diverse interaction physics from 44k hours of egocentric human videos — the largest video dataset to date for world model pretraining. To overcome the scarcity of action labels in human videos, it introduces continuous latent actions as unified proxy actions extracted via a self-supervised VAE. After post-training on small-scale robot data, DreamDojo demonstrates:

  • Strong OOD generalization to unseen objects, skills, and environments
  • Real-time inference at 10.81 FPS via an autoregressive distillation pipeline
  • Downstream applications: policy evaluation (Pearson r=0.995 with real-world), model-based planning (2x success rate improvement), and live teleoperation

This is an AC-WM (actions-in) — the counterpart to WAMs like DreamZero. See also A Tale of Two World Models for the WAM vs. AC-WM debate.

Paper Info

  • Title: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
  • Authors: Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, et al.
  • Affiliation: NVIDIA, HKUST, UC Berkeley, UW, Stanford, KAIST, UofT, UCSD, UT Austin
  • Date: 2026-02-06
  • arXiv: 2602.06949
  • Project page: dreamdojo-world.github.io

1. Motivation

Existing robot world models are trained on limited robot data and confined to in-distribution settings. The key bottleneck:

  • Robot data is scarce and expensive — hardware variability, teleoperation cost, mostly expert demonstrations
  • Real-world diversity is nearly infinite — objects, scenes, skills far exceed any robot dataset
  • Expert-only data lacks stochasticity — models don’t learn to respond to counterfactual actions

The insight: human videos capture the same underlying physics as robot interactions, despite the embodiment gap. And human videos are available at massive scale.

2. DreamDojo-HV Dataset

The paper curates the largest egocentric human video dataset for world model pretraining:

DatasetTypeHoursTrajectoriesSkillsScenes
DROIDRobot35076k86564
AgiBot-WorldRobot2.9k1,000k87106
In-labHuman5513.9k351
EgoDexHuman82930k1945
DreamDojo-HVHuman43,8271,135k6,0151,135k
Total mixtureHuman44,7111,179k>6,015>1,135k

Compared to the largest prior robot datasets: 15x longer duration, 96x more skills, 2,000x more scenes.

DreamDojo-HV covers home, retail, transport, food, repair, and many other daily scenarios, collected via crowdsourcing with text annotations for each episode.

3. Approach

3.1 Latent Actions as Proxy Actions

The central technical challenge: human videos don’t have fine-grained action labels. Three options considered:

MethodProsCons
Action-free pretrainingSimpleIgnores causality → poor controllability
Hand pose extraction (HaMeR/MANO)Precise for handsCan’t capture arm/locomotion; fails under occlusion
Latent actions (proposed)Self-supervised, cross-embodiment, captures all motionsProxy, not ground truth

The latent action model is a 700M spatiotemporal Transformer VAE:

  • Encoder: takes two consecutive frames $f_t, f_{t+1}$, extracts a compact latent vector $\hat{a}_t$ (dim=32) representing the action between frames
  • Decoder: reconstructs $f_{t+1}$ from $\hat{a}_t$ and $f_t$
  • Information bottleneck: forces the model to disentangle the most critical motion information
\[\mathcal{L}_{\theta,\varphi}(f_{t+1}) = \mathbb{E}_{q_\varphi(\hat{a}|f_{t:t+1})} \log p_\theta(f_{t+1}|\hat{a}, f_t) - \beta D_{KL}(q_\varphi(\hat{a}|f_{t:t+1}) \| p(\hat{a}))\]

Key finding: the learned latent actions transfer across embodiments — frames with similar latent actions show the same motion regardless of whether performed by a human or robot (see Fig. 3 in the paper).

3.2 World Model Architecture

Built on Cosmos-Predict2.5 (latent video diffusion model with DiT blocks):

  • Action injection: actions are chunked to match the temporal compression ratio of the video tokenizer (4 frames per latent). Each chunk of 4 consecutive actions conditions the corresponding latent frame.
  • Relative actions: transform absolute actions to relative for better generalization.
  • Causal chunked injection: future actions don’t condition current predictions — respects causality.

3.3 Training Objective

Standard flow matching loss + a temporal consistency loss:

\[\mathcal{L}_{\text{temporal}}(\theta) = \mathbb{E}\left[\sum_{i=1}^{K-1} \|(z_{i+1} - z_i) - (v_{i+1} - v_i)\|^2\right]\]

This supervises the transitions between frames, not just individual frames — directly encourages learning object dynamics and action following. Found to accelerate action controllability learning and improve object completeness.

3.4 Three-Phase Training

  1. Pretraining on human videos (In-lab + EgoDex + DreamDojo-HV) with latent action conditioning
  2. Post-training on target robot data — reset action conditioning layer, learn new action space
  3. Distillation — convert to autoregressive, few-step model for real-time inference

3.5 Distillation Pipeline

Based on Self Forcing (Huang et al., 2025):

  1. Warmup: regress student predictions to teacher’s ODE solutions (teacher forcing)
  2. Distillation: student generates from its own previous outputs, supervised by KL divergence between teacher and student distributions — minimizes train-test mismatch

Key innovation: student generates $N’ > N$ frames (longer than teacher horizon) during training to simulate long rollouts and reduce compounding error.

Result: 35 denoising steps → 4 steps, bidirectional → causal attention, enabling 10.81 FPS real-time inference.

4. Key Results

Scaling data improves everything

Adding more human data consistently improves OOD performance:

Pretraining DataIn-lab PSNRCounterfactual PSNR
No pretraining20.57620.472
In-lab only20.91320.755
In-lab + EgoDex20.97220.797
In-lab + EgoDex + DreamDojo-HV21.01620.852

Latent actions match ground-truth actions

Conditioning MethodIn-lab PSNREgoDex PSNR
No pretraining20.57619.952
Action-free pretraining20.79719.924
Latent action20.91320.344
Ground-truth action (ideal)20.96020.474

Latent actions close most of the gap to ground-truth labels — and are infinitely more scalable.

Human preference: scaling model helps

ComparisonPhysics CorrectnessAction Following
DreamDojo-2B > Cosmos-Predict2.562.5%63.5%
DreamDojo-14B > Cosmos-Predict2.573.5%72.6%
DreamDojo-14B > DreamDojo-2B72.5%65.5%

Distillation: real-time with minimal degradation

ModelFPSPredict LenContext Len
Teacher2.7212 frames1 frame
Student (distilled)10.814 frames12 frames

The student is 4x faster and has better context awareness (12-frame sliding window vs. 1-frame conditioning).

Downstream applications

Policy evaluation: Pearson correlation r=0.995 between DreamDojo-predicted success rates and real-world success rates across 6 policy checkpoints. Near-perfect ranking.

Model-based planning: Sample N action proposals from a policy ensemble, simulate all with DreamDojo, select best via a value model. Result: ~2x improvement in success rate over uniform sampling.

Live teleoperation: Real-time teleoperation of a virtual G1 robot using PICO VR controller on a single RTX 5090.

Architecture ablations

ModificationGR-1 Val PSNRCounterfactual PSNR
Baseline16.19919.448
+ Relative actions16.52219.482
+ Chunked injection17.62620.783
+ Temporal consistency loss17.63020.980

Chunked injection is the biggest single improvement — respecting causality matters a lot.

5. Strengths

  • Massive scale: 44k hours of human video pretraining — by far the largest for any robot world model
  • Elegant latent action design: self-supervised, cross-embodiment, nearly matches ground-truth actions
  • Consistent scaling: more data, bigger model → better OOD generalization on all benchmarks
  • Practical applications demonstrated: policy evaluation with r=0.995, 2x planning improvement, live teleoperation
  • Distillation pipeline: 10.81 FPS with improved context consistency

6. Limitations

  • Uncommon actions: struggles with fast/unusual motions (slapping, fast waving)
  • Optimistic simulator: absolute success rates in DreamDojo are often higher than real-world — doesn’t accurately generate nuanced failures
  • Single-view only: no multi-view simulation support (important for SOTA policies)
  • Post-training forgetting: retaining pretrained knowledge during fine-tuning not deeply studied
  • Pixel-space generation: computationally heavier than latent-space world models (V-JEPA2, Dreamer)

7. DreamDojo vs. DreamZero: Two Sides of the Same Coin

Both from NVIDIA, released weeks apart, representing the two world model paradigms:

 DreamZero (WAM)DreamDojo (AC-WM)
InputImage + text instructionImage + future actions
OutputVideo + actionsVideo
Pretraining dataRobot teleoperation (500 hrs)Human videos (44k hrs)
Data scalingLimited to success demosAll data including failures & play
Cross-embodimentVideo-only demos (no action labels)Latent actions as unified proxy
PlanningBest-of-N via text prompt variationGradient-based / action optimization
RL in modelNot possiblePossible (counterfactual simulation)
Policy evaluationNot possibleYes (r=0.995 correlation)
Real-time speed7 Hz (38x speedup)10.81 FPS (distilled)

These two papers together make a compelling case for the “flexibly conditioned” world model predicted in the Tale of Two World Models discussion.

8. Takeaways

  1. Human videos are a goldmine for robot world models — the physics transfers despite the embodiment gap, and the data is orders of magnitude more diverse than robot data
  2. Latent actions solve the label scarcity problem — self-supervised, cross-embodiment, nearly as good as ground truth
  3. Causal, chunked action injection is critical — respecting temporal causality dramatically improves controllability
  4. AC-WMs enable unique downstream applications — policy evaluation and model-based planning that WAMs simply cannot do
  5. Distillation bridges the gap to real-time — autoregressive + few-step denoising achieves 10.81 FPS with better context consistency

References

本文支持通过顶部导航栏的语言切换按钮在 English / 中文 之间切换。

概要

DreamDojo 是一个基础动作条件世界模型(AC-WM),从 44,000 小时自我中心视角的人类视频中学习多样化的交互物理——这是迄今为止用于世界模型预训练的最大视频数据集。为解决人类视频中动作标注稀缺的问题,论文引入了连续潜在动作作为统一的代理动作,通过自监督 VAE 提取。在少量机器人数据上后训练后,DreamDojo 展示了:

  • 对未见物体、技能和环境的强 OOD 泛化能力
  • 通过自回归蒸馏管线实现 10.81 FPS 实时推理
  • 下游应用:策略评估(与真实世界 Pearson r=0.995)、基于模型的规划(成功率提升 2 倍)和实时遥操作

这是一个 AC-WM(动作输入型)——与 DreamZero(WAM)形成对比。参见两种世界模型的故事了解 WAM vs. AC-WM 的讨论。

论文信息

  • 标题: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
  • 作者: Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye 等
  • 机构: NVIDIA, HKUST, UC Berkeley, UW, Stanford, KAIST, UofT, UCSD, UT Austin
  • 日期: 2026-02-06
  • arXiv: 2602.06949
  • 项目主页: dreamdojo-world.github.io

1. 动机

现有机器人世界模型在有限的机器人数据上训练,局限于分布内场景。核心瓶颈:

  • 机器人数据稀缺且昂贵——硬件差异、遥操作成本高、大多是专家示范
  • 真实世界的多样性几乎无限——物体、场景、技能远超任何机器人数据集
  • 纯专家数据缺乏随机性——模型无法学会响应反事实动作

核心洞察:人类视频捕获了与机器人交互相同的底层物理,尽管存在具身平台差异。而且人类视频可以大规模获取。

2. DreamDojo-HV 数据集

论文构建了迄今最大的自我中心人类视频数据集用于世界模型预训练:

数据集类型时长轨迹数技能数场景数
DROID机器人350h76k86564
AgiBot-World机器人2.9kh1,000k87106
In-lab人类55h13.9k351
EgoDex人类829h30k1945
DreamDojo-HV人类43,827h1,135k6,0151,135k
总数据混合人类44,711h1,179k>6,015>1,135k

与此前最大的机器人数据集相比:时长长 15 倍、技能多 96 倍、场景多 2,000 倍

3. 方法

3.1 潜在动作作为代理动作

核心技术挑战:人类视频没有细粒度的动作标注。三种方案比较:

方法优点缺点
无动作预训练简单忽略因果性 → 可控性差
手部姿态提取 (HaMeR/MANO)手部精确无法捕获手臂/移动;遮挡下失败
潜在动作(本文)自监督、跨具身平台、捕获所有运动代理而非真实标签

潜在动作模型是一个 7 亿参数的时空 Transformer VAE

  • 编码器:输入连续两帧 $f_t, f_{t+1}$,提取紧凑的潜在向量 $\hat{a}_t$(维度=32),���示帧间动作
  • 解码器:从 $\hat{a}t$ 和 $f_t$ 重建 $f{t+1}$
  • 信息瓶颈:迫使模型解纠缠出最关键的运动信息
\[\mathcal{L}_{\theta,\varphi}(f_{t+1}) = \mathbb{E}_{q_\varphi(\hat{a}|f_{t:t+1})} \log p_\theta(f_{t+1}|\hat{a}, f_t) - \beta D_{KL}(q_\varphi(\hat{a}|f_{t:t+1}) \| p(\hat{a}))\]

关键发现:学到的潜在动作可以跨具身平台迁移——具有相似潜在动作的帧展现相同的运动,无论是人还是机器人执行。

3.2 世界模型架构

基于 Cosmos-Predict2.5(DiT 块的潜空间视频扩散模型):

  • 动作注入:动作按 chunk 注入,匹配视频 tokenizer 的时间压缩比(每个潜在帧对应 4 帧)
  • 相对动作:将绝对动作转换为相对动作以提高泛化性
  • 因果 chunk 注入:未来动作不影响当前预测——尊重因果关系

3.3 训练目标

标准 flow matching 损失 + 时间一致性损失

\[\mathcal{L}_{\text{temporal}}(\theta) = \mathbb{E}\left[\sum_{i=1}^{K-1} \|(z_{i+1} - z_i) - (v_{i+1} - v_i)\|^2\right]\]

监督帧间的转换而不仅仅是单帧——直接鼓励学习物体动力学和动作跟随。

3.4 三阶段训练

  1. 预训练:在人类视频上以潜在动作为条件
  2. 后训练:在目标机器人数据上——重置动作条件化层,学习新动作空间
  3. 蒸馏:转换为自回归、少步模型以实现实时推理

3.5 蒸馏管线

基于 Self Forcing(Huang et al., 2025):

  1. 预热:学生预测回归教师的 ODE 解(teacher forcing)
  2. 蒸馏:学生从自身之前的输出生成,通过教师和学生分布间的 KL 散度监督——最小化训练-测试不匹配

关键创新:训练时学生生成 $N’ > N$ 帧(比教师视野更长),模拟更长的展开以减少累积误差。

结果:35 步去噪 → 4 步,双向注意力 → 因果注意力,实现 10.81 FPS 实时推理。

4. 核心结果

数据规模扩大,一切变好

增加人类数据持续改善 OOD 性能:

预训练数据In-lab PSNR反事实 PSNR
无预训练20.57620.472
仅 In-lab20.91320.755
In-lab + EgoDex20.97220.797
In-lab + EgoDex + DreamDojo-HV21.01620.852

潜在动作接近真实动作

条件化方法In-lab PSNREgoDex PSNR
无预训练20.57619.952
无动作预训练20.79719.924
潜在动作20.91320.344
真实动作(理想情况)20.96020.474

潜在动作弥补了与真实标签之间的大部分差距——而且可扩展性无限好。

人类偏好:模型越大越好

比较物理正确性动作跟随
DreamDojo-2B > Cosmos-Predict2.562.5%63.5%
DreamDojo-14B > Cosmos-Predict2.573.5%72.6%
DreamDojo-14B > DreamDojo-2B72.5%65.5%

蒸馏:实时且退化极小

模型FPS预测长度上下文长度
教师2.7212 帧1 帧
学生(蒸馏后)10.814 帧12 帧

学生快 4 倍且上下文感知更好(12 帧滑动窗口 vs. 单帧条件化)。

下游应用

策略评估:6 个策略检查点上,DreamDojo 预测的成功率与真实世界成功率的 Pearson 相关系数 r=0.995。近乎完美的排序。

基于模型的规划:从策略集合中采样 N 个动作提议,全部用 DreamDojo 模拟,用价值模型选最优。结果:相比均匀采样,成功率提升约 2 倍。

实时遥操作:在单张 RTX 5090 上使用 PICO VR 控制器实时遥操作虚拟 G1 机器人。

架构消融

修改GR-1 Val PSNR反事实 PSNR
基线16.19919.448
+ 相对动作16.52219.482
+ Chunk 注入17.62620.783
+ 时间一致性损失17.63020.980

Chunk 注入是最大的单项改进——尊重因果关系非常重要。

5. 优势

  • 海量规模:44k 小时人类视频预训练——在所有机器人世界模型中遥遥领先
  • 优雅的潜在动作设计:自监督、跨具身平台、接近真实动作效果
  • 持续的规模化收益:更多数据、更大模型 → 所有基准上更好的 OOD 泛化
  • 实用下游应用:策略评估(r=0.995)、2 倍规划改进、实时遥操作
  • 蒸馏管线:10.81 FPS 且上下文一致性更优

6. 局限性

  • 罕见动作:难以模拟快速/不常见动作(拍打、快速挥手)
  • 乐观模拟器:DreamDojo 中的绝对成功率通常高于真实世界——无法精确生成细微的失败
  • 仅单视角:不支持多视角模拟(对 SOTA 策略很重要)
  • 后训练遗忘:预训练知识在微调过程中的保留尚未深入研究
  • 像素空间生成:计算量大于潜空间世界模型(V-JEPA2、Dreamer)

7. DreamDojo vs. DreamZero:同一枚硬币的两面

两篇论文均来自 NVIDIA,相隔数周发布,代表了两种世界模型范式:

 DreamZero (WAM)DreamDojo (AC-WM)
输入图像 + 文本指令图像 + 未来动作
输出视频 + 动作视频
预训练数据机器人遥操作(500h)人类视频(44kh)
数据扩展仅限成功示范包括失败和游戏的所有数据
跨具身平台纯视频演示(无动作标注)潜在动作作为统一代理
规划通过文本提示变化做 best-of-N基于梯度/动作优化
模型内 RL不可能可能(反事实模拟)
策略评估不可能是(r=0.995 相关性)
实时速度7 Hz(38x 加速)10.81 FPS(蒸馏后)

这两篇论文共同有力地支持了两种世界模型的故事中预测的”灵活条件化”世界模型。

8. 核心要点

  1. 人类视频是机器人世界模型的金矿——尽管存在具身平台差距,物理知识可以迁移,且数据多样性远超机器人数据
  2. 潜在动作解决了标注稀缺问题——自监督、跨具身平台、接近真实标签的效果
  3. 因果 chunk 动作注入至关重要——尊重时间因果性大幅提升可控性
  4. AC-WMs 实现了独特的下游应用——策略评估和基于模型的规划是 WAMs 无法做到的
  5. 蒸馏弥合了实时推理的鸿沟——自回归 + 少步去噪实现 10.81 FPS 且上下文一致性更好

参考链接