[Paper Notes] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
DreamDojo is a foundation action-conditioned world model (AC-WM) that learns diverse interaction physics from 44k hours of egocentric human videos — the largest video dataset to date for world model pretraining. To overcome the scarcity of action labels in human videos, it introduces continuous latent actions as unified proxy actions extracted via a self-supervised VAE. After post-training on small-scale robot data, DreamDojo demonstrates:
- Strong OOD generalization to unseen objects, skills, and environments
- Real-time inference at 10.81 FPS via an autoregressive distillation pipeline
- Downstream applications: policy evaluation (Pearson r=0.995 with real-world), model-based planning (2x success rate improvement), and live teleoperation
This is an AC-WM (actions-in) — the counterpart to WAMs like DreamZero. See also A Tale of Two World Models for the WAM vs. AC-WM debate.
Paper Info
- Title: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
- Authors: Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, et al.
- Affiliation: NVIDIA, HKUST, UC Berkeley, UW, Stanford, KAIST, UofT, UCSD, UT Austin
- Date: 2026-02-06
- arXiv: 2602.06949
- Project page: dreamdojo-world.github.io
1. Motivation
Existing robot world models are trained on limited robot data and confined to in-distribution settings. The key bottleneck:
- Robot data is scarce and expensive — hardware variability, teleoperation cost, mostly expert demonstrations
- Real-world diversity is nearly infinite — objects, scenes, skills far exceed any robot dataset
- Expert-only data lacks stochasticity — models don’t learn to respond to counterfactual actions
The insight: human videos capture the same underlying physics as robot interactions, despite the embodiment gap. And human videos are available at massive scale.
2. DreamDojo-HV Dataset
The paper curates the largest egocentric human video dataset for world model pretraining:
| Dataset | Type | Hours | Trajectories | Skills | Scenes |
|---|---|---|---|---|---|
| DROID | Robot | 350 | 76k | 86 | 564 |
| AgiBot-World | Robot | 2.9k | 1,000k | 87 | 106 |
| In-lab | Human | 55 | 13.9k | 35 | 1 |
| EgoDex | Human | 829 | 30k | 194 | 5 |
| DreamDojo-HV | Human | 43,827 | 1,135k | 6,015 | 1,135k |
| Total mixture | Human | 44,711 | 1,179k | >6,015 | >1,135k |
Compared to the largest prior robot datasets: 15x longer duration, 96x more skills, 2,000x more scenes.
DreamDojo-HV covers home, retail, transport, food, repair, and many other daily scenarios, collected via crowdsourcing with text annotations for each episode.
3. Approach
3.1 Latent Actions as Proxy Actions
The central technical challenge: human videos don’t have fine-grained action labels. Three options considered:
| Method | Pros | Cons |
|---|---|---|
| Action-free pretraining | Simple | Ignores causality → poor controllability |
| Hand pose extraction (HaMeR/MANO) | Precise for hands | Can’t capture arm/locomotion; fails under occlusion |
| Latent actions (proposed) | Self-supervised, cross-embodiment, captures all motions | Proxy, not ground truth |
The latent action model is a 700M spatiotemporal Transformer VAE:
- Encoder: takes two consecutive frames $f_t, f_{t+1}$, extracts a compact latent vector $\hat{a}_t$ (dim=32) representing the action between frames
- Decoder: reconstructs $f_{t+1}$ from $\hat{a}_t$ and $f_t$
- Information bottleneck: forces the model to disentangle the most critical motion information
Key finding: the learned latent actions transfer across embodiments — frames with similar latent actions show the same motion regardless of whether performed by a human or robot (see Fig. 3 in the paper).
3.2 World Model Architecture
Built on Cosmos-Predict2.5 (latent video diffusion model with DiT blocks):
- Action injection: actions are chunked to match the temporal compression ratio of the video tokenizer (4 frames per latent). Each chunk of 4 consecutive actions conditions the corresponding latent frame.
- Relative actions: transform absolute actions to relative for better generalization.
- Causal chunked injection: future actions don’t condition current predictions — respects causality.
3.3 Training Objective
Standard flow matching loss + a temporal consistency loss:
\[\mathcal{L}_{\text{temporal}}(\theta) = \mathbb{E}\left[\sum_{i=1}^{K-1} \|(z_{i+1} - z_i) - (v_{i+1} - v_i)\|^2\right]\]This supervises the transitions between frames, not just individual frames — directly encourages learning object dynamics and action following. Found to accelerate action controllability learning and improve object completeness.
3.4 Three-Phase Training
- Pretraining on human videos (In-lab + EgoDex + DreamDojo-HV) with latent action conditioning
- Post-training on target robot data — reset action conditioning layer, learn new action space
- Distillation — convert to autoregressive, few-step model for real-time inference
3.5 Distillation Pipeline
Based on Self Forcing (Huang et al., 2025):
- Warmup: regress student predictions to teacher’s ODE solutions (teacher forcing)
- Distillation: student generates from its own previous outputs, supervised by KL divergence between teacher and student distributions — minimizes train-test mismatch
Key innovation: student generates $N’ > N$ frames (longer than teacher horizon) during training to simulate long rollouts and reduce compounding error.
Result: 35 denoising steps → 4 steps, bidirectional → causal attention, enabling 10.81 FPS real-time inference.
4. Key Results
Scaling data improves everything
Adding more human data consistently improves OOD performance:
| Pretraining Data | In-lab PSNR | Counterfactual PSNR |
|---|---|---|
| No pretraining | 20.576 | 20.472 |
| In-lab only | 20.913 | 20.755 |
| In-lab + EgoDex | 20.972 | 20.797 |
| In-lab + EgoDex + DreamDojo-HV | 21.016 | 20.852 |
Latent actions match ground-truth actions
| Conditioning Method | In-lab PSNR | EgoDex PSNR |
|---|---|---|
| No pretraining | 20.576 | 19.952 |
| Action-free pretraining | 20.797 | 19.924 |
| Latent action | 20.913 | 20.344 |
| Ground-truth action (ideal) | 20.960 | 20.474 |
Latent actions close most of the gap to ground-truth labels — and are infinitely more scalable.
Human preference: scaling model helps
| Comparison | Physics Correctness | Action Following |
|---|---|---|
| DreamDojo-2B > Cosmos-Predict2.5 | 62.5% | 63.5% |
| DreamDojo-14B > Cosmos-Predict2.5 | 73.5% | 72.6% |
| DreamDojo-14B > DreamDojo-2B | 72.5% | 65.5% |
Distillation: real-time with minimal degradation
| Model | FPS | Predict Len | Context Len |
|---|---|---|---|
| Teacher | 2.72 | 12 frames | 1 frame |
| Student (distilled) | 10.81 | 4 frames | 12 frames |
The student is 4x faster and has better context awareness (12-frame sliding window vs. 1-frame conditioning).
Downstream applications
Policy evaluation: Pearson correlation r=0.995 between DreamDojo-predicted success rates and real-world success rates across 6 policy checkpoints. Near-perfect ranking.
Model-based planning: Sample N action proposals from a policy ensemble, simulate all with DreamDojo, select best via a value model. Result: ~2x improvement in success rate over uniform sampling.
Live teleoperation: Real-time teleoperation of a virtual G1 robot using PICO VR controller on a single RTX 5090.
Architecture ablations
| Modification | GR-1 Val PSNR | Counterfactual PSNR |
|---|---|---|
| Baseline | 16.199 | 19.448 |
| + Relative actions | 16.522 | 19.482 |
| + Chunked injection | 17.626 | 20.783 |
| + Temporal consistency loss | 17.630 | 20.980 |
Chunked injection is the biggest single improvement — respecting causality matters a lot.
5. Strengths
- Massive scale: 44k hours of human video pretraining — by far the largest for any robot world model
- Elegant latent action design: self-supervised, cross-embodiment, nearly matches ground-truth actions
- Consistent scaling: more data, bigger model → better OOD generalization on all benchmarks
- Practical applications demonstrated: policy evaluation with r=0.995, 2x planning improvement, live teleoperation
- Distillation pipeline: 10.81 FPS with improved context consistency
6. Limitations
- Uncommon actions: struggles with fast/unusual motions (slapping, fast waving)
- Optimistic simulator: absolute success rates in DreamDojo are often higher than real-world — doesn’t accurately generate nuanced failures
- Single-view only: no multi-view simulation support (important for SOTA policies)
- Post-training forgetting: retaining pretrained knowledge during fine-tuning not deeply studied
- Pixel-space generation: computationally heavier than latent-space world models (V-JEPA2, Dreamer)
7. DreamDojo vs. DreamZero: Two Sides of the Same Coin
Both from NVIDIA, released weeks apart, representing the two world model paradigms:
| DreamZero (WAM) | DreamDojo (AC-WM) | |
|---|---|---|
| Input | Image + text instruction | Image + future actions |
| Output | Video + actions | Video |
| Pretraining data | Robot teleoperation (500 hrs) | Human videos (44k hrs) |
| Data scaling | Limited to success demos | All data including failures & play |
| Cross-embodiment | Video-only demos (no action labels) | Latent actions as unified proxy |
| Planning | Best-of-N via text prompt variation | Gradient-based / action optimization |
| RL in model | Not possible | Possible (counterfactual simulation) |
| Policy evaluation | Not possible | Yes (r=0.995 correlation) |
| Real-time speed | 7 Hz (38x speedup) | 10.81 FPS (distilled) |
These two papers together make a compelling case for the “flexibly conditioned” world model predicted in the Tale of Two World Models discussion.
8. Takeaways
- Human videos are a goldmine for robot world models — the physics transfers despite the embodiment gap, and the data is orders of magnitude more diverse than robot data
- Latent actions solve the label scarcity problem — self-supervised, cross-embodiment, nearly as good as ground truth
- Causal, chunked action injection is critical — respecting temporal causality dramatically improves controllability
- AC-WMs enable unique downstream applications — policy evaluation and model-based planning that WAMs simply cannot do
- Distillation bridges the gap to real-time — autoregressive + few-step denoising achieves 10.81 FPS with better context consistency
References
- [Paper] arXiv:2602.06949
- [Project] dreamdojo-world.github.io
