[Paper Notes] DreamZero: World Action Models are Zero-shot Policies
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
DreamZero is a 14B World Action Model (WAM) built on a pre-trained video diffusion backbone (Wan2.1-I2V-14B). It jointly predicts future video frames and robot actions from image observations and language instructions. Key results:
- 2x+ improvement in zero-shot generalization to unseen tasks/environments over SOTA VLAs (GR00T N1.6, pi0.5)
- Learns effectively from diverse, non-repetitive teleoperation data (~500 hrs on AgiBot G1)
- Cross-embodiment transfer: 42%+ relative improvement on unseen tasks from just 10-20 min of video-only demos from other robots or humans
- Few-shot embodiment adaptation: transfers to a new robot with only 30 minutes of play data
- Real-time control at 7Hz via a 38x inference speedup through system, implementation, and model-level optimizations
Paper Info
- Title: World Action Models are Zero-shot Policies
- Authors: Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, et al.
- Affiliation: NVIDIA
- Date: 2026-02-19
- arXiv: 2602.15922
- Project page: dreamzero0.github.io
1. Motivation
Vision-Language-Action models (VLAs) inherit semantic priors from VLMs — they know what to do — but lack spatiotemporal priors for how to execute novel physical motions. A VLA can do “move coke can to Taylor Swift” (if it knows the move skill) but fails at “untie the shoelace” if that skill was never in the training data. VLAs are also pre-trained on static image-text data, which limits their understanding of physical dynamics.
The core hypothesis: video diffusion models encode rich spatiotemporal priors about how the physical world evolves. By building a robot policy on top of a video model — jointly predicting video and actions — we can achieve much better generalization to novel tasks and environments.
2. What is a World Action Model (WAM)?
A WAM has the input-output structure:
[Current Image + Language Instruction] → [Future Video + Robot Actions]
Unlike VLAs that directly map observations to actions, WAMs first “imagine” what successful execution looks like (video), then extract actions aligned with that imagined future. The key insight is that joint video-action prediction decomposes into:
\[\pi(o_{l:l+H}, a_{l:l+H} | o_{0:l}, c, q_l) = \underbrace{\pi(o_{l:l+H} | o_{0:l}, c, q_l)}_{\text{video prediction}} \cdot \underbrace{\pi(a_{l:l+H} | o_{0:l+H}, q_l)}_{\text{inverse dynamics}}\]But instead of using two separate models, DreamZero trains a single end-to-end model with a joint denoising objective.
3. Architecture
DreamZero is built on Wan2.1-I2V-14B, a 14B image-to-video diffusion transformer:
- Inputs: visual context (VAE-encoded), language instruction (text encoder), proprioceptive state (state encoder)
- Backbone: autoregressive DiT with flow matching
- Outputs: future video frames (VAE decoder) + continuous actions (action decoder)
- Training objective: flow matching with teacher-forcing chunk-wise video denoising
Why autoregressive (not bidirectional)?
| Property | Autoregressive | Bidirectional |
|---|---|---|
| Inference speed | Fast (KV caching) | Slower (fixed-length) |
| Frame rate | Preserves native FPS | Requires subsampling → misalignment |
| Motion smoothness | Smoother (temporal backprop) | Similar task progress but jerkier |
| Error accumulation | Mitigated via ground-truth observation replacement in KV cache | N/A |
Closed-loop inference
A critical advantage of WAMs in closed-loop control: after executing each action chunk, ground-truth observations replace generated frames in the KV cache. This eliminates the compounding error problem inherent to autoregressive video generation.
4. Real-time Execution (38x Speedup)
Naive inference: ~5.7s per action chunk on a single GPU. Target: <200ms for smooth 7Hz control.
| Optimization Level | Technique | Cumulative Speedup (GB200) |
|---|---|---|
| System | CFG parallelism (2 GPUs) | 1.8x |
| System | DiT caching (reuse attention) | 5.4x |
| Implementation | Torch compile + CUDA graphs | 10.9x |
| Implementation | Kernel & scheduler optimizations | 14.8x |
| Implementation | NVFP4 quantization | 16.6x |
| Model | DreamZero-Flash (1-step denoising) | 38x |
DreamZero-Flash
Key insight: at inference with very few denoising steps, the video tokens remain noisy while we need clean actions. Standard training couples video and action noise levels → train-test mismatch.
Solution: decouple noise schedules. Bias video timesteps toward high-noise states via Beta(7, 1), while keeping action timesteps uniform. This trains the model to predict clean actions from noisy visual context, directly matching the single-step inference regime.
Result: 1-step inference at 150ms with only ~9% performance drop vs. 4-step baseline.
5. Key Experimental Results
Data and setup
- AgiBot G1: ~500 hrs diverse teleoperation across 22 environments, ~7.2K episodes, ~42 subtasks/episode
- DROID-Franka: public heterogeneous dataset
- Baselines: GR00T N1.6, pi0.5 (both from-scratch and pretrained variants)
Q1: Learning from diverse data (seen tasks, unseen environments)
| Model | Initialization | AgiBot Avg Task Progress |
|---|---|---|
| GR00T N1.6 | From scratch | 0.6% |
| GR00T N1.6 | Pretrained | 8.4% |
| pi0.5 | From scratch | 0% |
| pi0.5 | Pretrained | 27.4% |
| DreamZero | From scratch | 62.2% |
VLAs trained from scratch achieve near-zero on diverse data. DreamZero achieves 2x+ over the best pretrained VLA.
Q2: Zero-shot generalization to unseen tasks
On 10 tasks entirely absent from training (untying shoelaces, ironing, painting, shaking hands, etc.):
- DreamZero: 39.5% average task progress
- Best pretrained VLA: 16.3%
- From-scratch VLAs: <1%
Qualitatively: VLAs overfit to dominant training behaviors (e.g., always try pick-and-place). DreamZero performs visual planning and executes novel motions.
Q3: Post-training
After fine-tuning on task-specific data (shirt folding, fruit packing, table bussing):
- DreamZero: 90.5% average task progress
- Best pretrained VLA: 53.3%
- Environment generalization is retained after post-training.
Q4: Cross-embodiment transfer
Using only 10-20 min of video-only data (no actions) from another robot (YAM) or humans:
| Method | Unseen Task Progress |
|---|---|
| DreamZero (baseline) | 38.3% |
| + Human-to-robot transfer | 54.3% (+42% relative) |
| + Robot-to-robot transfer | 55.4% (+45% relative) |
Q5: Few-shot embodiment adaptation
Transfer to a new robot (YAM) with only 30 minutes of play data: retains strong language following and zero-shot generalization to novel objects.
6. Ablations
| Factor | Finding |
|---|---|
| Data diversity | Diverse data » repetitive data (50% vs 33%) — robust IDM needs diverse state-action correspondences |
| Model scale | 14B » 5B (50% vs 21%) — smaller models hallucinate visually → erroneous actions. VLAs at any size (5B-32B) still fail on diverse data (0%) |
| Architecture | AR ≈ BD in task progress, but AR produces smoother motions and is 3-4x faster |
7. Strengths
- Strong generalization story: 2x+ over SOTA VLAs on unseen tasks/environments, even without cross-embodiment pretraining
- Data-efficient cross-embodiment: 10-20 min of video-only data yields 42%+ relative improvement
- Practical real-time system: 38x speedup to reach 7Hz closed-loop control
- Interesting failure mode: most failures come from video prediction errors, not action extraction — improving the video backbone directly improves the policy
- Clean scaling signal: larger video models → better video quality → better actions (unlike VLAs where scaling doesn’t help with diverse data)
8. Limitations
- Behavior cloning paradigm: WAMs are still fundamentally instruction → action, without the ability to do RL or counterfactual reasoning (as noted in the Tale of Two World Models discussion)
- High-precision tasks: sub-centimeter tasks (key insertion, fine assembly) remain challenging
- Long-horizon tasks: limited by action chunk horizon and context window
- Single-embodiment pretraining: multi-embodiment joint pretraining not yet explored
- No scaling laws: the relationship between model size, data size, and compute for WAMs is still unknown
9. Takeaways
- Video models are powerful priors for robot policies — they encode spatiotemporal knowledge that VLMs fundamentally lack
- Data diversity > data repetition for WAMs — the video prediction is already learned from pretraining; the key is learning a robust inverse dynamics model
- Joint end-to-end training matters — single model with shared objective > separate video + IDM models
- Cross-embodiment transfer through video is surprisingly effective — video is embodiment-agnostic, making WAMs natural candidates for multi-robot learning
- The inference cost is solvable — with aggressive optimization, 14B video diffusion can run at 7Hz for real-time control
References
- [Paper] arXiv:2602.15922
- [Project] dreamzero0.github.io
- [Code] github.com/dreamzero0/dreamzero
