[Blog Notes] A Tale of Two World Models: WAMs vs. Action-Conditioned World Models
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
Two distinct “world model” recipes are emerging in robotics: World Action Models (WAMs) that take [Image + Text] → [Video + Actions], and Action-Conditioned World Models (AC-WMs) that take [Image + Future Actions] → [Video]. WAMs better preserve pre-trained video model capabilities and enable easy cross-embodiment training; AC-WMs unlock broader data utilization (including failures and play), RL inside the world model, and fine-grained planning. The long-term winner is likely a flexibly conditioned model that can condition on either text or actions, getting the best of both worlds.
This is a summary of Anirudha Majumdar’s blog post (March 2026).
Context
As of early 2026, world model-based approaches are starting to surpass vision-language-action models (VLAs) on benchmarks like DROID on RoboArena [1]. The top performer is DreamZero [2], a WAM fine-tuned from a video model. The central design question: should the world model condition its generation on a sequence of future robot actions?
Two Recipes
World Action Models (WAMs)
Input-output: [Current Image + Text Instruction] → [Video + Actions]
The model generates a video of successful task execution from a language instruction, then decodes robot actions from the video. Two decoding strategies:
- Sequential: generate video first, then use a vision pipeline to extract actions (e.g., Large Video Planner [3])
- Joint: train an inverse dynamics model alongside the video generator (e.g., mimic-video [4], DreamZero [2], VideoPolicy [5], UVA [6])
Action-Conditioned World Models (AC-WMs)
Input-output: [Current Image + Future Actions] → [Video]
The model simulates the consequences of given actions. Examples: Dreamer [7], Veo-Robotics [8], Ctrl-World [9], DreamDojo [10], PlayWorld [11]. Dynamics can also be modeled in latent space (e.g., V-JEPA2 [13]).
Arguments for WAMs
| Advantage | Explanation |
|---|---|
| Preserving pre-trained abilities | WAMs share the same input modalities (image + text) as pre-trained video models → minimal distribution shift during fine-tuning. AC-WMs must learn to condition on action representations, which can destroy pre-trained capabilities. |
| Easier learning problem | WAMs only need to model what successful task execution looks like, rather than predicting consequences of arbitrary actions. |
| Good action proposals | WAMs enable best-of-N planning [14]: generate diverse candidate plans by varying text prompts, score with a reward model, pick the best. AC-WMs need a separate policy to propose actions. |
| Cross-embodiment training | WAM video generation is embodiment-agnostic; only the action decoder needs embodiment-specific fine-tuning with limited data [4]. AC-WMs face the open challenge of conditioning on heterogeneous action spaces. |
Arguments for AC-WMs
| Advantage | Explanation |
|---|---|
| Data, data, data | AC-WMs can train on all robot data — successes, failures, and autonomous play [11]. WAMs require hindsight relabeling of “tasks”, which is hard to scale. |
| Beyond behavior cloning | WAMs are still instruction-in → actions-out (behavior cloning with a video objective). AC-WMs enable RL inside the world model (e.g., World-Gymnast [15], PlayWorld [11]) and planning via counterfactual simulation (e.g., DreamDojo [10]). |
| Fine-grained planning | AC-WMs support gradient-based action-sequence optimization at inference time. WAMs would require gradient-based optimization in text/prompt or diffusion-noise space — feasibility unclear. |
| Policy evaluation | AC-WMs allow closed-loop policy rollouts inside the model (e.g., Veo-Robotics [8], WorldGym [17]). WAMs cannot do this since they lack action conditioning. |
A Non-Argument
“AC-WMs uniquely enable inference-time scaling via planning” — this is not a valid differentiator. WAMs can also do best-of-N planning with a reward model [14].
Long-Term Prediction
Applying Sutton’s bitter lesson (which approach scales better with compute and data?):
- AC-WMs have a clearer data scaling story: all policy rollouts + autonomous play data can be thrown in
- In the long term, frontier labs may pre-train video models with actions as an input modality once the commercial case for robotics is clear
- This would eliminate WAMs’ main advantage (preserving pre-training abilities), while AC-WMs’ advantages (RL, fine-grained planning) would keep improving with scale
The Best of Both Worlds: Flexible Conditioning
The likely convergence point is a flexibly conditioned world model that can accept either text or actions as conditioning (e.g., by masking the unused input during training). Such a unified model would enable:
- Easy action proposal generation (WAM mode)
- Training on vast failure and play data (AC-WM mode)
- Fine-grained planning at inference time
- RL inside the world model
- Policy evaluation
Key Takeaways
- WAMs are surprisingly strong — they preserve pre-trained video model abilities and are the current SOTA (DreamZero on DROID)
- AC-WMs have the better long-term scaling story — they can leverage all robot data, not just success demonstrations
- The dichotomy is likely false — flexibly conditioned models that handle both text and action inputs will probably win out
- Inference-time planning is not exclusive to AC-WMs — WAMs can do best-of-N planning with reward models
References
- [1] Atreya et al., 2025, “RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies”
- [2] Ye et al., 2026, “DreamZero: World Action Models are Zero-shot Policies”
- [3] Chen et al., 2025, “Large Video Planner Enables Generalizable Robot Control”
- [4] Pai et al., 2025, “mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs”
- [5] Liang et al., 2025, “Video Generators are Robot Policies”
- [6] Li et al., 2025, “Unified Video Action Model”
- [7] Hafner et al., 2025, “Training Agents Inside of Scalable World Models”
- [8] Gemini Robotics Team et al., 2025, “Evaluating Gemini Robotics Policies in a Veo World Simulator”
- [9] Guo et al., 2025, “Ctrl-World: A Controllable Generative World Model for Robot Manipulation”
- [10] Gao et al., 2026, “DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos”
- [11] Yin et al., 2026, “PlayWorld: Learning Robot World Models from Autonomous Play”
- [12] Ha and Schmidhuber, 2018, “World Models”
- [13] Assran et al., 2025, “V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning”
- [14] Kim et al., 2026, “Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning”
- [15] Sharma et al., 2026, “World-Gymnast: Training Robots with Reinforcement Learning in a World Model”
- [16] Wagenmaker et al., 2025, “Steering Your Diffusion Policy with Latent Space Reinforcement Learning”
- [17] Quevedo et al., 2025, “WorldGym: World Model as An Environment for Policy Evaluation”
