[Paper Notes] Do World Action Models Generalize Better than VLAs? A Robustness Study
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
- WAMs are strongest on visual perturbations. LingBot-VA reaches 74.2% on RoboTwin 2.0-Plus, and Cosmos-Policy reaches 82.2% on LIBERO-Plus.
- But WAMs do not universally dominate. On LIBERO-Plus, pi0.5 reaches 85.7%, beating the best WAM overall.
- The likely advantage of WAMs is that video pretraining gives them stronger spatiotemporal priors, so downstream policy training does not need to learn dynamics from scratch as aggressively.
- The biggest practical downside is speed: the evaluated WAMs are about 4.8x to 83.0x slower than pi0.5 per action chunk.
Paper Info
- Title: Do World Action Models Generalize Better than VLAs? A Robustness Study
- Authors: Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pakdamansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, Feng Wen, Xingyue Quan, Yingxue Zhang
- Affiliation: Huawei Technologies; University of Toronto
- Date: 2026-03-23
- Venue: arXiv preprint
- arXiv: 2603.22078
1. Question and Setup
The paper asks a simple but important question: do world action models (WAMs) actually generalize more robustly than vision-language-action models (VLAs)?
To test this, the authors evaluate open-source or publicly released policies on two perturbation-heavy benchmarks:
- LIBERO-Plus: single-arm Franka manipulation
- RoboTwin 2.0-Plus: a new dual-arm benchmark built on RoboTwin 2.0
Both benchmarks perturb the policy along seven axes:
- Camera
- Robot initial state
- Language
- Light
- Background
- Noise
- Layout
The model pool spans three families:
- VLAs: pi0, pi0.5, OpenVLA-OFT, X-VLA, UniVLA, RIPT-VLA
- Hybrid approaches: MOTUS, VLA-JEPA
- WAMs: GE-Act, Cosmos-Policy, LingBot-VA
One caveat matters: DreamZero is discussed in the taxonomy, but excluded from quantitative evaluation because of dataset mismatch and very high training/inference cost.
2. What distinguishes WAMs from VLAs?
The paper frames the difference cleanly:
- VLA: predict the next action directly from the current history,
p_theta(a_t | h_t) - WAM: jointly predict future state and action,
p_phi(h_{t+1}, a_t | h_t), or first predict future state and then generate action,p_phi(h_{t+1} | h_t) * g_psi(a_t | h_t, h_{t+1})
So the main distinction is not just “language model backbone” versus “video model backbone”. It is also:
- direct action prediction versus
- future-state-aware action generation
Because WAM backbones are pretrained for video generation, the authors argue they inherit stronger spatiotemporal priors from web-scale video pretraining.
3. Main Results
Overall outcome
| Benchmark | Best WAM | Total | Best overall model | Total | Main read |
|---|---|---|---|---|---|
| RoboTwin 2.0-Plus | LingBot-VA | 74.2 | LingBot-VA | 74.2 | WAMs lead clearly on this dual-arm robustness benchmark |
| LIBERO-Plus | Cosmos-Policy | 82.2 | pi0.5 | 85.7 | A strong VLA beats the best WAM overall |
This is why I would read the paper’s title question as “often yes, but not always.”
Where WAMs look strongest
- On RoboTwin 2.0-Plus, LingBot-VA is especially strong under light (89.0), background (91.3), noise (80.9), and layout (87.9) perturbations.
- On LIBERO-Plus, Cosmos-Policy and GE-Act also show strong robustness under light, noise, and layout perturbations.
- Hybrid methods such as MOTUS and VLA-JEPA also improve robustness, which suggests that even partial integration of video/dynamics learning helps.
Where WAMs still struggle
- Camera viewpoint changes remain hard. On RoboTwin 2.0-Plus, LingBot-VA drops to 28.9 under camera perturbation.
- Robot initial-state perturbations are another weak spot. LingBot-VA gets 36.2 on RoboTwin, and Cosmos-Policy trails pi0.5 on LIBERO-Plus robot perturbations (63.3 vs 77.5).
So the current evidence says: WAMs are especially good at visual robustness, but not automatically better under geometry or embodiment shifts.
4. Why WAMs help
The paper’s explanation is intuitive:
- video backbones are pretrained on temporally rich internet-scale videos
- that pretraining teaches fine-grained visual dynamics
- downstream policy training can therefore focus more on action generation, instead of learning dynamics from scratch
Table 2 is one of the most interesting parts of the paper. It highlights how different the training pipelines are:
- Cosmos-Policy uses only 185 task trajectories for task-specific finetuning
- pi0.5 relies on a much broader recipe: robot data, web captioning/VQA/grounding data, high-level planning data, and post-training stages
My read: the paper supports a training-efficiency story more strongly than a pure architecture-only story. WAMs seem to buy robustness more cheaply, while strong VLAs can catch up or even surpass them when backed by a richer data pipeline.
5. Runtime and Practical Limits
The paper is also very clear about the main downside: WAM inference is slow.
| Model | Inference time per chunk |
|---|---|
| pi0.5 | 63 ms |
| X-VLA | 195 ms |
| GE-Act | 300 ms |
| Cosmos-Policy | 390 ms |
| LingBot-VA (real-world setting) | 480 ms |
| MOTUS | 1175 ms |
| LingBot-VA (RoboTwin setting) | 5230 ms |
The authors attribute much of this gap to the state denoising process inside WAMs. Even the faster WAMs in this study are at least 4.8x slower than pi0.5.
Other limitations worth keeping in mind:
- RoboTwin 2.0-Plus is an in-house benchmark, so external replication will take time
- DreamZero is not included in the quantitative benchmark comparison
- the compared methods do not use matched data pipelines, so this is not a perfectly controlled apples-to-apples study
6. Takeaways
- This paper supports “WAMs are a strong robustness prior” more than “WAMs always generalize better than VLAs.”
- WAMs look especially attractive for visual perturbations such as light, noise, and cluttered layouts.
- Strong VLAs can still win overall when they are trained with sufficiently rich and diverse data, as pi0.5 shows on LIBERO-Plus.
- Hybrid methods matter because they show that importing temporal/video structure into VLA pipelines already helps a lot.
- Inference efficiency is the biggest deployment bottleneck for WAMs today.
