[Paper Notes] Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
Fast-WAM asks whether World Action Models need to generate future videos at test time. Its answer is practical: keep WAM-style video supervision during training, then run as a direct action policy at inference. The model uses a pretrained video DiT to form latent world representations from the current observation and language, and predicts action chunks without the expensive imagine-then-execute stage.
The main empirical message is that training-time video co-training matters more than test-time future imagination. Fast-WAM reaches 91.8% average success on RoboTwin 2.0 and 97.6% on LIBERO without embodied pretraining, while running at 190 ms latency on a single RTX 5090D V2 GPU. Variants that still generate future videos are close in success but much slower; removing video co-training causes a larger accuracy drop.
Paper Info
The paper is “Fast-WAM: Do World Action Models Need Test-time Future Imagination?” by Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao from IIIS, Tsinghua University and Galaxea AI. It is available as arXiv:2603.16666, with a project page at yuantianyuan01.github.io/FastWAM and code at yuantianyuan01/FastWAM. The authors also release checkpoints and preprocessed LIBERO/RoboTwin datasets on Hugging Face.
Core Argument
Many WAMs follow an intuitive pipeline: observe the scene and language command, imagine future visual states, then predict actions conditioned on that imagined rollout. The intuition is attractive, but video diffusion is slow, and a robot policy ultimately needs low-latency closed-loop control.
Fast-WAM separates two roles that are often bundled together. Future video prediction can be a training objective that shapes the representation; future video generation can also be a test-time procedure used before action prediction. The paper’s controlled comparison shows that the first role carries most of the benefit on its benchmarks. In other words, video modeling is most useful as supervision for learning latent world structure, while inference can stay direct:
current observation + language
-> latent world representation
-> action chunk
This makes Fast-WAM look like a bridge between VLA policies and WAMs. At deployment time, it behaves like a direct action policy. During training, it still receives world-model supervision from future video latents.
Method
Fast-WAM is built on Wan2.2-5B, reusing the video Diffusion Transformer, pretrained T5 text encoder, and video VAE. On top of this video backbone, the authors add a 1B action expert DiT, giving the full model roughly 6B parameters. The model groups tokens into clean current-frame latents, noisy future-video latents used for training, and action tokens used for action chunk generation.
The most important implementation detail is the structured attention mask. During training, video tokens learn to predict future latent frames, and action tokens learn to predict actions from the clean current observation. Action tokens cannot attend to future video tokens, which prevents future-frame leakage and keeps the ablation fair: the action branch can benefit from a video-shaped backbone, but it cannot directly see ground-truth future frames. At inference, the future-video branch is removed; the current observation goes through the video backbone once, and the action expert denoises the action chunk.
The training objective is a joint flow-matching loss over action chunks and future video latents:
L_act = L_FM(action chunk)
L_vid = L_FM(future video latents)
L = L_act + lambda * L_vid
Reported implementation details include a hidden dimension of 1024 for the action expert, action horizon 32, future video horizon 9 frames after 4x temporal downsampling, multi-camera image concatenation before VAE encoding, 10 action denoising steps at inference, AdamW with learning rate 1e-4, mixed precision training, and gradient clipping. The official repository mirrors this structure through src/fastwam/, model and data configs for LIBERO/RoboTwin, evaluation managers under experiments/, scripts/train.py, DeepSpeed launch scripts, checkpoint evaluation instructions, ActionDiT preprocessing, and T5 embedding precomputation.
Controlled Variants
The paper tests the core claim by comparing variants that change whether video is used during training and whether future generation is used during inference:
| Variant | Training Video Co-Training | Test-Time Future Generation | Meaning |
|---|---|---|---|
| Fast-WAM | Yes | No | Main method |
| Fast-WAM-Joint | Yes | Yes | Joint video/action denoising |
| Fast-WAM-IDM | Yes | Yes | Generate future video, then predict action |
| Fast-WAM w/o video co-train | No | No | Tests whether video co-training matters |
This design isolates the real question: does performance come from learning with video, or from explicitly imagining video during deployment?
Simulation Results
Fast-WAM is evaluated on RoboTwin 2.0 and LIBERO. On RoboTwin, it is close to pretrained LingBot-VA and clearly above non-pretrained baselines:
| Method | Embodied Pretraining | Average Success |
|---|---|---|
| π0 | Yes | 62.2 |
| π0.5 | Yes | 79.8 |
| Motus | Yes | 87.8 |
| Motus from Wan2.2 | No | 77.3 |
| LingBot-VA | Yes | 92.2 |
| LingBot-VA from Wan2.2 | No | 80.6 |
| Fast-WAM | No | 91.8 |
On LIBERO, Fast-WAM remains competitive with strong WAM/VLA baselines even though it does not rely on embodied pretraining:
| Method | Embodied Pretraining | Average Success |
|---|---|---|
| OpenVLA | Yes | 76.5 |
| π0 | Yes | 94.1 |
| π0.5 | Yes | 96.9 |
| LingBot-VA | Yes | 98.5 |
| Motus | Yes | 97.7 |
| Fast-WAM | No | 97.6 |
The ablations are the heart of the paper. Fast-WAM stays close to variants that perform explicit future imagination, while removing video co-training causes a much larger drop:
| Variant | RoboTwin Average Success | LIBERO Average Success |
|---|---|---|
| Fast-WAM | 91.8 | 97.6 |
| Fast-WAM-Joint | 90.6 | 98.5 |
| Fast-WAM-IDM | 91.3 | 98.0 |
| Fast-WAM w/o video co-train | 83.8 | 93.5 |
Real-World Towel Folding
The real-world experiment uses a Galaxea R1 Lite platform on a long-horizon towel-folding task. The authors collect 60 hours of teleoperated demonstrations and train for 30k steps. The result is useful because it measures both success and completion time: a policy that eventually succeeds after repeated corrections may still be weak for deployment.
The real-world pattern matches the simulation story. Pretrained π0.5 remains the strongest baseline, but Fast-WAM variants with video co-training substantially outperform π0.5 without pretraining, and Fast-WAM without video co-training drops to 10% success. The latency comparison is also decisive: Fast-WAM runs at 190 ms, compared with about 580 ms for Fast-WAM-Joint and 810 ms for Fast-WAM-IDM. Within this model family, direct Fast-WAM gives the best deployment tradeoff: strong success with far lower inference latency.
Strengths and Limitations
The strongest part of the paper is the clean experimental question. Instead of making model scale the main story, it asks where the benefit of WAMs actually comes from and answers with controlled variants. The released code, checkpoints, and preprocessing/evaluation instructions also make the result easier to inspect and reuse.
There are still important boundaries. The conclusion is tested at a specific scale, with Wan2.2-5B plus a 1B action expert; larger video backbones, larger embodied datasets, or different action decoders may shift the tradeoff. Fast-WAM also still uses diffusion-style action denoising, so 190 ms is fast for a WAM comparison but still meaningful for high-frequency contact-rich control. The real-world evaluation focuses on towel folding on one robot platform, leaving open how well the conclusion transfers across rigid, articulated, deformable, and more contact-heavy manipulation. Finally, skipping future video generation removes an interpretable visual rollout that could help debugging, planning, or human inspection.
Takeaway
Fast-WAM’s takeaway is concise: for WAMs, future video prediction may be more valuable as a training objective than as a test-time procedure. This gives robot learning a useful design path: use video world modeling to learn better representations, keep action inference direct, and reserve explicit future imagination for cases where its interpretability or planning value justifies the latency.
