[Paper Notes] EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
EVA (Executable Video Alignment) identifies and addresses the executability gap in video world models for robotics — where visually coherent generated rollouts can still produce infeasible robot actions. The key idea: train an inverse dynamics model (IDM) on real robot data, then repurpose it as a reward model to post-train the video generator via GRPO. The reward penalizes non-smooth motions (high acceleration/jerk) and out-of-bound actions, directly aligning generated videos with physical executability. On the RoboTwin benchmark and a real bimanual robot, EVA improves kinematic plausibility by +20.9% and boosts real-world task success from 52% to 64% (seen) and 42% to 60% (OOD).
Paper Info
- Title: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
- Authors: Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia
- Affiliation: The Chinese University of Hong Kong, Shenzhen; DexForce Technology Co., Ltd.
- Date: 2026-03-24
- Venue: arXiv preprint
- arXiv: 2603.17808
- Project Page: eva-project-page.github.io
1. Problem and Motivation
Video generative models are increasingly used as world models for robotic manipulation in a decoupled paradigm:
- A video world model generates a future visual rollout conditioned on the current observation and language instruction
- An inverse dynamics model (IDM) converts the generated frames into executable robot actions
The problem: current video world models are optimized for visual realism but lack executability constraints. Even visually coherent rollouts can contain:
- Morphological deformations — arm stretching or melting
- Joint ambiguity — unclear articulation states
- Temporal discontinuities — abrupt jumps between frames
When decoded by an IDM, these artifacts translate into unstable control signals: high-frequency jitter, abrupt joint jumps, or out-of-bounds commands. The authors call this the executability gap.
While this gap can be mitigated at inference time (e.g., rejection sampling), such approaches are inefficient given the high cost of video generation. EVA instead addresses this at training time.
2. Method: Executable Video Alignment
2.1 Inverse Dynamics Model (IDM)
The IDM predicts robot actions from short temporal windows of visual observations:
\(\mathcal{L}_{\text{IDM}} = \mathbb{E}\left[\sum_t |f_\phi(I_{t-k:t+k}) - a_t^{gt}|_2^2\right]\)
Architecture: convolutional backbone → spatial softmax → MLP. The spatial softmax produces keypoint-like 2D coordinates per channel, which proves more stable than global pooling when decoding actions from generated (potentially artifact-laden) rollouts.
2.2 IDM-based Executability Reward
Given a generated video \(V\), the frozen IDM predicts joint commands \(A = {a_t}_{t=1}^T\). The reward evaluates the action sequence along two axes:
Smoothness penalties — Huber penalties on acceleration and jerk (finite differences of IDM-decoded actions):
\(P_\alpha = \mathbb{E}_t[\text{Huber}(\alpha_t; \delta_\alpha)], \quad P_j = \mathbb{E}_t[\text{Huber}(j_t; \delta_j)]\)
Embodiment limit penalties — penalize violations of velocity and acceleration bounds:
| \(P_{\text{vel}} = \mathbb{E}_t|\max( | v_t | - v_{\max}, 0)|_2^2, \quad P_{\text{acc}} = \mathbb{E}_t|\max( | \alpha_t | - a_{\max}, 0)|_2^2\) |
The total penalty is combined into a bounded reward:
\(R(V) = \left(\frac{1 + P(A)}{P_0}\right)^{-\gamma}\)
where \(P_0\) is estimated from rollouts of the pretrained model, and \(\gamma\) controls decay rate.
Key insight: the reward remains informative even when generated videos contain severe visual artifacts, because such artifacts typically translate into unstable or out-of-bound actions — providing a strong penalty signal.
2.3 RL Post-Training via GRPO
EVA uses Flow-GRPO (Group Relative Policy Optimization adapted for flow-matching models) to fine-tune the video generator:
- Sample \(G=8\) rollouts per prompt from a stochastic SDE derived from the flow model
- Score each rollout with the IDM-based reward
- Compute group-relative advantages: \(\hat{A}_i = (R_i - \mu_R) / (\sigma_R + \epsilon)\)
- Optimize using clipped policy gradient with KL regularization against the reference model
The IDM is kept frozen during GRPO fine-tuning — it serves purely as a reward model.
3. Experiments and Main Results
Base Model
- Wan2.1-14B DiT backbone with diffusion forcing
- Initialized from the Large Video Planner (LVP) checkpoint
- SFT on embodiment-specific data → then GRPO with IDM reward
- LoRA rank 32, 8× A800 GPUs, batch size 32
Visual Rollout Quality (Human Evaluation, 210 prompts)
| Method | Kinematic | Interaction | Instruction | Perfect |
|---|---|---|---|---|
| Vidar (Wan2.2) | 67.6% | 66.7% | 87.6% | 62.9% |
| EVA (w/o RL) | 70.5% | 83.3% | 90.5% | 68.1% |
| EVA (with RL) | 91.4% | 86.2% | 89.5% | 83.8% |
EVA improves kinematic plausibility by +20.9% and perfect execution by +15.7% over the SFT-only baseline.
Simulation (RoboTwin 2.0, 21 bimanual tasks)
| Method | Average Success |
|---|---|
| ACT | 29.0% |
| Diffusion Policy | 29.5% |
| RDT | 37.1% |
| π₀ | 45.7% |
| EVA (w/o RL) | 46.2% |
| EVA (with RL) | 52.6% |
EVA outperforms strong VLA baselines (π₀) while using a single multi-task policy across all 21 tasks (baselines are per-task).
Real-World Deployment (Agilex CobotMagic bimanual platform)
| Method | Seen (5 tasks) | OOD (5 tasks) |
|---|---|---|
| ACT | 42.0% | N/A |
| π₀ | 51.0% | 11.0% |
| Vidar | 44.0% | 34.0% |
| GE-Act | 43.0% | 3.0% |
| EVA (w/o RL) | 52.0% | 42.0% |
| EVA (with RL) | 64.0% | 60.0% |
Particularly strong OOD generalization: 60% success on novel tasks, far exceeding π₀ (11%) and GE-Act (3%).
IDM Validation
The IDM achieves 89.52% success rate when decoding ground-truth video demonstrations on RoboTwin, confirming it is a reliable reward model for the alignment phase.
4. Strengths
- Elegant problem formulation: the executability gap is a real and underexplored issue — using the IDM as both action decoder and reward model is a clean, dual-purpose design
- Dense reward from action space: the reward remains informative even for badly artifact-laden videos, since visual artifacts reliably produce kinematic violations
- Strong OOD generalization: the video world model paradigm + alignment produces the best generalization to novel tasks among all tested methods
- Practical simplicity: only requires an IDM trained on real robot data + domain knowledge about joint limits — no learned value function or human feedback needed
5. Limitations
- No contact dynamics modeling: the reward focuses on kinematic smoothness but does not model forces, friction, or torques — critical for precision contact-rich tasks
- Computational cost: diffusion-based video generation is expensive, limiting applicability to high-frequency reactive control
- Open-loop execution: the receding-horizon approach helps, but true closed-loop control with video world models remains challenging
- IDM as bottleneck: the 89.52% IDM accuracy means ~10% of failures may stem from the IDM itself, not the video generator
6. Takeaways
- Executability as an alignment target: shifting from visual/semantic fidelity to physical feasibility is a promising direction for robotic world models. The action space provides a natural, dense signal for alignment.
- IDM dual use: training an IDM for action decoding and then repurposing it as a reward model is efficient — no additional reward model training needed.
- Video world models generalize better: the decoupled paradigm (video planner + IDM) consistently shows stronger OOD performance than end-to-end VLA policies, likely due to leveraging internet-scale video priors.
- RL post-training works for video generation: the success of Flow-GRPO for aligning video world models mirrors the RLHF paradigm in LLMs, suggesting this could become standard practice for robotic video generation.
