[Paper Notes] Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
Fast-dVLA is the first method to accelerate discrete diffusion VLA (dVLA) models to real-time 30 Hz control. The key insight: despite using bidirectional attention, dVLAs implicitly follow a left-to-right block-wise decoding pattern. Fast-dVLA exploits this by combining block-wise causal attention (enabling KV cache reuse) with diffusion forcing (enabling inter-block parallelism), achieving 2.8×–4.1× speedup while maintaining SOTA-level success rates across CALVIN, LIBERO, and SimplerEnv benchmarks, and demonstrating real-time performance on a real bimanual robot.
Paper Info
- Title: Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance
- Authors: Wenxuan Song*, Jiayi Chen*, Shuai Chen*, Jingbo Wang, Pengxiang Ding, Han Zhao, Yikai Qin, Xinhu Zheng, Donglin Wang, Yan Wang†, Haoang Li†
- Affiliation: HKUST (Guangzhou), ShanghaiTech University, Tsinghua University, Westlake University, Zhejiang University
- Date: 2026-03-31 (v1), 2026-04-07 (v3)
- Venue: arXiv preprint
- arXiv: 2603.25661
- Project Page: Fast-dVLA
1. Problem and Motivation
Discrete diffusion VLAs (dVLAs) like Dream-VLA and DD-VLA are promising alternatives to flow-matching VLAs — they offer better multimodal alignment and preserve pretrained VLM knowledge. However, they are far too slow for real-time robotic control (~30 Hz):
| Paradigm | Forwards per Sequence | Forward Speed | Inference Speed |
|---|---|---|---|
| Autoregressive VLA | High | Fast | Slow |
| Discrete Diffusion VLA (dVLA) | Low | Slow (no KV cache) | Slow |
| Block Diffusion VLA | Moderate | Fast | Moderate |
| Fast-dVLA (ours) | Low | Fast | Fast |
The root cause: dVLAs use bidirectional attention, which means KV representations change every denoising iteration — preventing KV cache reuse and making each forward pass expensive.
Key observation: Despite bidirectional attention, dVLAs exhibit an implicit block-wise autoregressive decoding tendency — earlier action blocks are decoded before later ones. This is because (1) the backbone is initialized from an AR VLM, retaining AR characteristics, and (2) actions have inherent temporal dependencies.
2. Method
Fast-dVLA has three core components:
2.1 Block-Wise Causal Attention for KV Cache Reuse
Replace bidirectional attention with block-wise causal attention: each block can only attend to prefix tokens and tokens within the current block. Once a block is fully decoded, its KV states are frozen and can be cached for all subsequent blocks — just like standard AR decoding but at the block level.
2.2 Diffusion Forcing for Inter-Block Parallelism
Instead of waiting for block i to finish before starting block i+1 (as in standard block diffusion), Fast-dVLA assigns monotonically increasing noise levels to blocks:
\(t_1 < t_2 < \cdots < t_N\)
Earlier blocks have lower noise (closer to clean) while later blocks are more heavily masked. The model factorizes the denoising as:
| \(p_\theta(Y_0 | Y_{t_{1:N}}) = \prod_{i=1}^{N} p_\theta(Y^0_{B_i} | Y^{t_1}{B_1}, \ldots, Y^{t_i}{B_i})\) |
This allows concurrent denoising across blocks — earlier blocks finish first, and their cached KV states accelerate later blocks.
2.3 Asymmetric Distillation for Efficient Training
Rather than training from scratch, Fast-dVLA distills from a finetuned bidirectional dVLA teacher:
| \(\mathcal{L}{AD} = \mathbb{E}\left[\sum{i=1}^{N} D_{KL}\left(p_\theta(\cdot | \text{causal view}) | p_{\phi^-}(\cdot | \text{global view})\right)\right]\) |
The teacher sees all blocks (bidirectional), while the student only sees causally preceding blocks. This is asymmetric — the student learns to approximate the teacher’s richer context with restricted attention. Converges in only 1/10 of the steps needed for training from scratch.
2.4 Pipelined Parallel Decoding
At inference, blocks operate in a dynamic pipeline with two states:
- Semi-activated: block is introduced when the preceding block’s completion ratio exceeds τ_add; only high-confidence tokens are decoded
- Fully-activated: transitions when predecessor exceeds τ_act; at least 1/n remaining tokens decoded per step
This mechanism balances speed and reliability while preserving temporal causality.
3. Experiments and Main Results
Paradigm Comparison (LIBERO)
| Method | Avg SR | Speed (tokens/s) | Speedup |
|---|---|---|---|
| Dream-VLA | 0.856 | 98.8 | 1.0× |
| + Fast-dLLM | 0.828 | 183.2 | 1.9× |
| + Block Diffusion | 0.858 | 181.7 | 1.8× |
| + Fast-dVLA | 0.870 | 313.1 | 3.2× |
| DD-VLA | 0.963 | 152.1 | 1.5× |
| + Fast-dLLM | 0.935 | 312.5 | 3.2× |
| + Block Diffusion | 0.967 | 322.1 | 3.3× |
| + Fast-dVLA | 0.966 | 402.7 | 4.1× |
Fast-dVLA achieves the best speed–performance trade-off, even slightly improving success rates in some cases.
CALVIN Long-Horizon (UD-VLA)
Fast-dVLA achieves 2.8× speedup on UD-VLA (625-token sequences) while maintaining competitive performance (Avg Len 4.54 vs 4.64 baseline), ranking among the top methods on CALVIN ABCD→D.
SimplerEnv (Dream-VLA)
Fast-dVLA achieves 59.3% average success rate at 366.4 tokens/s, outperforming:
- Flow-matching methods (π0: 27.1%, GR00T-N1: 36.5%)
- AR methods (π0-FAST: 32.1%)
- Other dVLA acceleration methods
Real-World Results
On a bimanual AgileX platform with 3 tasks:
- Achieves consistent 30 Hz execution frequency — the first dVLA to reach real-time control
- Nearly doubles efficiency over Dream-VLA on conveyor belt picking
- Maintains competitive success rates on semantic manipulation tasks
Training Efficiency
Asymmetric distillation converges in ~2,000 steps — only 1/5 of the original finetuning budget, and 1/10 of training from scratch.
4. Ablation Highlights
- Block size: Using multiples of action dimensionality (e.g., 7 for 7-DoF actions) preserves intrinsic temporal dependencies and yields the best speed/performance trade-off
- Confidence threshold (τ_conf): 0.5 balances 2.8× acceleration with only 2% performance drop
- Thresholds τ_add and τ_act: Set to 2/7 and 4/7 respectively for the pipelined decoding schedule
5. Strengths and Limitations
Strengths:
- First real-time dVLA — reaches 30 Hz on physical robots
- Clean, principled design: block-wise attention + diffusion forcing is a natural fit for the observed decoding pattern
- Highly efficient training via asymmetric distillation (1/10 steps)
- Generalizes across multiple dVLA architectures (Dream-VLA, DD-VLA, UD-VLA)
- Maintains or improves performance while dramatically accelerating inference
Limitations:
- Requires a pretrained bidirectional dVLA as teacher — not a standalone training method
- Block size must align with action dimensionality — less flexible for variable-length outputs
- The implicit AR tendency is observed empirically but not theoretically guaranteed for all architectures
- Real-world experiments limited to a single bimanual platform
6. Takeaways
- dVLAs have a hidden AR structure — even with bidirectional attention, the decoding order is block-wise left-to-right. This is a useful inductive bias, not a bug.
- Block-wise attention + diffusion forcing is a powerful combination: KV cache reuse from AR-style attention + inter-block parallelism from diffusion forcing = real-time discrete diffusion.
- Asymmetric distillation is remarkably efficient — converting a bidirectional dVLA to a fast block-wise one costs only ~2k training steps.
- dVLAs can now compete with flow-matching VLAs on speed while retaining their advantages in multimodal alignment and unified generation — a significant step toward practical deployment.
