[Paper Notes] ACG: Action Coherence Guidance for Flow-based VLA Models
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
Action Coherence Guidance (ACG) is a training-free, test-time guidance technique for flow-matching-based VLA models (GR00T-N1, pi0, SmolVLA, etc.). The problem: diffusion/flow policies memorize noise in human demos (jerks, pauses, jitter), producing temporally incoherent action sequences that cause failures in fine-grained manipulation. ACG’s solution: construct an incoherent denoising vector by replacing self-attention maps with identity matrices (forcing each action token to attend only to itself), then guide away from this incoherent direction during sampling. Results: +6.7 pp average success rate across RoboCasa, DexMimicGen, and real-world SO-101 tasks, with especially large gains on fine manipulation (+23.1 pp button pressing, +28.8 pp real-world pick-and-place). No retraining required.
Paper Info
- Title: ACG: Action Coherence Guidance for Flow-based VLA models
- Authors: Minho Park, Kinam Kim, Junha Hyung, Hyojin Jang, Hoiyeong Jin, Jooyeol Yun, Hojoon Lee, Jaegul Choo
- Affiliation: KAIST AI
- arXiv: 2510.22201
1. Problem: Action Incoherence in Flow-based VLAs
Flow matching policies trained via imitation learning have high generative capacity — but this capacity also memorizes imperfections in human demos:
- Jerks during teleoperation
- Pauses and hesitations
- Jitter from hand tremor
This degrades action coherence: the smoothness and consistency of successive actions within an action chunk. During deployment, incoherent actions cause:
- Instability at critical moments — fumbling near objects, pushing them away
- Trajectory drift — small noise accumulates, deviating from the desired path
This is especially catastrophic for fine-grained manipulation (button pressing, insertion, precise grasping).
2. Method: Action Coherence Guidance
2.1 CFG Recap (and why it doesn’t work well for VLAs)
Standard CFG for flow policies:
\[v_\theta^{\text{CFG}(\lambda)} = (1+\lambda) \, v_\theta(\mathbf{A}_t^\tau, \mathbf{o}_t, \ell_t, \tau) - \lambda \, v_\theta(\mathbf{A}_t^\tau, \mathbf{o}_t, \varnothing, \tau)\]Problem: in VLAs, removing the language condition $\ell_t$ changes the action distribution dramatically — the “unconditional” direction is not meaningful and causes unstable behavior.
2.2 ACG: Guide Away from Incoherence
Instead of guiding toward a condition, ACG guides away from incoherence:
\[v_\theta^{\text{ACG}(\lambda)} = (1+\lambda) \, v_\theta(\mathbf{A}_t^\tau, \mathbf{o}_t, \ell_t, \tau) - \lambda \, v_\theta^{\text{IC}}(\mathbf{A}_t^\tau, \mathbf{o}_t, \ell_t, \tau)\]where $v_\theta^{\text{IC}}$ is the incoherent denoising vector — same model, same inputs, but with modified self-attention.
2.3 Constructing the Incoherent Vector
The key insight: in transformer-based flow policies, self-attention is what creates temporal coherence between action tokens. Each token (representing an action at a specific timestep) attends to all other tokens:
\[\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V\]To generate an incoherent action sequence, replace the attention map with an identity matrix:
\[\text{Attn}^{\text{IC}}(Q, K, V) = I \cdot V = V\]This forces each action token to attend only to itself — no temporal communication — producing a temporally disconnected action chunk. Then ACG steers the generation away from this incoherent direction.
Implementation details:
- Replace self-attention in layers 4-6 (out of 8 total) with identity attention
- Share the first half of layers between base and incoherent passes → ~1.5x compute overhead
- Guidance scale $\lambda = 3.0$ (default)
3. Key Results
Main comparison (GR00T-N1 backbone)
| Method | RoboCasa | DexMG | Real: Strawberries | Real: Tic-Tac-Toe | Average |
|---|---|---|---|---|---|
| Vanilla GR00T-N1 | 32.6% | 40.6% | 43.6% | 38.3% | 38.8% |
| Ensemble (n=2) | 34.0% | 40.3% | 56.7% | 45.0% | 44.0% |
| Feature Smoothing | 34.4% | 42.4% | 57.8% | 45.0% | 44.9% |
| CFG | 35.0% | 41.5% | 50.0% | 43.3% | 42.5% |
| WNG | 35.0% | 42.0% | 65.6% | 48.3% | 47.7% |
| ACG (Ours) | 39.3% | 44.0% | 74.4% | 56.7% | 53.6% |
ACG outperforms all baselines by a clear margin, especially on real-world tasks.
Fine-grained tasks see the largest gains
| Skill | Vanilla | ACG | Improvement |
|---|---|---|---|
| Button pressing | — | — | +23.1 pp |
| Insertion | — | — | +11.8 pp |
| Real-world pick-and-place | — | — | +28.8 pp |
Action coherence metrics
| Method | ATV (rad/s, ↓) | Jerk_RMS (×10³ rad/s³, ↓) |
|---|---|---|
| Vanilla GR00T-N1 | 1.314 | 1.353 |
| Ensemble (n=5) | 0.984 | 1.172 |
| CFG | 1.332 | 1.317 |
| Incoherent ($v_\theta^{\text{IC}}$) | 4.509 | 1.993 |
| ACG | 1.130 | 1.156 |
The incoherent variant is indeed much worse than baseline (validating the design), and ACG achieves the best smoothness while maintaining accuracy (unlike ensemble which smooths but loses precision).
Ablation highlights
- Guidance scale: performance improves up to ~3.0, then degrades (divergence from pretrained distribution)
- Number of incoherent layers: 2-6 layers all help; robust to this choice
- Layer position: middle/later layers work best; early layers can hurt
- Complementary with Self-GAD: ACG improves intra-chunk coherence, Self-GAD improves inter-chunk coherence; combining both yields further gains
4. Connection to CFGRL
This paper has an interesting conceptual link to CFGRL. Both use guidance at test time to improve flow policy outputs — but for different purposes:
| CFGRL | ACG | |
|---|---|---|
| Guides toward | Optimality (higher advantage) | Temporal coherence |
| Guides away from | Unconditional (no goal) | Incoherent (no self-attention) |
| Requires | Optimality labels | Nothing (architectural perturbation) |
| Improves | Return / success rate | Action smoothness → success rate |
Both demonstrate that test-time guidance is a powerful, underexplored tool for robot policy improvement.
5. Strengths
- Training-free: zero additional training, works on any flow-based VLA
- Principled design: identity attention → incoherence is clean and well-motivated
- Large real-world gains: +28.8 pp on real SO-101 tasks — not just sim improvements
- Fine-grained manipulation: biggest gains exactly where they matter most
- Complementary: works alongside other guidance methods (Self-GAD)
6. Limitations
- ~1.5x compute overhead: requires an extra (partial) forward pass for the incoherent vector
- Hyperparameter sensitivity: guidance scale too high → performance degrades
- Shallow network assumption: replacing layers 4-6 of 8 works; unclear if the same recipe scales to much deeper networks
- Only tested on GR00T-N1: applicability to other VLA architectures (pi0, SmolVLA) not empirically verified
7. Takeaways
- Action coherence is a real bottleneck in flow-based VLA policies — demo noise gets memorized and causes deployment failures, especially for fine manipulation
- Self-attention = temporal coherence: replacing attention maps with identity is a clean way to construct a “negative example” for guidance
- Test-time guidance is broadly useful for robot policies — not just for image/video generation. Both ACG (coherence) and CFGRL (optimality) show this
- Intra-chunk coherence matters more than inter-chunk: ACG outperforms Self-GAD, and combining both helps further
- Practical recipe: for anyone deploying GR00T-N1 or similar flow-based VLAs, ACG is essentially free performance — no retraining, just modify inference
References
- [Paper] arXiv:2510.22201
