[Paper Notes] Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

9 minute read

Published:

This post supports English / 中文 switching via the site language toggle in the top navigation.

TL;DR

This paper from Physical Intelligence identifies a critical problem in training continuous-action VLAs: naively adding a flow-matching action expert to a pre-trained VLM backbone degrades both training speed and knowledge transfer, because gradients from the randomly initialized action expert corrupt the backbone’s pre-trained representations. The proposed fix — knowledge insulation — is elegant: train the VLM backbone with discrete (FAST-tokenized) actions via next-token prediction, while simultaneously training a smaller action expert with flow matching on continuous actions, but stop the gradient from the action expert back into the backbone. This yields a model that trains as fast as π₀-FAST, runs fast at inference (via the small action expert), follows language instructions better, and generalizes more effectively — all by preserving the VLM’s pre-trained knowledge.

Paper Info

  • Title: Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
  • Authors: Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, Sergey Levine
  • Affiliation: Physical Intelligence
  • Date: 2025 (preprint, under review)
  • Project page: pi.website/research/knowledge_insulation

1. Motivation

Vision-language-action (VLA) models promise to bring web-scale VLM knowledge to robot control. But there’s a tension:

ApproachInference SpeedAction QualityVLM Knowledge Retention
Autoregressive VLAs (e.g., RT-2, π₀-FAST)Slow (~750ms/chunk)Discrete, lossyGood (next-token prediction)
Continuous-action VLAs (e.g., π₀)Fast (10 Hz)High-fidelity, smoothPoor (gradient interference)

The core dilemma: adding a continuous action expert (diffusion/flow matching head) to a VLM introduces randomly initialized parameters whose gradients damage the pre-trained backbone, hurting language following and generalization. Simply freezing the backbone doesn’t work either — VLM representations alone are insufficient for robotics without fine-tuning.

2. Method: Knowledge Insulation

The recipe has three key ingredients:

2.1 Joint Discrete/Continuous Action Prediction

Train the model to predict both discrete and continuous actions simultaneously:

\[\mathcal{L}_{\text{CO-VLA}}(\theta) = \mathbb{E}\left[-\sum_{j} M_j^{\ell} \log p_\theta(\hat{\ell}_{j+1}|x_{1:j}) + \alpha M^{\text{act}} \| \omega - a_{1:H} - f_\theta^a(a_{1:H}^{\tau,\omega}) \|^2 \right]\]
  • The VLM backbone is trained on discrete action tokens (FAST tokenization) via standard next-token prediction — this provides a clean learning signal
  • A separate action expert (300M parameter transformer) is trained with flow matching on continuous action chunks
  • At inference time, only the smaller action expert is used → fast continuous control

2.2 Stop-Gradient (Knowledge Insulation)

The critical innovation: stop the gradient flow from the action expert back to the VLM backbone. The action expert can read backbone features (via cross-attention), but its gradients don’t write back into them:

\[P_{ab} = \text{softmax}\left(Q_a(X_a) \cdot \text{sg}(K_b(X_b))^T + A\right)\]

where sg is the stop-gradient operator. Value embeddings from the backbone are similarly detached. This means:

  • The backbone learns only from the clean autoregressive (discrete action + language) loss
  • The action expert learns to use backbone features without corrupting them

2.3 VLM Data Co-training

Co-train the model on general VLM tasks (image captioning, VQA, object localization) alongside robot data. This further preserves pre-trained knowledge and improves language following and generalization to novel objects.

3. Architecture Details

  • VLM Backbone: PaliGemma (2B language model, 3B total), initialized from pre-trained weights
  • Action Expert: 300M parameter transformer with separate Q/K/V projections
  • Action Representation: FAST tokenization for discrete actions (training signal for backbone), flow matching for continuous actions (used at inference)
  • State Representation: Text state or continuous state both work well; special token state is worse
  • Embeddings interact via self-attention with a carefully designed mask — information flows unidirectionally from VLM to action expert

4. Experiments and Main Results

Real-World Tasks

Evaluated on complex, long-horizon manipulation tasks across multiple robot embodiments:

TaskRobot TypeKey Finding
Items in drawerStatic single-armOurs significantly outperforms all baselines (p<0.001 vs most)
Table bussingStatic single-armOurs best performance + fast inference; π₀-FAST 2× slower
T-shirt foldingStatic bimanualOurs matches or exceeds π₀-FAST (p=0.765)
Mobile manipulation (4 tasks)Mobile bimanualOurs w/ VLM data clearly best

Key Quantitative Results

  • vs π₀: The proposed method significantly outperforms π₀ on language following and task performance. π₀ struggles because its action expert gradients degrade the backbone
  • vs π₀-FAST: Comparable task performance, but 2× faster wall-clock time (π₀-FAST requires slow autoregressive decoding at ~750ms per chunk)
  • vs joint-training (no stop-grad): Stop-gradient consistently improves language following; without it, the backbone gets corrupted similarly to π₀
  • Training speed: Converges as fast as π₀-FAST, while π₀ requires 7.5× more training steps for similar performance

Simulation Benchmarks

BenchmarkLIBERO-90LIBERO-Spatial
π₀85.296.8
π₀-FAST60.296.8
OpenVLA-OFT94.597.6
Ours (generalist)96.098.0

State-of-the-art on LIBERO-90 and LIBERO-Spatial.

DROID Benchmark

Score of 0.55 ± 0.09 vs π₀ at 0.49 ± 0.09 and π₀-FAST at 0.45 ± 0.09.

Language Following

Stopping the gradient flow from the action expert is an effective way to improve language following. Co-training on VLM data further enhances this. The model pays more attention to language inputs rather than just overfitting to visual patterns.

Generalization to Novel Objects

Co-training on VLM data is particularly important for OOD generalization — the model transfers semantic knowledge from captioning/VQA tasks to robotic manipulation of unseen objects.

5. Ablation Highlights

  • Freezing backbone: 0% performance — VLM representations alone aren’t enough for robotics
  • HybridVLA (allows AR tokens to attend to flow-matching inputs): Significantly worse than the proposed masking strategy
  • Naive tokenization vs FAST: FAST provides a better representation learning signal, though naive tokenization still works
  • Without VLM data co-training: Slightly worse task completion, significantly worse language following on joint-training

6. Strengths

  • Clean, principled solution to a well-identified problem (gradient interference from action experts)
  • Comprehensive experimental evaluation across diverse real-world tasks and embodiments
  • Achieves the best of both worlds: fast training (like FAST), fast inference (like π₀), strong language following and generalization
  • The three ingredients (joint training, stop-gradient, VLM co-training) are independently ablated

7. Limitations

  • Training with both discrete and continuous outputs increases computational cost by ~20% (offset by faster convergence)
  • Language following, while improved, is still not perfect — training data distribution still causes occasional language instruction ignoring
  • Evaluation limited to the π₀/PaliGemma architecture family

8. Takeaways

  1. Gradient interference is real and severe: Randomly initialized action experts can badly damage pre-trained VLM representations. This is a fundamental issue for any VLA that adds continuous action heads to pre-trained backbones.
  2. Stop-gradient is a simple but powerful fix: By insulating the backbone from action expert gradients, you preserve pre-trained knowledge while still allowing the action expert to leverage backbone features.
  3. Discrete tokens as a representation learning signal: Even if you want continuous actions at inference, training the backbone with discrete action tokens provides a cleaner, more compatible learning signal.
  4. VLM co-training matters: Mixing in general VLM tasks during VLA training is not just regularization — it actively helps language following and semantic generalization.
  5. Architecture design matters as much as training recipe: The attention mask design (unidirectional flow from VLM to action expert) is critical. Bidirectional attention between action representations hurts performance.

本文支持 English / 中文 切换,请使用顶部导航栏的语言切换按钮。

概要

本文来自 Physical Intelligence,揭示了训练连续动作 VLA 时的一个关键问题:将 flow matching 动作专家模块朴素地添加到预训练 VLM 骨干网络上,会同时损害训练速度和知识迁移——因为随机初始化的动作专家产生的梯度会破坏骨干网络的预训练表征。提出的解决方案——知识隔离(Knowledge Insulation)——非常简洁:使用离散化动作(FAST token)通过 next-token prediction 训练 VLM 骨干网络,同时用 flow matching 训练一个较小的动作专家来生成连续动作,但阻断动作专家到骨干网络的梯度回传。这样得到的模型训练速度与 π₀-FAST 相当,推理速度快(使用小型动作专家),语言指令跟随能力更好,泛化能力更强——核心就是保护了 VLM 的预训练知识。

论文信息

  • 标题: Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
  • 作者: Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, Sergey Levine
  • 机构: Physical Intelligence
  • 时间: 2025(预印本,审稿中)
  • 项目主页: pi.website/research/knowledge_insulation

1. 研究动机

视觉-语言-动作(VLA)模型致力于将网络规模 VLM 知识引入机器人控制,但存在核心矛盾:

方案推理速度动作质量VLM 知识保留
自回归 VLA(如 RT-2, π₀-FAST)慢(~750ms/chunk)离散化,有损好(next-token prediction)
连续动作 VLA(如 π₀)快(10 Hz)高精度,平滑差(梯度干扰)

核心困境:给 VLM 添加连续动作专家(扩散/flow matching 头)会引入随机初始化的参数,其梯度会损害预训练骨干网络,导致语言跟随能力和泛化能力下降。而直接冻结骨干网络也不行——VLM 的表征本身不足以支撑机器人控制任务。

2. 方法:知识隔离

方法包含三个核心要素:

2.1 离散/连续动作联合预测

同时训练模型预测离散动作和连续动作

\[\mathcal{L}_{\text{CO-VLA}}(\theta) = \mathbb{E}\left[-\sum_{j} M_j^{\ell} \log p_\theta(\hat{\ell}_{j+1}|x_{1:j}) + \alpha M^{\text{act}} \| \omega - a_{1:H} - f_\theta^a(a_{1:H}^{\tau,\omega}) \|^2 \right]\]
  • VLM 骨干网络通过标准 next-token prediction 在离散动作 token(FAST 编码)上训练——提供干净的学习信号
  • 独立的动作专家(300M 参数的 transformer)通过 flow matching 在连续动作块上训练
  • 推理时只使用较小的动作专家 → 快速连续控制

2.2 梯度阻断(知识隔离)

关键创新:阻断从动作专家到 VLM 骨干网络的梯度流。动作专家可以通过交叉注意力读取骨干网络的特征,但其梯度不会回写到骨干网络:

\[P_{ab} = \text{softmax}\left(Q_a(X_a) \cdot \text{sg}(K_b(X_b))^T + A\right)\]

其中 sg 是 stop-gradient 算子。来自骨干网络的 value 嵌入同样被 detach。这意味着:

  • 骨干网络只从干净的自回归损失(离散动作 + 语言)中学习
  • 动作专家学习利用骨干网络特征,但不会破坏它们

2.3 VLM 数据联合训练

在机器人数据的同时,联合训练通用 VLM 任务(图像描述、VQA、目标定位)。这进一步保留了预训练知识,提升了语言跟随能力和对新物体的泛化能力。

3. 架构细节

  • VLM 骨干网络: PaliGemma(2B 语言模型,3B 总参数),从预训练权重初始化
  • 动作专家: 300M 参数的 transformer,具有独立的 Q/K/V 投影
  • 动作表征: FAST 编码用于离散动作(为骨干网络提供训练信号),flow matching 用于连续动作(推理时使用)
  • 状态表征: 文本状态和连续状态均表现良好;特殊 token 状态较差
  • 嵌入通过自注意力交互,使用精心设计的掩码——信息单向从 VLM 流向动作专家

4. 实验与主要结果

真实世界任务

在多种机器人形态的复杂长时间操作任务上进行评估:

任务机器人类型关键发现
物品放入抽屉静态单臂本方法显著优于所有基线(vs 多数 p<0.001)
桌面清理静态单臂本方法性能最佳 + 推理快;π₀-FAST 慢 2 倍
T恤折叠静态双臂本方法与 π₀-FAST 持平或更优(p=0.765)
移动操作(4个任务)移动双臂本方法 + VLM 数据明显最优

关键定量结果

  • vs π₀: 本方法在语言跟随和任务表现上显著优于 π₀。π₀ 表现差是因为动作专家梯度降解了骨干网络
  • vs π₀-FAST: 任务表现相当,但实际用时快 2 倍(π₀-FAST 需要约 750ms/chunk 的慢速自回归解码)
  • vs 联合训练(无 stop-grad): 梯度阻断持续改善语言跟随能力
  • 训练速度: 收敛速度与 π₀-FAST 相当,而 π₀ 需要 7.5 倍的训练步数才能达到类似性能

仿真基准

基准LIBERO-90LIBERO-Spatial
π₀85.296.8
π₀-FAST60.296.8
OpenVLA-OFT94.597.6
本方法(通用模型)96.098.0

在 LIBERO-90 和 LIBERO-Spatial 上达到 SOTA。

DROID 基准

得分 0.55 ± 0.09,而 π₀ 为 0.49 ± 0.09,π₀-FAST 为 0.45 ± 0.09。

语言跟随

阻断动作专家的梯度流是改善语言跟随能力的有效方式。联合训练 VLM 数据进一步增强了这一能力。模型更关注语言输入,而非仅仅过拟合视觉模式。

新物体泛化

VLM 数据联合训练对分布外泛化尤为重要——模型能将描述/VQA 任务中的语义知识迁移到对未见物体的机器人操作中。

5. 消融实验亮点

  • 冻结骨干网络: 0% 性能——VLM 表征本身不足以支撑机器人控制
  • HybridVLA(允许 AR token 注意到 flow-matching 输入): 显著差于本文的掩码策略
  • 朴素 tokenization vs FAST: FAST 提供更好的表征学习信号,但朴素 tokenization 也能工作
  • 无 VLM 数据联合训练: 任务完成率略有下降,联合训练的语言跟随能力显著下降

6. 优势

  • 针对明确识别的问题(动作专家梯度干扰)提出了简洁、有原则的解决方案
  • 全面的实验评估,涵盖多种真实世界任务和机器人形态
  • 同时实现了多个优势:训练快(如 FAST)、推理快(如 π₀)、语言跟随和泛化能力强
  • 三个要素(联合训练、梯度阻断、VLM 联合训练)均有独立消融验证

7. 局限性

  • 同时训练离散和连续输出增加约 20% 的计算开销(被更快的收敛所抵消)
  • 语言跟随虽有改善但仍不完美——训练数据分布仍会导致偶尔忽略语言指令
  • 评估局限于 π₀/PaliGemma 架构家族

8. 核心启示

  1. 梯度干扰是真实且严重的:随机初始化的动作专家会严重破坏预训练 VLM 表征。这对任何向预训练骨干网络添加连续动作头的 VLA 都是根本性问题。
  2. Stop-gradient 简单但强大:通过将骨干网络与动作专家梯度隔离,可以保留预训练知识的同时让动作专家利用骨干网络特征。
  3. 离散 token 作为表征学习信号:即使推理时需要连续动作,用离散动作 token 训练骨干网络也能提供更干净、更兼容的学习信号。
  4. VLM 联合训练很重要:在 VLA 训练中混合通用 VLM 任务不仅是正则化——它积极帮助语言跟随和语义泛化。
  5. 架构设计与训练方案同样重要:注意力掩码设计(VLM 到动作专家的单向信息流)至关重要,双向注意力会损害性能。