[Paper Notes] Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
Goal-VLA proposes a zero-shot robotic manipulation framework that uses image-generative VLMs as object-centric world models. Instead of training end-to-end VLAs on expensive paired action data, Goal-VLA generates a goal image depicting the desired task outcome, extracts a precise 3D object pose from it via feature matching and point cloud registration, and then uses a training-free low-level policy to execute the manipulation. A novel Reflection-through-Synthesis mechanism iteratively validates and refines the generated goal image before execution. The result: 59.9% average success in RLBench simulation (vs. 26% for the best baseline MOKA) and 60% in real-world tasks — all zero-shot, with no task-specific fine-tuning.
Paper Info
- Title: Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation
- Authors: Haonan Chen, Jingxiang Guo, Bangjun Wang, Tianrui Zhang, Xuchuan Huang, Boren Zheng, Yiwen Hou, Chenrui Tie, Jiajun Deng, Lin Shao†
- Affiliations: National University of Singapore, HKU, Peking University, Tsinghua University
- arXiv: 2506.23919
- Project page: nus-lins-lab.github.io/goalvlaweb
1. Problem and Motivation
Generalization in robotic manipulation is hard. Two dominant VLA paradigms both have critical weaknesses:
| Paradigm | Intermediate Repr. | Training-Free? | Precise 3D Grounding? |
|---|---|---|---|
| End-to-end VLAs (OpenVLA, π₀) | N/A | ✗ | ✗ |
| Hierarchical VLAs (MOKA, VoxPoser, SUSIE) | Keypoints / Value maps / Subgoal images | Partially | ✗ |
| Goal-VLA (this paper) | Object state (goal image → 3D pose) | ✓ | ✓ |
The key insight: object state representation is the golden interface that naturally separates high-level semantic reasoning from low-level spatial control. End-to-end models need massive action data. Hierarchical models either use sparse representations (keypoints — not enough spatial detail) or dense ones (subgoal images — but then need trained low-level policies to interpret them). VLMs excel at semantic reasoning but struggle with precise spatial reasoning.
Goal-VLA resolves this by letting the VLM do what it’s good at (semantic goal generation) and offloading spatial grounding to a dedicated geometric module.
2. Method
The framework has three stages:
2.1 Goal State Reasoning (High-Level)
Given an RGB-D observation \(O = (I, D)\) and language instruction \(L\):
- Prompt Enhancement: Feed \(I\) and \(L\) into a text-output VLM (Gemini 2.5 Pro) to produce a richer, more descriptive prompt \(L_e\).
- Goal Image Generation: An image-generative VLM (Gemini 2.5 Flash-image) generates a candidate goal image \(I’_{\text{cand}}\).
- Reflection-through-Synthesis Loop (the key novelty):
- Synthesize: Segment the target object from the candidate goal using Grounded SAM, overlay it onto the original scene with partial transparency
- Reflect: A Reflector VLM evaluates whether the synthesized image is semantically correct and physically feasible
- Refine: If rejected, the Reflector generates corrective feedback for the next generation attempt
- Repeat until validated or max iterations reached
This is clever — the synthesis step grounds the reflection by showing the goal object in the context of the original scene, making errors much easier to spot (e.g., the VLM moving the pan along with the tomato).
2.2 Spatial Grounding
Once a valid goal image \(I’\) is obtained, convert the semantic goal into a precise 3D transformation:
Semantic Matching: Use Geo-Aware features to find pixel correspondences between initial image \(I\) and goal image \(I’\):
\((x’, y’) = \arg\max_{(p,q)} \frac{f_{(x,y)} \cdot f’_{(p,q)}}{|f_{(x,y)}| |f’_{(p,q)}|}\)
This is necessary because the generated goal image is semantically correct but may not preserve instance-level appearance — so traditional optical flow fails.
Point Cloud Registration: Lift 2D to 3D using depth, align depth scales via least-squares regression on background pixels:
\(D[(M \cup M’)^c] = s_1 \cdot D’[(M \cup M’)^c] + b\)
Then solve for a similarity transformation using the Umeyama algorithm:
\(s_2 \cdot P’ = RP + t\)
where \(R \in SO(3)\), \(t \in \mathbb{R}^3\), and \(s_2\) accounts for scale differences.
2.3 Low-Level Policy
- Contact Module: Sample-based method to find feasible contact poses on the object’s point cloud (surface normals → collision filtering → geometric scoring)
- Goal Pose: Apply the computed \((R, t)\) transformation to the contact pose
- Motion Planning: Standard sample-based planner to execute the trajectory
The entire low-level policy is training-free — no action data needed.
3. Experiments and Main Results
Simulation (RLBench, 8 tasks, 100 trials each)
| Method | Paradigm | Avg Success |
|---|---|---|
| OpenVLA | End-to-end | 0.2% |
| π₀ | End-to-end | 0.0% |
| SUSIE | Hierarchical | 0.0% |
| VoxPoser | Hierarchical | 5.8% |
| MolmoAct | End-to-end | 11.3% |
| MOKA | Hierarchical | 26.0% |
| Goal-VLA | Hierarchical | 59.9% |
End-to-end models completely fail in zero-shot settings — they’re brittle without in-domain action data. Goal-VLA more than doubles the best baseline.
Real World (4 tasks, 10 trials each)
| Method | Tomato | Sweeping | Duck | Bottle | Avg |
|---|---|---|---|---|---|
| OpenVLA | 0/10 | 0/10 | 0/10 | 0/10 | 0% |
| MOKA | 5/10 | 1/10 | 3/10 | 0/10 | 22.5% |
| MolmoAct | 5/10 | 0/10 | 6/10 | 0/10 | 27.5% |
| Goal-VLA | 9/10 | 4/10 | 7/10 | 4/10 | 60% |
Ablation Study
| Configuration | Avg Success |
|---|---|
| Base (no enhancement, no reflection) | 40.0% |
| + Reflector only | 51.2% |
| + Input Enhancement only | 67.5% |
| + Both (max 1 reflection) | 83.8% |
| + Both (max 3 reflections) | 88.8% |
Input Enhancement contributes the most (+27.5pp), Reflection adds +11.2pp, and they’re complementary. More reflection iterations help further.
4. Strengths
- Truly zero-shot: No task-specific fine-tuning, no paired action data — the whole pipeline uses off-the-shelf foundation models
- Object-centric abstraction: By focusing on object state rather than agent-centric representations, the framework is inherently cross-embodiment
- Reflection-through-Synthesis is well-designed: The synthesis overlay trick makes VLM self-evaluation much more reliable by providing in-context visual comparison
- Strong empirical results: 2.3× the best baseline in simulation, 2.2× in real world
- Minimal assumptions: Only requires a single RGB-D view and language instruction — no pre-scanned maps, object meshes, or task-specific priors
5. Limitations
- Depth estimation bottleneck: Most real-world failures trace back to inaccurate depth in the spatial grounding module, especially for precision-demanding tasks (weighing duck, bottle stand-up)
- High-level reasoning failures: Table sweeping requires sophisticated semantic understanding that the VLM sometimes gets wrong
- Rigid-body assumption: The framework assumes rigid object transformations — deformable objects or articulated manipulations would need extensions
- Latency: Multiple VLM calls (prompt enhancement → image generation → reflection loop) means this is not a real-time system
- Limited task complexity: All evaluated tasks are single-step pick-and-place style — long-horizon multi-step manipulation is not addressed
- Dependency on commercial VLMs: Uses Gemini 2.5 Pro/Flash, which may not always be available or affordable
6. Takeaways
- Object state as the interface between high-level and low-level is a powerful design choice — it cleanly separates what VLMs are good at (semantics) from what geometric methods are good at (spatial precision)
- Image-generative VLMs as world models is a natural and underexplored direction — instead of training robot-specific world models, just use the VLM’s ability to imagine future states
- Reflection-through-Synthesis is a generally useful technique: synthesize the proposed change in context before evaluating it, rather than evaluating the generated output in isolation
- The massive gap between end-to-end VLAs (0-0.2%) and Goal-VLA (59.9%) in zero-shot settings is striking — it suggests that current end-to-end VLAs have essentially no zero-shot capability outside their training distribution
