[Paper Notes] TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

10 minute read

Published: March 14, 2026

This post supports English / 中文 switching via the site language toggle in the top navigation.

TL;DR

TiPToP is a modular robotic manipulation system that turns a stereo RGB observation plus a natural-language instruction into a complete manipulation plan, without any robot training data.

Its core claim is simple but important: for long-horizon tabletop manipulation, a pipeline built from foundation-model perception + explicit task-and-motion planning + precise execution can compete with, and often outperform, a large end-to-end VLA policy that was fine-tuned on hundreds of hours of embodiment-specific demonstrations.

What makes the paper stand out is not only the benchmark result. It is the systems argument that modularity is still a strong design choice in robotics, because it gives:

better semantic grounding on distractor-heavy tasks
stronger multi-step reasoning through symbolic planning
clearer failure diagnosis at the component level
easier cross-embodiment deployment

Paper Info

Title: TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation
Authors: William Shen, Nishanth Kumar, Sahit Chintalapudi, Jie Wang, Christopher Watson, Edward Hu, Jing Cao, Dinesh Jayaraman, Leslie Pack Kaelbling, Tomas Lozano-Perez
Affiliations: MIT CSAIL, University of Pennsylvania
arXiv: 2603.09971
Project page: tiptop-robot.github.io
Paper type: robotic manipulation / modular systems / task and motion planning

1. Problem Setting and Motivation

The paper studies a practical manipulation setting:

input: a stereo RGB image pair and a language instruction
output: robot joint trajectories and gripper commands that complete the task

The target is not just simple pick-and-place. The tasks include:

distractor-heavy object selection
semantic grounding such as “matching plate” or “largest toy”
multi-step manipulation with obstacle removal or packing

The authors position TiPToP against a strong baseline: pi0.5-DROID, a vision-language-action model fine-tuned on 350 hours of embodiment-specific demonstrations.

The motivation is clear:

end-to-end VLAs are appealing, but expensive in data and hard to debug
classical TAMP is structured and interpretable, but has usually been too brittle or too tightly engineered
recent foundation models make it possible to revisit modular planning with much stronger perception

2. System Overview

TiPToP is split into three modules:

Perception
Planning
Execution

The full policy is planner-based. It observes the scene once at the beginning and then executes an open-loop plan.

2.1 Perception module

The perception module builds an object-centric scene representation from the initial stereo observation and the language instruction.

Its two branches run in parallel:

a 3D vision branch for depth and grasp generation
a semantic branch for object detection, segmentation, and goal grounding

The key ingredients are:

FoundationStereo for dense stereo depth
M2T2 for 6-DoF grasp proposals
Gemini Robotics-ER 1.5 for open-vocabulary object detection and symbolic goal grounding
SAM-2 for segmentation

The output is a set of per-object meshes, candidate grasps, and symbolic goal predicates.

2.2 Planning module

Planning is handled by cuTAMP, a GPU-parallelized task-and-motion planner.

Given the symbolic goal, TiPToP:

enumerates candidate plan skeletons
samples grasps, placements, IK solutions, and trajectories
optimizes these parameters under collision and feasibility constraints
invokes cuRobo for collision-free motion generation

This is the paper’s central engineering point: instead of hoping a policy implicitly discovers long-horizon structure, TiPToP builds that structure explicitly.

2.3 Execution module

The robot executes the planned trajectory with a joint impedance controller.

This makes accurate tracking a first-class requirement. Because TiPToP does not replan during execution, errors from slipping, missed grasps, or object motion can directly cause task failure.

3. Technical Details That Matter

Several implementation choices are more important than they may look at first glance.

3.1 From pixels to symbolic goals

Instead of using a VLM only for captions or labels, TiPToP asks the VLM to produce symbolic goals such as On(a, b).

That matters because the planner can then reason over:

which objects are relevant
what relations need to hold
which multi-step action sequence could satisfy them

This is what enables tasks like sorting by color or placing items on matching containers.

3.2 Convex-hull object completion

Each segmented object is converted into a watertight mesh by projecting observed points downward and taking the convex hull.

This is a pragmatic choice:

it is cheap
it provides conservative geometry for collision checking
but it also causes errors on concave shapes like bananas

The paper’s failure analysis shows this approximation is one of the main weak points.

3.3 Open-loop planning as a tradeoff

TiPToP uses a single initial observation and then executes open-loop.

This gives:

fast task completion
strong geometric consistency with the planner

But it also removes:

recovery after failed grasps
correction after object slip
adaptation to unexpected scene changes

The paper is honest that this tradeoff is currently one of the biggest limitations.

4. Experimental Results

The evaluation spans 28 tasks and 165 trials across:

simulation
the authors’ DROID setup
an external evaluation team’s DROID setup

The tasks are grouped into:

simple
distractor
semantic
multi-step

4.1 Main comparison against pi0.5-DROID

The headline result is:

TiPToP: 98/165 successes, 74.6%
pi0.5-DROID: 55/165 successes, 52.4%

The most interesting pattern is not that TiPToP wins everywhere. It does not.

Instead:

on simple tasks, the two systems are fairly close
on distractor tasks, TiPToP is much stronger
on semantic tasks, TiPToP is much stronger
on multi-step tasks, TiPToP is again much stronger

This matches the architecture:

VLM grounding helps when language and semantic selection matter
TAMP helps when multi-step structure and collision constraints matter
end-to-end reactive control still helps on fragile grasps and execution recovery

4.2 Time-to-success

TiPToP is also often faster than the VLA baseline on successful trials.

Examples from Table II:

can -> mug (sim): 18.6s vs 41.0s
crackers -> tray (simple): 14.9s vs 32.2s
crackers -> tray (medium): 14.9s vs 45.2s

The reason is straightforward: TiPToP executes a planned trajectory directly, while the reactive VLA may spend extra time probing, retrying, or idling.

5. Failure Analysis

This section is one of the paper’s strongest parts.

The authors manually analyzed 173 additional real-world trials and traced failures to specific modules. The dominant categories are:

grasping failures: 31 / 55 failures
scene completion errors: 13 / 55
VLM errors: 6 / 55
cuTAMP failures: 5 / 55

The big takeaway is that grasping and execution robustness dominate the remaining error budget.

In other words, the planning stack is already fairly strong. The larger problems are:

bad grasp proposals
slip during transport
mesh approximation errors from partial observation
lack of visual feedback during execution

This is exactly the kind of conclusion that modular systems make easier to reach.

6. Why the Paper Is Interesting

I think the paper makes three useful arguments.

6.1 Modular systems are still competitive

There is a strong current narrative that large end-to-end policies will absorb everything. TiPToP pushes back with a concrete counterexample: if task structure matters, explicit planning can still be very competitive.

6.2 Better debugging is a real research advantage

Because the system is decomposed, the authors can identify whether failures come from:

perception
mesh completion
grasp generation
planning
execution

That is much more actionable than simply reporting a task-level failure rate.

6.3 Cross-embodiment deployment matters

The authors also show deployment on UR5e and WidowX AI. This is important because many robotics systems look strong only inside one tightly controlled stack. TiPToP argues for a reusable interface between perception, planning, and embodiment-specific execution.

7. Limitations

The paper is strong, but the limitations are substantial and worth keeping in view.

Open-loop execution is the biggest weakness. Many failures could likely be recovered with re-perception and re-planning.
Single-view perception limits object visibility and mesh quality.
Convex hull geometry is too crude for concave objects and can distort collision reasoning.
The system still depends on a fairly heavyweight collection of external foundation models.
Some extensions, especially to richer manipulation skills, will require more abstract action models and more robust low-level controllers.

8. Takeaways

My main takeaway is that TiPToP is not just a manipulation system, but a strong argument for bringing planning back into the modern foundation-model robotics stack.

The paper shows that a system can be:

open-vocabulary
data-efficient
interpretable
fairly portable across robots

without giving up strong performance on long-horizon tabletop manipulation.

If I had to summarize the paper in one sentence, it would be:

foundation models are now good enough that explicit planning becomes attractive again, because perception can finally provide the semantic and geometric abstractions that planners need.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Lixin Xu

TL;DR

Paper Info

1. Problem Setting and Motivation

2. System Overview

2.1 Perception module

2.2 Planning module

2.3 Execution module

3. Technical Details That Matter

3.1 From pixels to symbolic goals

3.2 Convex-hull object completion

3.3 Open-loop planning as a tradeoff

4. Experimental Results

4.1 Main comparison against pi0.5-DROID

4.2 Time-to-success

5. Failure Analysis

6. Why the Paper Is Interesting

6.1 Modular systems are still competitive

6.2 Better debugging is a real research advantage

6.3 Cross-embodiment deployment matters

7. Limitations

8. Takeaways

TL;DR

论文信息

1. 问题设定与动机

2. 系统概览

2.1 感知模块

2.2 规划模块

2.3 执行模块

3. 真正关键的技术点

3.1 从像素到符号目标

3.2 基于凸包的物体补全

3.3 开环规划的利弊

4. 实验结果

4.1 与 pi0.5-DROID 的主结果比较

4.2 成功耗时

5. 失败分析

6. 为什么这篇论文值得看

6.1 模块化系统依然有竞争力

6.2 可调试性本身就是研究价值

6.3 跨机器人平台迁移很重要

7. 局限性

8. 总结

Share on

You May Also Enjoy

[Paper Notes] Cross-Hand Latent Representation for Vision-Language-Action Models

[Paper Notes] Visual Dexterity: In-Hand Reorientation of Novel and Complex Object Shapes

The Singularity is Near

[Paper Notes] Reward Prediction with Factorized World States