[Paper Notes] TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
TiPToP is a modular robotic manipulation system that turns a stereo RGB observation plus a natural-language instruction into a complete manipulation plan, without any robot training data.
Its core claim is simple but important: for long-horizon tabletop manipulation, a pipeline built from foundation-model perception + explicit task-and-motion planning + precise execution can compete with, and often outperform, a large end-to-end VLA policy that was fine-tuned on hundreds of hours of embodiment-specific demonstrations.
What makes the paper stand out is not only the benchmark result. It is the systems argument that modularity is still a strong design choice in robotics, because it gives:
- better semantic grounding on distractor-heavy tasks
- stronger multi-step reasoning through symbolic planning
- clearer failure diagnosis at the component level
- easier cross-embodiment deployment
Paper Info
- Title: TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation
- Authors: William Shen, Nishanth Kumar, Sahit Chintalapudi, Jie Wang, Christopher Watson, Edward Hu, Jing Cao, Dinesh Jayaraman, Leslie Pack Kaelbling, Tomas Lozano-Perez
- Affiliations: MIT CSAIL, University of Pennsylvania
- arXiv: 2603.09971
- Project page: tiptop-robot.github.io
- Paper type: robotic manipulation / modular systems / task and motion planning
1. Problem Setting and Motivation
The paper studies a practical manipulation setting:
- input: a stereo RGB image pair and a language instruction
- output: robot joint trajectories and gripper commands that complete the task
The target is not just simple pick-and-place. The tasks include:
- distractor-heavy object selection
- semantic grounding such as “matching plate” or “largest toy”
- multi-step manipulation with obstacle removal or packing
The authors position TiPToP against a strong baseline: pi0.5-DROID, a vision-language-action model fine-tuned on 350 hours of embodiment-specific demonstrations.
The motivation is clear:
- end-to-end VLAs are appealing, but expensive in data and hard to debug
- classical TAMP is structured and interpretable, but has usually been too brittle or too tightly engineered
- recent foundation models make it possible to revisit modular planning with much stronger perception
2. System Overview
TiPToP is split into three modules:
- Perception
- Planning
- Execution
The full policy is planner-based. It observes the scene once at the beginning and then executes an open-loop plan.
2.1 Perception module
The perception module builds an object-centric scene representation from the initial stereo observation and the language instruction.
Its two branches run in parallel:
- a 3D vision branch for depth and grasp generation
- a semantic branch for object detection, segmentation, and goal grounding
The key ingredients are:
- FoundationStereo for dense stereo depth
- M2T2 for 6-DoF grasp proposals
- Gemini Robotics-ER 1.5 for open-vocabulary object detection and symbolic goal grounding
- SAM-2 for segmentation
The output is a set of per-object meshes, candidate grasps, and symbolic goal predicates.
2.2 Planning module
Planning is handled by cuTAMP, a GPU-parallelized task-and-motion planner.
Given the symbolic goal, TiPToP:
- enumerates candidate plan skeletons
- samples grasps, placements, IK solutions, and trajectories
- optimizes these parameters under collision and feasibility constraints
- invokes cuRobo for collision-free motion generation
This is the paper’s central engineering point: instead of hoping a policy implicitly discovers long-horizon structure, TiPToP builds that structure explicitly.
2.3 Execution module
The robot executes the planned trajectory with a joint impedance controller.
This makes accurate tracking a first-class requirement. Because TiPToP does not replan during execution, errors from slipping, missed grasps, or object motion can directly cause task failure.
3. Technical Details That Matter
Several implementation choices are more important than they may look at first glance.
3.1 From pixels to symbolic goals
Instead of using a VLM only for captions or labels, TiPToP asks the VLM to produce symbolic goals such as On(a, b).
That matters because the planner can then reason over:
- which objects are relevant
- what relations need to hold
- which multi-step action sequence could satisfy them
This is what enables tasks like sorting by color or placing items on matching containers.
3.2 Convex-hull object completion
Each segmented object is converted into a watertight mesh by projecting observed points downward and taking the convex hull.
This is a pragmatic choice:
- it is cheap
- it provides conservative geometry for collision checking
- but it also causes errors on concave shapes like bananas
The paper’s failure analysis shows this approximation is one of the main weak points.
3.3 Open-loop planning as a tradeoff
TiPToP uses a single initial observation and then executes open-loop.
This gives:
- fast task completion
- strong geometric consistency with the planner
But it also removes:
- recovery after failed grasps
- correction after object slip
- adaptation to unexpected scene changes
The paper is honest that this tradeoff is currently one of the biggest limitations.
4. Experimental Results
The evaluation spans 28 tasks and 165 trials across:
- simulation
- the authors’ DROID setup
- an external evaluation team’s DROID setup
The tasks are grouped into:
- simple
- distractor
- semantic
- multi-step
4.1 Main comparison against pi0.5-DROID
The headline result is:
- TiPToP:
98/165successes, 74.6% - pi0.5-DROID:
55/165successes, 52.4%
The most interesting pattern is not that TiPToP wins everywhere. It does not.
Instead:
- on simple tasks, the two systems are fairly close
- on distractor tasks, TiPToP is much stronger
- on semantic tasks, TiPToP is much stronger
- on multi-step tasks, TiPToP is again much stronger
This matches the architecture:
- VLM grounding helps when language and semantic selection matter
- TAMP helps when multi-step structure and collision constraints matter
- end-to-end reactive control still helps on fragile grasps and execution recovery
4.2 Time-to-success
TiPToP is also often faster than the VLA baseline on successful trials.
Examples from Table II:
can -> mug (sim): 18.6s vs 41.0scrackers -> tray (simple): 14.9s vs 32.2scrackers -> tray (medium): 14.9s vs 45.2s
The reason is straightforward: TiPToP executes a planned trajectory directly, while the reactive VLA may spend extra time probing, retrying, or idling.
5. Failure Analysis
This section is one of the paper’s strongest parts.
The authors manually analyzed 173 additional real-world trials and traced failures to specific modules. The dominant categories are:
- grasping failures:
31 / 55failures - scene completion errors:
13 / 55 - VLM errors:
6 / 55 - cuTAMP failures:
5 / 55
The big takeaway is that grasping and execution robustness dominate the remaining error budget.
In other words, the planning stack is already fairly strong. The larger problems are:
- bad grasp proposals
- slip during transport
- mesh approximation errors from partial observation
- lack of visual feedback during execution
This is exactly the kind of conclusion that modular systems make easier to reach.
6. Why the Paper Is Interesting
I think the paper makes three useful arguments.
6.1 Modular systems are still competitive
There is a strong current narrative that large end-to-end policies will absorb everything. TiPToP pushes back with a concrete counterexample: if task structure matters, explicit planning can still be very competitive.
6.2 Better debugging is a real research advantage
Because the system is decomposed, the authors can identify whether failures come from:
- perception
- mesh completion
- grasp generation
- planning
- execution
That is much more actionable than simply reporting a task-level failure rate.
6.3 Cross-embodiment deployment matters
The authors also show deployment on UR5e and WidowX AI. This is important because many robotics systems look strong only inside one tightly controlled stack. TiPToP argues for a reusable interface between perception, planning, and embodiment-specific execution.
7. Limitations
The paper is strong, but the limitations are substantial and worth keeping in view.
- Open-loop execution is the biggest weakness. Many failures could likely be recovered with re-perception and re-planning.
- Single-view perception limits object visibility and mesh quality.
- Convex hull geometry is too crude for concave objects and can distort collision reasoning.
- The system still depends on a fairly heavyweight collection of external foundation models.
- Some extensions, especially to richer manipulation skills, will require more abstract action models and more robust low-level controllers.
8. Takeaways
My main takeaway is that TiPToP is not just a manipulation system, but a strong argument for bringing planning back into the modern foundation-model robotics stack.
The paper shows that a system can be:
- open-vocabulary
- data-efficient
- interpretable
- fairly portable across robots
without giving up strong performance on long-horizon tabletop manipulation.
If I had to summarize the paper in one sentence, it would be:
foundation models are now good enough that explicit planning becomes attractive again, because perception can finally provide the semantic and geometric abstractions that planners need.
