[Paper Notes] UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
UltraDexGrasp tackles a missing piece in dexterous manipulation: universal grasping for bimanual robots across multiple grasp strategies. Instead of focusing only on one hand or one grasp type, the paper builds a synthetic-data pipeline that supports:
- two-finger pinch
- three-finger tripod
- whole-hand grasp
- bimanual grasp
Using this pipeline, the authors build UltraDexGrasp-20M, a 20-million-frame dataset over 1,000 objects, and train a point-cloud policy that achieves 84.0% average success in simulation and 81.2% average success in real-world zero-shot sim-to-real grasping.
Paper Info
- Title: UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data
- Authors: Sizhe Yang, Yiman Xie, Zhixuan Liang, Yang Tian, Jia Zeng, Dahua Lin, Jiangmiao Pang
- Affiliations: Shanghai AI Laboratory, CUHK, Zhejiang University, HKU, Peking University
- Project page: yangsizhe.github.io/ultradexgrasp
- arXiv: 2603.05312
- Code: UltraDexGrasp GitHub
1. Motivation
The paper starts from a clear observation: human grasping is naturally strategy-dependent.
- small objects are often handled with pinch or tripod grasps
- medium objects can be grasped with one full hand
- large or heavy objects often require both hands
Current robotic dexterous grasping work usually does not cover this full space. Most prior work is limited to:
- parallel grippers
- single dexterous hands
- one grasp style at a time
For bimanual dexterous robots, the main bottleneck is data. The paper argues that generating high-quality universal grasp data is hard because we need:
- physically plausible contact
- good geometric conformity
- arm-level kinematic feasibility
- dual-arm coordination
- multiple grasp strategies for different object regimes
That is the niche UltraDexGrasp is trying to fill.
2. Core Idea
The key contribution is not just a policy. It is a data generation framework that combines:
- optimization-based grasp synthesis
- planning-based demonstration generation
This lets the system first find feasible grasp poses, then convert them into closed-loop, coordinated arm-hand trajectories that can be executed and filtered in simulation.
The output is a large-scale multi-strategy dataset, which is then used to train a universal grasp policy.
3. Data Generation Pipeline
3.1 Optimization-based grasp synthesis
For a given object and robot, the system first synthesizes candidate bimanual grasps by optimizing:
- hand pose
- finger joint positions
- contact forces
under constraints such as:
- forward kinematics
- joint limits
- friction cone feasibility
- hand-object collision avoidance
- hand-hand collision avoidance
The formulation is shared across grasp strategies; the main difference between pinch, tripod, whole-hand, and bimanual grasp is which hand contact points are activated.
For each object, the method generates 500 candidate grasps, then filters them using:
- physical plausibility checks
- inverse-kinematics reachability
- collision checking
Finally, it selects a preferred grasp based on the shortest SE(3) distance from the current end-effector pose, which makes execution easier and more natural.
3.2 Planning-based demonstration generation
Once the preferred grasp is selected, the whole grasping process is split into four stages:
- pregrasp
- grasp
- squeeze
- lift
Bimanual motion planning is used to produce collision-free coordinated trajectories. In simulation, the robots execute the planned motions with PD control, and the trajectory is kept only if the object is stably lifted.
The success condition is fairly concrete:
- the object must rise at least 0.17 m
- it must stay elevated for at least 1 second
This is a useful design choice because it turns grasp data generation into a physically validated process, not only a kinematic one.
3.3 Dataset scale
Using this process, the authors build UltraDexGrasp-20M:
- 20 million frames
- 1,000 objects
- multiple grasp strategies
The paper also notes that simulated rendering includes an imaged robot point cloud, which helps reduce the sim-to-real gap because at deployment time the robot’s own geometry is known.
4. Policy Design
The grasp policy is intentionally simple.
- input: scene point cloud
- encoder: PointNet++-style point encoder
- aggregator: decoder-only transformer with unidirectional attention
- output: arm and hand control commands
The point cloud is first downsampled to 2,048 points, then encoded with two set-abstraction layers.
Two design choices seem especially important in the paper:
- the unidirectional attention mechanism for aggregating scene features
- bounded Gaussian distribution prediction for actions
The authors emphasize that the policy is meant to be simple and clean, so the paper’s gains should be interpreted mainly as evidence that the data pipeline is strong enough to support universal grasping.
5. Main Results
5.1 Simulation benchmark
The simulation benchmark evaluates grasping on 600 objects, split by size:
- small
- medium
- large
Results against DP3 and DexGraspNet are strong:
- DP3: 46.7 average success
- DexGraspNet: 58.8 average success
- UltraDexGrasp policy: 84.0 average success
More specifically:
- seen small: 78.8
- seen medium: 84.3
- seen large: 90.4
- unseen small: 76.9
- unseen medium: 85.8
- unseen large: 87.5
The average performance on unseen objects is about 83.4%, which is the main generalization result.
An important comparison is that DexGraspNet cannot handle large objects in this setup because it only synthesizes unimanual grasps. That highlights why multi-strategy bimanual data matters.
5.2 Data scaling and policy quality
The paper notes that the raw data-generation pipeline itself has a grasping success rate of 68.5%, but once the policy is trained on more than 1M frames, policy performance significantly exceeds the generator.
That is a nice result: the learned policy is not just imitating noisy demonstrations; it is actually distilling and improving on the large synthetic dataset.
5.3 Ablation study
The ablations show the policy architecture is not arbitrary:
- without bounded distribution prediction: 73.5%
- without unidirectional attention: 68.2%
- full model: 84.0%
So both design choices contribute materially, and the attention design appears particularly important.
5.4 Real-world results
The real-world setup uses:
- two UR5e robots
- two 12-DoF XHands
- two Azure Kinect DK cameras
The policy is tested on 25 real objects across small, medium, and large categories.
Reported real-world results:
- DP3: 46.7
- DexGraspNet: 62.3
- UltraDexGrasp policy: 81.2
By object size:
- small: 72.0
- medium: 82.2
- large: 89.3
These are strong numbers for direct zero-shot sim-to-real deployment, especially since the policy is trained only on synthetic data.
6. Why This Paper Matters
I think the most valuable aspect of the paper is its problem scope.
A lot of dexterous grasping work asks:
- can one hand grasp many objects?
This paper instead asks:
- can a robot choose among multiple dexterous grasp strategies, including bimanual ones, using one training framework?
That is much closer to the real-world version of grasping.
The second reason the paper matters is that it shows a realistic pipeline for scaling data:
- synthesis for diverse contact-rich grasps
- planning for executable trajectories
- simulation filtering for physical validity
- policy learning for generalization and speed
7. Strengths
- Strong focus on a genuinely underexplored setting: universal bimanual dexterous grasping.
- The dataset-generation pipeline is concrete, scalable, and physically grounded.
- Supports multiple grasp strategies instead of a single grasp mode.
- Strong simulation and real-world results with synthetic-data-only training.
- Clear evidence that the learned policy outperforms both the raw generator and strong baselines.
8. Limitations and Open Questions
- The paper focuses on grasping, not the subsequent manipulation after grasp acquisition.
- The evaluation is still mostly object lifting; more task-oriented or functional grasp benchmarks would be useful.
- The policy is trained on point clouds with known robot geometry, so robustness to worse sensing conditions remains unclear.
- The architecture is relatively task-specific; it is not obvious yet whether the same setup scales to more general dexterous manipulation beyond grasping.
- The object set is broad, but it would be interesting to see more articulated, deformable, or cluttered scenarios.
9. Takeaways
My main takeaway is that UltraDexGrasp makes a strong case that multi-strategy dexterous grasping can be learned from synthetic data alone, provided the data are generated with enough physical and kinematic care.
The recipe is fairly compelling:
- synthesize physically plausible grasps
- plan closed-loop coordinated bimanual trajectories
- validate in simulation
- train a simple policy on a very large dataset
That combination gets surprisingly far. For bimanual dexterous robots, this feels like a practical foundation for scaling toward more general manipulation.
