[Paper Notes] D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
D-REX is a real-to-sim-to-real framework for dexterous grasping that tries to solve a specific bottleneck in sim-to-real transfer: the simulator often has the wrong object physics, especially object mass. The paper combines:
- Gaussian-Splat-based object reconstruction for visually realistic digital twins
- a differentiable physics engine for mass identification from robot interaction videos
- human-video-to-robot-demo transfer for supervision
- a force-aware grasping policy conditioned on the identified mass
The key idea is simple but useful: if the simulator can infer the object’s mass from real robot-object interactions, then the learned grasp policy can apply more appropriate grasp force and transfer more reliably to the real world.
Paper Info
- Title: D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping
- Authors: Haozhe Lou, Mingtong Zhang, Haoran Geng, Hanyang Zhou, Sicheng He, Zhiyuan Gao, Siheng Zhao, Jiageng Mao, Pieter Abbeel, Jitendra Malik, Daniel Seita, Yue Wang
- Affiliations: University of Southern California, University of California Berkeley
- Venue: ICLR 2026 conference paper
- arXiv: 2603.01151
- Project page: drex.github.io
1. Motivation
The paper starts from a practical issue in robot learning:
- simulation is cheap and scalable for policy learning
- but sim-to-real performance depends heavily on whether the simulator matches real-world object dynamics
- geometry alone is not enough; mass mismatch often causes grasp failure because the required grasp force changes with object weight
The authors focus on building a pipeline that can recover a physically plausible digital twin from real observations, then use that twin to train better dexterous grasping policies.
2. What D-REX Does
D-REX has four main stages:
- Real-to-Sim reconstruction: reconstruct scene/object geometry from RGB videos using Gaussian Splat representations, then derive collision meshes for simulation.
- Mass identification: execute consistent robot actions in the real world and in simulation, then optimize object mass so the simulated trajectory matches the real one.
- Human-to-robot demo transfer: recover human hand/object motion from RGB videos and retarget them into robot joint trajectories.
- Policy learning: train a dexterous grasping policy conditioned on object geometry and identified mass.
This makes the pipeline a genuine “real-to-sim-to-real” loop instead of only using reconstruction for visualization.
3. Core Technical Idea
3.1 Differentiable mass identification
The central optimization target is to infer object mass m by minimizing the discrepancy between simulated and real trajectories:
min_{m > 0} L_traj(m) = sum_t || s_t^sim(m) - s_t^real ||_2^2
where the object state s = [p, q]^T contains 3D position and quaternion orientation.
The simulator uses a differentiable rigid-body/contact model and computes gradients of the trajectory loss with respect to mass. The paper implements the state update with a semi-implicit Euler integrator so gradients can backpropagate through the dynamics rollout.
In effect, the robot pushes an object, the system compares simulated motion with real motion, and the mass is adjusted until the simulated motion becomes consistent with reality.
3.2 Human demonstration transfer
Human RGB videos are processed to estimate:
- articulated hand pose
- object pose through time
These are then retargeted to a robotic hand using Dex-Retargeting, producing robot-executable trajectories that serve as demonstrations for learning grasp poses.
3.3 Force-aware policy learning
The learned policy does not only predict hand joint positions. It also predicts force-related outputs and explicitly conditions on the identified object mass.
This is the main learning claim of the paper: grasping should be mass-aware, because a pose that works for a light object may fail for a heavy object if the applied force is not adjusted.
4. Why This Is Interesting
Many sim-to-real manipulation papers focus on:
- geometry reconstruction
- domain randomization
- direct imitation from human videos
D-REX argues that physical parameter identification, especially mass, should be part of the loop. That framing is useful because it turns “better reconstruction” into something task-relevant: the policy can exploit the recovered mass to learn force-aware grasps instead of only shape-aware grasps.
5. Experiments and Main Results
5.1 Mass identification
The paper evaluates mass identification by pushing objects with the same actions in the real world and in simulation, then optimizing mass from trajectory mismatch.
Reported findings:
- across diverse objects, percentage error ranges roughly from 4.8% to 12.0%
- for objects with the same geometry but different internal densities, the estimated mass error is under 13 g
- simulated trajectories with optimized mass match real trajectories much better than trajectories using an incorrect lighter mass
This is the strongest systems contribution in the paper: the differentiable engine appears accurate enough to recover task-relevant mass information from interaction data.
5.2 Force-aware grasping
The grasping experiments show that:
- policies trained for one mass perform well mainly on that mass
- mismatched mass leads to failures from too much or too little force
- policies using identified mass achieve performance comparable to policies using ground-truth mass
The cross-evaluation table is especially intuitive: train on one density, evaluate on another, and success drops sharply when mass no longer matches.
5.3 End-to-end tabletop grasping
The method is compared against:
- DexGraspNet 2.0
- Human2Sim2Robot
Across eight tabletop objects with different shapes and masses, D-REX reports consistently higher success rates and lower variance. The paper emphasizes that baseline performance degrades as objects get heavier, whereas the proposed force-aware policy remains more stable.
5.4 Runtime
The system is not lightweight, but it is practical as an offline pipeline:
- object reconstruction uses about 300-340 RGB images
- offline reconstruction takes about 30-35 minutes per object
- mass identification takes about 1.43-1.68 s per iteration
- convergence is typically around 200 epochs, or roughly 5-20 minutes
This is acceptable if the goal is offline digital-twin construction followed by policy training.
6. Strengths
- Clear systems story: reconstruction, identification, demo transfer, and policy learning are tightly connected.
- Good task framing: identifying mass is a concrete and manipulation-relevant target.
- Differentiable physics is used for a meaningful downstream benefit rather than as a standalone novelty.
- The force-aware policy claim is experimentally interpretable: wrong mass leads to the wrong force.
- The pipeline uses only RGB observations plus robot interaction data, which is appealing from a deployment standpoint.
7. Limitations and Open Questions
- The identified physical parameter is mainly mass; other contact properties such as friction and compliance are still major contributors to sim-to-real error.
- The paper relies on several upstream components, including pose estimation, reconstruction, and retargeting, so the full pipeline may be brittle in harder settings.
- Runtime is clearly offline; this is not yet close to online adaptation.
- It is not obvious how well the approach scales to more contact-rich tasks such as in-hand reorientation or dynamic non-prehensile manipulation.
- The grasping policy is conditioned on a single inferred mass value, which may be too coarse for objects with complicated internal mass distributions.
8. Takeaways
My main takeaway is that the paper makes a strong case for task-driven system identification in dexterous manipulation. Instead of treating digital twins as purely visual assets, D-REX uses differentiable simulation to recover the part of physics that most directly matters for grasp success.
For robotics research, this suggests a useful recipe:
- reconstruct the object
- identify the missing physical parameters from interaction
- train the policy with those parameters explicitly in the loop
That is a more convincing route to sim-to-real transfer than relying on geometry reconstruction alone.
