[Paper Notes] Human Universal Grasping
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
Human Universal Grasping (HUG) is built around a direct scaling argument: human egocentric grasp data can become dexterous robot grasp supervision if every grasp is mapped into a canonical human hand space and then retargeted at deployment. The paper collects smart-glasses recordings of people grasping everyday objects, fits each terminal grasp to MANO, trains an RGB-D and point-conditioned flow-matching model, and retargets the predicted human grasp to robot hands.
The headline assets are 1M-HUGs, a 27.8-hour dataset with 1M egocentric grasp frames, 6,707 object instances, and 41 buildings; HUG, the flow-matching grasp model; and HUG-Bench, a 90-object benchmark with metric-scale meshes for paired simulation and real-world evaluation.
On the 30-object HUG-Bench test set, HUG reaches 73.0% success in MuJoCo simulation, 66.7% real-world tabletop success, and 62.0% in-the-wild success. In tabletop trials, it beats Dex1B by +23% and CAP by +34%.
Paper Info
The paper is “Human Universal Grasping” by Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, and Lerrel Pinto.
It appears on arXiv as arXiv:2606.17054, dated June 15, 2026. The project page is grasping.io, with code, data, benchmark, checkpoints, and an interactive demo.
Problem and Motivation
Dexterous grasping is still bottlenecked by data. Simulation can generate many grasps, but sim-to-real transfer for multi-fingered hands remains brittle; teleoperation collects real robot grasps, but it is slow and embodiment-specific. HUG shifts the source of supervision to ordinary human behavior. People already grasp thousands of objects in natural settings, and smart glasses now provide calibrated egocentric RGB-D, camera motion, and hand tracking for that behavior.
The core pipeline is compact: collect in-the-wild human grasps, fit them into a shared MANO hand space, learn a conditional distribution over human grasps from RGB-D observations and target points, then retarget the generated grasp to a robot hand. The paper’s central claim is that this route gives robot dexterity a scalable data source while keeping the learned representation independent of any single robot embodiment.
1M-HUGs Dataset
1M-HUGs is collected with Aria Gen 2 smart glasses. In each recording, the wearer first looks around a target object for 15-30 seconds while the hands stay out of view, then reaches in with the right hand and grasps the object. This protocol turns one physical grasp into many training pairs: the final grasp pose is propagated backward through camera poses and paired with earlier object-only RGB-D frames from different viewpoints.
Each curated training entry contains a 224 x 224 RGB or grayscale frame, intrinsics, metric depth, an object mask, and the terminal MANO hand pose plus wrist transform in the camera frame. The pipeline uses a vision-language model for object identification, SAM3 for mask propagation, heuristics for grasp-frame selection, and human review in a web annotation interface. After filtering, the dataset has 1M RGB frames and 1M grayscale stereo-left frames, roughly 2M training entries, from 6,707 recordings, 41 buildings, and about 1.5K unique objects.
MANO as the Common Hand Space
Aria provides sparse 21-hand-landmark tracking, so HUG optimizes a full MANO hand for each frame. MANO separates shape (\beta), which controls hand size and proportions, from pose (\theta), which controls joint articulation. HUG fixes shape to one canonical hand during training, so different collectors do not introduce different hand scales into the grasp target.
This is the representation choice that makes the whole system portable. The model learns wrist placement and articulated finger pose in a stable human hand coordinate system. The same canonical MANO hand can be exported to MuJoCo for simulation, and the predicted hand pose can later be retargeted to robot hands with different kinematics.
HUG Model
The model takes a single RGB-D observation and a 2D click ((u, v)) on the target object. Depth and camera intrinsics lift the click to a 3D query point (p_q), giving the model a compact way to specify which object or object part should be grasped.
The output is a 99-dimensional grasp state:
\[x = [t, R_{6d}, \theta_{6d}] \in \mathbb{R}^{99}\]Here (t \in \mathbb{R}^3) is wrist translation in the camera frame, (R_{6d} \in \mathbb{R}^6) is wrist rotation in the continuous 6D representation, and (\theta_{6d} \in \mathbb{R}^{15 \times 6}) represents the 15 MANO finger joints.
The perception stack combines RGB semantics with local 3D geometry. A frozen DINOv2-Base with register tokens encodes the image, while a trainable PointNeXt U-Net encodes 4096 points cropped within 0.3 m of the query point. The two streams meet through point painting: point-cloud centroids are projected into the image, DINO features are sampled at those locations, and the concatenated RGB/3D features are refined by a transformer. The grasp generator is a flow-matching transformer that tokenizes translation, wrist rotation, and finger pose, conditions on the fused scene tokens, and integrates the learned velocity field with 50 Euler steps at inference.
Training Objective
The learning objective combines velocity prediction in normalized grasp space with geometric hand supervision. The paper reconstructs the predicted clean grasp, runs it through MANO, and applies an L1 loss to 3D hand landmarks in the camera frame:
\[L = \lambda_v L_v + \lambda_{3D}(1 - t)L_{3D}\]The weights are (\lambda_v = 1) and (\lambda_{3D} = 20). The ((1 - t)) factor emphasizes near-clean denoising steps, where the reconstructed hand is physically meaningful. This 3D loss is crucial: removing it drops HUG-Bench test success from 73.0% to 32.7% and raises fingertip contact error from 14.6 mm to 35.7 mm. The full model trains for 100K steps with AdamW, batch size 128, and two RTX 5090 GPUs, taking about 10 hours including MuJoCo validation.
HUG-Bench
HUG-Bench contains 90 unseen everyday objects, arranged by five geometric categories (cylindrical, spheroidal, prismatic, appendaged, and amorphous) and three size bins. Each category-size cell has four validation objects and two test objects, giving 30 test objects for real-world evaluation. The set is deliberately awkward: small and large items, handles, articulated structures, and objects that require structure-aware contact, including glue stick, pepper shaker, wine bottle, strawberry, football, storage bin, picnic basket, rubber duck, grapes, headphones, and easel.
The benchmark also contributes the simulation assets needed to evaluate dexterous grasps. The authors build metric meshes from short Aria recordings by extending Multi-view SAM3D with Aria intrinsics, extrinsics, and stereo depth, then manually align, make meshes watertight, and compute convex decompositions for MuJoCo. The released scan-to-asset pipeline is called aria2mesh.
Simulation Results
In MuJoCo, the canonical MANO hand executes an open-loop pre-grasp, grasp, and lift rollout. A grasp succeeds if the object is lifted away from the surface.
The main simulation results are:
| Method | Val SR | Test SR | Test fingertip contact error |
|---|---|---|---|
| RGB + PC full HUG | 71.5% | 73.0% | 14.6 mm |
| without point cloud crop | 61.2% | 58.0% | 25.7 mm |
| without point painting | 61.8% | 58.3% | 23.3 mm |
| without 3D loss | 39.2% | 32.7% | 35.7 mm |
| PC only | 64.2% | 70.7% | 22.1 mm |
| RGB only | 26.8% | 29.7% | 108.6 mm |
| Human grasp oracle | 90.3% | 94.0% | 7.4 mm |
The ablations point to a useful division of labor. Point-cloud geometry carries the main spatial signal, RGB adds semantic grounding and improves fingertip placement, and the full RGB+PC model gives the best contact accuracy. The human grasp oracle exposes the remaining gap from tracking noise, asset imperfections, and open-loop execution. Scaling is equally important: from 25K to 1M RGB frames, test success rises from 33% to 73%, while fingertip contact error drops from 54.2 mm to 14.6 mm. The curve has not saturated at 1M, so the system still looks data-bound.
Real-World Results
The real-world evaluation uses the 30 HUG-Bench test objects, with 10 trials per object and 300 trials per method.
In tabletop experiments, HUG and Dex1B are deployed on a 6-DoF Ability hand mounted on a 7-DoF xArm, using a third-person ZED stereo camera. CAP uses its published parallel-jaw configuration with an iPhone wrist camera.
| Method | Overall tabletop success | Objects with at least one success |
|---|---|---|
| Dex1B | 43.7% | 27/30 |
| CAP | 32.7% | 20/30 |
| HUG | 66.7% | 28/30 |
HUG is especially strong on large prismatic and structured objects: 10/10 on the storage bin where both baselines get 0/10, 9/10 on the picnic basket, 9/10 on the spray bottle, and 8/10 on the easel.
The in-the-wild evaluation changes camera, arm, hand, scene, and viewpoint at once. HUG is deployed on a YOR mobile manipulator with an AgileX NERO arm, a 20-DoF WUJI hand, and Aria Gen 2 for vision, with no onsite model tuning or WUJI-specific retargeting adjustment. It reaches 62.0% in-the-wild success, only 4.7 percentage points below tabletop, and succeeds at least once on 29/30 objects.
Failure Modes
The failure breakdown is practical rather than mysterious. Most failures occur during the transition from pre-grasp to grasp: the hand hits the object while closing, collides with the table, misses or overreaches before pre-grasp, slips during lift, or drops the object after it has been raised. This matches the system design. HUG predicts a static grasp and executes it open-loop after retargeting, so motion planning could reduce object/table collisions and force-aware closing could reduce post-grasp slips.
Strengths and Limitations
The strongest part of HUG is the data flywheel. A single human grasp becomes many object-only RGB-D training pairs by back-propagating the final grasp into earlier frames, making smart-glasses capture far more efficient than one-pair-per-grasp collection. MANO gives the system a canonical human hand space for learning, simulation, and retargeting, while HUG-Bench makes the evaluation concrete with real objects, metric simulation assets, and real robot trials. The real-world transfer is the most compelling empirical signal: the same human-trained model works zero-shot across a ZED+xArm+Ability tabletop setup and an Aria+YOR+WUJI household setup.
The limitations are also clear. HUG is trained on right-hand grasps only, so left-handed, bimanual, and morphology-specific grasp styles are outside the current coverage. The fixed MANO shape gives consistency but can mismatch a target robot hand or a human grasp that depends on hand size. Deployment is open-loop, with no visual or force feedback during contact and lift. The model predicts one grasp per trial, leaving multi-sample selection as an obvious next step. The evaluation remains indoor-focused; outdoor, industrial, tool-use, transparent, reflective, and heavily deformable objects are still open directions.
Takeaway
For my taxonomy, I would label this paper Human Grasp Data / Dexterous Grasp Generation / RGB-D Flow Matching / Cross-Embodiment Retargeting.
The reusable message is concise: use wearable capture to turn everyday human grasping into large-scale supervision; use MANO to canonicalize the human hand target; use RGB-D, a clicked 3D query point, and flow matching to generate a grasp; then use retargeting to move from human hand space to robot embodiments. HUG makes a persuasive case that human egocentric data can be a scalable route to dexterous robot grasping when the capture stack provides calibrated depth, camera motion, and hand tracking.
