[Paper Notes] UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
UniDex is a complete foundation suite for universal dexterous hand control, built from egocentric human videos rather than expensive robot teleoperation data. It introduces three tightly coupled components: (1) UniDex-Dataset — 50K+ trajectories across 8 different robot hands (6–24 DoFs) retargeted from human video data; (2) FAAS (Function–Actuator–Aligned Space) — a unified action representation that maps functionally similar joints across different hands to shared coordinates, enabling cross-hand transfer; and (3) UniDex-VLA — a 3D vision-language-action policy pretrained on this dataset. On five challenging real-world tool-use tasks, UniDex-VLA achieves 81% average task progress (vs. 38% for π₀), demonstrates zero-shot cross-hand skill transfer, and shows that human video data can partially substitute for robot demonstrations at a ~2:1 exchange rate. Accepted at CVPR 2026.
Paper Info
- Title: UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos
- Authors: Gu Zhang†, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao*, and 13 additional contributors
- Affiliations: Tsinghua University, Shanghai Qizhi Institute, Sun Yat-sen University, UNC Chapel Hill
- Venue: CVPR 2026
- arXiv: 2603.22264
- Project page: unidex-ai.github.io
- Code: github.com/unidex-ai/UniDex
1. Problem and Motivation
Dexterous manipulation with multi-fingered hands is hard — and building foundation models for it is even harder than for grippers, for three reasons:
- Data scarcity: Dexterous hand teleoperation is expensive and doesn’t scale. Most existing robot foundation datasets are gripper-centric.
- Embodiment heterogeneity: Dexterous hands vary wildly in DoFs (6–24), morphology, kinematics, and appearance. A policy trained on one hand doesn’t transfer to another.
- High dimensionality: Controlling 20+ joints simultaneously demands expressive action spaces and effective learning algorithms.
The key insight: dexterous robot hands are designed to mimic human hands, and humans naturally generate abundant manipulation data in daily life. Egocentric human videos are cheaper, more diverse, and easier to scale than robot teleoperation. The challenge is bridging the kinematic and visual gaps between human and robot hands.
2. Method
2.1 UniDex-Dataset: From Human Videos to Robot Trajectories
Data Sources: Four egocentric human-manipulation datasets — H2O, HOI4D, HOT3D, and TACO — providing diverse daily manipulation activities.
Human-to-Robot Transformation Pipeline:
- Visual alignment: Compute pointclouds from RGB-D, mask out human hands (using WiLoR + SAM2), replace with retargeted robot hand meshes, and reproject to single-view pointcloud
- Kinematic retargeting (human-in-the-loop):
- Extract \(m\) fingertip targets from human hand pose: \(X^\star = [x_1^\star, \ldots, x_m^\star] \in \mathbb{R}^{3 \times m}\)
- Introduce a 6-DoF dummy base offset \(T_{\text{offset}}\) for global alignment
- Solve fingertip IK: \(x_i(q; T_{\text{offset}}) = \text{Trans}(T_{\text{world}}^{\text{dummy}} \cdot T_{\text{offset}} \cdot T_i(q)) \in \mathbb{R}^3\)
- Automatic stage: IK solver minimizes fingertip error with joint limits/damping
- Interactive stage: Human adjusts \(T_{\text{offset}}\) via GUI sliders until contacts look plausible
Result: 9M paired image–pointcloud–action frames, 50K+ trajectories, 8 robot hand platforms (Inspire, Leap, Shadow, Allegro, Ability, Oymotion, Xhand, Wuji), covering 6–24 active DoFs.
| Dataset | Trajectories | Hands | Language | Varied Scenes | Pointcloud |
|---|---|---|---|---|---|
| UniDex-Dataset | 52K | 8 | ✓ | ✓ | ✓ |
| ActionNet | 30K | 2 | ✓ | ✗ | low-quality |
| RoboMind | 19K | 1 | ✓ | ✗ | ✗ |
| RealDex | 2K | 2 | ✓ | ✗ | ✓ |
2.2 FAAS: Function–Actuator–Aligned Space
The core idea: despite different DoFs and kinematics, all dexterous hands implement the same small set of functional primitives — thumb–index pinch, finger curling around handles, lateral ab-/adduction for stabilization.
FAAS maps each actuator to a shared index based on its functional role, not its URDF position. This creates a function-centric control interface shared across embodiments:
- 82-dimensional vector: 18 dims for wrist pose (9d pose × 2 hands = absolute + relative), 64 dims for joint commands (32 slots per hand)
- 21 base actuator slots shared across all hands; remaining slots for hand-specific DoFs
- Functionally similar joints (e.g., thumb flexion on Allegro vs. Inspire) get the same FAAS index
This is elegant because it’s purely a mapping — no learned alignment, no post-processing. Just group actuators by what they do.
2.3 UniDex-VLA: 3D Vision-Language-Action Policy
Architecture follows π₀ with key modifications for 3D dexterous control:
- 3D pointcloud encoder: Replace SigLIP (2D) with Uni3D — a vanilla ViT pretrained to align pointcloud features with image–text features
- Backbone: Gemma (from PaliGemma), fusing pointcloud features with text and proprioception
- Action head: Flow matching with forward-Euler integration at inference
- Observation: \(o_t = [P_t, \ell_t, q_t]\) — colored pointcloud, language instruction, proprioceptive state (all in FAAS)
- Action: \(H\)-step action chunk \(A_t = [a_t, \ldots, a_{t+H-1}]\) in FAAS
Training: Pretrain on UniDex-Dataset, then finetune with 50 task demonstrations per task.
2.4 UniDex-Cap: Human-Robot Data Co-training
A practical portable capture setup:
- Apple Vision Pro (hand/head pose estimation) + Intel RealSense L515 (RGB-D) + 3D-printed mount
- Time-synchronized, calibrated to shared coordinate frame
- Captured human data → transformation pipeline → robot trajectories for co-training
Key finding: ~2 human demos can substitute for 1 robot demo, and human demos are ~5.2× faster to collect. This means significant cost reduction for scaling dexterous data.
3. Experiments and Main Results
Hardware
- 7-DoF Franka Panda arm + three dexterous end-effectors: Inspire (6 active, 12 full DoFs), Wuji (20 active DoFs), Oymotion (6 active, 11 full DoFs)
- Intel RealSense L515 for egocentric RGB-D
- Only 50 demonstrations per task for fine-tuning
Five Real-World Tool-Use Tasks
- Make Coffee (Inspire): Grasp kettle → lift to dripper → pour water
- Sweep Objects (Inspire): Grasp sweeper → sweep objects into dustpan
- Water Flowers (Wuji): Grasp spray bottle → press trigger with thumb
- Cut Bags (Wuji): Insert fingers into scissors → cut bags
- Use Mouse (Wuji): Place fingers on mouse → drag file → click
Main Results (Average Task Progress, 20 trials/task)
| Model | Make Coffee | Sweep | Water Flowers | Cut Bags | Use Mouse | Average |
|---|---|---|---|---|---|---|
| DP | 32.5 | 37.5 | 50.0 | 27.5 | 20.0 | 29.0 ± 19.9% |
| DP3 | 35.0 | 50.0 | 40.0 | 12.5 | 20.0 | 35.0 ± 17.1% |
| π₀ | 60.0 | 55.0 | 85.0 | 15.0 | 60.0 | 38.0 ± 7.4% |
| UniDex-VLA (No Pretrain) | 60.0 | 82.5 | 50.0 | 32.5 | 30.0 | 32.5 ± 18.5% |
| UniDex-VLA | 87.5 | 82.5 | 85.0 | 90.0 | 60.0 | 81.0 ± 12.1% |
UniDex-VLA achieves 81% average task progress — more than doubling π₀ (38%) and all other baselines.
Generalization Results
Spatial generalization: With DemoGen augmentation, UniDex-VLA approaches near-perfect success across out-of-distribution object placements.
Object generalization: Replacing the original black kettle with a smaller purple kettle of different shape → UniDex-VLA achieves 80% (vs. 15% for π₀), showing robust tool understanding.
Cross-hand transfer (zero-shot):
| Hand | π₀ | UniDex-VLA (No Pretrain) | UniDex-VLA |
|---|---|---|---|
| Wuji | 0% | 0% | 40% |
| Oymotion | 10% | 5% | 60% |
A policy trained only on Inspire Hand transfers zero-shot to Wuji and Oymotion — this is enabled by FAAS. Baselines completely fail.
Human-Robot Co-training
The co-training heatmap (Fig. 13) reveals:
- With 0 robot demos, adding human demos alone doesn’t work (all zeros)
- With even 10 robot demos + human demos, performance scales steadily
- The “high-performance” boundary has slope ≈ 2, meaning ~2 human demos ≈ 1 robot demo
- Human demos are ~5.2× faster to collect → substantial cost savings
4. Strengths
- Complete suite, not just a model: Dataset + action space + policy + capture system — each component is independently useful
- FAAS is a clean abstraction: No learned alignment, no post-processing — just a principled functional mapping that enables cross-hand transfer out of the box
- Human video as scalable data source: The retargeting pipeline with human-in-the-loop quality control is practical and produces usable training data
- Strong empirical results: 81% on genuinely difficult tool-use tasks with only 50 demos, plus zero-shot hand transfer that baselines completely fail at
- 3D pointcloud input: The right choice for dexterous manipulation — tool-use requires reasoning about 3D geometry and contact affordances that 2D images can’t provide
- Open-source: Dataset, code, and models all publicly available
5. Limitations
- No action-free pretraining: The framework doesn’t yet leverage the vast amounts of unlabeled egocentric video data (without action annotations) — incorporating these could further scale pretraining
- Human-in-the-loop retargeting: While practical, the interactive calibration step still requires human effort per dataset/hand combination — fully automatic retargeting would improve scalability
- Limited to tool-use tasks: All five real-world tasks involve tool use — in-hand manipulation (e.g., reorienting objects within the hand) is not evaluated
- 50 demos per task: While much less than end-to-end approaches, this is still not zero-shot — true zero-shot dexterous manipulation from pretraining alone remains open
- Single-arm setup: All experiments use a single Franka arm — bimanual dexterous manipulation is not addressed
- FAAS assumes functional similarity: The mapping assumes all hands share the same functional primitives — highly exotic hand designs might not fit cleanly
6. Takeaways
- Egocentric human videos are a viable foundation for dexterous manipulation — the kinematic and visual gaps are real but bridgeable with careful retargeting and visual alignment
- Function-centric action spaces (FAAS) are a compelling alternative to learned latent spaces for cross-embodiment transfer — simpler, more interpretable, and immediately effective
- The 2:1 human-to-robot exchange rate is an actionable finding: labs can supplement expensive robot demos with cheaper human captures to reduce data costs
- 3D perception matters for dexterous manipulation — replacing 2D encoders with 3D pointcloud encoders is not just a nice-to-have but essential for tasks requiring precise contact reasoning
- Pretraining on diverse hands enables generalization — the performance gap between UniDex-VLA and UniDex-VLA (No Pretrain) is large, especially on the hardest tasks (Cut Bags: 84.6% relative improvement)
