[Paper Notes] EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
EgoInfinity is best read as a data engine: a system for turning arbitrary in-the-wild RGB manipulation videos into robot-usable 4D hand-object interaction data. The output includes metric hand trajectories, object point clouds and meshes, 6-DoF object poses, contact-relevant interaction states, and robot-specific retargeted trajectories.
The central idea is to make internet video actionable. EgoInfinity starts from web-scale Action100M-style clips, filters for manipulation, reconstructs hands and objects in a shared metric 3D frame, repairs hand-object drift with interaction-aware rules, then compiles the recovered hand motion into executable robot trajectories through a learned root-frame estimator plus IK. The most useful takeaway is the representation boundary:
\[ H_t={M_t^h,K_t^h,{}^c p_t^h,P_t^o,M^o,{}^c p_t^o} \]
This state is agent-agnostic: it describes the human hand, object geometry, and object pose in metric 4D space before choosing a target robot embodiment. That is the reason the same recovered interaction can be retargeted to Unitree G1, Robonaut2, dual Franka arms, and a LEAP hand policy setting.
Paper and Resources
The paper is “EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning” by Gaotian Wang, Kejia Ren, Andrew Morgan, Yiting Chen, Howard H. Qian, Podshara Chanrungmaneekul, and Kaiyu Hang. It is available as arXiv:2606.17385. The project page is the EgoInfinity Hugging Face Space, and the public dataset entry is Rice-RobotPI-Lab/egoinfinity.
The arXiv paper reports the data engine as corpus-agnostic, with the current scale bounded by Action100M: 14.6 years of footage, 147M action segments, and a table-level comparison listing 127K hours. The released browser/dataset is a curated, inspectable subset: the paper reports 106 processed manipulation videos, 277 objects, 12,107 frames, and median 5.8 s clip duration; the Hugging Face dataset page currently exposes 1,150 rows and about 5.49 GB.
The Problem: Video Is Huge, Robot Data Is Actionable
Egocentric and internet videos contain enormous manipulation diversity, but most of that signal is trapped in pixels. A robot needs metric geometry, object state, contact timing, feasible motion, and embodiment-specific commands. Existing robot datasets provide actions but are expensive and hardware-bound; existing human-video datasets scale better but usually stop at observation, narration, pose, or weak action labels.
EgoInfinity tries to close that gap by turning RGB videos into an intermediate form that is closer to robot action but still independent of any one robot. This is the right design pressure. If the engine produced only 2D tracks, downstream policies would still need to infer geometry. If it produced only robot actions, the output would be tied to one embodiment. The paper instead builds a metric 4D HOI representation and adds retargeting as a compilation step.
Data Engine Pipeline
The pipeline has two passes. The first pass is cheap: scan videos, identify hand-present segments, filter by hand-motion statistics and camera-motion cues, and retain clips likely to contain useful manipulation. The second pass runs the full reconstruction stack only on the active segments. The current paper focuses on approximately static-view videos, which is a practical assumption for many tutorial and how-to clips.
The full stack has four conceptual blocks:
- Metric geometry and hand tracking. MoGe-2 estimates focal length and global metric scale; Flow3R predicts dense depth; GeoCalib estimates gravity. WiLoR reconstructs MANO hand parameters and hand meshes, with infilling and smoothing to handle missing frames.
- Object discovery and reconstruction. Text annotations from Action100M are used to extract object prompts. SAM-3 detects target regions, SAM-2 tracks masks through time, SAM-3D reconstructs object meshes, and FoundationPose++ estimates 6-DoF object pose trajectories.
- Interaction-aware refinement. MEMFOF optical flow, hand keypoints, object masks, and point clouds are used to classify object state. The main text compresses this to \(s_t\in\{\mathrm{static},\mathrm{grasped},\mathrm{moving}\}\), while the appendix uses finer labels such as left-hand grasp, right-hand grasp, and bimanual grasp.
- Coordinate cleanup and exo-to-ego reframing. The system can erode masks, filter depth boundaries, remove outliers, and synthesize an egocentric view by rigidly re-rendering the recovered 3D scene from a hand-centered virtual camera.
The important implementation point is cross-module metric calibration. EgoInfinity turns a collection of foundation-model outputs into one shared metric camera-world frame: hand predictions, object predictions, dense depth, camera calibration, gravity, masks, and object priors are all aligned before contact reasoning. Without that, hand-object contact reasoning would collapse into scale mismatch and pose drift.
Interaction-Aware Refinement
The refinement stage is the paper’s strongest systems idea. Pure visual object tracking is brittle under occlusion, especially when the hand covers the manipulated object. EgoInfinity uses interaction state to decide which geometric signal should drive object pose.
For each frame, it first forms an object proposal:
\[ \tilde p_t^o=(R_{\mathrm{cano}},\operatorname{center}(S_t^o\odot D_t)) \]
Here \(R_{\mathrm{cano}}\) comes from the canonical SAM-3D orientation, \(S_t^o\) is the object mask, and \(D_t\) is depth. Then the state determines the trusted pose source:
- If the object is static, lock it to a robust point-cloud centroid.
- If it is grasped, bind it rigidly to the hand frame with a canonical palm-aligned transform.
- If it is moving but not confidently grasped, keep the visual proposal and smooth it.
In the appendix, the state classifier is more concrete. A global-static gate catches fixtures whose mask centroid barely moves; a Schmitt trigger stabilizes per-frame motion detection; grasp signals combine 2D mask overlap, fingertip-to-cloud distance, and wrist-to-cloud distance; temporal smoothing fills short gaps and removes short false runs. This is a nice example of using simple geometry and temporal logic to make foundation-model outputs more physically usable.
Functional Retargeting
The retargeting section is another key piece. Internet videos often show only hands, partial arms, or arbitrary viewpoints, so exact human body-pose imitation is a fragile target. EgoInfinity uses functional retargeting: preserve task-relevant hand motion and choose a feasible robot root frame, without requiring full human kinematic recovery.
Given recovered hand trajectories and optional gravity, the retargeter estimates a robot-specific kinematic root frame:
\[ {}^c p_r=({}^c R_r,{}^c t_r)\in SE(3) \]
The estimator is an SE(3)-equivariant Vector Neuron network trained entirely in MuJoCo simulation. It predicts plausible root frames from hand trajectories, using flow matching to represent ambiguity: many torso/root poses can explain the same observed hand path. At inference time, the system samples multiple root-frame hypotheses, clusters them, interpolates smooth root trajectories across windows, then scores candidates by IK convergence, residual tracking error, manipulability, joint-limit margin, and smoothness.
This design makes the representation boundary clean:
RGB video
-> metric 4D hand-object state
-> robot-specific root-frame estimate
-> IK and smoothing
-> executable joint trajectory
Finger motion is handled separately for dexterous hands through robot-specific mapping from MANO keypoints. Arm joints follow wrist-level IK targets; finger joints use hand geometry.
Experiments
The curated Action100M subset contains 106 clips, 277 objects, and 12,107 frames. The paper reports that 88% of clips and 47% of objects involve manipulation, with a mix of static, moving, left-hand, right-hand, and bimanual grasp states. Object categories include containers, tools, food, hardware, appliances, textiles, electronics, decor, paper, and hygiene objects; top verbs include place, add, show, season, arrange, hold, pick, pour, present, remove, insert, and slice.
For cross-embodiment motion retargeting, the paper evaluates Unitree G1, Robonaut2, and dual Franka FR3:
| Robot | IK rate | Position error | Orientation error |
|---|---|---|---|
| Unitree G1 | 0.821 | 2.86 cm | 6.73 deg |
| Robonaut2 | 0.774 | 6.67 cm | 8.25 deg |
| Dual-Franka | 0.706 | 10.27 cm | 12.17 deg |
The trend makes sense. Unitree G1 has stronger whole-body reachability for many human-like hand motions; dual Franka is constrained by a tabletop-style bimanual setup and larger morphology mismatch. The paper also shows real-robot executions for cutting, pouring, and wiping, plus a LEAP dexterous-hand grasping policy trained with EgoInfinity-extracted hand motions as priors.
Why This Is Useful
For robot learning, EgoInfinity is useful because it shifts the bottleneck from collecting robot demonstrations to compiling human videos into a structured action prior. The output is still imperfect, but it is far richer than action labels or 2D tracks. It contains metric hand motion, object geometry, 6-DoF object motion, language labels inherited from video annotations, and interaction-state information.
For VLA or imitation-learning pipelines, I would read EgoInfinity as a source of pretraining or prior data, not as a final expert demonstration dataset. It can tell the model how hands approach objects, when grasp transitions happen, how objects move relative to hands, and how task-relevant motion unfolds. A robot still needs embodiment-specific control, tactile/force feedback, and real-world correction.
The paper also makes a broader methodological point: scaling robot data may require engines alongside datasets. A static release gets stale when perception models improve. A modular engine can swap in better hand pose, segmentation, depth, object reconstruction, SLAM, or retargeting modules while preserving the output contract.
Limitations
EgoInfinity currently assumes approximately static-camera videos. This helps avoid full online SLAM, but excludes a large fraction of body-mounted, handheld, and strongly moving-camera clips. The interaction-aware refinement improves physical plausibility, yet it does not guarantee contact-level accuracy: exact fingertip placement, no-slip behavior, force consistency, and tactile state remain outside the representation. The retargeter is robot-specific and may require retraining or calibration for new embodiments.
The engine is also limited by its perception components. If SAM-3 chooses the wrong object, SAM-3D returns a poor mesh, depth is unstable, or the hand tracker fails under occlusion, the downstream 4D state can still become unreliable. The good news is that the modular design makes these failures inspectable and, in principle, replaceable.
Takeaways
EgoInfinity is a useful blueprint for turning internet manipulation videos into robot-learning substrate. The core recipe is: recover metric hand-object state, use interaction state to repair visual tracking, keep the representation agent-agnostic, and compile it into robot actions only after choosing an embodiment. That separation is what makes the system interesting. It turns “watching humans manipulate objects” into a reusable intermediate representation for retargeting, policy priors, and future VLA-scale robot learning.
