[Paper Notes] VITRA: Scalable VLA Pretraining from Real-Life Human Videos
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
VITRA explores a simple but powerful idea: use ordinary egocentric human activity videos as scalable pretraining data for dexterous Vision-Language-Action models. The authors treat the human hand as a dexterous end-effector, reconstruct 3D hand motion with MANO-based labels, segment long videos into atomic hand actions, caption each segment with language, and pretrain a PaliGemma-based VLA with a diffusion action head. Robot data is then used mainly for adaptation: after human-video pretraining, the model is fine-tuned on a much smaller set of real robot trajectories, with the robot hand mapped into the human/MANO action space.
Paper Info
The paper is “Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos” by Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, Yizhong Zhang, Xi Chen, Hao Chen, Lily Sun, Dong Chen, Jiaolong Yang, and Baining Guo from Tsinghua University and Microsoft Research Asia. It is listed on the project page as ICRA 2026. The paper, project, code, data, and models are linked from microsoft.github.io/VITRA, with the arXiv PDF at 2510.21571.
The Core Problem
Modern VLA models need action data, but robot action data is expensive, slow to collect, and often narrow in objects, scenes, and skills. This is even more severe for dexterous hands, where large-scale robot datasets are scarce. In contrast, the web and existing egocentric video datasets contain many real human manipulation behaviors, but they are unsegmented, uncalibrated, noisy, and missing action labels.
VITRA asks whether these raw human videos can be converted into the same kind of supervision used by robotic VLA models: image, language instruction, state, and future action chunks. Their answer is yes, with a pipeline that aligns human video data to robot VLA data at two levels: task granularity and action labels.
Turning Raw Human Videos into VLA Data
The data construction pipeline has three stages.
First, VITRA estimates 3D motion. It detects whether the camera is static or moving, estimates intrinsics, undistorts fisheye or wide-angle videos into a pinhole model, reconstructs per-frame 3D hand motion with HaWoR, and represents hand pose with the MANO parametric hand model. Camera poses are estimated with a modified MegaSAM pipeline using MoGe-2 depth priors, allowing the system to transform camera-space hand motion into world-space trajectories.
Second, it segments long videos into atomic actions. Instead of using fixed time windows or existing dataset annotations, VITRA cuts at local minima of wrist speed in world space. This is a nicely pragmatic idea: human hand motions often slow down at action boundaries, and because the speed is computed from reconstructed 3D wrist trajectories, the segmentation is tied to action dynamics rather than only image appearance.
Third, it labels each action with language. For every segment, the pipeline samples frames, overlays projected 3D hand trajectories, and asks GPT-4.1 to describe the specified hand action in imperative form. The trajectory overlay is important because it gives the captioning model temporal and geometric hints about what the hand is actually doing.
The resulting dataset contains about 1M atomic VLA episodes and 26M frames, sourced from Ego4D, EPIC-KITCHENS, EgoExo4D, and Something-Something-V2. Importantly, the original human annotations from those datasets are not used, because they do not match the desired robotic action granularity.
Model and Action Space
The VLA model uses PaliGemma-2 3B as the vision-language backbone and a Diffusion Transformer action expert for action prediction. The VLM receives the image, language instruction, a camera FoV token, and a learnable cognition token. The cognition token becomes the conditioning feature for the diffusion action head.
The model predicts a chunk of future hand actions. In the paper, the hand action is:
\[a_t = [\Delta t_l, \Delta r_l, \theta_h^l, \Delta t_r, \Delta r_r, \theta_h^r] \in \mathbb{R}^{102}\]Here, each hand has 51 dimensions: 3D relative wrist translation, 3D relative wrist rotation, and 45 MANO finger pose angles from 15 joints times 3 Euler angles. The released code pads this into a unified 192-D action space, where left hand occupies 0:51, right hand occupies 51:102, and 102:192 is currently padding. The state is padded to 212-D. In the code, the active state mostly uses the same left/right hand kinematic slots, while the final 20 dimensions are reserved for MANO beta shape parameters but are not currently used.
This padding is best understood as a fixed interface with masks. The diffusion head always sees the same action dimensionality, but the action mask says which hand and which dimensions are valid. That makes single-hand, dual-hand, human, and robot data easier to route through one model.
Robot Fine-Tuning and XHand Mapping
My reading of the training flow is: VITRA first learns broad manipulation priors from human data, then uses robot data for deployment adaptation. This is not “robot data only afterthought” exactly, but robot data is not the main source of scale; it is used after pretraining to adapt the policy to real robot embodiment and execution.
For the real robot experiments, the paper uses a Realman arm with a 12-DoF XHand and a RealSense head camera. The robot data is aligned to the human-hand action space: camera-space end-effector pose gives the 6D wrist action, while robot hand joints are mapped to the closest MANO/human-hand joint dimensions.
The released code makes this concrete. XHand raw state/action is 36-D: 18 dimensions per hand, consisting of 6D wrist pose plus 12 hand joints. The function transfer_xhand_to_human inserts these XHand dimensions into selected channels of the human/MANO-style action space. During inference, transfer_human_to_xhand extracts the same mapped channels back into XHand commands. So the mapping is not an IK solver or full pose retargeting method; it is a hard-coded sparse index/sign correspondence between selected XHand joints and selected MANO action dimensions.
This also explains the relation between human pretraining and robot inference. Human pretraining supervises MANO-style finger action. Robot fine-tuning teaches the model which subset of those human-hand channels matter for XHand, and inference reads only those channels back out for the robot hand.
Results
On unseen human hand action prediction, VITRA outperforms baselines trained on lab data, original human annotations, and a concurrent hand-VLA baseline. The ablations are intuitive: trajectory-aware augmentation helps, causal action denoising helps, wrist-speed segmentation beats fixed-interval segmentation, and trajectory overlays improve GPT captioning.
The real-robot results are the most important part. After fine-tuning on 1.2K teleoperated robot trajectories, VITRA reaches 71.0% average success on seen tasks and 64.6% on unseen object/background/category settings, substantially higher than the compared baselines including VPP, π0, no VLA pretraining, latent-action pretraining, and OXE pretraining. The paper also shows a positive scaling trend: larger and more diverse human-video pretraining improves both human hand prediction and downstream robot success.
Codebase Notes
The released repo is useful for understanding the data representation and model interface, but it does not appear to include the complete production-scale raw-video-to-VITRA-1M processing pipeline. It includes dataset documentation, undistortion scripts, metadata formats, loaders, a hand reconstruction wrapper that calls MoGe/HaWoR/MANO, model code, and the robot XHand alignment functions.
For the MANO part, the released metadata stores hand_pose as (T, 15, 3, 3) local MANO joint rotations, plus wrist orientation, translation, MANO beta, and 21 hand joints. In the training code, the default action type is angle-based, so the finger action is MANO joint pose, not raw robot joint pose. XHand enters later through the sparse mapping described above.
Takeaway
VITRA is compelling because it treats human video not as a vague source of visual representation, but as explicit action supervision. The key move is to force alignment: segment human videos into robot-like atomic tasks, reconstruct MANO-based 3D hand actions, caption them in robot-instruction style, and train a VLA action head in a unified action space. The approach is imperfect because monocular hand and camera reconstruction is noisy, and the released repo does not expose every dataset-construction component. But the direction feels important: scalable human activity video can become a serious pretraining source for dexterous robot manipulation, with small robot datasets used to bridge embodiment differences.
