[Paper Notes] Cross-Hand Latent Representation for Vision-Language-Action Models
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
XL-VLA tackles a real bottleneck in dexterous robot learning: every new robot hand comes with a different joint space, so standard VLA models do not scale cleanly across embodiments.
The paper’s solution is to learn a shared latent action space across multiple dexterous hands, then train a single VLA policy to predict latent action chunks instead of raw joint commands.
That simple abstraction works surprisingly well:
- cross-hand mean success improves from 0.32 with
pi0to 0.72 with XL-VLA - on G1 cross-robot evaluation, mean success improves from 0.525 to 0.825
- the model also shows zero-shot transfer to unseen hand-task combinations
My short reading is that the paper’s main contribution is not just a better dexterous policy. It is a useful systems argument that for cross-embodiment dexterous manipulation, the action representation is the real bottleneck.
Paper Info
- Title: Cross-Hand Latent Representation for Vision-Language-Action Models
- Authors: Guangqi Jiang, Yutong Liang, Jianglong Ye, Jia-Yang Huang, Changwei Jing, Rocky Duan, Pieter Abbeel, Xiaolong Wang, Xueyan Zou
- Affiliations: UC San Diego, Amazon FAR, UC Berkeley
- Project page: xl-vla.github.io
- Paper type: dexterous manipulation / cross-embodiment learning / VLA
1. Problem Setting and Motivation
The paper starts from a clean observation: language has a fairly stable vocabulary, but robot actions do not.
This is especially painful for dexterous hands:
- different hands have different numbers of fingers
- different actuation structures
- different joint parameterizations
- new hardware keeps appearing quickly
That makes standard VLA training expensive and fragmented. Even if two hands should perform the same task, their raw action spaces are incompatible.
The paper asks two practical questions:
- how can we define a unified action representation across a family of dexterous hands?
- how can we integrate a new hand without retraining a full policy from scratch?
2. Core Idea
The central idea is to replace raw joint-space prediction with a shared latent action space.
For each hand h, the system learns:
- a hand-specific encoder
E_h - a hand-specific decoder
D_h
These map between:
- the hand’s own joint-space action chunk
- a common latent action vector
The VLA policy itself is then hand-agnostic:
- it takes vision, language, and previous latent action tokens
- predicts the next latent action chunk
- the hand-specific decoder turns that latent back into joint commands
This gives a very clean separation:
- the VLA backbone does not need to understand every hand’s kinematics directly
- embodiment-specific details are pushed into encoders and decoders
3. Method Breakdown
3.1 Action chunks instead of single-step actions
The model operates on action chunks rather than single actions.
Each chunk is:
64joint-position commands- sampled at
20 Hz - corresponding to
3.2seconds of motion
At time t, the policy receives:
- image observations
- language instruction
- short history of joint states
- previously executed action chunk
and predicts the next chunk.
This chunked setup is a reasonable fit for dexterous bimanual manipulation because many tasks require coordinated temporal structure, not just immediate motor responses.
3.2 Shared latent space via a multi-headed VAE-style autoencoder
The latent space is pretrained independently of the VLA backbone.
For each hand, the encoder produces a Gaussian posterior:
- mean
mu^(h) - variance
sigma^(h)
and the decoder reconstructs that latent into the hand’s own joint configuration.
The paper uses lightweight MLPs for these hand-specific encoders and decoders, but all of them share the same latent distribution.
3.3 Three losses define the latent space
The latent space is shaped by three constraints:
- Reconstruction loss
L1 - Retargeting loss
L2 - Latent KL regularization
L3
Reconstruction loss
This ensures each hand can autoencode its own joint configurations accurately.
Retargeting loss
This is the most interesting part. The paper uses differentiable forward kinematics to align fingertip geometry across hands.
Instead of matching raw joint angles across embodiments, it matches:
- pinch distances
- pinch directions
between corresponding fingers.
That is a good design choice because cross-hand equivalence is geometric and functional, not joint-wise.
Latent KL loss
The KL term encourages the shared latent space to follow a standard Gaussian prior, making it smoother and easier to interpolate.
3.4 Training without paired cross-hand demonstrations
One of the strongest details in the paper is that the latent autoencoder is trained without paired cross-hand trajectories.
Instead, the method:
- samples random joint configurations within each hand’s limits
- encodes them into latent codes
- decodes those latents through all hand-specific decoders
- uses reconstruction and geometric retargeting losses to align the space
So the cross-hand alignment is effectively self-supervised through kinematics, not supervised through matched demonstrations.
That makes the approach much more scalable than collecting paired multi-hand teleoperation data just for latent alignment.
4. Experimental Setup
The authors collect a real-world teleoperation dataset with:
- 4 dexterous hands: Ability, Inspire, X-Hand1, Paxini DexH13
- 10 tasks
- 50 demonstrations per task per hand
- 2000 demonstrations total
- about 2M state-action pairs
The tasks include:
- prepare fruits
- stack cans
- sort cans
- hand over bottle
- reorganize lemons
- pour sauce
- rearrange boxes
- push sugar
- pour sugar
- push cans
The hardware setup includes:
- a bimanual
xArm7platform - a
Unitree G1humanoid
5. Main Results
5.1 Cross-hand VLA training
The core comparison is against pi0, trained under the same multi-hand multi-task setting.
Mean success across all tasks and hands:
- pi0:
0.32 - XL-VLA:
0.72
Per-hand means:
- Ability:
0.37 -> 0.73 - Inspire:
0.27 -> 0.68 - Paxini:
0.35 -> 0.78 - XHand:
0.29 -> 0.70
Those are large gains, especially because dexterous hands differ much more than ordinary grippers.
Task-wise, XL-VLA improves the mean success rate on every listed task category in Table 2, with especially large gains on:
- Hand over Bottle
- Sort Cans
- Re-arrange Boxes
- Pour Sugar
The broader point is clear: once the action representation is aligned, a single VLA backbone becomes much more reusable across embodiments.
5.2 Cross-robot scaling to G1
The paper also tests whether the latent action space helps when mixing data from:
- tabletop
xArm - humanoid
G1
On four tasks, the reported G1 mean success improves from:
- pi0:
0.525 - XL-VLA:
0.825
This is a 57% relative improvement according to the paper.
That result matters because it suggests the latent space is not just smoothing over minor hand differences. It is helping across broader robot-system variation as well.
5.3 Zero-shot unseen-task transfer
The paper also evaluates zero-shot unseen-task generalization.
For each hand, some tasks are held out during training. The trained policy is then tested directly on those unseen hand-task combinations using the corresponding decoder.
The comparison baseline is pi0 + RT:
- train a policy on XHand
- retarget predicted trajectories to the other hands using kinematic retargeting
The reported result is qualitatively strong:
- XL-VLA consistently outperforms the retargeting baseline
- it never underperforms that baseline on any hand or task
- gains are especially clear on fine-grained dexterous tasks
This is exactly where a latent action space should help: it transfers functional control patterns, not only fingertip geometry after the fact.
6. Ablation Results
6.1 Latent replay vs LAD
The paper compares its latent space with LAD, a supervised latent retargeting method.
Replay mean success:
- LAD on Ability+Inspire:
0.60 - XL-VLA on Ability+Inspire:
0.82 - LAD on Paxini+XHand:
0.61 - XL-VLA on Paxini+XHand:
0.81
This is a strong result because XL-VLA’s latent alignment is unsupervised, yet it still outperforms the supervised alternative.
6.2 Loss design matters
The loss ablations show a clean pattern:
- removing
L1destroys reconstruction - removing the distance part of
L2hurts cross-hand distance preservation - removing the direction part of
L2hurts cross-hand directional consistency - removing
L2entirely causes the largest cross-embodiment degradation
That supports the paper’s modeling decision: a shared latent action space only works if it is explicitly shaped by cross-hand geometry.
6.3 Latent size should not be too large
The architecture and latent-dimension ablations suggest that very large latent spaces are actually counterproductive.
My read is that this makes sense: once the latent becomes too expressive, it can start storing embodiment-specific shortcuts instead of discovering a compact shared action manifold.
7. Why This Paper Is Interesting
I think the paper contributes three useful ideas.
7.1 It identifies the right bottleneck
A lot of cross-embodiment work focuses on better policies, better visual backbones, or better retargeting pipelines. This paper argues that for dexterous VLA, the real bottleneck is often the action representation itself.
7.2 It keeps the VLA architecture mostly standard
XL-VLA does not require a radically new VLA design. It plugs a latent action interface into an existing backbone (pi0), which makes the proposal easier to adopt.
7.3 It is grounded in real hardware
The paper emphasizes real-world dexterous hands rather than only simulation. That matters because cross-embodiment dexterous transfer is easy to overclaim in simulation and much harder to validate on real hardware.
8. Limitations
The paper is strong, but a few limits are worth keeping in mind.
- The experiments are still within a relatively small family of dexterous hands rather than a truly open-ended hardware zoo.
- The latent space is based on hand-specific encoders and decoders, so adding a brand-new hand still requires building that interface.
- The zero-shot transfer results are compelling, but most are presented as plots rather than full numeric tables.
- The approach aligns fingertip geometry well, but richer contact dynamics and object-dependent force patterns may still remain embodiment-specific.
9. Takeaways
My main takeaway is:
XL-VLA shows that cross-embodiment dexterous VLA becomes much more practical once the policy predicts in a shared latent action space instead of raw joints.
More generally, the paper suggests a useful design principle:
- use a large VLA backbone for perception and instruction following
- hide embodiment-specific motor details behind a compact latent interface
- make that interface geometric and self-supervised rather than purely kinematic or manually engineered
If this line of work continues, I expect the most interesting next step will be extending this idea from cross-hand transfer to broader whole-body cross-embodiment manipulation.
