DexFormer

Cross-Embodied Dexterous Manipulation
via History-Conditioned Transformer

Ke Zhang1*, Lixin Xu1*, Chengyi Song1, Junzhe Xu1, Xiaoyi Lin2, Zeyu Jiang1, Renjing Xu1†

1The Hong Kong University of Science and Technology (Guangzhou), 2Wuhan University

*Equal contribution, Corresponding author

300 randomized training embodiments
Zero-shot to LEAP / Allegro / RAPID
Real-world LEAP hand on Franka
A history-conditioned policy transfers dexterous grasping across diverse hand embodiments.

Dexterous manipulation suffers from embodiment-specific dynamics: every new hand typically needs retuning or retraining. DexFormer builds an embodiment-agnostic transformer policy that conditions on recent observation history to implicitly infer morphology and produce embodiment-appropriate actions without identifiers or per-hand heads.

History-Conditioned

Temporal tokens let the transformer infer hand dynamics online, avoiding explicit morphology codes.

Shared Action Space

Canonical finger embedding aligns actuators across hands while zero-padding missing joints.

Scaling Across Morphologies

Trained on 300 randomized embodiments, evaluated zero-shot on standard LEAP, Allegro, and RAPID hands and 32 novel variants.

Method

Canonical shared action space
Embodiment masks

Canonical 20-D action space: identical anatomical joints map to fixed indices; lower-DoF hands zero-pad unused slots.

DexFormer unifies heterogeneous hands with a shared finger action space: joints with the same anatomical role (MCP abduction/flexion, PIP/DIP flexion, thumb CMC/IP) occupy fixed canonical indices. Each embodiment writes its native joint commands into those slots and zero-pads missing joints; masks ensure only valid dimensions are applied. This alignment lets one policy head emit a single 20-D finger command vector that works across LEAP, Allegro, RAPID, and their randomized variants.

History-conditioned transformer
History-conditioned transformer with causal masking.

The policy consumes a fixed horizon of observation history and passes it through a causal transformer encoder. The final MLP layer outputs the shared action; embodiment masks and smoothing translate it to joint targets.

Training Setup

Training embodiments
Training set: 300 randomized embodiments.
Testing embodiments
Evaluation set: 96 randomized embodiments for zero-shot.
Distributed training
Distributed rollouts and synchronized updates across embodiment groups.
  • 300 procedurally randomized embodiments built from LEAP, Allegro, and RAPID canonical hands.
  • Training in Isaac Lab with automatic domain randomization and 4096 parallel environments.
  • Embodiment groups assigned per GPU; gradients synchronized via Distributed Data Parallel all-reduce.
  • Reward combines smooth actions, approach, grasp contact, contact-aware pose tracking, and stability terms.

Results

Success rates
Success rate.
Automatic Domain Randomization rate
ADR rate.
Reward terms
Mean reward.
LEAP canonical & variants evaluation
Allegro canonical & variants evaluation
RAPID canonical & variants evaluation
Hand Setting LSTM GRU Ours
LEAP canonical 66.81 58.91 83.25
32 variants 66.72 57.38 86.84
Allegro canonical 65.38 25.81 74.19
32 variants 62.44 24.97 71.94
RAPID canonical 46.72 45.06 71.69
32 variants 44.22 53.59 77.09
Average Combined 58.72 44.29 77.50

Real-World Evaluation

  • Point-cloud from two Intel RealSense D435 cameras fused with ICP to match simulated point-cloud.
  • Franka joint-space controller at 1000 Hz; policy evaluated at 10 Hz.
  • Videos showcasing policy rollouts at original speed.
Real-world object set for evaluation
Physical object set used in real-world evaluation: polyhedra, RealSense camera box, mug, plush toys, orange, and block.
Orange fruit.
Red mug.
Yellow block.
Dodecahedron (12-faced polyhedron).
Icosahedron (20-faced polyhedron).
Large hamster toy.
Package box put to the left.
Package box put in the middle.
Package put to the right.

Real-World Adaptation

LEAP hand variants with reduced DoFs
Real-world LEAP hand variants with progressively reduced degrees of freedom: (left) 1 DoF removed, (center) 2 DoFs removed, (right) 3 DoFs removed.
LEAP variant with 1 DoF removed.
LEAP variant with 2 DoFs removed.
LEAP variant with 3 DoFs removed.

Resources & BibTeX

Preprint PDF: dexformer.pdf

@article{dexformer2026,
  title={DexFormer: Cross-Embodied Dexterous Manipulation via History-Conditioned Transformer},
  author={Ke Zhang, Lixin Xu, Chengyi Song, Junzhe Xu, Xiaoyi Lin, Zeyu Jiang, Renjing Xu},
  journal={preprint},
  year={2026}
}

Questions? Drop a note in the repo or contact the paper authors.