[Paper Notes] HandX: Scaling Bimanual Motion and Interaction Generation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
This paper introduces HandX, a unified foundation for bimanual hand motion generation that spans data, annotation, and evaluation. The authors consolidate existing datasets and collect new motion-capture data emphasizing contact-rich two-hand interactions, then propose a decoupled annotation pipeline that extracts kinematic features and uses LLM reasoning to produce fine-grained text descriptions. They benchmark both diffusion and autoregressive models across multiple scales and observe clear scaling trends: jointly increasing model capacity and data size consistently improves text-motion alignment and hand-contact quality.
Paper Info
The paper is “HandX: Scaling Bimanual Motion and Interaction Generation,” by Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, and Liang-Yan Gui from University of Illinois Urbana-Champaign, Specs Inc., and Snap Inc. It appears at CVPR 2026. The project page is at handx-project.github.io and code is available at github.com/handx-project/HandX.
1. Problem and Motivation
Human motion synthesis has made impressive progress, but hand motion — especially bimanual interaction — remains underexplored. Most whole-body motion models treat hands as rigid end-effectors and miss the fine-grained cues that matter: finger articulation, contact timing, and inter-hand coordination. Meanwhile, hand-centric datasets tend to focus on object interaction, use coarse annotations, or lack high-fidelity bimanual sequences altogether.
The bottleneck is threefold. First, existing datasets either lack hand detail (Motion-X, InterAct) or are limited to object-centric settings with categorical labels (ARCTIC, H2O). Second, mismatched skeletons, frame rates, and annotation protocols across sources make it hard to unify data. Third, standard evaluation metrics like FID and R-Precision do not capture hand-specific qualities like contact fidelity or bimanual coordination.
HandX is designed to address all three. It provides a large-scale, contact-rich bimanual motion dataset with fine-grained multi-level text annotations, and introduces hand-focused metrics to evaluate generation quality.
2. Dataset
HandX is built in two steps.
Aggregating existing data. The authors consolidate multiple open-source datasets with bimanual motion (HOT3D, ARCTIC, GigaHands, H2O, HoloAssist), converting them to a unified skeletal representation and coordinate system. An intensity-aware filter removes static or near-static segments that would cause generative models to freeze.
Capturing new data. Using a 36-camera OptiTrack optical motion-capture system, they record dexterous two-hand interactions with 25 reflective markers per hand, capturing fine-grained articulation of the wrist, palm, fingers, and fingertips. The hand skeleton is reconstructed by estimating joint centers and enforcing anatomical constraints on bone lengths, with per-frame refinement for kinematic consistency.
The final dataset comprises 54.2 hours of high-quality bimanual motion, 5.9 million frames, and 490K text descriptions. Compared to prior datasets, HandX stands out for its contact richness, motion intensity, and fine-grained language annotations organized in a triplet structure (left hand, right hand, inter-hand relation).
3. Annotation Pipeline
Manually annotating this much bimanual motion is prohibitively expensive. The authors propose a two-stage automatic pipeline.
Stage 1: Kinematic Feature Extraction. They compute a set of kinematic descriptors at each frame — finger flexion, finger-palm distances, inter-hand spatial relationships — then segment the temporal evolution into discrete events (e.g., touch, slide, release). These events are organized into a structured JSON format that LLMs can readily parse.
Stage 2: LLM-based Description Generation. Given the JSON-formatted kinematic features, a carefully designed prompt guides the LLM to generate descriptions following three principles: (a) explicitly describe left hand, right hand, and their inter-hand relationships; (b) report critical motion events like contact, separation, and hyperextension; (c) incorporate temporal context to preserve the progression of events. The LLM generates five levels of detail, from concise summaries to comprehensive descriptions covering subtle changes and speed variations.
This decoupled approach ensures that annotations are grounded in actual motion dynamics rather than hallucinated from visual appearance, and the multi-level structure supports both fine- and coarse-grained generation tasks.
4. Generation Models
The paper benchmarks two representative paradigms.
4.1 Diffusion Model
The motion representation concatenates 3D joint coordinates with a compact rotation scalar per joint (exploiting the limited rotational degrees of freedom of hand joints). An MLP-based encoder projects each frame into a \(D\)-dimensional embedding. The three text prompts (\(T_L\), \(T_R\), \(T_I\) for left, right, and interaction) are encoded separately by T5, each with a learnable CLS token to prevent left-right confusion. The text embeddings are cross-attended with the motion embeddings and fused through residual connections:
\[\tilde{z} = z'_t + \sum_{k \in \{L,R,I\}} \text{CrossAttention}(z'_t, \mathfrak{T}_k)\]An MLP decoder maps the fused representation back to motion: \(\tilde{x} = G(\tilde{z}) \in \mathbb{R}^{F \times 2J \times 4}\).
A key design insight is that at inference time, the diffusion model supports diverse generation tasks through partial denoising — blending known constraints with the current sample at each denoising step. This enables motion in-betweening, keyframe control, wrist trajectory conditioning, hand-reaction synthesis, and long-horizon generation, all from a single model.
4.2 Autoregressive Model
The AR model uses Finite Scalar Quantization (FSQ) for tokenization, which offers better codebook utilization and scaling behavior than VQ-VAE. It adopts a local motion representation (wrist-relative positions and velocities) to improve codebook utilization. A text-prefix autoregressive model then predicts the next motion token conditioned on preceding tokens and the T5-encoded text prefix:
\[\mathcal{L} = -\sum_{k=1}^{n} \log p(\hat{y}^k \mid y^{<k}, \mathfrak{T})\]The tokenizer uses 1D convolutional blocks with a temporal downsampling factor of 2, and the autoregressive model is explored with varying Transformer layers (8, 12, 16) and codebook sizes (512 to 4,096).
5. Metrics
Beyond standard FID, Diversity, R-Precision, and MM Distance, the paper introduces contact-focused metrics: contact precision (\(C_\text{prec}\)), recall (\(C_\text{rec}\)), and F1 (\(C_\text{F1}\)). These evaluate whether the generated sequence reproduces contact events at the corresponding frames in the ground truth, with a 2 cm contact threshold.
This is a meaningful addition. Standard metrics can look good even when contact timing and inter-hand coordination are poor, which is exactly the failure mode you care about in bimanual generation.
6. Experiments and Main Results
Scaling Trends
Both model families show clear positive scaling trends. For diffusion models, scaling either model depth or training data consistently improves R-Precision and contact-related scores. The 12-layer model achieves the best overall performance, but further scaling to a 16-layer ultra-large variant (6.7× more parameters) causes performance to drop across all metrics — a clear saturation point.
For autoregressive models, increasing codebook size alone does not reliably help. Performance only improves when codebook size and model capacity are scaled jointly, suggesting finer discrete representations need sufficient autoregressive capacity to be useful.
Under a fixed 5% data budget, the authors observe an approximately log-linear relationship between Top-3 R-Precision and FLOPs, with a correlation coefficient of 0.96:
\[R_\text{prec} = 0.4391 \times \log_{10}(\text{FLOPS}) - 3.8707\]Qualitative Results
The qualitative comparisons are telling. Models trained on the full dataset generate more expressive motion with better text alignment than those on 5% or 20% subsets. Larger models produce motion better aligned with text and exhibit improved bimanual contact. The generated sequences successfully capture complex contact events specified in the text prompt — finger-to-finger touches, temporal coordination, and hand-hand spatial relationships.
The framework also demonstrates versatile generation: text-to-motion, motion in-betweening, trajectory control, keyframe guidance, hand-reaction synthesis, and long-horizon generation, all from the same model via the partial denoising mechanism.
7. Code and Implementation
The codebase is well-structured, with separate modules for diffusion, autoregressive, evaluation, and IsaacGym-based simulation. Key implementation details:
- Diffusion: Hydra-based config, 4/8/12/16-layer Transformer decoder variants
- Autoregressive: VQ tokenizer training → code extraction → text-prefix AR training; model sizes from 4.6M to 3B parameters; codebook sizes 512–65,536
- Evaluation: Unified pipeline computing FID, R-Precision, MM Dist, Diversity, and contact metrics (\(C_\text{prec}\), \(C_\text{rec}\), \(C_\text{F1}\))
- Simulation: IsaacGym-based physics replay for MANO hand meshes, supporting single-sequence and grid visualizations
8. Strengths and Limitations
Strengths. The paper’s main value lies in its holistic approach: it does not just build a model but constructs a complete ecosystem — dataset, annotation pipeline, benchmarks, metrics, and scaling analysis. The triplet annotation structure (left, right, interaction) is a clean design choice that helps models avoid left-right confusion. The contact-focused metrics fill a genuine evaluation gap. And the scaling analysis, while not earth-shattering, provides concrete evidence of when scaling helps and when it saturates.
Limitations. The paper is primarily about hand-only motion without body context. The practical applicability depends on how well these hand motions integrate with whole-body models downstream. The LLM-based annotation pipeline, while scalable, inherits whatever biases the LLM brings to motion description — the paper does not analyze annotation quality or failure modes in depth. Finally, the scaling analysis is limited to the HandX dataset; it would be interesting to see whether the trends hold when transferring to other domains or combining with body-level data.
9. Takeaways
HandX makes a strong case that hand motion generation is a field that was held back more by data and evaluation infrastructure than by model architecture. The core contributions — a clean, contact-rich bimanual dataset, a principled annotation pipeline, and hand-focused evaluation metrics — are the kind of foundation that enables others to build on top of. The scaling analysis adds useful guidance: moderate scaling works, but matching model capacity to data size matters more than blindly increasing either one.
For anyone working on embodied AI, telepresence, or human animation, the practical message is that fine-grained bimanual motion is now within reach of generative models, but it requires purpose-built data and evaluation to get there.
