[Paper Notes] DexterCap: Affordable and Automated Capture of Complex Hand-Object Interactions
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
DexterCap is a low-cost optical motion-capture system for fine-grained hand-object interaction. The paper’s main contribution is a capture-and-reconstruction pipeline for data collection; policy learning is downstream. It shows that dense, identifiable visual markers plus automated geometry can collect subtle in-hand manipulation data with higher fidelity and less manual cleanup than typical low-cost capture setups.
The key mechanism is to attach many character-coded marker patches directly to relatively rigid hand regions and objects. CornerNet, EdgeNet, and BlockNet recover marker corners, valid block edges, and two-character IDs from multi-view grayscale video; triangulated 3D markers then drive MANO hand fitting and object pose reconstruction. The resulting DexterHand dataset contains long in-hand manipulation sequences over primitive objects and a Rubik’s Cube, making the work most useful as dataset infrastructure for dexterous robot learning.
Paper Info
The paper is “DexterCap: Affordable and Automated Capture of Complex Hand-Object Interactions” by Yutong Liang, Shiyi Xu, Yulong Zhang, Bowen Zhan, He Zhang, and Libin Liu, accepted to Eurographics 2026 / Computer Graphics Forum. The arXiv entry uses the title “DexterCap: An Affordable and Automated System for Capturing Dexterous Hand-Object Manipulation” and is available as arXiv:2601.05844. The project page is pku-mocca.github.io/Dextercap-Page, with code at PKU-MoCCA/dextercap and dataset access through Hugging Face.
Core Problem
Fine hand-object capture sits in an awkward middle ground. Commercial optical mocap is accurate but expensive and often needs cleanup when markers disappear or swap. Data gloves reduce visual occlusion but still face drift and finger-accuracy issues. Markerless RGB/RGB-D methods are cheaper, yet severe self-occlusion, close contact, motion blur, and small in-hand rotations remain difficult.
DexterCap chooses explicit sensing design. The prototype uses 13 Hikvision MV-CS050-10GM industrial GigE PoE cameras around a 2 x 1 x 2 m capture cage, recording grayscale video at 2048 x 2448 and 20 FPS. The reported hardware cost is under 6,000 USD. The system still needs one-time calibration, subject-specific MANO shape estimation, and a small amount of session labeling: 2-3 frames from each video captured by each camera, with about 180 labeled frames enough for the image models to generalize well in the same environment.
Method
DexterCap’s marker design is the center of the system. Each checkerboard-like white square contains a two-character ID drawn from uppercase letters and digits after removing visually confusing characters, giving 324 unique tags; an underscore marks orientation. Instead of putting markers on a glove that can stretch, wrinkle, or slide, the system attaches marker patches to finger knuckles, finger segments, dorsum, and palm. Each hand uses 19 patches and provides more than 500 detectable corners. Object surfaces are marked as well, so rigid objects can be solved by pose alignment and articulated objects can expose internal state.
The visual parser is a three-stage cascade. CornerNet predicts checkerboard-corner heatmaps on grayscale patches. EdgeNet decides whether two candidate corners form a valid block edge, which makes assembly much cheaper than exhaustive quadrilateral validation. BlockNet reads the two characters and orientation, after which a voting step uses the known marker-patch layout to correct local mistakes. The edge-first design is both a speed trick and a reliability trick: it reduces wrong correspondences before geometric reconstruction begins.
Once 2D marker IDs are recovered across cameras, DexterCap triangulates 3D marker positions, keeps markers observed by at least three cameras, and uses RANSAC to reject outlier observations. It then removes inconsistent 3D clusters within a marker patch, filters abnormal temporal z-scores, and linearly interpolates short gaps when neighboring frames contain observations.
Hand reconstruction uses MANO with shape parameters (\beta \in \mathbb{R}^{10}) and pose parameters (\theta \in \mathbb{R}^{45}). The authors define anatomically informed local joint coordinates and reduce the controllable hand pose to:
\[\phi \in \mathbb{R}^{27}\]which maps differentiably into the full MANO pose space. For each subject, a coarse Structure Sensor scan mounted on an iPhone produces a mesh of about 6k vertices for shape fitting through Chamfer distance, optionally helped by finger-length measurements. Calibration ties physical markers to constrained MANO submeshes and fixes marker-to-surface barycentric coordinates; per-frame optimization then solves global translation, global orientation, and hand pose with pose-limit regularization. When a marker and downstream markers on the same kinematic chain are occluded, the corresponding joint DoFs are held at their previous-frame values for temporal stability.
Object reconstruction is simple for rigid objects and more interesting for the Rubik’s Cube. Rigid pose is estimated with the Kabsch algorithm by aligning observed object markers to canonical marker positions on the object mesh. For the 2 x 2 x 2 Rubik’s Cube, DexterCap uses 384 markers on external facelets, detects the rotating face through coplanarity analysis, decomposes the cube into two 1 x 2 x 2 blocks, registers them separately, and snaps accumulated rotation to discrete quarter turns. This example matters because it turns the system from rigid object tracking into structured articulated-state capture.
Dataset and Results
DexterCap is used to build DexterHand, an open-source dataset for in-hand manipulation over seven basic object shapes plus a Rubik’s Cube, including cuboids, cylinder, disk/plate, ring, and triangular prism variants. The selected sequences in Table 1 total 4936.65 seconds, about 82 minutes, with most sequences lasting around 7-12 minutes. The reported average hand-object penetration is 0.38 +/- 0.31 cm, which the authors interpret as consistent with real hand deformation under grasping forces.
Marker extraction is quantitatively strong: CornerNet reaches 94.7% precision, 81.6% recall, and 87.7% F1 at the image level; EdgeNet reaches 99.02% accuracy, 98.9% precision, 99.1% recall, and 99.0% F1; BlockNet reaches 98.39% orientation accuracy, 97.95% left-character accuracy, and 97.36% right-character accuracy. The edge-first assembly reduces the search from 5550 quads per frame to 83 blocks via 707 edges. Reconstruction errors are also small for this capture setting: triangulated detected markers have 1.42 px reprojection error, MANO marker reconstruction error is 0.77 +/- 0.28 mm during calibration and 2.06 +/- 1.09 mm during dynamic manipulation, and object marker fitting error is 1.512 mm.
For motion quality, the paper compares DexterHand with GRAB, ARCTIC, HUMOTO, HaMeR, and GigaHands. The table is worth keeping because it compresses the main empirical claim: DexterHand is competitive with mocap/data-glove datasets and much stronger than vision-only baselines for fine in-hand manipulation.
| Dataset | MSNR ↑ | Jerk ↓ | Diversity ↑ | Coherence ↑ |
|---|---|---|---|---|
| DexterHand / Ours | 9.31 | 0.76 | 0.97 | 0.68 |
| GRAB (Vicon) | 7.29 | 3.68 | 0.91 | 0.70 |
| ARCTIC (Vicon) | 7.82 | 0.91 | 0.90 | 0.81 |
| HUMOTO (Data Glove) | 7.51 | 1.90 | 0.93 | 0.63 |
| HaMeR (Vision) | -0.05 | 23.76 | 0.90 | 0.81 |
| GigaHands (Vision) | 3.50 | 2.62 | 0.91 | 0.73 |
Why It Matters
For robot learning, DexterCap is valuable because it records the things policies and tracking controllers often need but ordinary video rarely provides cleanly: fine hand articulation, object pose trajectories, contact-rich motion, long-horizon interaction, and articulated object state. DexterHand can therefore serve as a manipulation-prior source for later work such as ConTrack, which evaluates on DexterHand clips for continuous single-hand in-hand rotation.
The strongest idea is the full systems argument. DexterCap combines affordable camera hardware, dense explicit markers, learned visual parsing, geometric reconstruction, and MANO fitting into one capture loop. The Rubik’s Cube example expands that argument from 6-DoF rigid pose to structured object state, which is exactly the sort of signal future dexterous manipulation datasets need.
Limitations
DexterCap remains a vision-based marker system, so severe occlusion is still a real failure mode. The paper specifically mentions the ring-object case where fingers can be fully occluded, producing artifacts such as finger-object penetration. The dataset is useful but still limited in subject count, object diversity, and task range; the authors list future directions including more subjects, deformable and articulated objects, bimanual interaction, tool use, grasp labels, functional intent, contact regions, and force annotations.
The current implementation is also offline and computationally heavy: marker recognition takes roughly 5 seconds per frame, and hand-object reconstruction takes 5-12 seconds per frame. Marker patches reduce ambiguity but change the appearance of hands and objects, so the captured videos differ from natural RGB observations.
Takeaway
DexterCap’s core message is that high-quality dexterous hand data may need explicit capture design alongside stronger markerless vision. Dense character-coded patches, direct attachment to rigid hand regions, learned corner/edge/block recognition, anatomically constrained MANO fitting, and structured articulated-object reconstruction together make a low-cost system capable of collecting data that RGB-only methods still struggle to recover.
My short label for the paper is:
Dexterous Hand-Object Motion Capture / Marker-Based Reconstruction / Dataset Infrastructure
