[Paper Notes] DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
DexJoCo is best read as an integrated benchmark and workflow for task-oriented dexterous manipulation. Its contribution is the full stack around policy learning: MuJoCo environments, a Franka Panda + Allegro Hand robot setup, low-cost human teleoperation, replayable demonstrations, LeRobot/Zarr data conversion, policy training and evaluation, and robustness tests under visual and dynamics randomization.
The paper builds 11 functional tasks and collects 1.1K human demonstration trajectories. The tasks cover tool use, button pressing, folding, grasping with tongs, watering, peg insertion, articulated-object interaction, bimanual sequencing, and language-conditioned reasoning. My main takeaway is that DexJoCo turns dexterous manipulation from a set of isolated simulator puzzles into a reproducible robot-learning pipeline, while also showing how fragile current imitation-learning and VLA policies remain once fine contact, hand coordination, temporal memory, and visual generalization matter.
Paper and Resources
The paper is “DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo” by Hanwen Wang, Weizhi Zhao, Xiangyu Wang, Siyuan Huang, He Lin, Boyuan Zheng, Rongtao Xu, Gang Wang, Yao Mu, He Wang, Lue Fan, Hongsheng Li, Zhaoxiang Zhang, and Tieniu Tan. It is available as arXiv:2605.16257, with a project page at dexjoco.github.io, code at brave-eai/dexjoco, and LeRobot-format data on Hugging Face at DexJoCo/DexJoCo-Datasets-LeRobot.
核心论点
Many dexterous manipulation benchmarks emphasize hand-only control, in-hand manipulation, or simple pick-and-place. DexJoCo argues for a more task-oriented setting where the full arm-hand system must produce a functional outcome: a nail is hammered, a mouse is clicked, glasses are folded, a plant is watered, a peg is inserted, an iPad is unlocked, or a camera shutter is pressed. This design makes success depend on sequence, pose, articulated state, and contact:
G = {g_seq, g_pose, g_joint, g_contact}
That framing is important because dexterity is evaluated through what the hand accomplishes, with visual plausibility treated as insufficient. The benchmark includes single-arm and bimanual tasks around a MuJoCo model of a Rethink mount, Franka Panda arm, and Allegro Hand. Observations include third-person and wrist RGB/RGB-D views, object poses, robot states, end-effector pose, and hand joint angles; demonstrations use target absolute end-effector poses plus target absolute hand joint angles.
The task suite is broad enough to stress different failure surfaces without turning the post into a catalog: Hammer Nail and Water Plant test tool use, Click Mouse and Photograph test precise button interaction, Pinch Tongs tests finger coordination, Fold Glasses and Microwave Cook test articulated objects, Assembly and Hanoi test alignment and sequencing, Pick Bucket tests long-horizon object handling, and Unlock iPad adds language-conditioned reasoning to bimanual control.
Teleoperation and Data Pipeline
A strong part of DexJoCo is that the benchmark comes with a practical data path. The authors use a roughly $2,300 USD teleoperation setup built from Rokoko Smartgloves, HTC Vive Trackers, HTC Base Stations, and a 3D-printed connector. The glove captures hand motion while avoiding camera-occlusion issues common in vision-only hand tracking; the wrist tracker drives the Franka end-effector. Human hand motion is retargeted to the Allegro Hand with GeoRT, using a self-supervised objective over fingertip direction, workspace coverage, sensitivity, pinch behavior, and self-collision:
L = L_dir + λ1 L_cover + λ2 L_flat + λ3 L_pinch + λ4 L_col
The released repository mirrors this end-to-end story. dexjoco/ contains the MuJoCo environments and task wrappers, teleoperation/ documents the Vive/Rokoko/GeoRT hardware workflow, scripts/record_demos_zarr.py and scripts/replay_demos_zarr.py support recording and replay, dexjoco-data-converter/ converts demonstrations into LeRobot datasets and Diffusion Policy-style Zarr buffers, openpi/ supports π0.5 training and evaluation, and docs/custom_policy_integration.md describes the observation/action contract for custom policies.
One useful engineering detail is the OpenPI-style server-client evaluation pattern. The policy server emits action chunks, while the DexJoCo evaluation client buffers and executes those actions in simulation, requesting the next plan before the buffer runs dry. That is closer to deployed policy execution than a purely synchronous one-step inference loop.
| Setup | Policy Action | Environment Action |
|---|---|---|
| Single-arm | 22D [xyz, rotvec, hand16] | 23D [xyz, quat, hand16] |
| Bimanual | 44D [r_xyz, r_rotvec, r_hand16, l_xyz, l_rotvec, l_hand16] | 46D quaternion layout |
The state logs include privileged environment information for replay, but policy training should use only robot proprioception: the first 23 dimensions for single-arm tasks and the first 46 dimensions for bimanual tasks.
Robustness and Evaluation
DexJoCo evaluates ACT, Diffusion Policy Transformer (DP-T), Diffusion Policy CNN (DP-C), π0.5, and GR00T N1.5. ACT and Diffusion Policy are trained from scratch with vision and proprioception; π0.5 and GR00T N1.5 are LoRA fine-tuned and condition on language. Because default VLA action heads do not directly match bimanual dexterous action dimensions, the authors adapt the heads, including partial reinitialization for extra dimensions.
The robustness design is compact but revealing. rand-obj randomizes object placement and table height. rand-full adds third-person camera pose, lighting direction/color, and tabletop texture randomization. The replay system lets users apply visual randomization by replaying the same trajectories under different rendering settings, and the code also exposes --randomize-dynamics for parameters such as joint friction, stiffness, and object mass.
| Model | rand-obj Avg. Success | rand-full Avg. Success |
|---|---|---|
| DP-T | 50.4% | 20.0% |
| DP-C | 47.6% | 28.4% |
| ACT | 35.5% | 22.7% |
| π0.5 | 52.5% | 34.1% |
| GR00T N1.5 | 40.2% | 30.5% |
The table gives the main result in one glance: π0.5 has the strongest average success, but all methods degrade sharply under fuller visual randomization. Smaller Diffusion Policy variants remain competitive on several settings, which suggests that dexterous manipulation still depends heavily on action representation, temporal memory, and contact-level control instead of scaling vision-language pretraining alone.
Limitations
The failure modes are the most useful part of the benchmark. Policies can look semantically correct while failing physically: they pick up a camera but miss the shutter, reach a button but fail to press it, align near a peg but miss insertion, or start a bimanual sequence and lose timing. Pinch Tongs exposes repeated open-close memory, Assembly and Hanoi expose precise alignment, and several bimanual tasks show how quickly action dimensionality and asymmetric hand roles become bottlenecks.
The benchmark also inherits limits from simulation and sensing. Vision-only policies lack force and tactile cues for contact-rich manipulation. Current VLA models are still mostly pretrained on gripper-heavy robot data, so high-DoF hand action heads require adaptation and can remain brittle. Domain randomization improves coverage, but sim-to-real transfer will still need stronger physical, visual, and sensing fidelity. The iPad password setting also hints that language grounding can collapse into action bias when instructions require arithmetic or paraphrased reasoning.
Takeaway
DexJoCo is most valuable as infrastructure: it packages functional dexterous tasks, accessible teleoperation, replayable demonstrations, data conversion, policy integration, and robustness evaluation into one benchmark pipeline. For research, it is a good place to test whether a method can actually complete contact-rich functional interactions beyond plausible arm-hand motion. For practice, the code release matters because the benchmark can be extended, replayed, converted, and evaluated with modern imitation-learning and VLA tooling.
