[Paper Notes] ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
Core Idea
ConTrack is a constrained reinforcement-learning controller for tracking dexterous hand-object demonstrations. The paper is “ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control” by Yutong Liang, Quanquan Peng, Ri-Zhao Qiu, and Xiaolong Wang from UC San Diego; the paper is available as arXiv:2606.03177, with a project page at lyt0112.com/projects/ConTrack.
The core argument is that human-to-robot tracking should be framed as a trade-off problem, beyond pure imitation. A human hand trajectory can contain unreachable robot postures, unstable contacts, or object motions that only work for a different embodiment. ConTrack therefore separates task from style: object tracking is treated as the primary requirement, while hand motion and contact fidelity are optimized as style once the object trajectory is under control.
Each reference clip is modeled as a finite-horizon MDP. The state includes robot joints, object poses, and velocities; the reference provides retargeted robot joint targets, object pose targets, link-level contact events, and object-local contact points. The policy predicts a residual around the reference,
\[q_t^{tar} = q_t^{ref} + a_t\]so the controller stays near the demonstration while retaining enough freedom to resolve physical conflicts. The reward decomposes into task tracking, style fidelity, and motion regularization:
\[r(s_t, a_t) = r_g(s_t, a_t) + r_s(s_t, a_t) + r_p(s_t, a_t)\]where (r_g) measures object pose tracking, (r_s) measures hand kinematics and contact fidelity, and (r_p) penalizes high-frequency or unstable motion. The constrained objective is the most compact statement of the paper:
\[\max_\pi J_s(\pi) \quad \text{s.t.} \quad J_g(\pi) \ge \alpha J_g^\star\]In words, ConTrack maximizes style fidelity while maintaining a target fraction of the best observed task-tracking return. The target ratio (\alpha) becomes the knob that defines how much object-tracking safety the policy must maintain before it spends capacity on motion style.
Adaptive Task-Style Mixing
The main mechanism is an online controller for the task-style frontier. ConTrack uses a Lagrangian-style relaxation,
\[\mathcal{L}(\pi, \lambda) = J_s(\pi) - \lambda \left(\alpha - \frac{J_g(\pi)}{J_g^\star}\right)\]where (\lambda) controls how strongly the policy should emphasize object tracking. PPO updates the policy with a mixed advantage,
\[A_{mix} = w_{task} A_g + (1 - w_{task}) A_s + A_p\]with:
\[w_{task} = \sigma(\lambda)\]and the dual variable is updated online:
\[\lambda \leftarrow \lambda + \eta \left(\alpha - \frac{\hat{J}_g}{J_g^\star}\right)\]When normalized task return drops below (\alpha), (\lambda) rises and the policy shifts pressure toward object tracking. When task return is comfortably above the target, (\lambda) falls and the policy can recover more hand-pose and contact style. This is the paper’s cleanest departure from fixed reward mixtures: the controller adapts the trade-off during training instead of requiring a single static reward balance for every sequence.
Contact Priors and Adaptive Resets
Contact priors are what keep the style term from degenerating into joint-angle imitation. The reference contains binary contact events between hand links and objects, plus contact points in the object frame. The style reward encourages the policy to match the reference contact timing and keep contact points near the annotated object-local targets. This matters because low object pose error alone can hide very different finger-object interactions.
The reset mechanism addresses the long-horizon side of the same problem. If every rollout begins at frame one, early failures dominate training and later contact phases receive little signal. ConTrack maintains a reset library indexed by reference frame, storing policy-reachable states instead of raw reference states. That detail is important: directly resetting to a human hand-object reference can create impossible contacts in simulation, while states visited by the current policy are physically consistent under the learned controller.
For each frame (k), ConTrack estimates a survival ratio,
\[u_k = \frac{\bar{\ell}_k}{T-k}\]then samples reset frames with
\[p(k) \propto \exp(-u_k / \tau)\]This favors segments where the policy fails quickly. As tracking improves, the sampled starts move with the current failure boundary, producing a curriculum that remains tied to reachable contact states.
Experiments and Main Results
ConTrack is evaluated on GRAB for bimanual rigid-object interaction, ARCTIC for articulated and multi-object bimanual interaction, and DexterHand for single-hand in-hand rotation. Each clip is trained as an independent tracking task for 5000 PPO updates under a fixed simulator-step budget. Evaluation starts from the first reference frame, so progress measures whether the controller can survive the full sequence instead of only selected resets.
| Method | Progress ↑ | Obj pos ↓ | Obj rot ↓ | Finger err ↓ | Contact F1 ↑ | Contact pt ↓ |
|---|---|---|---|---|---|---|
| ConTrack | 0.899 | 0.026 m | 0.272 rad | 0.163 rad | 0.784 | 0.018 m |
| ManipTrans | 0.743 | 0.012 m | 0.207 rad | 0.277 rad | 0.620 | 0.030 m |
| DexMachina | 0.246 | 0.038 m | 0.348 rad | 0.147 rad | 0.708 | 0.024 m |
| SPIDER | 0.444 | 0.201 m | 1.104 rad | 0.157 rad | 0.191 | 0.036 m |
The table captures the main trade-off. ManipTrans has lower object pose error on frames it survives, while ConTrack reaches much higher progress and stronger contact fidelity. DexMachina keeps finger motion close to the reference but has limited progress under the same budget. SPIDER lacks a learned feedback policy and struggles once contact dynamics dominate.
The ablations line up with the method story: adaptive task-style mixing improves progress and contact fidelity over fixed mixing; the reset library improves progress over start-only and uniform mid-clip resets; contact prior rewards improve contact F1 and contact-point accuracy. The paper also reports a real-world feasibility study on a tabletop bimanual platform with two xArm7 arms and two xHands, where policy-predicted joint references are streamed from simulation to a real-time controller over TCP.
Relation to ManipTrans
ManipTrans is a useful comparison because it also transfers human bimanual manipulation to dexterous robot hands through residual learning. Its pipeline pretrains a generalist trajectory imitator for hand motion, then fine-tunes residual corrections under interaction constraints. ConTrack changes the emphasis: it turns object tracking into the explicit constraint, treats hand motion and contacts as style, and uses an online dual controller to adapt the balance during RL. It also makes mid-trajectory resets depend on policy-reachable states, which is especially relevant when direct reference resets produce inconsistent contact configurations.
My short taxonomy is: ManipTrans is a two-stage transfer pipeline centered on imitation plus residual correction; ConTrack is a constrained RL tracking controller centered on adaptive task-style allocation.
Strengths and Limitations
ConTrack’s strength is conceptual clarity. It names the central conflict in dexterous tracking: object success and motion fidelity often compete. The dual controller turns that conflict into an explicit optimization mechanism, and the reset library keeps training aligned with the simulator states the policy can actually reach. The metrics are also well chosen: progress, object errors, joint errors, contact F1, and contact point error make it harder for a method to look good by optimizing only one side of the task-style trade-off.
The limitations are also clear. The constrained formulation uses a running maximum of task return for normalization, so it behaves as a practical controller instead of a strict guarantee. Contact priors depend on the availability and quality of contact annotations. The hardest DexterHand Ring clip remains difficult under the fixed 5000-update budget; the appendix reports that longer training can reach 94% success with 100,000 PPO updates, which suggests feasibility while exposing compute sensitivity. The real-world study demonstrates executable joint-command streaming, but broader deployment would still need stronger perception and tighter sim-to-real alignment.
Takeaway
The practical recipe is compact: use human motion as a reference, give object tracking priority through a constrained objective, preserve hand motion and contact as style when the task allows it, train from policy-reachable mid-trajectory states, and report progress together with contact fidelity. For my own mental taxonomy, I would label this paper:
Constrained RL / Reference Tracking / Residual Tracking Controller / Human-Demonstration-Guided Dexterous Manipulation
