[Paper Notes] TopoRetarget: Interaction-Preserving Retargeting for Dexterous Manipulation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
TopoRetarget shifts dexterous retargeting from Human Pose -> Robot Pose to Human Hand-Object Interaction -> Robot Hand-Object Interaction. The central object in the method is a shared interaction mesh that contains both hand keypoints and object surface points. On top of that mesh, the method optimizes topology-aware Laplacian coordinates so that the robot hand preserves the local hand-object interaction pattern in the human demonstration.
The important part of the paper is Sections 3.2-3.4. The method first builds a reasonable robot-hand warm start from relative bone directions, then constructs a shared hand-object graph, then refines the robot hand by matching local topology in place of absolute keypoint positions. The RL controller comes after this retargeting stage: it tracks the generated reference with residual joint-position actions and absorbs dynamics, timing, and sim-to-real robustness.
Paper: “TopoRetarget: Interaction-Preserving Retargeting for Dexterous Manipulation”. arXiv: 2606.16272. Project page: toporetarget2026.github.io/TopoRetarget.
Core Framing
TopoRetarget moves the goal away from direct keypoint fitting. Human and robot hands differ in bone lengths, joint layout, palm shape, finger arrangement, and feasible contact surfaces. A retargeted pose can match fingertips in Euclidean space while losing the actual manipulation structure, especially when useful contact occurs on phalanges, finger sides, or palm regions.
The paper therefore treats manipulation retargeting as preservation of local hand-object interaction. The robot should keep the same local relationship between hand regions and object regions as the human demonstration. In this view, object-relative geometry becomes the object of imitation. The optimization target shifts from “where is this human keypoint in global space” to “where is this hand point relative to its neighboring hand and object points”.
3.2 Relative Bone-Direction Initialization
The first stage provides a robot-hand initialization. Since human and robot hands have different geometry, TopoRetarget uses the local bending pattern of the fingers as the initialization signal and avoids direct copying of absolute hand keypoint positions.
For each non-terminal keypoint \(k\), define the bone direction \(d_k\) as the unit vector from the current keypoint to its child keypoint. \(d_k^s\) denotes the source human bone direction, and \(d_k^r(q)\) denotes the robot bone direction under joint configuration \(q\).
The key design is to compare relative changes between adjacent bones on the same finger. For adjacent bone pairs \((k,l)\in A_B\), the bone-direction mismatch is:
\[ E_{bone}(q) = \sum_{(k,l)\in A_B} \left| (d_k^r(q)-d_l^r(q)) - (d_k^s-d_l^s) \right|_2^2 . \]
This loss captures local articulation. It describes how a finger bends from one bone to the next and avoids forcing a single bone to point in an absolute direction. This distinction matters for cross-embodiment transfer because link lengths and palm frames are mismatched across hands.
The initialization solves:
\[ \tilde q_t^r = \arg\min_q \lambda_{warm}E_{bone}(q) + \lambda_{smooth} |q-\tilde q_{t-1}^r|_2^2 . \]
The first term makes the robot reproduce the local hand shape. The second term keeps the initialization temporally continuous. The result \(\tilde q_t^r\) serves as a warm start for the final retargeting output. Its purpose is to put the robot hand near a plausible local articulation state before the interaction-aware refinement begins.
3.3 Interaction Mesh Construction
Matching hand shape alone is insufficient for manipulation. The core signal is the relationship between the hand and the object.
At frame \(t\), TopoRetarget forms a source vertex set and a robot vertex set:
\[ V_t^s=[P_t^h;O_t], \qquad V_t^r(q)=[P_t^r(q);O_t]. \]
\(P_t^h\) is the human hand keypoint set, \(P_t^r(q)\) is the robot hand keypoint set, and \(O_t\) is a set of points sampled from the object surface. The graph has \(N_v=21+N_o\) vertices: the first 21 are hand keypoints, and the remaining vertices are object surface points.
The paper runs Delaunay tetrahedralization on the source vertices \(V_t^s\) to obtain the interaction edge set \(\mathcal I_t\). This gives a source graph:
\[ G_t^s=(V_t^s,\mathcal I_t). \]
Then it reuses the same connectivity for the robot graph:
\[ G_t^r(q)=(V_t^r(q),\mathcal I_t). \]
This shared connectivity is the main structural move. Human and robot graphs now have the same local neighborhood structure, so the optimizer can compare corresponding hand-object interactions directly. The method avoids manually specifying which fingertip, phalanx, or palm point should contact which object region. The interaction mesh encodes those local neighborhoods from the source demonstration.
3.4 Topology-Aware Laplacian Refinement
Once the shared graph exists, TopoRetarget compares local geometry through weighted Laplacian coordinates.
The edge weights \(w_{ij,t}\) are computed from spatial distances in the source graph. Close neighbors receive high weight, distant neighbors receive low weight. These weights are computed once on the source graph and then reused for the robot graph.
For vertex \(v_i\), the weighted Laplacian coordinate is:
\[ \Delta_t(V)i = \sum{j\in\mathcal N_t(i)} w_{ij,t}(v_i-v_j). \]
When the weights are normalized so that \(\sum_j w_{ij,t}=1\), the same expression becomes:
\[ \Delta_i = v_i-\sum_jw_{ij,t}v_j. \]
This is “the current point minus the weighted center of its neighbors.” It describes local structure in place of absolute position. It is naturally insensitive to global translation and fits cross-embodiment comparison better than coordinate matching.
The interaction-mesh energy is:
\[ E_{IM}(q) = \frac1{N_v} \sum_{i=1}^{N_v} \left| \Delta_t(V_t^r(q))_i - \Delta_t(V_t^s)_i \right|_2^2 . \]
This objective asks the robot graph to match the human graph’s Laplacian coordinates. Put differently, it preserves how a hand point sits relative to surrounding hand points and object surface points. The retained quantity is local hand-object topology: which regions are near each other, in what local direction, and under what local neighborhood geometry.
Final Optimization
The final retargeting problem combines the interaction objective with the hand-shape prior and feasibility terms:
\[ (q_t^{r,\ast},s_t^\ast) = \arg\min_{q,s} \lambda_{IM}E_{IM}(q) + \lambda_{bone}E_{bone}(q) + E_{reg}(q;q_{t-1}^{r,\ast}) + \frac{w_s}{2} \sum_{i\in Q_t}s_i^2 . \]
\(E_{IM}\) is the core term. It keeps local hand-object interaction consistent with the human demonstration. \(E_{bone}\) preserves the local articulation prior from initialization. \(E_{reg}\) provides temporal smoothness and floating-base regularization. The slack variables \(s_i\) belong to the penetration constraints and are penalized so that the optimizer can tolerate small controlled violations while rejecting severe penetration.
The paper also adds signed-distance constraints \(\phi_i(q)\) for robot-hand and object geometry. The design combines a soft tolerance with a hard bound: the optimization can absorb minor geometric noise, while large interpenetration is blocked. This is important because retargeting is often driven by noisy hand-object capture, where tiny contact errors are common and strict zero-penetration constraints can make the problem brittle.
Where RL Fits
The RL part should be read as a tracking layer on top of the retargeted reference. TopoRetarget first produces \(q_t^{r,\ast}\) and object-aligned references; then a PPO controller learns to track them.
The policy uses residual joint-position control:
\[ q^{target}_t = q^{ref}_t + a_t . \]
The observation contains robot proprioception, object state, and current/lookahead reference information. The reward combines object tracking, hand-link tracking, joint tracking, and smoothness. Domain randomization is used for physical robustness.
This RL section matters because it clarifies the division of labor. Retargeting encodes the contact topology and generates a meaningful reference. RL handles dynamical tracking, residual correction, and robustness. The policy executes and corrects a topology-preserving reference, which reduces the burden of discovering the manipulation sequence from scratch.
Limitations
The most important limitation is source quality. TopoRetarget can preserve local hand-object relations that exist in the captured motion. If the source trajectory contains a virtual contact, where a finger is meant to interact with the object but fails to touch or approach the surface, the interaction mesh has no correct relation to preserve. This suggests that upstream contact completion or motion cleanup may be necessary for noisy egocentric or monocular data.
Another limitation is that Laplacian topology is a geometric proxy. It preserves local neighborhoods, relative directions, and object-relative structure, but it does not directly optimize contact force, friction cone feasibility, or force closure. The downstream RL controller can absorb part of this gap, yet the retargeted reference itself remains kinematic.
The current setting mainly targets single-hand rigid-object manipulation. Extending the same idea to bimanual manipulation, articulated objects, deformable objects, and full arm-hand-body retargeting would require richer graph construction and stronger physical constraints.
Takeaway
TopoRetarget’s main contribution is the interaction mesh plus topology-aware Laplacian objective. Relative bone-direction initialization gives a plausible hand shape. The shared hand-object graph gives both human and robot the same local topology. Laplacian refinement makes the robot preserve local interaction in place of absolute pose. Penetration constraints keep the result physically usable. RL then tracks the generated reference.
The key message for robot learning is: reference quality becomes policy quality. If the retargeted trajectory loses contact topology, the controller inherits a damaged objective. If the retargeted trajectory preserves local hand-object interaction, RL can spend its capacity on execution robustness instead of repairing the demonstration.
