[Paper Notes] DexEMG: Towards Dexterous Teleoperation System via EMG2Pose Generalization
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
DexEMG is a lightweight teleoperation system that uses a commodity sEMG wristband to control a 22-DOF dexterous robotic hand. Instead of bulky exoskeletons or line-of-sight-constrained cameras, the operator wears a simple armband that captures forearm muscle signals. A neural network (EMG2Pose) maps those signals to continuous hand joint angles in real time. The system generalizes to unseen objects and cluttered environments, and can handle multi-stage tasks like desktop packaging and table wiping.
Paper Info
- Title: DexEMG: Towards Dexterous Teleoperation System via EMG2Pose Generalization
- Authors: Qianyou Zhao, Wenqiao Li, Chiyu Wang, Kaifeng Zhang
- Affiliations: Sharpa, Shanghai Jiao Tong University
- arXiv: 2603.05861
- Paper type: teleoperation / dexterous manipulation / sEMG-based control
1. Problem and Motivation
High-fidelity teleoperation of dexterous hands is a prerequisite for deploying robots in unstructured domestic environments. The two dominant paradigms each have clear drawbacks:
- Exoskeletons (e.g., CyberGrasp, Dexmo): precise but bulky, expensive, and cause operator fatigue.
- Vision-based capture (e.g., Vicon, Leap Motion): either extremely expensive with strict environment requirements, or susceptible to self-occlusion when fingers are hidden by the palm or grasped objects.
sEMG is attractive because it reads neuromuscular signals directly from the forearm, is wearable and cheap, and is immune to visual occlusion. The key challenge is going from discrete gesture classification to continuous, high-dimensional pose estimation accurate enough for dexterous control.
2. Method
2.1 Data Collection via Kinematic Retargeting
The operator wears two devices simultaneously during data collection:
- An 8-channel gForce sEMG armband on the forearm.
- A Manus MoCap glove providing 35 skeletal keypoints as ground truth.
The captured human hand poses are retargeted to the 22-DOF Sharpa Wave hand via keypoint-based optimization:
\[q^* = \arg\min_q \sum_i \| p_i^h - p_i^r(q) \|_2^2\]A collision classifier checks the optimized joint angles and clamps to a safe manifold if self-collision is detected. The result is a paired dataset of sEMG streams and collision-free robotic joint angles.
2.2 EMG2Pose Architecture
The model follows an encoder-decoder design:
- Encoder: raw sEMG input of shape $(B, 8, T)$ goes through two Conv1d blocks and two Time-Depth Separable (TDS) stages (2D conv + layer norm + feedforward + layer norm).
- Decoder: an LSTM + MLP that predicts joint velocities $\dot{\theta}$ rather than absolute angles. Poses are reconstructed iteratively: $\theta_t = \theta_{t-1} + \dot{\theta}_t$, starting from a rest pose $\theta_0$.
The velocity-based approach decouples muscle activation intensity from static postures, reducing sensitivity to sensor displacement and signal drift during sustained grasping.
2.3 Deployment Pipeline
At deployment time the MoCap glove is removed. The operator wears only the sEMG armband and an HTC Vive Tracker for wrist tracking. The system runs inference on a sliding window of sEMG inputs, outputs action chunks of predicted joint angles, and executes the initial frames of each chunk for smooth, continuous control.
3. Experiments and Results
Pose Estimation Accuracy
- Grasp tasks: MAE of 0.09 rad.
- In-hand rotation tasks: MAE of 0.15 rad (more complex joint coupling and rapid transitions).
Generalization (Grasping)
Tested across 5 object categories (tiny, cylinder, sphere, irregular, deformable) with 20 trials each:
| Scenario | Overall SR | Overall DR |
|---|---|---|
| Trained Objects | 76.0% | 14.5% |
| Unseen Objects | 66.0% | 18.2% |
| Novel Scenarios (cluttered) | 56.0% | 28.6% |
The performance drop on unseen objects is moderate, suggesting the model learns generalizable motor patterns rather than overfitting to specific geometries. In novel (cluttered) scenarios the degradation is attributed mainly to arm-level planning difficulty rather than EMG model failure.
Long-Horizon Tasks
| Task | One-shot SR | With-retry SR |
|---|---|---|
| Desktop Packaging | 60% | 80% |
| Table Wiping | 40% | 70% |
Wiping is harder because it requires sustained contact force; minor EMG drift causes the cloth to slip. With retry the system recovers well, indicating no irrecoverable failure states.
4. Strengths
- Lightweight and cheap: a commodity sEMG armband replaces expensive exoskeletons or multi-camera setups.
- Occlusion-immune: unlike vision-based systems, sEMG works even when fingers are hidden.
- Velocity-based decoding: mitigates signal drift during sustained grasping.
- Generalization: reasonable performance on unseen objects and cluttered environments without per-object recalibration.
- Practical retry behavior: the system does not enter irrecoverable states after a failure, making it suitable for scalable data collection.
5. Limitations
- Irregular and deformable objects remain challenging (25-50% SR in novel scenarios).
- sEMG currently lacks the signal-to-noise ratio for extreme tactile precision tasks.
- The system only controls the hand; arm-level motion still relies on separate wrist tracking (HTC Vive Tracker).
- A latency-stability trade-off exists: longer input windows smooth noise but reduce responsiveness.
- Still requires an initial calibration session (data collection with MoCap glove) though not per-object recalibration.
6. Takeaways
- sEMG is a viable and underexplored modality for dexterous teleoperation. The key insight is predicting joint velocities rather than absolute angles, which makes the system more robust to sensor drift.
- The approach offers a compelling cost-portability-performance balance for scalable data collection in unstructured environments. If you need to teleoperate dexterous hands for hours to collect demonstration data, wearing only a wristband is much more practical than an exoskeleton or being tethered to cameras.
- The main bottleneck is not the EMG decoding itself but the precision ceiling of sEMG signals for fine manipulation. Combining sEMG with complementary sensing (e.g., tactile feedback or sparse vision) could push performance further.
