[Paper Notes] Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

9 minute read

Published: March 11, 2026

This post supports English / 中文 switching via the site language toggle in the top navigation.

TL;DR

This paper pushes VLA-style robot control toward human-like bimanual dexterous manipulation. The core idea is to split the problem into two parts:

IMCopilot: RL-trained in-hand manipulation skills that both assist humans during teleoperation and act as callable low-level skills during inference
MoDE-VLA: a VLA architecture that injects force and tactile feedback into a pretrained backbone through a dedicated sparse-expert pathway

The result is a system that can handle harder contact-rich tasks such as gear assembly, charger plugging, tube rearranging, and apple peeling, with a clear gain over the baseline pi_0 backbone.

Paper Info

Title: Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA
Authors: Tutian Tang, Xingyu Ji, Wanli Xing, Ce Hao, Wenqiang Xu, Lin Shao, Cewu Lu, Qiaojun Yu, Jiangmiao Pang, Kaifeng Zhang
arXiv: 2603.08122
Topic: dexterous manipulation, multimodal VLA, force/tactile fusion, shared-autonomy teleoperation

1. Motivation

Most successful VLAs still operate in a relatively easy regime:

low-DoF grippers
visually guided actions
limited contact reasoning

This paper focuses on the harder setting of 63-DoF bimanual dexterous manipulation, where the robot must coordinate:

visual perception
language-conditioned task execution
arm torques and contact forces
fingertip tactile feedback
in-hand object rotation and grasp stabilization

The authors argue that current VLA pipelines break down here for three reasons:

Data collection is too hard: direct teleoperation of high-DoF bimanual systems is cognitively demanding.
One policy struggles to cover all skills: gross motion, insertion, peeling, and in-hand rotation are qualitatively different.
Force/tactile inputs are not easy to fuse: naive concatenation can hurt a pretrained VLA rather than help it.

2. Core Idea

The framework combines two complementary components:

2.1 IMCopilot

IMCopilot is a set of RL-trained atomic in-hand manipulation skills, especially:

stable grasp maintenance
in-hand object rotation around a target axis

It serves two roles:

during teleoperation: the operator controls gross motion and triggers IMCopilot through foot pedals for difficult in-hand phases
during inference: the high-level VLA emits a trigger signal, and IMCopilot takes over hand-level control when needed

This is a practical shared-autonomy design. Instead of forcing the human or the VLA to solve every dexterous subproblem directly, the system delegates the hardest contact-rich finger coordination to a specialized low-level controller.

2.2 MoDE-VLA

MoDE-VLA stands for Mixture-of-Dexterous-Experts VLA. It extends a pretrained VLA backbone with a modality-specific branch for:

force: arm joint torque readings
tactile: 6-DoF force/wrench readings from ten fingertip sensors

The design has three important pieces:

Dedicated pathway for force and tactile tokens instead of naive state concatenation
Sparse MoE routing so different experts can specialize to different contact regimes
Residual injection so multimodal corrections refine the pretrained action prediction rather than overwrite it

This is the architectural part I found most convincing. The paper is not just saying “more sensors help”; it explains why these sensors should be routed differently because arm-level torques and fingertip contact patterns carry different physical meanings.

3. Method Details

3.1 Platform and sensing

The robot platform includes:

dual 7-DoF arms
dual 22-DoF dexterous hands
fingertip tactile sensors on all ten fingers
stereo head cameras and wrist cameras

The data collection setup uses:

upper-body exoskeleton
exoskeleton gloves
VR headset
force/tactile visualization in VR
vibrotactile fingertip feedback

This makes the teleoperation system much richer than vision-only data collection.

3.2 RL training for IMCopilot

IMCopilot skills are trained in simulation with PPO and teacher-student distillation.

The policy observes:

short proprioceptive history
fingertip contact forces
target rotation axis

The reward encourages:

rotation around the desired axis
low unwanted linear motion
low torque and joint work
stable motion

The high-level takeaway is simple: the paper isolates in-hand dexterity as a reusable skill module instead of expecting the VLA to learn everything end-to-end from limited demonstrations.

3.3 MoDE-VLA action generation

The base VLA is built on a pretrained pi_0-style flow-matching backbone. MoDE adds force and tactile tokens, lets them interact with the backbone through self-attention, routes them through sparse experts, and then generates residual corrections.

The paper uses:

E = 8 experts
top-k = 1 routing
action horizon H = 50
N = 10 Euler denoising steps at inference

The action vector contains:

arm actions
hand actions
other actions including waist motion
an IMCopilot trigger scalar

When the trigger is active, hand actions are delegated to IMCopilot.

4. Experiments and Main Results

The paper evaluates four contact-rich tasks:

Apple Peeling
Tube Rearranging
Gear Assembling
Charger Plugging

All methods are evaluated over 20 trials per task.

4.1 Teleoperation benefits from force/tactile feedback

The paper reports that force/tactile feedback improves demonstration quality and collection efficiency. One example given is Gear Assembling:

without feedback: 100 trials in 75 minutes, 85 successful demonstrations
with feedback: 100 trials in 65 minutes, 93 successful demonstrations

That is a practical result: multimodal sensing helps before learning even starts.

4.2 IMCopilot strongly improves in-hand rotation

For in-hand manipulation, plain teleoperation is much weaker than IMCopilot:

Ping-pong ball: 10% -> 83%
Tennis ball: 67% -> 93%
Apple: 27% -> 90%
Overall: 34% -> 89%

This is one of the clearest findings in the paper. The authors are not just using RL as a benchmark skill; they show it directly fixes a bottleneck in data acquisition.

4.3 MoDE-VLA vs. baseline

Compared with the pretrained backbone pi_0, the proposed method improves average success rate from 15% to 34% across the four tasks.

Task-level results:

Apple Peeling: task failure for baseline, proposed method reaches 30% SR and 73% peel completion ratio
Tube Rearranging: 8% -> 30%
Gear Assembling: 40% -> 60%
Charger Plugging: 5% -> 15%

The absolute numbers are still modest, especially for the hardest tasks, but the direction is consistent: contact-aware sensing plus skill hierarchy helps.

4.4 Ablations

The ablations show each component matters:

without force: average SR drops to 23%
without tactile: average SR drops to 26%
without IMCopilot: apple peeling PCR drops from 73% to 25%

Interpretation:

force matters most for insertion and contact onset
tactile helps with slip-sensitive hand interactions
IMCopilot is crucial for the peel-and-rotate loop

5. Why This Paper Is Interesting

I think the strongest aspect of the paper is its systems framing.

A lot of VLA work assumes that scaling a single end-to-end policy is enough. This paper takes a different position:

use teleoperation, but augment it with autonomy
use a pretrained VLA, but refine it with modality-aware residual experts
use end-to-end action generation, but keep a specialized low-level controller for in-hand dexterity

That feels much closer to how capable robotic systems will likely be built in practice.

6. Limitations

A few limitations are worth keeping in mind:

the final success rates are still not high enough for robust deployment
evaluation covers only four tasks
the system depends on specialized hardware: dexterous hands, tactile sensors, exoskeletons, and VR teleoperation
IMCopilot currently focuses on a small set of atomic in-hand skills rather than a broad manipulation library

So this is better viewed as a strong research prototype than a general-purpose deployment recipe.

7. My Takeaways

Shared autonomy is a good data strategy for dexterous manipulation. Humans do not need to control every fine contact event manually.
Force and tactile signals should not be fused naively into a pretrained VLA. The modality-specific residual path is a sensible design.
Hierarchical skill invocation is probably necessary for long-horizon dexterous tasks such as peeling, tool use, and regrasping.
The paper is especially relevant if you care about the next step beyond simple gripper-based VLA benchmarks.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Lixin Xu

TL;DR

Paper Info

1. Motivation

2. Core Idea

2.1 IMCopilot

2.2 MoDE-VLA

3. Method Details

3.1 Platform and sensing

3.2 RL training for IMCopilot

3.3 MoDE-VLA action generation

4. Experiments and Main Results

4.1 Teleoperation benefits from force/tactile feedback

4.2 IMCopilot strongly improves in-hand rotation

4.3 MoDE-VLA vs. baseline

4.4 Ablations

5. Why This Paper Is Interesting

6. Limitations

7. My Takeaways

TL;DR

论文信息

1. 研究动机

2. 核心方法

2.1 IMCopilot

2.2 MoDE-VLA

3. 方法细节

3.1 平台与传感

3.2 IMCopilot 的强化学习训练

3.3 MoDE-VLA 的动作生成

4. 实验与主要结果

4.1 力觉/触觉反馈改善遥操作

4.2 IMCopilot 显著提升手内旋转

4.3 MoDE-VLA 相比基线的效果

4.4 消融实验

5. 为什么这篇论文值得关注

6. 局限性

7. 我的收获

Share on

You May Also Enjoy

[Paper Notes] Cross-Hand Latent Representation for Vision-Language-Action Models

[Paper Notes] Visual Dexterity: In-Hand Reorientation of Novel and Complex Object Shapes

[Paper Notes] TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

The Singularity is Near