[Paper Notes] T-Rex: Tactile-Reactive Dexterous Manipulation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
T-Rex argues that tactile feedback should be treated as a high-frequency control signal for dexterous manipulation. In this framing, touch is part of the control loop: vision and language provide slow semantic planning, while touch provides local, fast correction when contact changes inside an action chunk.
The paper combines a tactile-synchronized bimanual robot dataset, a variable-rate Mixture-of-Transformer-Experts (MoT) policy, and a three-stage training recipe. Across 12 real-world contact-rich tasks, T-Rex reaches 65% average success, compared with 35% for EgoScale, the strongest baseline. A useful warning in the results is that π0.5 + tactile drops to 6%, below π0.5 without tactile, showing that tactile signals need architecture and training alignment to help.
Paper Info
The paper is “T-Rex: Tactile-Reactive Dexterous Manipulation” by Dantong Niu, Zhuoyang Liu, Zekai Wang, Boning Shao, Zhao-Heng Yin, Anirudh Pai, Yuvan Sharma, Stefano Saravalle, Ruijie Zheng, Jing Wang, Ryan Punamiya, Mengda Xu, Yuqi Xie, Yunfan Jiang, Letian Fu, Konstantinos Kallidromitis, Matteo Gioia, Junyi Zhang, Jiaxin Ge, Haiwen Feng, Fabio Galasso, Wei Zhan, David M. Chan, Yutong Bai, Roei Herzig, Jiahui Lei, Fei-Fei Li, Ken Goldberg, Jitendra Malik, Pieter Abbeel, Yuke Zhu, Danfei Xu, Jim Fan, and Trevor Darrell.
It is available as arXiv:2606.17055. The project page is tactile-rex.github.io, the code is released at ZhuoyangLiu2005/T-Rex, and the dataset is on Hugging Face as zekaiwang/trex_dataset.
Core Argument
Many VLA policies can interpret instructions and visual context, but contact-rich dexterity often fails at a shorter time scale. Turning a page, extracting a card, squeezing toothpaste, handling an egg, or opening a lock requires quick reactions to force, slip, deformation, and contact geometry. Those signals are local and high-frequency, and the useful correction may need to happen before a slower visual policy replans.
T-Rex addresses this frequency mismatch directly. It keeps a slow visual-language-action pathway for task progress and adds a fast tactile pathway for contact-level refinement. The result is a policy whose control loop is shaped by the sensing modality: vision carries broad context; touch adjusts the action when the physical interaction changes.
Dataset and Training Recipe
The T-Rex Dataset is collected on a fixed-base Dexmate Vega-1 robot with two Sharpa Wave dexterous hands. The setup uses a head camera, two wrist cameras, five fingertip tactile sensors per hand, tactile force vectors, tactile deformation maps, Manus gloves, and VIVE trackers. The full dataset described in the paper contains 100 hours of teleoperation, 7700+ trajectories, 22 motor primitives, 200+ daily objects, and synchronized RGB, tactile, robot-state, action, and language streams. The public release currently contains about 50 hours and 5400+ trajectories in LeRobot v3.0 format.
The data is designed for more than task cloning. By covering elementary motor primitives and object interactions, it gives the model reusable contact-rich building blocks during robot mid-training. The training recipe has three stages: first, the latent and action experts inherit broad visuomotor priors from EgoScale-style pretraining on 22,889 hours of egocentric human video; second, robot mid-training on the T-Rex Dataset aligns those priors with bimanual actions and synchronized tactile feedback; third, task-specific post-training adapts the model with about 100 demonstrations per downstream task. The recipe suggests that tactile reactivity can be learned efficiently in a dedicated robot stage, after large-scale visual pretraining.
Variable-Rate MoT
The model uses a Mixture-of-Transformer-Experts policy with three experts:
| Expert | Role | Rate |
|---|---|---|
| Latent Expert | Future visual latent prediction | Low-rate |
| Action Expert | Low-frequency action denoising | About 5 Hz |
| Tactile Expert | High-frequency tactile refinement | About 20 Hz |
The action expert first produces an intermediate action chunk through flow matching, then the tactile expert refines it using fresh tactile observations. In the implementation, the action chunk length is 16, denoising uses 10 Euler steps, the split is τ_split = 0.4, the action expert runs 6 slow steps, and the tactile expert runs 4 fast refinement steps. Tactile updates are triggered at offsets {0, 4, 8, 12} inside the chunk, so the model can react to new contact without rerunning the full vision-language stack.
The tactile encoder also matches the nature of the signal. For fingertip force, a per-finger VQ-VAE compresses recent 6D force/torque history into temporal tokens while preserving the current force vector for instantaneous contact. For deformation, a convolutional encoder processes tactile maps. The final tactile tokens combine temporal force, current force, and spatial deformation features, allowing the policy to distinguish events such as force spikes, gradual slip, and local surface deformation.
Empirical Evidence
The benchmark contains 12 real-world tactile-reactive tasks: Flip Page, Transfer Egg, Wipe Plate, Apply Toothpaste, Split Cup, Sort Mahjong, Open Lock, Refill Tablet, Acid-Base Neutralization, Extract Card, Deal Poker, and Screw Lightbulb. Each task is evaluated with 16 rollouts under randomized object poses, and multi-stage tasks use progress-based scoring.
Average success across the 12 tasks:
| Method | Average Success |
|---|---|
| ViTacFormer | 3% |
| RDP | 6% |
| Tactile-VLA | 15% |
| EgoScale | 35% |
| π0.5 | 17% |
| π0.5 + tactile | 6% |
| T-Rex | 65% |
The main result supports the paper’s central claim: tactile feedback is most useful when the policy can react with a separate fast pathway. The tactile ablation tells the same story:
| Configuration | Average |
|---|---|
| Full T-Rex | 65% |
| w/o Tactile | 42% |
| MLP Force + Deform | 58% |
| Deform only | 54% |
| MLP Force + VQ-VAE Force | 59% |
| w/o Async | 60% |
Removing tactile drops success from 65% to 42%. Force and deformation both help, the temporal VQ-VAE improves force modeling, and asynchronous refinement adds a smaller but still meaningful gain. The training ablation further supports the recipe:
| Recipe | Average |
|---|---|
| No pretraining, no mid-training | 18% |
| Pretraining only | 34% |
| Mid-training only | 45% |
| Full recipe | 65% |
The full system wins because it combines broad human-video priors with tactile-grounded robot mid-training. The code release reflects this split: the main branch includes post-training, inference, dataset quickstart, tactile VQ-VAE tools, and robot-side code, while pretraining and mid-training scripts are provided in the full-pipeline branch with released checkpoints.
Limitations and Takeaway
The paper notes that long-horizon tasks with tight contact tolerances remain difficult to teleoperate and learn from demonstrations alone. Reinforcement learning or online interaction-based refinement may be needed for those cases. It also highlights hardware constraints: tactile sensor distortion, calibration drift, cross-device variation, and the lack of dense palm sensing make tactile foundation policies harder to scale. A practical adoption issue is that T-Rex depends on a rich dexterous platform with fingertip tactile sensing, so broader use will depend on tactile hardware becoming more common and standardized.
The clear takeaway is that tactile feedback changes the control problem. For dexterous manipulation, the policy needs slow vision-language planning plus fast tactile-reactive refinement. T-Rex is valuable because it turns that principle into a dataset, architecture, training recipe, and real-world benchmark result.
