[Paper Notes] RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
RGB-S makes tactile sensing look like something a standard visual encoder can already use: an image-space saliency map. Instead of feeding touch as a robot-centric vector and asking the policy to discover where each taxel belongs in the camera view, the method uses forward kinematics and camera calibration to project tactile sensor locations into the RGB image plane. Contact force is then rendered as a Gaussian saliency channel and concatenated with RGB:
\[X_t = \mathrm{Concat}(I_t, S_t) \in \mathbb{R}^{H \times W \times 4}\]Here (I_t) is RGB and (S_t) is the tactile saliency map. The fourth channel is added to a pretrained ResNet-18 with zero initialization, so the encoder begins as an ordinary RGB encoder and gradually learns how projected touch should affect the visual representation. The central result appears under occlusion: in real-world dexterous manipulation, RGB-S reaches 51.7% average success, compared with 25.0% for the strongest implicit visuo-tactile baseline, a +26.7 percentage-point gain.
Paper Info
The paper is “RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation” by Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao, and Chenxi Xiao.
It appears on arXiv as arXiv:2606.08765, with v2 dated June 11, 2026. The project page is touch-as-saliency.github.io.
核心论点
Dexterous manipulation needs both broad scene context and direct evidence of contact. Vision provides the first, and pretrained image encoders make it cheap to reuse. Touch provides the second, especially when the hand or object hides task-relevant pixels. The hard part is alignment: RGB images live in a dense 2D coordinate system, while tactile readings are sparse, low-dimensional, and tied to a robot hand. If a policy only receives tactile features through concatenation, FiLM, or attention, it must learn the taxel-to-image correspondence from demonstrations.
RGB-S inserts a geometric prior before learning begins. Given robot proprioception, tactile sensor offsets, camera intrinsics, and camera extrinsics, each tactile node can be projected into the image. A contact becomes a visual cue near the image location where interaction is happening. Under occlusion, this matters because the tactile map can still mark where the robot is touching even when the corresponding RGB region is masked or unreliable.
Force-Aware Kinematic Projection
The method starts with tactile readings:
\[f_t = \{f_{i,t}\}_{i=1}^{M}\]where each (f_{i,t}) is a scalar force magnitude or contact intensity from tactile sensor node (i).
For each tactile node, the 3D world position is computed with forward kinematics:
\[P_{i,t} = \mathrm{FK}(s_t, L_i)\]Here, (s_t) is robot proprioception, and (L_i) is the fixed local offset of the tactile sensor relative to its attached robot link.
Then, for camera view (c), the node is projected into the image plane:
\[[u^c_{i,t}, v^c_{i,t}, 1]^\top \sim K_c(R_cP_{i,t}+t_c)\]where (K_c) is the camera intrinsic matrix, and (R_c, t_c) are extrinsics. Nodes outside image bounds are discarded.
The projected sparse contacts are rendered as a dense saliency map:
\[S^c_t(u,v) = \max_{i \in V^c_t} \tilde{f}_{i,t} \exp \left( -\frac{(u-u^c_{i,t})^2 + (v-v^c_{i,t})^2}{2\sigma^2} \right)\]The force is normalized with:
\[\tilde{f}_{i,t} = \tanh(\gamma f_{i,t}/F^i_{limit})\]The Gaussian absorbs uncertainty from calibration, kinematics, and contact localization. Force modulation preserves more information than a binary contact map, and max aggregation keeps the saliency map bounded when multiple contacts overlap.
Network Architecture
After building the RGB-S input, the policy can use a standard visual stack. The ResNet-18 first convolution is expanded from three input channels to four:
\[z^c_t = W_{rgb} * I^c_t + W_s * S^c_t\]The RGB weights (W_{rgb}) come from pretrained ResNet-18, while the tactile weights (W_s) start at zero. This makes initialization conservative: the model initially behaves like the original RGB encoder, then fine-tuning learns how the saliency channel should modulate perception. The feature map is compressed with SpatialSoftmax into 32 keypoints, giving a 64-dimensional feature per camera view; multi-view features are concatenated with proprioception and passed to downstream policies. The same RGB-S representation is evaluated with Behavior Cloning MLP, ACT, and Diffusion Policy, so the contribution centers on image-aligned tactile representation across policy families.
Experiments
The simulation suite covers pick-and-place, cube-push, and rotate-cross. Policies are trained with normal vision and evaluated under both normal and occluded views, where black masks cover task-relevant image regions only at test time. Tactile readings come from an ETac-based simulator, and RGB-S maps are rendered for each camera view.
Across policy families and tasks, RGB-S is usually best or second-best. For Diffusion Policy, the average success rates are:
| Fusion | Pick-and-Place Avg | Cube-Push Avg | Rotate-Cross Avg |
|---|---|---|---|
| Vision-only | 39.7 | 60.9 | 52.0 |
| Concat | 43.8 | 57.5 | 59.0 |
| FiLM | 42.6 | 66.7 | 53.0 |
| CLiP-style | 43.4 | 64.2 | 56.0 |
| Cross-Attn | 38.4 | 59.2 | 61.0 |
| RGB-S | 59.1 | 68.3 | 69.0 |
The improvement is most visible when RGB is degraded. For Diffusion Policy on pick-and-place, RGB-S reaches 39.7% success under occlusion while vision-only reaches 7.4%. On rotate-cross, RGB-S reaches 50.0% under occlusion, ahead of vision-only at 26.0%. The pattern is important: extra tactile input alone is not enough, since implicit fusion methods improve some settings and hurt others. RGB-S makes touch spatial before policy learning, so the visual encoder receives contact as a location-aware cue.
The real-world platform uses an xArm6 with a LEAP Hand, four fingertip TwinTac sensors, twelve FSR sensors, and 44 projected tactile nodes. Visual input comes from two calibrated RealSense D435 cameras, with EasyHEC used for camera extrinsics. Observations include proprioception, two RGB views, and tactile readings; actions are 22-dimensional targets for the arm and hand. Demonstrations are collected through VR teleoperation with Meta Quest 3 and Manus Quantum Metaglove. On pick-and-place, open-drawer, and flip-box, Diffusion Policy gives the following real-world results:
| Method | Normal Avg | Occluded Avg |
|---|---|---|
| Vision-only | 56.7 | 10.0 |
| Concat | 55.0 | 13.3 |
| Cross-Attn | 30.0 | 25.0 |
| RGB-S | 66.7 | 51.7 |
The per-task occluded results show the same picture:
| Method | Pick & Place | Open Drawer | Flip Box |
|---|---|---|---|
| Vision-only | 0/20 | 4/20 | 2/20 |
| Concat | 1/20 | 6/20 | 1/20 |
| Cross-Attn | 0/20 | 4/20 | 11/20 |
| RGB-S | 7/20 | 10/20 | 14/20 |
RGB-S improves occluded real-world average success by 26.7 percentage points over Cross-Attn, the strongest implicit baseline in this table. This is the central result: explicit image-space grounding of touch helps most when vision loses the task-relevant region.
Ablations
The ablations keep the same message tight. Under Diffusion Policy on pick-and-place, rendering contact as a force-aware saliency map beats binary contact and RGB overlay:
| Variant | Normal | Occluded | Average |
|---|---|---|---|
| Vision-only | 71.9 | 7.4 | 39.7 |
| RGB Overlay | 65.3 | 33.1 | 49.2 |
| Binary RGB-S | 65.3 | 27.3 | 46.3 |
| Force-aware RGB-S | 78.5 | 39.7 | 59.1 |
Binary RGB-S already shows the value of contact location, while force-aware RGB-S adds interaction strength. Spatial alignment matters most under occlusion: with a 25 px random tactile-map offset, simulated occluded success drops from 39.7% to 32.2%; at 100 px, it falls to 9.9%. Normal vision is more tolerant because RGB still carries usable object information. Architecture also matters, with early zero-initialized fusion performing best:
| Architecture | Normal | Occluded |
|---|---|---|
| Late fusion | 73.6 | 35.5 |
| Intermediate fusion | 73.6 | 22.3 |
| Early RGB-S | 78.5 | 39.7 |
Injecting saliency at the first visual layer lets tactile information flow through the full visual hierarchy while keeping initialization stable.
Efficiency
RGB-S is computationally lightweight in the real-time diffusion policy pipeline:
| Model | Pre-denoising latency | Overall time |
|---|---|---|
| Vision-only | 10.10 ms | 74.36 ms |
| Cross-Attn | 15.13 ms | 79.69 ms |
| Point Cloud | 95.12 ms | 171.84 ms |
| RGB-S | 21.06 ms | 85.30 ms |
The saliency generation itself takes only 6.14 ms on average. RGB-S is much faster than an explicit 3D point-cloud branch while preserving the speed profile of a standard 2D visual policy.
Strengths
The strength of RGB-S is its restraint. It uses known geometry to make tactile-image correspondence explicit, then lets an ordinary visual encoder process the result. This is interpretable, cheap, and compatible with existing robot policy stacks. The experiments are also well scoped: the paper tests multiple policy classes in simulation, deploys on real dexterous hardware, and checks rendering, alignment, architecture, and efficiency.
Limitations
RGB-S depends on calibration and kinematic accuracy. Camera extrinsics, joint backlash, link deformation, sensor placement, and contact-induced compliance can all shift the projected tactile map.
The representation is 2D. Contacts on the far side of an object can project onto similar image regions as front-side contacts, creating depth ambiguity. The paper reports that proprioception and multi-view observations help, though the ambiguity remains.
The method assumes tactile sensor locations are known. Soft skins, uncalibrated tactile arrays, or sensors that deform substantially may require learnable offsets or online calibration.
The experiments focus on manipulation tasks where image-space contact anchors are useful. Tasks requiring fine force control, slip dynamics, or detailed tactile texture may need richer tactile representations.
Takeaways
The takeaway is simple: tactile fusion becomes much easier when touch arrives with a spatial prior. RGB-S projects taxels into the camera image, renders force-aware Gaussian saliency, concatenates RGB and saliency as a 4-channel input, and zero-initializes the new channel so pretrained visual features remain intact. Occlusion is the right stress test because it exposes whether the policy can use touch when pixels disappear.
Image-Aligned Tactile Fusion / Dexterous Imitation Learning / Occlusion-Robust Manipulation
The broader reusable idea is that tactile learning does not always need a separate tactile foundation model. A good geometric adapter can make sparse touch compatible with existing visual representations, especially when contact is the clue that survives occlusion.
