Tactile sensation is essential for contact-rich manipulation tasks. It provides direct feedback on object geometry, surface properties, and interaction forces, enhancing perception and enabling fine-grained control. An inherent limitation of tactile sensors is that readings are available only when an object is touched. This precludes their use during planning and the initial execution phase of a task. Predicting tactile information from visual information can bridge this gap. A common approach is to learn a direct mapping from camera images to the output of vision-based tactile sensors. However, the resulting model will depend strongly on the specific setup and on how well the camera can capture the area where an object is touched. In this work, we introduce FlowTouch, a novel model for view-invariant visuo-tactile prediction. Our key idea is to use an object's local 3D mesh to encode rich information for predicting tactile patterns while abstracting away from scene-dependent details. FlowTouch integrates scene reconstruction and Flow Matching-based models for image generation. Our results show that FlowTouch is able to bridge the sim-to-real gap and generalize to new sensor instances. We further show that the resulting tactile images can be used for downstream grasp stability prediction.
The FlowTouch architecture consists of two main components: an image-to-point-cloud (PCN) sampling pipeline (left) and a generative model (right). First, we align an object mesh with the robot coordinate frame to calculate the contact pose. By sampling a local 3D point cloud around the contact point, we extract geometry context. This geometric prior, along with the sensor's static background image, is then fed into a Conditional Flow Matching framework. The model accurately predicts the static tactile response, naturally abstracting away visual distractors.
A major benefit of working in a geometric space is the reduced sim-to-real gap. We integrate Taxim into MuJoCo to generate a diverse library of synthetic geometries. To test zero-shot capabilities, we deploy FlowTouch on a physical FR3 robot equipped with DIGIT sensors and a D435i camera to collect tactile samples of common household objects.
Left: Primitive geometries for simulation. Right: Taxim GelSight rendered images with contact variations.
Physical data collection setup with an FR3 robot and generated mesh alignments.
FlowTouch demonstrates robust performance across different datasets, including GelSight (OFR-G) and DIGIT (YCB-D). Utilizing domain conditioning and Sparsh Perceptual Loss, our combined training method bridges the sim-to-real gap, successfully predicting contact patterns even for completely unseen geometries and sensor instances (SELF-D).
Qualitative tactile predictions across various ablation models on the validation datasets.
We evaluate whether the generated tactile images preserve sufficient information for downstream robotic manipulation tasks. Specifically, we adopt the grasp stability estimation task, in which a binary classifier predicts grasp success from a tactile image acquired at a candidate contact location.
When evaluated on Ground Truth (GT) tactile images, the baseline classifier achieves an 85.83% accuracy. Remarkably, even without seeing any samples of the specific grasp dataset during training (zero-shot, Variant D), our model achieved an impressive 81.35% accuracy, indicating that our approach is capable of retaining important physical properties even across sensor types and domain gaps.
When the generative model is trained on the full task dataset (Variant A), it slightly exceeds the baseline at 86.06%. For Variant B, the model is pre-trained on simulated data and fine-tuned on the full task dataset, coming very close to the baseline at 85.17%. Variant C demonstrates the data-efficiency of our approach: by fine-tuning on just 10% of the task dataset after mixed sim and real data pre-training, it still achieves a high accuracy of 83.74%. This shows that the pretraining provides a model with strong priors that require little data to be finetuned and adapted to a new domain.
| Variant | Description | Accuracy (%) |
|---|---|---|
| GT | baseline (ground truth) | 85.83 |
| A | full task data | 86.06 |
| B | sim pretraining + full task data | 85.17 |
| C | mixed sim and real data pretraining + 10% task data | 83.74 |
| D | zero-shot / no task data | 81.35 |
Meshes used for the ObjectFolder Benchmark grasp stability test.
Tactile predictions from the grasp stability task on the dataset ablations. SELF-D has not been seen during training.
@misc{flowtouch,
title={{FlowTouch}: View-Invariant Visuo-Tactile Prediction},
author={Seongjin Bien and Carlo Kneissl and Tobias J{\"u}lg and Frank Fundel and Thomas Ressler-Antal and Florian Walter and Bj{\"o}rn Ommer and Gitta Kutyniok and Wolfram Burgard},
year={2026},
url={https://arxiv.org/abs/2603.08255}
}