Reactive In-Air Clothing Manipulation with Confidence-Aware Dense Correspondence and Visuotactile Affordance

MIT Logo
CoRL 2025
*Indicates Equal Contribution

Overview: We develop a visuotactile system capable of folding and hanging in-air using dense visual representations and tactilely-supervised visual affordance networks.

Abstract

Manipulating clothing is challenging due to their complex, variable configurations and frequent self-occlusion. While prior systems often rely on flattening garments, humans routinely identify keypoints in highly crumpled and suspended states. We present a novel, task-agnostic, visuotactile framework that operates directly on crumpled clothing—including in-air configurations that have not been addressed before.

Our approach combines global visual perception with local tactile feedback to enable robust, reactive manipulation. We train dense visual descriptors on a custom simulated dataset using a distributional loss that captures cloth symmetries and generates correspondence confidence estimates. These estimates guide a reactive state machine that dynamically selects between folding strategies based on perceptual uncertainty. In parallel, we train a visuotactile grasp affordance network using high-resolution tactile feedback to supervise grasp success. The same tactile classifier is used during execution for real-time grasp validation. Together, these components enable a reactive, task-agnostic framework for in-air garment manipulation, including folding and hanging tasks. Moreover, our dense descriptors serve as a versatile intermediate representation for other planning modalities, such as extracting grasp targets from human video demonstrations, paving the way for more generalizable and scalable garment manipulation.

Dataset Generation in Simulation

We use Blender to simulate a wide variety of shirt geometries and deformations. We incorporate hems, stitches, and sewing seams into our simulations to mimic realistic garments, enhaving visual realism and providing key features to correspondence. We use consistent vertex indexing across shirts , enabling descriptors to align with a canonical template.

Shirt simulation example

Our animated simulation pipeline enables generation of a wide range of shirt types and configurations. Shown below are a few examples different simulated shirts in both hanging in-air and on-table configurations.

Shirt simulation example

Dense Correspondence with Distributional Loss

To train the dense correspondence network, a pixel on the deformed shirt is queried. Because the vertex scheme across the different simulated shirts is consistent with the canonical shirt mapping, the corresponding pixel match in simulation is found. The network outputs a dense descriptor for the shirt where every pixel is mapped to an d-dimensional embedding, capturing the location of the pixel with respect to the canonical shirt in the descriptor space.

Shirt simulation example

Using a Distributional Loss for training our dense correspondence network provides several advantages, including allowing the network to represent multiple valid matches for a pixel, capturing symmetries in the cloth, and by providing a built-in confidence metric needed to manipulate garments in highly occluded states.

Visuotactile Grasp Affordance Network

We train a grasp affordance network to find good grasp regions of a hanging shirt using the simulated dataset. A depth image of the garment is input into the UNet archiecture implemented in [1] and a heatmap representing the graspable regions is output.

By training on the simulated shirt dataset, we can explicitly check desired grasping criteria, such as (1) side grasp reachability, (2) gripper collision, and (3) whether or not two layers of fabric or less are being grasped

Shirt simulation example

However, because there are a wide range of material properties and dynamics from varied fabrics that cannot be fully captured in simulation, we fine-tune on the robot using a tactile classifier to supervise grasp success. The visualized grasp affordance shows predictions from the model trained in simulation, and the selected grasps based on that affordance to improve efficiency of grasp selection.

References - MHT Fix the Formatting Later

  1. Visuotactile Affordances for Cloth Manipulation with Local Control. arXiv:1234.5678 .