Reactive In-Air Clothing Manipulation with Confidence-Aware Dense Correspondence and Visuotactile Affordance

Manipulating clothing is challenging due to complex configurations, variable material dynamics, and frequent self-occlusion. Prior systems often flatten garments or assume visibility of key features. We present a dual-arm visuotactile framework that combines confidence-aware dense visual correspondence and tactile-supervised grasp affordance to operate directly on crumpled and suspended garments.

The correspondence model is trained on a custom, high-fidelity simulated dataset using a distributional loss that captures cloth symmetries and generates correspondence confidence estimates. These estimates guide a reactive state machine that adapts folding strategies based on perceptual uncertainty. In parallel, a visuotactile grasp affordance network, self-supervised using high-resolution tactile feedback, determines which regions are physically graspable. The same tactile classifier is used during execution for real-time grasp validation. By deferring action in low-confidence states, the system handles highly occluded table-top and in-air configurations. We demonstrate our task-agnostic grasp selection module in folding and hanging tasks. Moreover, our dense descriptors provide a reusable intermediate representation for other planning modalities, such as extracting grasp targets from human video demonstrations, paving the way for more generalizable and scalable garment manipulation.

We use Blender to simulate a wide variety of shirt geometries and deformations. We incorporate hems, stitches, and sewing seams into our simulations to mimic realistic garments, ensuring visual realism and providing key features for correspondence. We use consistent vertex indexing across shirts , enabling descriptors to align with a canonical template.

Our animated simulation pipeline enables generation of a wide range of shirt types and configurations. Shown below are a few examples of different simulated shirts in both suspended in-air and on-table configurations.

Example shirts simulated in Blender.

To train the dense correspondence network, a pixel on the deformed shirt is queried. Because the vertex scheme across the different simulated shirts is consistent with the canonical shirt mapping, the corresponding pixel match in simulation is found. The network outputs a dense descriptor for the shirt where every pixel is mapped to an d-dimensional embedding, capturing the location of the pixel with respect to the canonical shirt in the descriptor space.

Using a Distributional Loss for training our dense correspondence network provides several advantages, including allowing the network to represent multiple valid matches for a pixel, capturing symmetries in the cloth, and by providing a built-in confidence metric needed to manipulate garments in highly occluded states.

The resulting network can query points on deformed shirts and get correspondence heatmaps with respect to the canonical shirt. We can also use our descriptors in the inverse direction to query points on the canonical shirt and find correspondences on the deformed shirt. This is the direction we query in during execution on the robot.

We train a grasp affordance network to find good grasp regions of a hanging shirt using the simulated dataset. A depth image of the garment is input into the UNet archiecture implemented in [1] and a heatmap representing the graspable regions is output.

By training on the simulated shirt dataset, we can explicitly check desired grasping criteria, such as (1) side grasp reachability, (2) gripper collision, and (3) whether or not two layers of fabric or less are being grasped.

Visuotactile grasp affordance training in simulation.

However, because there are a wide range of material properties and dynamics from varied fabrics that cannot be fully captured in simulation, we fine-tune on the robot using a tactile classifier to supervise grasp success. The visualized grasp affordance shows predictions from the model trained in simulation, and the selected grasps based on that affordance to improve efficiency of grasp selection.

We repeat this fine-tuning processing for a range of shirt types to capture a range of material properties and dynamics.

By combining confidence-aware dense visual correspondence and a visuotactile grasp affordance network, our system can defer action in low confidence states, allowing us to handle highly occluded table-top and in-air configurations.

In the single-grasp example below, a point on the end of the sleeve is queried on the canonical shirt. The system starts with a relatively bad affordance for the sleeve since the surface normal is not aligned with the gripper. As the shirt rotates, the sleeve moves to a better position for grasping, and the network becomes more confident that the point is indeed a sleeve. Once a high correspondence point has a good affordance, the robot attempts to grasp the desired point.

Example of robot deferring grasp action until the queried point (end of the sleeve) has a high enough correspondence match and visuotactile grasp affordance.

We evaluate grasping performance across four garment categories (sleeve, bottom, shoulder, and collar) using two different correspondence networks: one trained solely on suspended shirts and another on a combined table and suspended dataset. For each category, we perform 10 grasp attempts per network, recording outcomes as succes, failure, or below confidence threshold. Failures are further categorized as correspondence errors or affordance errors.

Confidence-Based State Machine

This system enables reactive in-air folding by dynamically selecting grasp points based on real-time confidence estimates and recovery failures using tactile reactivity. A Confidence-Based State Machine is implemented to guide the folding task by picking the next highest-confidence action.

The system starts by picking the shirt up from the table, looking for correspondence regions above a confidence threshold. At each grasp attempt, the robot can query from three canonical regions (shoulder, sleeve, bottom) using the distributional dense correspondence network to generate confidence-weighted heatmaps. A grasp is executed only if both the correspondence confidence and grasp affordance exceed predefined thresholds. Otherwise, the robot rotates the garment and reevaluates, ensuring robust grasp point selection across the four folding strategies.

A confidence-based state machine enables reactive in-air folding by dynamically selecting grasp points based on real-time confidence estimates and recovery failures using tactile reactivity.

Our confidence-aware state machine was able to successfully fold the shirts for 6 out of the 10 trials. Of the 30 total grasps attempted during the course of the 10 folding trials, 6 were empty grasps successfully caught by the tactile classifier, immediately triggering recovery behaviors, further demonstrating the usefulness of tactile feedback in the folding pipeline.

Of the 4 unsuccessful folding trials, irrecoverable failure modes included correspondence failures (1 trial), grabbing too much fabric (1 trial), and grabbing diagonally across the shirt (2 trials). (See Appendix 7.2 for for visualizations of these failure modes.)

Furthermore, without affordance fine-tuning, the folding success rate dropped to 3 out of 10 trials , with an increase in cases of grabbing too much fabric caused by poor affordance, rather than poor correspondence.

Example of a successful folding trial, showing the system step through the confidence-based state machine using both the correspondence and affordance networks

Compilation of other successful folding demos, showing a variety of shirt geometries (including longsleeve shirts), materials thicknesses, and dynamics. Different folding strategies, as well as grasps being re-attemped, and the tactile classifier informing the robot to try again are also demonstrated.

The task-agnostic grasp selection module can be applied to hanging tasks as well. To perform the hanging task, collar or shoulder points are queried from the table and in the air. After securing both grasps, the robot moves open-loop to a peg.

Our hanging system was successful in 7 out of 10 trials, where success is evaluated based on grasping the correct regions and whether the cloth stays on the peg. All 3 failures were attributed to correspondence errors. Example successful hanging videos are shown below.

BibTeX

@article{sunil2025reactiveinairclothingmanipulation,
      title={Reactive In-Air Clothing Manipulation with Confidence-Aware Dense Correspondence and Visuotactile Affordance}, 
      author={Neha Sunil and Megha Tippur and Arnau Saumell and Edward Adelson and Alberto Rodriguez},
      year={2025},
      url={https://arxiv.org/abs/2509.03889}, 
}

Reactive In-Air Clothing Manipulation with Confidence-Aware Dense Correspondence and Visuotactile Affordance

Overview: We develop a visuotactile system capable of folding and hanging in-air using dense visual representations and tactilely-supervised visual affordance networks.

Abstract

Dataset Generation in Simulation

Dense Correspondence with Distributional Loss

Visuotactile Grasp Affordance Network

In Air-Garment Manipulation

Confidence-Based State Machine

Tensioning

Folding Shirts

Hanging Shirts

BibTeX