Bimanual Manipulation of Rigid and Deformable Objects

Project Information

BIRD

Grant agreement ID: 884807

DOI

10.3030/884807

Project closed

EC signature date 7 April 2020

Start date 1 September 2020

End date 31 August 2025

Funded under

EXCELLENT SCIENCE - European Research Council (ERC)

Total cost

€ 2 424 186,25

EU contribution

€ 2 424 186,00

2 424 186,00

0,25

Coordinated by

KUNGLIGA TEKNISKA HOEGSKOLAN
Sweden

Periodic Reporting for period 4 - BIRD (Bimanual Manipulation of Rigid and Deformable Objects)

Reporting period: 2025-03-01 to 2025-08-31

All day long, our fingers touch, grasp and move objects in various media such as air, water, oil. We do this almost effortlessly - it feels like we do not spend time planning and reflecting over what our hands and fingers do or how the continuous integration of various sensory modalities such as vision, touch, proprioception, hearing help us to outperform any other biological system in the variety of the interaction tasks that we can execute. Largely overlooked, and perhaps most fascinating is the ease with which we perform these interactions resulting in a belief that these are also easy to accomplish in artificial systems such as robots. Humans acquire physical interaction skills from birth and continue to advance these throughout their lifetime. It is the interplay between perception, planning and control together with training and some innate knowledge that drives this.
A vision for the future is systems that perform complex tasks safely and robustly in interaction with humans and the environment. To assist humans both at home and in industrial environments, robots need to manipulate objects -- pick them up, place in a particular position, move, and even perform more complex tasks such as cutting food, packing bags, dressing humans, etc. These objects can have different shapes and physical properties as well as different degrees of deformability. Despite recent deep reinforcement learning algorithms that demonstrate action learning directly from raw sensor data, this usually requires large training datasets that are hard or impossible to collect in robotics application. The questions of how to model, or represent, the object under consideration, the robot itself, and the environment in which the robot operates, are therefore fundamental for robotic manipulation planning and control.
The main scientific objective of this project was to create new informative and compact representations of deformable objects that incorporate both analytical and learning-based approaches and encode geometric, topological, and physical information about the robot, the object, and the environment. We have done this in the context of challenging multimodal, bimanual object interaction tasks.

We have developed a framework for visual action planning of complex manipulation tasks with high-dimensional state spaces. Planning is performed in a low-dimensional latent state space that embeds images. We define and implement a Latent Space Roadmap which is a graph based structure that globally captures the latent system dynamics. We show the effectiveness of the method on a simulated box stacking task as well as a T-shirt folding task performed with a real robot. Furthermore, we have addressed additional representation learning problem for manipulation of deformable objects. In particular, we consider graph-based representations of deformable objects which arise naturally from their point-cloud representation. Through manipulation, we learn to coarsen this graph into a simpler representation which still captures the necessary dynamics of the object. Our model consists of a Cluster Assignment Model which takes the initial graph and coarsens it, a Coarsened Dynamics Model that approximates the dynamics of the coarsened graph and a Forward Prediction Model which predicts the next state.
We also addressed learning representations of data that are equivariant with respect to a symmetry group. Equivariant representations preserve the geometry of the data space in its latent space, leading to an isomorphism between data space and latent space. We achieve this by constructing a latent space which respects transformations given by known group actions in the data-space. In this manner, we achieve representations that are disentangled with respect to pose (group action) and class (orbit). Furthermore, we looked into meta-learning, thus having an objectives of learning-to-learn. Meta-Learning is a field of study concerned with learning-to-learn. We have demonstrated the work where models are trained in a multi-task setting, with the objective of learning novel tasks using only a small dataset (denoted as support-set in few-shot learning). The size of the support-set naturally induces variance in the adapted parameters, leading to essentially inefficient identification of model parameters and consequently learning a sub-optimal model for the specific task. In the conducted work, we propose to reduce this variance through an inverse variance weighting scheme, where the variance is the model uncertainty induced from each point in the support set. The model uncertainty, in turn, is found through the Laplace approximation. The scientific work in the final period made a strong push towards theoretical development based on the development of the whole AI and robotics field. The biggest proof of excellence was the invited keynote at NeurIPS 2024, the most prestigeos conference in the area of AI.

Despite significant advances in data gathering, software and hardware development, replicating the effectiveness and flexibility of human hands remains a challenge. The questions of how to model, or represent, the object under consideration, the robot itself, and the environment in which the robot operates, were therefore fundamental for the conducted work.
We started by addressing the interpretability of network-based models by introducing the kinodynamic images. We proposes a methodology that created images from kinematic and dynamic data of contact-rich manipulation tasks. By using images as the state representation, we enabled the application of interpretability modules that were previously limited to vision-based tasks. This was the first work that applied this type of visualisation the context of tactile data. Related to representation learning, we addressed the evaluation of the quality of learned representations without relying on a downstream task. We developed GeomCA algorithm that evaluates representation spaces based on their geometric and topological properties. GeomCA can be applied to representations of any dimension, independently of the model that generated them. We demonstrated its applicability by analyzing representations obtained from a variety of scenarios, such as contrastive learning models, generative models and supervised learning models. This led to work on multimodal data. Learning representations of multimodal data that are both informative and robust to missing modalities at test time remains a challenging problem due to the inherent heterogeneity of data obtained from different channels. To this end, we developed a Geometric Multimodal Contrastive representation learning method consisting of two main components: i) a two-level architecture consisting of modality-specific base encoders, allowing to process an arbitrary number of modalities to an intermediate representation of fixed dimensionality, and a shared projection head, mapping the intermediate representations to a latent representation space; ii) a multimodal contrastive loss function that encourages the geometric alignment of the learned representations. Regarding planning for interaction tasks, we developed a first framework for visual action planning of complex manipulation tasks with high-dimensional state spaces, focusing on manipulation of deformable objects.

Simulation of tasks involving complex objects

Periodic Reporting for period 4 - BIRD (Bimanual Manipulation of Rigid and Deformable Objects)

Download Download the content of the page