Periodic Reporting for period 4 - BIRD (Bimanual Manipulation of Rigid and Deformable Objects)
Periodo di rendicontazione: 2025-03-01 al 2025-08-31
A vision for the future is systems that perform complex tasks safely and robustly in interaction with humans and the environment. To assist humans both at home and in industrial environments, robots need to manipulate objects -- pick them up, place in a particular position, move, and even perform more complex tasks such as cutting food, packing bags, dressing humans, etc. These objects can have different shapes and physical properties as well as different degrees of deformability. Despite recent deep reinforcement learning algorithms that demonstrate action learning directly from raw sensor data, this usually requires large training datasets that are hard or impossible to collect in robotics application. The questions of how to model, or represent, the object under consideration, the robot itself, and the environment in which the robot operates, are therefore fundamental for robotic manipulation planning and control.
The main scientific objective of this project was to create new informative and compact representations of deformable objects that incorporate both analytical and learning-based approaches and encode geometric, topological, and physical information about the robot, the object, and the environment. We have done this in the context of challenging multimodal, bimanual object interaction tasks.
We also addressed learning representations of data that are equivariant with respect to a symmetry group. Equivariant representations preserve the geometry of the data space in its latent space, leading to an isomorphism between data space and latent space. We achieve this by constructing a latent space which respects transformations given by known group actions in the data-space. In this manner, we achieve representations that are disentangled with respect to pose (group action) and class (orbit). Furthermore, we looked into meta-learning, thus having an objectives of learning-to-learn. Meta-Learning is a field of study concerned with learning-to-learn. We have demonstrated the work where models are trained in a multi-task setting, with the objective of learning novel tasks using only a small dataset (denoted as support-set in few-shot learning). The size of the support-set naturally induces variance in the adapted parameters, leading to essentially inefficient identification of model parameters and consequently learning a sub-optimal model for the specific task. In the conducted work, we propose to reduce this variance through an inverse variance weighting scheme, where the variance is the model uncertainty induced from each point in the support set. The model uncertainty, in turn, is found through the Laplace approximation. The scientific work in the final period made a strong push towards theoretical development based on the development of the whole AI and robotics field. The biggest proof of excellence was the invited keynote at NeurIPS 2024, the most prestigeos conference in the area of AI.
We started by addressing the interpretability of network-based models by introducing the kinodynamic images. We proposes a methodology that created images from kinematic and dynamic data of contact-rich manipulation tasks. By using images as the state representation, we enabled the application of interpretability modules that were previously limited to vision-based tasks. This was the first work that applied this type of visualisation the context of tactile data. Related to representation learning, we addressed the evaluation of the quality of learned representations without relying on a downstream task. We developed GeomCA algorithm that evaluates representation spaces based on their geometric and topological properties. GeomCA can be applied to representations of any dimension, independently of the model that generated them. We demonstrated its applicability by analyzing representations obtained from a variety of scenarios, such as contrastive learning models, generative models and supervised learning models. This led to work on multimodal data. Learning representations of multimodal data that are both informative and robust to missing modalities at test time remains a challenging problem due to the inherent heterogeneity of data obtained from different channels. To this end, we developed a Geometric Multimodal Contrastive representation learning method consisting of two main components: i) a two-level architecture consisting of modality-specific base encoders, allowing to process an arbitrary number of modalities to an intermediate representation of fixed dimensionality, and a shared projection head, mapping the intermediate representations to a latent representation space; ii) a multimodal contrastive loss function that encourages the geometric alignment of the learned representations. Regarding planning for interaction tasks, we developed a first framework for visual action planning of complex manipulation tasks with high-dimensional state spaces, focusing on manipulation of deformable objects.