Skip to main content
European Commission logo print header

Bimanual Manipulation of Rigid and Deformable Objects

Periodic Reporting for period 1 - BIRD (Bimanual Manipulation of Rigid and Deformable Objects)

Okres sprawozdawczy: 2020-09-01 do 2022-02-28

All day long, our fingers touch, grasp and move objects in various media such as air, water, oil. We do this almost effortlessly - it feels like we do not spend time planning and reflecting over what our hands and fingers do or how the continuous integration of various sensory modalities such as vision, touch, proprioception, hearing help us to outperform any other biological system in the variety of the interaction tasks that we can execute. Largely overlooked, and perhaps most fascinating is the ease with which we perform these interactions resulting in a belief that these are also easy to accomplish in artificial systems such as robots. However, there are still no robots that can easily hand-wash dishes, button a shirt or peel a potato. Our claim is that this is fundamentally a problem of appropriate representation or parameterization.
When interacting with objects, the robot needs to consider geometric, topological, and physical properties of objects. This can be done either explicitly, by modeling and representing these properties, or implicitly, by learning them from data. The main scientific objective of this project is to create new informative and compact representations of deformable objects that incorporate both analytical and learning-based approaches and encode geometric, topological, and physical information about the robot, the object, and the environment. We will do this in the context of challenging multimodal, bimanual object interaction tasks. The focus will be on physical interaction with deformable objects using multimodal feedback. To meet these objectives, we will use theoretical and computational methods together with rigorous experimental evaluation to model skilled sensorimotor behavior in bimanual robot systems.
WP1 - Theoretical foundations
This work package addresses the fundamental work on compact low-level representations and development of learning methods exploiting these and is organised along three tasks.

T1.1 Representations: definition, modelling, efficiency
An important objective of this project is to design compact low-dimensional representations for objects and actions that would incorporate their geometric, topological, and physical properties that are relevant for specific tasks, in order to enable efficient control and planning strategies to perform these. Learning state representations enables robotic planning directly from raw observations such as images. Most methods learn state representations by utilizing losses based on the reconstruction of the raw observations from a lower-dimensional latent space. The similarity between observations in the space of images is often assumed and used as a proxy for estimating similarity between the underlying states of the system. However, observations commonly contain task-irrelevant factors of variation which are nonetheless important for reconstruction, such as varying lighting and different camera viewpoints. In our initial work, we defined relevant evaluation metrics and performed a thorough study of different loss functions for state representation learning. We showed that models exploiting task priors, such as Siamese networks with a simple contrastive loss, outperform reconstruction-based representations in visual task planning.

T1.2. Identifying unknowns in multimodal data
Our initial work in this direction focused on learning representations of multimodal data and their evaluation. Such representations need to be both informative and robust to missing modalities at test time and this remains a challenging problem due to the inherent heterogeneity of data obtained from different channels. To address it, we developed a novel Geometric Multimodal Contrastive (GMC) representation learning method comprised of two main components: i) a two-level architecture consisting of modality-specific base encoder, allowing to process an arbitrary number of modalities to an intermediate representation of fixed dimensionality, and a shared projection head, mapping the intermediate representations to a latent representation space; ii) a multimodal contrastive loss function that encourages the geometric alignment of the learned representations.

T1.3. Skill transfer and adaptation - theoretical foundations
The problem of transferring skills and tasks has been receiving significant attention in the research community, both in computer vision and reinforcement learning. Despite the advances in this area, current approaches suffer from the lack of fundamental understanding of the systems behavior and, as a result, cannot provide theoretical guarantees. While constructing efficient and well-structured representations for objects and leveraging them for sample-efficient learning and planning are the key milestones in this project, our long-term scientific goal is to develop theoretical underpinnings for adapting and transferring manipulation skills between closely related tasks and environments with varying properties. In our work, we made progress in this direction prior to the project in the context of reinforcement learning. Reinforcement Learning methods are capable of solving complex problems, but resulting policies might perform poorly in environments that are even slightly different.We considered the problem of transferring knowledge within a family of similar Markov decision processes. We assume that Q-functions are generated by some low-dimensional latent variable. Given such a Q-function, we can find a master policy that can adapt given different values of this latent variable. Our method learns both the generative mapping and an approximate posterior of the latent variables, enabling identification of policies for new tasks by searching only in the latent space, rather than the space of all policies.

WP2 Perception, learning and control of bimanual tasks
T2.1 Perceiving humans, scenes and objects
Operating in open-set conditions requires that a system is capable of extending its knowledge and efficiently learn new classes without forgetting the previously learned representations.

T2.2 Efficient learning from few examples

WP3: Benchmarks and validation

In the original plan we structured the practical work along an important and challenging robotic manipulation task of cloth manipulation. We envisioned three levels of difficulty: i) Spreading a tablecloth, ii) Folding a towel, and iii) Partial dressing. In already published work and the uploaded publications we have successfully addressed all three. We have learned about the important challenges related to teh tasks and we continue to use them to demonstrate our theoretical developments on them using three different robot platforms: YuMI, Baxter and Franka Emika arms.
The main focus for the first period was to develop methods for successful encoding of complex robotic manipulations tasks and work on theoretical methods for their evaluations. We both developed a data-driven visual-action planning framework for folding tasks and a Geometric Component Analysis (GeomCA) algorithm that evaluates representation spaces based on their geometric and topological properties. GeomCA can be applied to representations of any dimension, independently of the model that generated them. We demonstrated its applicability by analyzing representations obtained from a variety of scenarios, such as contrastive learning models, generative models and supervised learning models.
Simulation of tasks involving complex objects