Periodic Reporting for period 2 - UNION (Unsupervised Perception)
Okres sprawozdawczy: 2023-12-01 do 2025-05-31
One such limitations is that current AIs learn their capabilities from the Internet, which is vast but not all-encompassing. Specifically, an AI is unlikely to find on the Internet everything it requires to know in order to cater to the specific needs of every possible user. Furthermore, while in principle a user could teach an AI new skills and capabilities as needed, this is highly unrealistic with current technology, as AIs can only learn inefficiently from vast quantities of data, well beyond what a single user can provide.
The primary goal of the UNION project is to build the technology needed for AIs to learn new concepts and tasks effortlessly, as needed to help their users in personal, creative and work-related tasks. This means developing methods that can automatically extract concepts from the observation of small quantities of new data provided by the user, with minimal or no supervisory effort on their part.
In part, this requires providing AIs with a better understanding of the physical properties of the world, so that new concepts can be acquired more efficiently and precisely. In fact, current AIs have a poor understanding of the physical world, as they do not “live” in it, but only experience it indirectly by reading and watching contents off the Internet. A consequence of their imprecise understanding of reality is that AIs often fall for basic errors of judgment. A well-known example of this shortcoming is the fact that sometimes images and videos generated by AIs make no physical sense, for example because images of hands contain too many fingers. More generally, such AIs cannot solve problems well that require a precise understanding of physical reality, which is needed, for example, to measure, relate and count things in images. One of the main hypothesis of UNION is that more performant and efficient AIs can be obtained by making the underlying models aware of the 3D nature of the world.
In UNION we have further developed methods that integrate language and image understanding, starting from industry-standard vision-language models. For example, we have investigated how models that generate images based on textual instructions work, and how these can be modified in order to also incorporate non-textual user instructions, such as specifying the layout of the desired composition. We have also found new, interesting properties of standard vision-language models, showing that they can learn surprising concepts such as the meaning of circling things in images automatically and from very little relevant data; but also how this sensitivity causes these models to potentially learn undesirable biases in the way the interpret images.
In UNION we have further extended the “foundation” models that are at the basis of image and video understanding in modern AIs with the ability to interpret data in 3D. A representative outcome of this work is the release of the new EPIC Field dataset in collaboration with the University of Bristol, a new benchmark for the study of egocentric 3D vision. In a parallel investigation, we have demonstrated that, by combining existing image and video foundation models with a 3D interpretation of the data, we can significantly improve their performance, which strongly supports one of the key hypotheses of the project. We have also built AIs that can imagine not just images, but also the 3D shape of objects, and used this capability to reconstruct full 3D objects from a single image. We have also developed techniques that can learn such 3D-aware models hundreds of times faster than prior alternatives, significantly increasing the impact and reach of the technology. We have worked on methods that can learn not only the 3D shape of objects, but also the way they move, and used these mew methods to develop AIs that can understand different animal types, with the ultimate goal of extending this technology to any type of dynamic object.
Partly in recognition to the research carried out in UNION, the PI has received a major award, the PAMI Thomas S. Huang Memorial Prize; he is the fourth to receive this honour, in a short list that includes Fei-Fei Li, the inventor of the ImageNet dataset.
First, our work on 3D foundations for computer vision has demonstrated that, by reasoning about images and videos in 3D, machine learning algorithms can provide better, more accurate and less noisy answers than what is possible by only considering this data in 2D. In this manner, we have obtained state-of-the-art results in detecting and segmenting objects by type and identity, as well as in grounding natural language expressions in complex 3D scenes.
Our work on learning deformable objects has vastly increased the diversity of objects that AIs can understand in 3D, from a few, including humans, to hundreds of animal species. This was obtained by developing general-purpose modelling techniques that can, in principle, be applied beyond animals to other types of objects too.
In addition to better accuracy and generality, our work in 3D understanding has also led to significantly more efficient algorithms and models, which will be a key factor in deploying this and future technology to end users.
In the remainder of the project, the goal of UNION is to rebuild foundation vision-language models with the capabilities that we have been exploring so far: integrating a 3D understanding of the world, and consequent grounding of the high-level semantics of language models in physical reality, and integrating the ability to acquire and reason about new objects on the fly, extracting them automatically by images and videos, so that the new concepts can be used by the underlying reasoning engine to solve new tasks.