Unsupervised Perception

Informazioni relative al progetto

UNION

ID dell’accordo di sovvenzione: 101001212

Sito web del progetto

DOI

10.3030/101001212

Data della firma CE 7 Dicembre 2021

Data di avvio 1 Gennaio 2022

Data di completamento 31 Maggio 2027

Finanziato da

EXCELLENT SCIENCE - European Research Council (ERC)

Costo totale

€ 2 311 847,00

Contributo UE

€ 2 311 847,00

2 311 847,00

Coordinato da

THE CHANCELLOR, MASTERS AND SCHOLARS OF THE UNIVERSITY OF OXFORD
United Kingdom

Periodic Reporting for period 2 - UNION (Unsupervised Perception)

Periodo di rendicontazione: 2023-12-01 al 2025-05-31

The aim of the UNION project is to make Artificial Intelligence (AI) more useful and productive for experts and non-experts alike. The general public is now familiar with AIs like ChatGPT and other tools that can create images, videos and music starting from written or spoken instructions. This technology is powerful and transformative, and yet it still comes with significant limitations.

One such limitations is that current AIs learn their capabilities from the Internet, which is vast but not all-encompassing. Specifically, an AI is unlikely to find on the Internet everything it requires to know in order to cater to the specific needs of every possible user. Furthermore, while in principle a user could teach an AI new skills and capabilities as needed, this is highly unrealistic with current technology, as AIs can only learn inefficiently from vast quantities of data, well beyond what a single user can provide.

The primary goal of the UNION project is to build the technology needed for AIs to learn new concepts and tasks effortlessly, as needed to help their users in personal, creative and work-related tasks. This means developing methods that can automatically extract concepts from the observation of small quantities of new data provided by the user, with minimal or no supervisory effort on their part.

In part, this requires providing AIs with a better understanding of the physical properties of the world, so that new concepts can be acquired more efficiently and precisely. In fact, current AIs have a poor understanding of the physical world, as they do not “live” in it, but only experience it indirectly by reading and watching contents off the Internet. A consequence of their imprecise understanding of reality is that AIs often fall for basic errors of judgment. A well-known example of this shortcoming is the fact that sometimes images and videos generated by AIs make no physical sense, for example because images of hands contain too many fingers. More generally, such AIs cannot solve problems well that require a precise understanding of physical reality, which is needed, for example, to measure, relate and count things in images. One of the main hypothesis of UNION is that more performant and efficient AIs can be obtained by making the underlying models aware of the 3D nature of the world.

The aim of UNION is to build AIs that can learn new concepts required to solve new tasks with little to no human intervention. Consider this task example: given pictures of products on the shelves of supermarkets, discover when new products are introduced, and use this information to build automatically a catalogue of old and new products. This is, perhaps surprisingly, a very difficult task for an AI because it requires it to automatically discover new concepts, in this case new supermarket products. Each new product name or category is a new concept that needs to be identified and recognised, and exemplifies a fundamental capability that an AI must posses in order to automatically expand its knowledge. In UNION we have significantly advanced the state of the art in these type of problems, and also defined a better way of measuring progress which has since been picked up by the scientific community.

In UNION we have further developed methods that integrate language and image understanding, starting from industry-standard vision-language models. For example, we have investigated how models that generate images based on textual instructions work, and how these can be modified in order to also incorporate non-textual user instructions, such as specifying the layout of the desired composition. We have also found new, interesting properties of standard vision-language models, showing that they can learn surprising concepts such as the meaning of circling things in images automatically and from very little relevant data; but also how this sensitivity causes these models to potentially learn undesirable biases in the way the interpret images.

In UNION we have further extended the “foundation” models that are at the basis of image and video understanding in modern AIs with the ability to interpret data in 3D. A representative outcome of this work is the release of the new EPIC Field dataset in collaboration with the University of Bristol, a new benchmark for the study of egocentric 3D vision. In a parallel investigation, we have demonstrated that, by combining existing image and video foundation models with a 3D interpretation of the data, we can significantly improve their performance, which strongly supports one of the key hypotheses of the project. We have also built AIs that can imagine not just images, but also the 3D shape of objects, and used this capability to reconstruct full 3D objects from a single image. We have also developed techniques that can learn such 3D-aware models hundreds of times faster than prior alternatives, significantly increasing the impact and reach of the technology. We have worked on methods that can learn not only the 3D shape of objects, but also the way they move, and used these mew methods to develop AIs that can understand different animal types, with the ultimate goal of extending this technology to any type of dynamic object.

Partly in recognition to the research carried out in UNION, the PI has received a major award, the PAMI Thomas S. Huang Memorial Prize; he is the fourth to receive this honour, in a short list that includes Fei-Fei Li, the inventor of the ImageNet dataset.

UNION has advanced the state of the art in several important ways.

First, our work on 3D foundations for computer vision has demonstrated that, by reasoning about images and videos in 3D, machine learning algorithms can provide better, more accurate and less noisy answers than what is possible by only considering this data in 2D. In this manner, we have obtained state-of-the-art results in detecting and segmenting objects by type and identity, as well as in grounding natural language expressions in complex 3D scenes.

Our work on learning deformable objects has vastly increased the diversity of objects that AIs can understand in 3D, from a few, including humans, to hundreds of animal species. This was obtained by developing general-purpose modelling techniques that can, in principle, be applied beyond animals to other types of objects too.

In addition to better accuracy and generality, our work in 3D understanding has also led to significantly more efficient algorithms and models, which will be a key factor in deploying this and future technology to end users.

In the remainder of the project, the goal of UNION is to rebuild foundation vision-language models with the capabilities that we have been exploring so far: integrating a 3D understanding of the world, and consequent grounding of the high-level semantics of language models in physical reality, and integrating the ability to acquire and reason about new objects on the fly, extracting them automatically by images and videos, so that the new concepts can be used by the underlying reasoning engine to solve new tasks.

Periodic Reporting for period 2 - UNION (Unsupervised Perception)

Scarica Scarica il contenuto della pagina