The aim of UNION is to build AIs that can learn new concepts required to solve new tasks with little to no human intervention. Consider this task example: given pictures of products on the shelves of supermarkets, discover when new products are introduced, and use this information to build automatically a catalogue of old and new products. This is, perhaps surprisingly, a very difficult task for an AI because it requires it to automatically discover new concepts, in this case new supermarket products. Each new product name or category is a new concept that needs to be identified and recognised, and exemplifies a fundamental capability that an AI must posses in order to automatically expand its knowledge. In UNION we have significantly advanced the state of the art in these type of problems, and also defined a better way of measuring progress which has since been picked up by the scientific community.
In UNION we have further developed methods that integrate language and image understanding, starting from industry-standard vision-language models. For example, we have investigated how models that generate images based on textual instructions work, and how these can be modified in order to also incorporate non-textual user instructions, such as specifying the layout of the desired composition. We have also found new, interesting properties of standard vision-language models, showing that they can learn surprising concepts such as the meaning of circling things in images automatically and from very little relevant data; but also how this sensitivity causes these models to potentially learn undesirable biases in the way the interpret images.
In UNION we have further extended the “foundation” models that are at the basis of image and video understanding in modern AIs with the ability to interpret data in 3D. A representative outcome of this work is the release of the new EPIC Field dataset in collaboration with the University of Bristol, a new benchmark for the study of egocentric 3D vision. In a parallel investigation, we have demonstrated that, by combining existing image and video foundation models with a 3D interpretation of the data, we can significantly improve their performance, which strongly supports one of the key hypotheses of the project. We have also built AIs that can imagine not just images, but also the 3D shape of objects, and used this capability to reconstruct full 3D objects from a single image. We have also developed techniques that can learn such 3D-aware models hundreds of times faster than prior alternatives, significantly increasing the impact and reach of the technology. We have worked on methods that can learn not only the 3D shape of objects, but also the way they move, and used these mew methods to develop AIs that can understand different animal types, with the ultimate goal of extending this technology to any type of dynamic object.
Partly in recognition to the research carried out in UNION, the PI has received a major award, the PAMI Thomas S. Huang Memorial Prize; he is the fourth to receive this honour, in a short list that includes Fei-Fei Li, the inventor of the ImageNet dataset.