Skip to main content
European Commission logo print header

Integrated and Detailed Image Understanding

Periodic Reporting for period 4 - IDIU (Integrated and Detailed Image Understanding)

Reporting period: 2020-02-01 to 2021-07-31

Perception, and in particular vision, is a fundamental component of artificial intelligent systems. While vision comes natural to humans and is thus given for granted, it constitutes a very challenging computational problem: translating the light measurements taken by our eyes into abstract concepts and ideas that express the content of images in a useful, actionable manner. While we are not conscious of it, this problem is sufficiently difficult that it requires more than half of our brain to solve.

The aim of the _Integrated and Detailed Image Understanding_ (IDIU) project is to remove some of these barriers and deliver a much more powerful computer vision technology, closer to the capabilities of vision in humans, but with the added benefit that machine brings over biology, including including the ability of operating on millions of images and videos in seconds, without fatigue or distraction, and the ability to embody any amount of expert knowledge in the process, which would normally require human experts to hold. Where current computer vision systems can, say, tell that there is a person in an image, our goal is to be able to say _everything_ the image reveals about that person, such as posture, action, type of clothes, relationship to other objects, etc. Such a detailed understanding is often paramount in applications, where small differences can completely change the meaning conveyed by an image.

The challenge is to scale algorithms to learn so much about images. Using the standard approach of manually supervising a machine to recognise individual concepts does not scale when the concept space is so large.
Thus, we need to develop machine learning systems that do not require this explicit and expensive level of supervision, but can learn by themselves by watching unlabelled images or by researching concepts automatically by using the Internet.
We also need to improve integration: Whereas current systems are limited to solve one specific function, such as recognising people or reading text, learning automatically requires to develop an overall understanding of the visual world, which is only possible if the same machine is able to process all the information required.

In the past five years, the IDIU project has significantly advanced the three aims of detailed, unsupervised and integrated image and video analysis. Because of this project and many other contributions by the wider machine learning community, computer vision is many times more powerful and versatile than it was. IDIU has contributed to this progress by introducing significant innovations in unsupervised visual geometry, unsupervised and weakly-supervised semantic analysis and by creating new research datasets and open source software for the community to build on. The significant impact of this project is measured not just in terms of publications in top international venues, but also by having received prestigious research awards and, most importantly, by having inspired substantial follow up work by academia and industry (including pioneering a new subfield nowadays called "internal learning").

Finally, IDIU has trained top talents has since become faculties or scientists in leading academic and industry research institutions such as Cambridge, Bristol, Edinburgh, Google DeepMind, and Meta (Facebook) Research.
The academic output has been substantial, with 59 publications in major peer-reviewed venues in machine learning and computer vision, dissemination in numerous international workshops and summer schools, organisation of scientific meetings and benchmark challenges and the release of several new datasets and open source code implementing the new algorithms. According to impact metrics such as citation counts in the several thousands, this output has been highly impactful in the research community. One of our papers received the best paper award from the Conference on Computer Vision and Pattern Recognition, our largest international meeting -- this paper is awarded yearly to one out of roughly 5000 papers. As part of the project, several postdoctoral researchers and PhD students have been trained, and have since obtained prestigious research position in academia and industry, including becoming professors and joining research labs such as Google Research, Deep Mind, Facebook AI Research and others.

The project has also achieved significant technical progress in all the key challenges.

First, we have developed new methods to understand images in detail with less or no supervision. For example, we have built the first system that can learn about the parts of objects (e.g. that a human’s body is composed of several limbs connected together) in a completely unsupervised manner, by looking for this information on the Web, using Google or a similar search engine automatically. Second, we have invented a new approach to unsupervised learning, which allows a computer to learn about the structure of visual objects all without a single manual annotation other than the images themselves. The learning principle which we discovered, which we call factor learning, is powerful and general and has been demonstrated in numerous applications and examples in the project.

Second, we have made strides in integrated image understanding demonstrating that it is possible, by using certain technical innovations, to build single deep neural network models that can understand very diverse image types, from handwritten digits to images of sport or animals, with a very small incremental cost compared to learning about one of such domains at a time. The resulting models can be one or two order of magnitude smaller than learning models individually, and the overall performance, due to sharing of acquired visual capabilities between domains, is actually improved.

Third, we have developed new methods, including theory, to better understand the outcome of complex learning processes such as deep learning; in fact, current models are, similar to the human brain, black boxes that are induced automatically from empirical experiences. Thus how these models work in practice remains unknown; our new techniques allow to visualize what happens inside a deep network, to better understand how it works and what are its limitations.

Significantly, some of this technology is already been tested for real-world deployment in various research and business contexts, including in orthogonal research areas such as bibliography, material science, and zoology. Furthermore, we have initiated a partnership with Continental corporation on autonomous and assisted driving, where our methods turned out to be extremely useful in the context of learning from large quantities of car-collected data while requiring little manual intervention.
We have advanced the state of the art in numerous ways. In some cases, such as unsupervised and weakly-supervised learning of object landmarks and parts, we have been the first to achieve such results in the first place. For integrated understanding, we even created a whole new public benchmark and challenges to motivate other research group to compete with us and show that they can do better than us on these new problems. Furthermore, whenever possible our methods are assessed against previous state of the art on third-party publicly available benchmarks. In almost all cases, our methods came out on top at the time of publication.
A cover image showing some of the research in the project