Periodic Reporting for period 4 - IDIU (Integrated and Detailed Image Understanding)
Reporting period: 2020-02-01 to 2021-07-31
The aim of the _Integrated and Detailed Image Understanding_ (IDIU) project is to remove some of these barriers and deliver a much more powerful computer vision technology, closer to the capabilities of vision in humans, but with the added benefit that machine brings over biology, including including the ability of operating on millions of images and videos in seconds, without fatigue or distraction, and the ability to embody any amount of expert knowledge in the process, which would normally require human experts to hold. Where current computer vision systems can, say, tell that there is a person in an image, our goal is to be able to say _everything_ the image reveals about that person, such as posture, action, type of clothes, relationship to other objects, etc. Such a detailed understanding is often paramount in applications, where small differences can completely change the meaning conveyed by an image.
The challenge is to scale algorithms to learn so much about images. Using the standard approach of manually supervising a machine to recognise individual concepts does not scale when the concept space is so large.
Thus, we need to develop machine learning systems that do not require this explicit and expensive level of supervision, but can learn by themselves by watching unlabelled images or by researching concepts automatically by using the Internet.
We also need to improve integration: Whereas current systems are limited to solve one specific function, such as recognising people or reading text, learning automatically requires to develop an overall understanding of the visual world, which is only possible if the same machine is able to process all the information required.
In the past five years, the IDIU project has significantly advanced the three aims of detailed, unsupervised and integrated image and video analysis. Because of this project and many other contributions by the wider machine learning community, computer vision is many times more powerful and versatile than it was. IDIU has contributed to this progress by introducing significant innovations in unsupervised visual geometry, unsupervised and weakly-supervised semantic analysis and by creating new research datasets and open source software for the community to build on. The significant impact of this project is measured not just in terms of publications in top international venues, but also by having received prestigious research awards and, most importantly, by having inspired substantial follow up work by academia and industry (including pioneering a new subfield nowadays called "internal learning").
Finally, IDIU has trained top talents has since become faculties or scientists in leading academic and industry research institutions such as Cambridge, Bristol, Edinburgh, Google DeepMind, and Meta (Facebook) Research.
The project has also achieved significant technical progress in all the key challenges.
First, we have developed new methods to understand images in detail with less or no supervision. For example, we have built the first system that can learn about the parts of objects (e.g. that a human’s body is composed of several limbs connected together) in a completely unsupervised manner, by looking for this information on the Web, using Google or a similar search engine automatically. Second, we have invented a new approach to unsupervised learning, which allows a computer to learn about the structure of visual objects all without a single manual annotation other than the images themselves. The learning principle which we discovered, which we call factor learning, is powerful and general and has been demonstrated in numerous applications and examples in the project.
Second, we have made strides in integrated image understanding demonstrating that it is possible, by using certain technical innovations, to build single deep neural network models that can understand very diverse image types, from handwritten digits to images of sport or animals, with a very small incremental cost compared to learning about one of such domains at a time. The resulting models can be one or two order of magnitude smaller than learning models individually, and the overall performance, due to sharing of acquired visual capabilities between domains, is actually improved.
Third, we have developed new methods, including theory, to better understand the outcome of complex learning processes such as deep learning; in fact, current models are, similar to the human brain, black boxes that are induced automatically from empirical experiences. Thus how these models work in practice remains unknown; our new techniques allow to visualize what happens inside a deep network, to better understand how it works and what are its limitations.
Significantly, some of this technology is already been tested for real-world deployment in various research and business contexts, including in orthogonal research areas such as bibliography, material science, and zoology. Furthermore, we have initiated a partnership with Continental corporation on autonomous and assisted driving, where our methods turned out to be extremely useful in the context of learning from large quantities of car-collected data while requiring little manual intervention.