Skip to main content

Integrated and Detailed Image Understanding

Periodic Reporting for period 3 - IDIU (Integrated and Detailed Image Understanding)

Reporting period: 2018-08-01 to 2020-01-31

Perception, and in particular vision, is a fundamental component of artificial intelligent systems. While vision comes natural to humans and is thus given for granted, in actuality it embodies a very challenging computational problem: translating the light measurements taken by our eyes into abstract concepts and ideas that express the content of images in a useful, actionable manner. While we are not conscious of it, this problem is sufficiently difficult that it requires more than half of our brain to solve.

In this analogy, cameras are similar to eyes, as they measure light, and the goal of computer vision is to transform such measurements in a meaningful description of the content of the images, such as a list of the objects they contain. While computer vision has progressed tremendously in the past few years, especially due to breakthrough technologies such as deep neural networks and GPU-accelerated hardware, it still pales in terms of flexibility and depth of understanding compared to vision in humans.

The goal of the _Integrated and Detailed Image Understanding_ (IDIU) project is break down some of these barriers and deliver a much more powerful computer vision technology, closer to the capabilities of vision in humans, but with the added benefit that machine brings over biology, including including the ability of operating on millions of images and videos in seconds, without fatigue or distraction, and the ability to embody any amount of expert knowledge in the process, which would normally require human experts to hold.

In IDIU we aim at breaking two particular barriers. The first one is to understand images in detail. Where current computer vision systems can, say, tell that there is a person in an image, our goal is to be able to say _everything_ the image reveals about that person, such as posture, action, type of clothes, relationship to other objects, etc. In short, we want to be able to see everything a human would in such a picture, and possibly more. Such a detailed understanding is of paramount importance in many applications. In fact, often the easy-to-recognize information (such as that there is a person in an image) is not nearly as interesting as the details (such as what the person is or is doing). Consider for example an autonomous driving application: depending on such details, a person may be either an officer directing traffic, or a child crossing the road. Clearly, a computer vision system that cannot differentiate between these two situations is not sufficiently powerful to enable driving a car.

Unfortunately, teaching a computer to see everything is extremely challenging. Computers are typically taught in a pedantic manner, through explicit examples of what everything is (examples would need to show “police officer”, “child”, but also “child running”, “tall man”, etc.) This is simply not scalable enough, so one of our key goals is to develop machine learning systems that do not require this explicit and expensive level of supervision, but can learn by themselves by watching unmarked images or by discovering the required information automatically by watching movies or by looking it up on the Internet, all by themselves. The goal is to learn about objects and their details, but also about the abstract content of images in general. The latter is exemplified by the understanding of abstract two dimensional patterns, whose content may “jump out” to a human as carrying significant information (think of the crack in the surface of a wall), but of which computers are at present largely oblivious.

Doing so will require our team to face a second key challenge, namely the one of integrated understanding. Whereas current systems are limited to solve one specific function, such as recognizing people or reading text, it is clear that a sufficiently advanced system that can learn automatically to see cannot do so without an overall understanding of the visual world, which is only possible if the same machine can integrate into a single component all the information required. Unfortunately, current technology is simply not designed to do so; developing a solution to this problem is the challenge of integrated machine learning.
Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far (For the final period please include an overview of the results and their exploitation and dissemination)

Progress in the project has been swift. The academic output alone in the first two and a half years has been very robust with 22 peer-reviewed scientific publications in the top venues in machine learning and computer vision. Results have been disseminated in several international workshops, international summer schools, and by developing new challenges and benchmarks for the research community to use.

Most importantly, we have achieved significant technical progress in all the key challenges of the project.

First, we have developed new methods to understand images in detail with less or no supervision. For example, we have built the first system that can learn about the parts of objects (e.g. that a human’s body is composed of several limbs connected together) in a completely unsupervised manner, by looking for this information on the Web, using Google or a similar search engine automatically. Second, we have invented a new approach to unsupervised learning, which allows a computer to learn about the structure of visual objects all without a single manual annotation or other source of external supervision other than the images themselves. The learning principle which we discovered, which we call factor learning, is powerful and general and has been demonstrated in numerous applications and examples in the project.

Second, we have made strides in integrated image understanding. We have in particular demonstrated that it is possible, by using certain technical innovations, to build single deep neural network models that can understand very diverse image types, from handwritten digits to images of sport or animals, with a very small incremental cost compared to learning about one of such domains at a time. This brings these modes somewhat closer to the capabilities of the human brain, which is also a very flexible machine capable of understanding all sort of image data. The resulting models can be one or two order of magnitude smaller than learning models individually, and the overall performance, due to sharing of acquired visual capabilities between domains, is actually improved. This is also a practical boon for applications such as mobile computers (phones and tables) that are increasingly relying on a collection of machine learning models to implement their most advanced functionalities.

Third, we have developed new methods, including theory, to better understand the outcome of complex learning processes such as deep learning; in fact, current models are, similar to the human brain, black boxes that are induced automatically from empirical experiences. Thus how these models work in practice remains unknown; our new techniques allow to visualize what happens inside a deep network, to better understand how it works and what are its limitations. The latter is particularly important to appreciate the potential drawbacks of these approaches and address them.

Significantly, some of this technology is already been tested for real-world deployment in various research and business contexts. For instance, we have tested our detailed and weakly-supervised and unsupervised learning technology in orthogonal research areas such as bibliography, material science, and zoology. Furthermore, we have initiated a partnership with Continental corporation on autonomous and assisted driving, where our methods turned out to be extremely useful in the context of learning from large quantities of car-collected data while requiring little manual intervention.

Looking beyond, we have found that unsupervised learning is particularly useful in deploying computer vision to applications that are of interest to individual users and user groups – where traditionally AI experts were needed to deploy machine learning technology to each new application area at a signifiant cost, unsupervised and weakly-supervised learning offer us a way to much more quickly and cheaply apply AI to specific new problems. In the future, this will make it possible to transform AI in a every day tool that anyone will be able to apply to the solution of their own professional and personal problems.
Progress beyond the state of the art and expected results until the end of the project

So far, we have advanced the state of the art in numerous ways. In some cases, such as unsupervised and weakly-supervised learning of object landmarks and parts, we have been the first to achieve such results in the first place. For integrated understanding, we even created a whole new public benchmark and challenges to motivate other research group to compete with us and show that they can do better than us on these new problems.

Furthermore, whenever possible our methods are assessed against previous state of the art on third-party publicly available benchmarks. In almost all cases, our methods came out on top at the time of publication. For instance, in 2016 we set the new record for the accuracy of our new unsupervised object detector and in 2017 we have beaten all previous approaches in estimating the 3D pose and shape of object categories while doing so without the help of manual supervision.

Our grand objective from now to the end of the project is to tie up these threads. In particular, we want to build the first system that an understand objects and images at many levels (2D patters, objects and parts, and geometry and physics) in an integrated package, while learning most of these concepts with weak or no supervision.