Skip to main content

LEarning from our collective visual memory to Analyze its trends and Predict future events

Final Report Summary - LEAP (LEarning from our collective visual memory to Analyze its trends and Predict future events)

People constantly draw on past visual experiences to anticipate future events and better understand, navigate, and interact with their environment, for example, when seeing an angry dog or a quickly approaching car. Currently there is no artificial system with a similar level of visual analysis and prediction capabilities. LEAP is a first step in that direction, leveraging the emerging collective visual memory formed by the unprecedented amount of visual data available in public archives, on the Internet and from surveillance or personal cameras – a complex evolving net of dynamic scenes, distributed across many different data sources, and equipped with plentiful but noisy and incomplete metadata. The goal of this project is to analyze dynamic patterns in this shared visual experience in order (i) to find and quantify their trends; and (ii) learn to predict future events in dynamic scenes.

In this context, the main achievements of the LEAP project can be summarized as follows.

We have introduced novel visual representations that generalize across vastly different imaging conditions often present in the distributed data. Examples include day/night illumination, major changes in viewpoint such as from hand-held and car-mounted cameras, or variation over time such as seasonal changes or buildings built or destroyed. These models have been used to improve the performance of visual localization - an important application for autonomous driving - as well as opened-up new applications such as geo-registering historical and non-photographic imagery. We have also developed a computational model of analogy reasoning to recognize unseen relations between objects - a step towards generalization to yet unseen situations.

Next, we have developed several novel methods to learn the visual vocabulary of patterns from data using readily-available but weak (incomplete and noisy) meta-data. This has lead to results exceeding state-of-the-art in object, place and human activity recognition in some cases outperforming fully supervised methods. This is an important step towards machines that learn automatically from readily available but noisy meta-data without the need for explicit manual supervision, which is often expensive and hard to obtain.

Towards the goal of finding and quantifying trends, we have developed several new computational models to discover and analyze spatial and temporal trends in image collections. Potential applications of these models include, for example, assessing disease progression in radiology or discovering new insights about long-term evolution of visual style in art, history, architecture or design.

Finally, towards the goal of predictive modelling, we have introduced several novel temporal models of dynamic scenes. The developed models are able to temporally localize steps of complex human activities, estimate 3D motion and contact forces of people interacting with objects, model relations between similar human activities, or predict future actions in video seconds before they occur. We have shown benefits of the developed models on existing benchmarks as well as newly collected large-scale datasets. Potential applications include autonomous robotics, self-driving cars or automatic assistance.