Community Research and Development Information Service - CORDIS


ACTIVIA Report Summary

Project ID: 307574
Funded under: FP7-IDEAS-ERC
Country: France

Mid-Term Report Summary - ACTIVIA (Visual Recognition of Function and Intention)

Computer vision is concerned with the automated interpretation of images and video streams. Today's research is mostly aimed at answering queries such as "Is this a picture of a dog?", "Is the person walking in this video?" (image and video categorisation) or sometimes "Find the dog in this photo" (object detection). While categorisation and detection are useful for many tasks, inferring correct class labels is not the final answer to visual recognition. The categories and locations of objects do not provide direct understanding of their function, i.e., how things work, what they can be used for, or how they can act and react. Neither do action categories provide direct understanding of subject's intention, i.e., the purpose of his/her activity. The understanding of function and intention would be highly desirable to answer currently unsolvable queries such as "Am I in danger?" or "What can happen in this scene?". The goal of ACTIVIA is to address such questions. Towards this goal we aim to learn new representations and models for dynamic visual scenes. In particular, we consider learning methods that capture relations among objects, scenes, human actions and people in large-scale visual data.

During the first half of the project, the ACTIVIA team introduced new weakly-supervised methods that learn visual models using incomplete, noisy but readily-available supervision. A discriminative clustering framework formulated as a quadratic program under linear constraints has been used to jointly learn models for actions and actors from movie scripts. An extension of this model has been introduced to localize actions given the the order of events at the training time. Towards the goal of improving visual representations, ACTIVIA developed deep convolutional neural networks enabling to transfer pre-trained visual models to new tasks with limited amount of training data. A weakly-supervised extension of this model has been developed to learn CNNs from incomplete annotation to localize objects and actions in images, achieving state-of-the-art results on Pascal VOC object classification task. For the task of person analysis we have developed new methods for segmenting people in stereoscopis videos and for tracking people in crowded scenes. Towards action recognition, we have introduced new highly efficient video features using motion information in video compression. We have also co-organized a series of THUMOS action recognition challenges attracting participation from the leading research groups around the world. Finally, the work in the project has been focused on the modeling of person-scene and person-object interactions. The the co-occurrence relations between human actions and scene types have been investigated and a model for predicting human actions for images of static scenes has been developed and validated in practice. The relations between people and scene geometry have also been explored in a new approach using human pose as a cue for single-view 3D scene understanding. The proposed method uses automatic human pose estimation to extract functional and geometric constraints on the scene, demonstrating significant improvements in estimates of 3D scene geometry.

The second part of the ACTIVIA project will deepen our started efforts on reliable and weakly-supervised visual representations. Along with vision we will address the ambiguity of linguistic representations for human actions and person-object relations. Joint models for vision and language will be developed for particular tasks enabling recognition of object function through the changes in object states.

Reported by

Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top