Skip to main content

Visual Recognition of Function and Intention

Final Report Summary - ACTIVIA (Visual Recognition of Function and Intention)

Computer vision is concerned with the automated interpretation of images and video streams. Today's research is mostly aimed at answering queries such as "Is this a picture of a dog?", "Is the person walking in this video?" (image and video categorization) or sometimes "Find the dog in this photo" (object detection). While categorization and detection are useful for many tasks, inferring correct class labels is not the final answer to visual recognition. The categories and locations of objects do not provide direct understanding of their function, i.e. how things work, what they can be used for, or how they can act and react. Neither do action categories provide direct understanding of subject's intention, i.e. the purpose of his/her activity. The understanding of function and intention would be highly desirable to answer currently unsolvable queries such as "Am I in danger?" or "What can happen in this scene?". The goal of ACTIVIA is to address such questions. Towards this goal we aim to learn new representations and models for dynamic visual scenes. In particular, we consider learning methods that capture relations among objects, scenes, human actions and people in large-scale visual data.

During the course of the project, the ACTIVIA team has introduced new weakly-supervised methods that learn visual models using incomplete, noisy but readily-available supervision. A discriminative clustering framework formulated as a quadratic program under linear constraints has been used to jointly learn models for actions and actors from movie scripts. An extension of this model has been introduced to localize actions given sparse annotation the training time and to classify relations between objects and people. Towards the goal of improving visual representations, ACTIVIA developed deep convolutional neural networks enabling to transfer pre-trained visual models to new tasks with limited amount of training data. A weakly-supervised extension of this model has been developed to learn CNNs from incomplete annotation to localize objects and actions in images, achieving state-of-the-art results on Pascal VOC object classification task. For the task of person analysis we have developed new methods for segmenting people in stereoscopis videos and for tracking people in crowded scenes. Towards action recognition, we have introduced new highly efficient video features using motion information in video compression. We have also co-organized a series of THUMOS action recognition challenges attracting participation from the leading research groups around the world. Finally, the work in the project has been focused on the modeling of person-scene and person-object interactions. The co-occurrence relations between human actions and scene types have been investigated and a model for predicting human actions for images of static scenes has been developed and validated in practice. The relations between people and scene geometry have also been explored in a new approach using human pose as a cue for single-view 3D scene understanding. The proposed method uses automatic human pose estimation to extract functional and geometric constraints on the scene, demonstrating significant improvements in estimates of 3D scene geometry. To delve further into relations between people, objects and actions, we have initiated a research direction on instructional videos enabling the study of human actions and objects in relation to the goal. Our work provides a basis for future applications with focus on automatic visual assistants and robotics tasks learned from human demonstrations.