Final Report Summary - ACTIVIA (Visual Recognition of Function and Intention)
During the course of the project, the ACTIVIA team has introduced new weakly-supervised methods that learn visual models using incomplete, noisy but readily-available supervision. A discriminative clustering framework formulated as a quadratic program under linear constraints has been used to jointly learn models for actions and actors from movie scripts. An extension of this model has been introduced to localize actions given sparse annotation the training time and to classify relations between objects and people. Towards the goal of improving visual representations, ACTIVIA developed deep convolutional neural networks enabling to transfer pre-trained visual models to new tasks with limited amount of training data. A weakly-supervised extension of this model has been developed to learn CNNs from incomplete annotation to localize objects and actions in images, achieving state-of-the-art results on Pascal VOC object classification task. For the task of person analysis we have developed new methods for segmenting people in stereoscopis videos and for tracking people in crowded scenes. Towards action recognition, we have introduced new highly efficient video features using motion information in video compression. We have also co-organized a series of THUMOS action recognition challenges attracting participation from the leading research groups around the world. Finally, the work in the project has been focused on the modeling of person-scene and person-object interactions. The co-occurrence relations between human actions and scene types have been investigated and a model for predicting human actions for images of static scenes has been developed and validated in practice. The relations between people and scene geometry have also been explored in a new approach using human pose as a cue for single-view 3D scene understanding. The proposed method uses automatic human pose estimation to extract functional and geometric constraints on the scene, demonstrating significant improvements in estimates of 3D scene geometry. To delve further into relations between people, objects and actions, we have initiated a research direction on instructional videos enabling the study of human actions and objects in relation to the goal. Our work provides a basis for future applications with focus on automatic visual assistants and robotics tasks learned from human demonstrations.