Skip to main content
European Commission logo
italiano italiano
CORDIS - Risultati della ricerca dell’UE
CORDIS

Analysis and Representation of Complex Activities in Videos

Periodic Reporting for period 4 - ARCA (Analysis and Representation of Complex Activities in Videos)

Periodo di rendicontazione: 2020-12-01 al 2021-05-31

The goal of the project has been to automatically analyze human activities observed in videos, which is the basis for novel applications. It could be used to create short videos that summarize daily activities to support patients suffering from Alzheimer's disease. It could also be used for education, e.g. by providing a video analysis for a trainee in the hospital that shows if the tasks have been correctly executed. The analysis of complex activities in videos, however, is very challenging since activities vary in temporal duration between minutes and hours, involve interactions with several objects that change their appearance and shape, e.g. food during cooking, and are composed of many sub-activities, which can happen at the same time or in various orders.

While the majority of recent works in action recognition has focused on developing better feature encoding techniques for classifying sub-activities in short video clips of a few seconds, this project moved forward and aimed to develop a higher-level representation of complex activities to overcome the limitations of current approaches. This includes the handling of large time variations and the ability to recognize and locate complex activities in videos. A second objective of the project has been to learn a representation from videos that is not limited to a specific application, but that can be reused and adapted to a new setting. The third objective has been to synthesize human motion or poses by just providing a list of human actions or a human description to demonstrate that the model cannot only interpret data but also generate data.

In this project, we made a significant progress beyond the state-of-the-art. We have developed methods that are able detect and segment actions in videos with a very high accuracy. Since these methods are trained on large video datasets where the occurring actions are annotated in each frame, we developed new training procedures that require only partially annotated videos. In this way, the cost of annotating training videos has been substantially reduced, which is a key aspect for commercial applications. We furthermore demonstrated that actions can be recognized across datasets and modalities and that human poses can be generated just from text describing an action.
We developed a hierarchical model that models complex activities at different granularities. At the top level, complex activities like “preparing pancakes” or “preparing a fruit salad” are modelled. These complex activities require several sub-activities that need to be executed like “take egg”, “crack egg”, or “stir dough”. These sub-activities are the intermediate representation of the hierarchy. At the lowest level, fine-granular activities or motion primitives are modelled. For instance, cracking eggs involves a sequence of human movements. The hierarchical model processes continuous video streams and predicts for each frame what sub-activity is executed as well as the overall complex activity.

In order to learn the parameters of the model, annotated videos are required. The developed model has the advantage that it can be trained in two ways. In the first setting, we assume that the videos have been annotated in the same way as the model is expected to analyze the videos. This means that for each frame the ongoing sub-activity is annotated. This setting is known as learning with full supervision. Providing such a frame-wise labeling of videos, however, is an enormous effort and can be too expensive for practical applications. We therefore developed learning procedures that allow to learn the model with less supervision, i.e. weak supervision. We investigated different types of weak supervision including video tags or protocols. While video tags only summarize what actions are occurring in each video, protocols provide the temporal order of the actions occurring in each video. For protocols with timestamps, we achieved up to 97% of the accuracy compared to fully supervised learning while reducing the annotation cost by factor 6.

While the activities occurring in a video is an important aspect of a video, they do not describe the full content of a video. We therefore moved beyond the commonly studied activity recognition task and introduced a novel, holistic view on video understanding. Instead of recognizing only the activities, holistic video understanding aims also at recognizing the objects and their attributes that are involved, the scenery where the video has taken, as well as the general context in which the activities are happening. As pioneering work, we released a dataset consisting of about 580,000 videos annotated by about 3100 different categories organized in a hierarchically taxonomy, which resulted in about 7.5 million annotations. We also demonstrated that training a network on such richly annotated dataset improves the action recognition accuracy on other datasets.

We also addressed the problem how a model that is trained on one domain can be adapted to recognize activities in another domain. For instance, we might have a model trained on videos from YouTube, but we want that it recognizes activities from a video that is captured by a camera mounted on a service robot. Since the videos the model has been trained on and the videos the model has to analyze look different, the model needs to be adapted to handle the differences. We thus developed approaches that successfully adapt models to different modalities or domains.

The results of the project have been disseminated by 8 publications in journals, 36 publications in peer-reviewed conferences, and 7 publications in peer-reviewed workshops. Furthermore, 9 symposia, workshops, and tutorials have been organized. The source code and data of several publications have been publicly released.
The developed models substantially improved the state-of-the-art for temporally localizing activities in videos. If the models are trained with full supervision, the models already achieve an accuracy that is sufficient for many applications. In case of weak supervision in form of protocols with timestamps, we were able to close the gap between the accuracy of fully supervised approaches and weakly supervised approaches. This is a milestone since it substantially reduces the cost of annotating data.

We furthermore moved beyond describing videos only by their activities and introduced the concept of holistic video understanding, which aims to recognize all relevant aspects to describe the content of a video including objects, attributes, and scenery. Vice versa, we were also able to generate human poses that correspond to a text description of an activity like “a tennis player hitting a tennis ball with a racquet”.

Finally, we demonstrated that learned models for recognizing activities can be adapted to other datasets or modalities. In contrast to previous works that assumed that the unlabeled target dataset contains only videos of activities that are in the labeled source dataset, we overcame this so-called “closed world” assumption and introduced the novel concept of open set domain adaptation, which does not impose such restrictions on the data and thus applies to real-world problems.
summary-2mb.png