Skip to main content
European Commission logo
italiano italiano
CORDIS - Risultati della ricerca dell’UE
CORDIS

Employment of Advanced Deep Learning and Human-Robot Interaction for Virtual Coaching

Periodic Reporting for period 1 - Dr VCoach (Employment of Advanced Deep Learning and Human-Robot Interaction for Virtual Coaching)

Periodo di rendicontazione: 2021-09-01 al 2023-08-31

The rise in global wealth and well-being has led to an increased percentage of the elderly population, requiring daily assistance and monitoring for a safe and productive life. Technologies that promote active aging appear to be the right solution to enhance elders' independence and well-being. Regular physical activity is critical for maintaining independence, but often elders require the presence of a caregiver for monitoring. Researchers aim to automate the teaching and monitoring process, proposing virtual coaches capable of suggesting exercises, monitoring execution, and sending results to a doctor.
While a computer, screen, and camera are sufficient to implement a virtual coach, humanoid robots present a promising solution. A humanoid robot can demonstrate exercises, making movements easier to understand for the elderly.

The output of our project is a robotic coach designed to assist the elderly during their daily physical training. The robot can:
1 - understand verbal commands from the user through a conversational module
2 - define the sequence of exercises and show them to the user through a control module
3 - monitor the user performing exercises using RGB cameras and identify errors through a video action recognition module.

Although several datasets for video action recognition exist, none completely cover the movements and labels needed for this project. Hence, a new dataset was created by collecting RGB videos of people performing soft gymnastic exercises in a lab environment. This core dataset was augmented in simulation using a synthetic image generator implemented in Unity. Simulated avatars, moved with the originally recorded data, and added randomization to motion, background, lighting, and camera position, resulting in a large augmented dataset.

The system has been implemented on the Nao robot. The research focus of the project includes:
1 - defining a set of gentle gymnastic actions
2 - introducing a new Unity synthetic data generator
3 - creating a novel dataset to train the action recognition and prediction models
4 - integrating all project modules into the final robotic coach implementation
5 - deploying and testing the robotic coach on subjects in a real-life scenario.
During the first year of the project, we deeply analyzed the state of the art of video data augmentation for action recognition. We published a survey on videos data augmentation for deep learning models in the Future Internet Journal. This review shows that, in data augmentation, we are having a transition from methods based on basic image transformations to more complex generative and simulated models.

In consultation with a professional personal trainer, we selected 11 gentle gymnastic exercises to use in the data collection and test phase of our project. The image named “motions.png” depicts each one of the 11 exercises.

We collected a small real-life video dataset in our laboratory, which we subsequently expanded by generating a larger synthetic dataset using our Synthetic Video Generator. Twenty subjects were recorded in a controlled environment inside our lab to generate a training/validation set. The video recordings of additional three subjects were used to create our test set. Videos in this dataset were collected outdoors, under variable conditions. The image named “Collected.png” displays some example frames extracted from the collected dataset. The collected datasets are accessible online.

We have developed a video generator capable of producing synthetic videos featuring a subject performing specific actions. To test the capability of our synthetic video generator, we augmented the previously collected video dataset generating a new synthetic video dataset. Each video was generated by randomizing avatar appearance, background images, scene illumination, animation speed, and camera position and orientation. The image named “GeneratorExamples.png” displays some examples of generated frames. The synthetic video generator and other utility scripts are accessible online.

Both collected and augmented datasets were tested on state-of-the-art action recognition models (CNNs and Transformers based), and we presented the results in the paper “Synthetic data augmentation for video action classification using Unity”. The paper also contains implementation details of the synthetic data generator. At the moment, the paper is under review for publication in the journal “Transactions on Pattern Analysis and Machine Intelligence.”

We implemented the final system on the Nao robot. For the conversational module, we used a question-answering model based on ChatGPT. To monitor the action of the subjects, we used the Timesformer model trained on our augmented dataset. We tested the system with 9 subjects, and we made them fill in a survey about their experience interacting with the robotic coach. We are now finishing the writing of a paper to present the final system, the results obtained during the test session, and the analysis of the surveys.

Regarding the dissemination of the project, we created and uploaded the project webpage (https://drvcoach.unica.it) with all the most useful information in it. We gave several interviews on national and local media. A collection of them can be found at: https://drvcoach.unica.it/news-1.html. We presented the project at the "SHARPER European Researcher's Night 2022" held at the Cagliari Botanical Garden. We also presented the project in a book chapter titled "Sensor Datasets for Human Daily Safety and Well-being", published by Springer in September 2023.
In the last decade, numerous reviews on image data augmentation for DL models have been published. Recently, the popularity of video data augmentation has surged, driven by various applications based on video analysis. From our research, we identified a gap in the existing literature: a review solely focusing on video data augmentation. Our paper titled "Survey on videos data augmentation for deep learning models" in 2022 aimed to address this gap.

The development of our synthetic video generator tool stemmed from the absence of readily available and user-friendly solutions that could meet our specific randomization requirements. While some existing solutions address the problem of object detection in static images, there are limited options for generating synthetic videos tailored specifically for action recognition. Moreover, these existing solutions are often designed for highly specific scenarios. None of these alternatives offers the level of flexibility and randomization needed for our augmented action recognition dataset.The results of our experiments showcased a significant improvement in the generalization capabilities of state-of-the-art action recognition models when augmented with data generated by our tool. Furthermore, our findings highlight that a dataset composed exclusively of synthetic data is sufficient to train models for recognizing actions within videos collected in real-world settings. These real-world conditions include variable locations, lighting, subjects, and camera positions.
Examples frames generated by our Synthetic video generator
The 11 gentle gymnastic exercices chosen with the help of a professional personal trainer
Example frames extracted from both the training/validation (upper row) and test sets (bottom row)