Skip to main content
European Commission logo
polski polski
CORDIS - Wyniki badań wspieranych przez UE
CORDIS
Zawartość zarchiwizowana w dniu 2024-06-18

Vision and Hearing in Action

Final Report Summary - VHIA (Vision and Hearing in Action)

The VHIA project developed methods and associated algorithms and software for audio-visual machine perception and human-robot interaction (HRI). In particular the project exploited the complementary roles played by audio and vision and proposed a novel methodology for audio-visual fusion. The prevailing paradigm for combining these two sensorial modalities has been based on temporal correlation, either using unsupervised machine learning techniques, e.g. canonical correlation analysis (CCA), or deep neural network architectures (DNNs). Both CCA- and DNN-based methods assume that there is some kind of temporal correlation between audio signals and visual flows. For example, these methods have exploited synchronization cues between speech and lip movements. In contrast, VHIA has developed fusion algorithms based on spatial co-occurence of visual and audio features, in addition to temporal synchronization. This has enabled to address a much wider class of audio-visual fusion problems. In particular the proposed methods do not necessitate frontal views of people and speech signals with good properties. The task of fusion was cast in the framework of latent-variable Gaussian mixture and of Bayesian inference. Methods for the following tasks were developed, implemented and validated: audio-visual feature alignment, multiple-person and -speaker localization and tracking, audio-visual speaker diarization, and audio-visual robot control. This necessitated the development of several state-of-the-art audio signal processing, computer vision, and machine learning techniques: audio separation of static and of moving sources, speech enhancement, speech dereverberation, head- and eye-gaze detection and tracking, human activity recognition, high-dimensional regression, weighted-data clustering, variational inference, switching linear dynamic systems, and deep regression. These methods and algorithms were implemented as software packages and were optimized to be both efficient and robust; they were embedded into several robotic demonstrators. It is well established that people communicate using verbal information (speech and language) as well as non-verbal information (facial expressions, visual gaze, head movements, prosody, etc.) and that these multi-modal cues are intimately mixed together, in particular in social contexts. Nevertheless, currently available HRI systems are based on interface technologies borrowed from smartphones, namely touch-screens and voice commands. The VHIA project has opened the door to to the integration of audio-visual machine perception, interaction and communication into HRI, well beyond the current state-of-the art in social robotics.