Skip to main content
European Commission logo
polski polski
CORDIS - Wyniki badań wspieranych przez UE
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary
Zawartość zarchiwizowana w dniu 2024-06-18

Audio-VIsual Speech Processing for Interaction in Realistic Environments

Final Report Summary - AVISPIRE (Audio-VIsual Speech Processing for Interaction in Realistic Environments)

1.1.Project Objectives

The field of audio-visual speech processing has attracted significant interest over the past 15 years. Relevant research has focused on recruiting visual speech information, extracted from the speaker’s mouth region, as a means to improve robustness of traditional, unimodal, acoustic-only based speech processing. Nevertheless, to-date, most work has been limited to ideal-case scenarios, where the visual data are of high quality, typically of steady frontal head pose, captured by high-end cameras in high resolution, and under uniform lighting conditions, with only a single subject present. Obviously, this case remains far from the desired unconstrained, multi-party human interaction, captured by low-price sensors; therefore, it comes to no surprise that practical audio-visual speech systems have yet to be deployed in real life.

In project AVISPIRE, short for “Audio-VIsual Speech Processing for Interaction in Realistic Environments”, the objective has been to work towards expanding the state-of-the-art from the ideal “toy” examples to realistic, practical scenarios of human-computer interaction in difficult environments sensed by inexpensive equipment. Starting from the traditional single-speaker scenario with relatively constrained audio-visual variability and high-quality data, the project has been designed to expand work to multi-speaker data sets with realistic pose and illumination variation, also investigating novel sensing devices that can informatively capture the desired speech data. Successful audio-visual speech processing there requires progress beyond the state-of-the-art in processing and robust extraction of visual speech information, as well as its efficient fusion with the acoustic modality, due to the varying quality of the extracted stream information.

AVISPIRE has spanned 42 months of activity, presenting a natural evolution of prior research efforts of the beneficiary researcher, Dr. Gerasimos Potamianos, while in the U.S. Work has been conducted jointly with the host organization, the Institute of Informatics and Telecommunications at the National Centre for Scientific Research, “Demokritos”, in Athens, Greece, and in particular its SKEL (Software and Knowledge Engineering Laboratory) and CIL (Computational Intelligence Laboratory) units. In a nutshell, AVISPIRE has aimed to:

•Integrate state-of-the-art knowledge of the researcher in the area of audio-visual speech processing with the state-of-the-art capabilities of researchers in the host institute in the areas of machine learning, knowledge representation, visual processing, and fusion.
•Apply audio-visual speech processing technologies to realistic, challenging environments.
•Derive robust visual processing methods based on inexpensive sensors and data fusion, whenever feasible, and audio-visual speech fusion methods robust to degradation.
•Develop methods for using speaker localization methods from non audio-visual modalities (such as, e.g. range data) in order to improve the robustness of audio-visual analysis.
•Develop audio-visual speech recognition technology in the Greek language.

1.2.Work Performed and Results Achieved

To meet AVISPIRE objectives the following activities have been conducted.

Baseline audio-visual speech recognition system for ideal conditions:
Initial efforts have focused in developing the main components of an audio-visual automatic speech recognizer under the ideal visual data case. This work utilized the CUAVE corpus, a widely benchmarked dataset in the literature (in US English), freely available from Clemson University. A visual front end sub-system has been implemented to extract visual features from video of the speaker’s face, consisting of an AdaBoost-based face and mouth detector, a normalized region-of-interest extractor, and its representation based on appropriate feature selection and transformation techniques applied on its discrete cosine transform. The module has been complemented by an audio-visual decision fusion component, utilizing a multi-stream hidden Markov model.

Audio-visual speech recognition using the Kinect:
The baseline approach has been subsequently extended to incorporate a novel, popular, and inexpensive sensor, the Microsoft Kinect. This device has provided an additional stream to the traditional audio and planar video modalities, namely visual depth information of the speaker’s face, thus improving robustness to illumination and pose variability. Both visual front end and audio-visual fusion subsystems have been appropriately extended to incorporate this added stream of information. Such work has required collecting an appropriate database, named BAVCD, i.e. the bilingual audio-visual corpus with depth information. In addition to US English, this database also includes data recordings in Greek, an under-represented EU language in terms of linguistic resources.

Audio-visual speech recognition in Greek:
Based on the aforementioned BAVCD corpus, collected as part of the AVISPIRE activities, the first ever AVASR system in Greek has been developed, paralleling the approach employed in the development of the English system.

Challenging visual environments:
Work has progressed on the visual front end in three challenging domains: (i) Broadcast News based on the GRIDNEWS database of Greek television programs; (ii) automobile interior, based on data provided by the University of Texas at Dallas; and (iii) a multi-sensory domain of multiple humans-robot interaction employing the newly acquired robotic platform at SKEL (the acquisition of this platform has been partially funded by AVISPIRE for the purposes of these experiments).

Additional speech technologies:
Further to core automatic speech recognition, additional speech processing technologies have been considered as part of AVISPIRE, such as emotion recognition from speech.

Fusion with non audio-visual modalities:
Significant work has been carried out in using human pattern recognition in order to estimate the number and location of speakers present in the scene. This prior was proven to significantly improve the results of the purely audio-visual analysis.

Dissemination activities:
A number of dissemination activities have been carried out as part of the project activities. In particular, three talks have promoted the project, six presentations have been given at conferences, eight papers published, and five student Theses have been initiated or completed.

1.3.Project Impact

AVISPIRE has played an important role in the successful integration of the beneficiary researcher to the research and academic environment in the host institute and country, following a long career outside Europe. Furthermore, it has provided the means for bi-directional knowledge transfer between the beneficiary and relevant host institute labs, and in particular has yielded significant synergies with the robotic activities at the host as part of the human-robot realistic environment scenario investigated under the project. In addition, it has helped enhance or establish collaboration between the host and other academic institutions.

In summary, AVISPIRE research results represent a radical departure from the traditional paradigm of audio-visual speech processing for a single speaker under ideal visual conditions. As such, AVISPIRE work has contributed to the long road towards robust speech-based human-computer interaction in a wide range of environments and conditions. Such constitutes a key to natural interactivity, as well as to easier and faster information access by the broader public.

1.4.Project Website

The project website is accessible at: http://avispire.iit.demokritos.gr