CORDIS - EU research results
CORDIS

Audio-VIsual Speech Processing for Interaction in Realistic Environments

Article Category

Article available in the following languages:

New ways to enhances speech recognition

If computers could read lips just like humans doe, what techniques would then be required in order to effectively capture voice by using inexpensive equipment? This is the issue addressed by EU researchers working on improving speech recognition systems in distinguishing voice from multiple speakers under real-life conditions.

Industrial Technologies icon Industrial Technologies

Research into audiovisual (AV) speech recognition has focused hitherto on extracting visual information from the speaker's mouth in order to help computers understand fluently spoken speech. Nevertheless, most work has been limited to ideal-case scenarios, in which visual data are of high quality. This means that expensive cameras capture high-resolution images of a single person who does not freely move and who keeps a frontal head pose in relation to the camera, usually in ambient light conditions.Against this backdrop, the 'Audio-visual speech processing for interaction in realistic environments' (AVISPIRE) project aimed to use inexpensive AV systems in more realistic environments. Starting from the traditional single-speaker scenario with high-quality data, the project carried on further work in voice recognition in a multi-speaker environment under different light conditions. Furthermore, it sought to investigate the output of new sensing devices that can collect the desired speech data.Initial efforts were focused on developing the basic components of an automatic AV speech recogniser. By using a visual front-end sub-system, the team implemented the Adaboost algorithm to detect the speaker's face as well as a normalised extractor to isolate the region of interest from the image (mouth). In order to fuse audio and visual data, a multi-stream Markov model was used.Both visual front-end and AV fusion sub-systems were extended to include the Microsoft Kinect sensor. This inexpensive device provides motion information as the speaker speaks, improving robustness to head poses. Work was then focused on collecting data in order to create the bilingual AV corpus enriched with motion information. Apart from English, this database includes data recordings in Greek.Finally project partners explored how human-pattern recognition can be used in order to improve the robustness of AV speech recognition. Prior knowledge of the number of speakers in the scene as well as their location should help in significantly improving the results by adjusting to the mouth contour of the speakers.Human–computer interaction based on speech recognition is finding increasing applications nowadays, but it has a long road ahead. Providing the possibility to understand more than one person speaking under conditions similar to real life, AVISPIRE contributed to making this interaction more robust.

Keywords

Real life environment, audiovisual, speech recognition, motion recognition, light conditions, human–computer interaction

Discover other articles in the same domain of application