Skip to main content
European Commission logo print header

A Model for Predicting Perceived Quality of Audio-visual Speech based on Automatic Assessment of Intermodal Asynchrony

Article Category

Article available in the following languages:

Assessing audiovisual speech quality

When video lags behind speech, the effect can be discouraging for technology users. New research into effectively measuring asynchronous audiovisual signals could help address this phenomenon.

Industrial Technologies icon Industrial Technologies

High-tech audiovisual communication is becoming a very common form of exchange, from teleconferencing by satellite to live chatting through smart phones. As simple as the idea seems, matching the picture and voice together in a synchronised manner is challenging yet crucial for the success of such complex applications. If users perceive that the communication experience is not synchronised, they could switch to other means of communication. Against this backdrop, the EU-funded project PERCQUALAVS aimed to measure synchrony between the visual and acoustic speech elements of such technologies. Building on fields such as computer vision, cognitive science, machine learning and speech processing, it conceived a model to predict the perceived quality of audiovisual speech. To achieve its aims, the project was divided into four parts. The first looked at extracting key audiovisual features from an input signal to apply automatic asynchrony detection. The second involved gathering subjective perceptual response data through several perceptual experiments. The third component analysed perceptual responses gathered, while the fourth represented a machine learning component that predicts human perception of asynchronous input. The project team successfully developed computer vision-based feature extractors that track the lips in real time and extract valuable data, creating as well speech processing toolkits to assist in analysing the data. Another key project achievement was the development of software to process the extracted features in order to measure synchrony and map the results. This enabled comparison between perceptual responses of users and automatically generated results. While the project's results were not disseminated adequately due to different technical and time constraints, they have laid the groundwork for more research in the field. This is a step forward for assessing and improving audiovisual technology, which is growing rapidly worldwide.

Keywords

Audiovisual, speech quality, asynchronous, communication, machine learning, human perception, asynchronous input

Discover other articles in the same domain of application