Skip to main content
European Commission logo print header

A Model for Predicting Perceived Quality of Audio-visual Speech based on Automatic Assessment of Intermodal Asynchrony

Final Report Summary - PERCQUALAVS (A Model for Predicting Perceived Quality of Audio-visual Speech based on Automatic Assessment of Intermodal Asynchrony)

The aim of the project was to develop a model for predicting the perceived quality of asynchronous audio-visual speech. The project was conceived as an inherently multidisciplinary endeavour that would encompass aspects of computer vision, speech processing, cognitive science, machine learning and statistical computing. The project can be conceived of as having the following main components:

1. A computer vision and speech processing component that would be used to extract useful audio-visual features from an input signal. These features would form the basis to which an automatic asynchrony detection measure would be applied.

2. A data gathering component, which would oversee the acquisition of new primary data (mainly in the form of a new audio-visual corpus for use in the project) as well as the gathering of subjective perceptual response data, through the design and execution of several perceptual experiments.

3. An analytics component that involved the exploration and analysis of the perceptual responses gathered through experimentation. The results of these analyses would form the basis for an initial model of audio-visual asynchrony perception.

4. A machine learning component, which would seek to use the model that resulted from component 3 above as the basis for an automatic model for measuring asynchrony in some audio-visual input and making a prediction regarding how a human being might perceive such asynchronous input in terms of quality.

The technical implementation and execution of the project can be said to be its most successful component. The fellow succeeded in developing computer vision based feature extractors, which automatically tracked the lips in real time and extracted relevant and useful features from them. In conjunction with acoustic feature extractors he developed using standard speech processing toolkits, the fellow was able to generate useful digital representations of the audio-visual speech inputs which formed the basis of the project's analysis.

The fellow also developed software that processed these extracted features automatically and applied several techniques for measuring the degree of synchrony or asynchrony between the audio and visual components. The resulting asynchrony score was mapped onto the same perceptual response scales used during the perceptual experiments and provided a means through which the automatic asynchrony detection could be directly assessed in terms of accuracy, as it could be directly compared with the experimental data.

This direct means of comparison between the perceptual responses gathered experimentally and the automatically rated asynchrony also allowed the fellow to develop a learned model of audio-visual asynchrony perception using techniques from the area of machine learning.

The project was considerably less successful in terms of its dissemination and research outputs, particularly publications. The fellow wrote and submitted regularly to conferences, but the results (as highlighted by the reviewers) were generally felt to be intermediate and therefore the papers were rejected. Unfortunately, one paper written towards the closing of the project was accepted but had to be withdrawn on the basis that notice of acceptance had come after the project funding had closed.

Although significant progress was made in terms of developing software, tools and processes for each of the outlined components above, it is probably true that in the initial stages of the project the focus of the fellow lay too heavily on these areas. Looking back, it would have been more productive to have focused more on more visible results, such as concentrating on more focused experimental studies (rather than the rather narrow focus of acquiring perceptual data). This broader outlook would probably have been rewarded with a more favourable acceptance-to-submission ratio, in terms of publications.

However, the technical and professional gains made by the fellow in terms of acquiring extensive skills in new and useful areas such as data analysis, machine learning and computer vision, should not be overlooked, and can perhaps offset in some small way, the disappointing performance of the project in other areas.