The main objective of the project is to approach the problem of videophone coding from an audio - video combined point of view, both for analysis and synthesis. The motivating idea is that interpersonal audio-video communication represents an information source that can easily be modulated and characterised in audio by a human speaker's voice and in video by his/her face.
Two demonstrators will be implemented: i) a hardware platform with H324 coder/decoder integrated with a board for speech analysis and articulation estimation, lip extraction and tracking, audio assisted frame interpolation for increasing the frame frequency; ii) a software demonstrator of a hybrid coding scheme compliant with MPEG-4 where audio/video analysis/synthesis are used for composing the natural background together with the speaker's face represented by means of a synthetic 3D model.
Through successful submission to ACTS 3rd Call, VIDAS activities have been extended up to November 1999 in order to maintain commitment to MPEG-4 Version 2 until its conclusion. In Version 2 phase, VIDAS will propose improved definition of MPEG-4 profiles and levels with specific reference to Face and Body Animation and will actively participate in related Core Experiments.
In Rhodes, 5-9 September 1997, VIDAS has organised the 1st Int. Workshop on SNHC and 3D Imaging. A second Workshop is planned in September 1999.
From the technical point of view, two goals are reached. Firstly, integrating the standard H.324 coding scheme, by means of speech assisted frame interpolation; secondly, implementing a software prototype of a hybrid scheme based on the segmentation of the scene into a component that can be modulated (the speaker's face) and another that cannot be modulated (the background). The region of the speaker's face is encoded through model-based algorithms assisted by speech analysis, while the background is encoded through region-based algorithms. Activities focus on the development and experimentation of suitable algorithms for estimating lip movements from speech, segmenting the speaker's face region form input images, extracting and tracking the speaker's facial parameters and for their suitable and realistic synthesis, either based on simple 2D meshes, or on complex deformable 3D models.
A suitable English multi-speaker audio-video database has been acquired for allowing the maximum level of system independence compatible with the scientific and technological state of the art. The H.324 software demonstrator with integrated analysis/synthesis algorithms has been already achieved, while its integration into the hardware H.324 prototype is currently going to be completed.
The implementation of the software demonstrator of the SNHC hybrid scheme is also in progress. The 3D parameterised structure used to model and animate the speaker's head has been supplied to the MPEG-4 SNHC verification model, and has been made compliant to the Facial Animation Parameters (FAP) standardised in SNHC. Current work concerns the upgrading of the model to Facial Description Parameters (FDP) ruling the shape and texture calibration of the face polygon mesh to any specific face.
Processing real images and extraction of facial features used to reproduce the motion and the facial expression on a synthetic 3D head model. Images produced at Miralab at the University of Geneve and at LIG laboratory at EPFL, Lausanne.
Summary of Trial
Through constant interaction with end user associations (NAD: Irish National Association for the Deaf), a set of suitable subjective experiments has been defined to formalise the visual relevance of speech articulation and co-articulation. This activity has led to the definition of a suitable evaluation protocol used to access the quality of the achieved results.
By taking into account the bimodal nature of the mechanisms of speech production and perception, experiments have been carried out to investigate the sensitivity of a human perceiver to variations of the many system parameters. The outcomes of these experiments represent a basis of knowledge which has been used for purposes of speech analysis and image synthesis. The quality assessment will be done in co-operation with the end user associations by exploiting the expertise coming from the end users, test experiments will be applied to a group of hearing impaired observers who will express their subjective opinion. The various opinions will be collected through a suitable questionnaire defined by the end users as well.
- A synchronised audio-video corpus of 10 speakers, composed of single utterances of 700 English words, has been acquired and processed to allow bimodal multi-speaker speech processing.
- A set of tools has been developed for extracting the region of the speaker's mouth from QCIF H.324 images and for generating extrapolated frames in which the mouth movements are synthesised based on parameters extracted from speech analysis.
- A real-time H.324 board, based on Trimedia component, with extrapolation of synthesised video frames
- A set of tools for face region segmentation and 3D facial feature extraction & tracking
- A 3D model of a synthetic human face, compliant with MPEG-4, driven by Facial Definition Parameters (FDP) and Facial Animation Parameters (FAP)
The VIDAS project intends to show that major improvements can be brought to model based coding schemes. The model based coding techniques will be standardised at the end of the project activity (MPEG-4). This means that products integrating these techniques will be largely diffused at the end of the century. Therefore, for industrial companies, it is a major issue to control and to introduce them in their devices. The VIDAS software demonstrator is the way to point out improvements to model based codecs allowed by joint video-speech analysis. Therefore, for industrial partners, it is the way to acquire know-how on these advanced techniques which are very promising for them in terms of compression efficiency and new functionality (movie dubbing, automatic generation of cartoons, automatic translator, virtual actor on computer,...).
Example of MPEG-4 compliant calibration procedure for adapting the model to the geometry and the texture of a real face. (Images were produced at Miralab at the University of Geneve and at LIG laboratory at EPFL, Lausanne.)
The number of users who could benefit from the project's outcomes is definitely large, ranging from the normal consumer to the pathological hearing impaired. The goals of the project are in fact oriented to the general improvement of the visual subjective quality of the images in a narrow-band videophone. Everyone will benefit from this improvement since the images will look more natural.
Moreover, in case of hearing impairments, this benefit will be dramatic. In this case the videophone will not be a "useless" advanced telephone, but will become the privileged communication means. In some cases, rehabilitation to lip-reading could even be done through remote teaching via videophone. In-between these two extremes, being the normal hearing user and the deaf, a large variety of intermediate possible consumers can be mentioned and, first of all, elderly people who could benefit so much from the improvements on video-phone achieved by the project activity.
Main contributions to the programme objectives:
Essential contribution to the standardisation of facial and body animation parameters (within MPEG-4)
Contribution to the programme
Work will soon result in standardised animated avatars widely used in many web and broadcast applications
- Audio/video synchronisation
- Model-based video coding
- 3D modelling and animation
- Synthetic/Natural Hybrid Coding
Funding SchemeCSC - Cost-sharing contracts