Audiovisual to Articulatory Speech Inversion

ASPI is concerned with the recovery of vocal tract shape dynamics from an acoustical speech signal supplemented by image analysis of a speaker’s face.  It is (i) developing inversion methods with emphasis to audiovisual to articulatory inversion methods and the investigation of additional constraints and optimization methods to deduce the under-determined nature of inversion, and (ii) constructing a multimodal articulatory database based on ultrasound, MRI, and facial motion capture.

ASPI may lead to a much needed breakthrough in our understanding of speech and our approach to speech research, given the focus on multimodal data collection, the activities related to publicizing data collection protocols and technical specifications of data collection equipment as well the activities planned to exploit the data.

Audiovisual-to-articulatory inversion consists in recovering the vocal tract shape (from vocal folds to lips) dynamics from the acoustical speech signal, supplemented by image analysis of speaker's face. Being able to recover this information automatically would be a major break-through in speech research and technology, as a vocal tract representation of a speech signal would be both beneficial from a theoretical point of view and practically useful in many speech processing applications (language learnin, automatic speech processing, speech coding, speech therapy, film industry...). The design of audiovisual-to-articulatory inversion involves two kinds of interdependent task. The first is the development of inversion methods that successfully answer the main acknowledged difficulties (non-unicity of inverse solution, lack of phonetic relevancy of inverse solutions, impossibility of using standard spectral data), and the second is the construction of an articulatory database that comprises dynamic images of the vocal tract together with the speech signal uttered, and that for several male and female speakers. For the inversion itself the main objectives are: 1.Development of inversion methods, 2.Investigation of additional constraints to reduce the under-determination of the inversion, 3.Evaluation of the inversion methods on articulatory data. For the construction of the articulatory database: 4.Design and acquisition of articulatory data that enables both the development of articulatory models and the assessment of inversion methods, 5.Design of a low cost acquisition technology based on ultrasound and facial motion capture, 6.Exploitation of existing databases (mainly X-ray images previously acquired). The consortium provides an outstanding blend of competences, mixing groups with theoretical background in speech production, acoustic-to-articulatory inversion, computer vision and medical imaging.

