Inverse mapping from speech sounds to articulatory gestures is a difficult problem, primarily because of the nonlinear, many-to-one, relationship of articulation to acoustics. So far, it has been an ill-posed problem, in the mathematical sense. Due to recent outstanding progress in robotics, it is now possible to answer, both theoretically and technologically, a basic question in speech inverse acoustics: Can an articulatory robot learn to produce articulatory gestures from sounds?
To answer the basic question in speech inverse acoustics: Can an articulatory robot learn to produce articulatory gestures from sounds?
In order to ascertain whether an articulatory robot can learn to produce articulatory gestures from sound, the following research was carried out.
Aerodynamic, acoustic and laryngograph data have been recorded in order to study excitation sources generation and a voice source model has been assessed by comparison with inverse filtered natural speech. Dynamics of voice and noise sources has been studied, especially glottis constriction coordination for fricatives, and variations of the voice source in vowel consonant sequences.
As concerns vocal tract geometric and acoustic data, scanner and video measurements of the vocal tract have been realised, and a software for the digitalization of labial and X-ray films was developed. Vocal tract bioacoustic measurements have been performed, using a new technique, and compared with a database of reference transfer functions.
Articulatory to acoustic modelling has resulted in an acoustic vocal tract simulation software, including several new features. An articulatory acoustic codebook has been generated with a first version of the Speech Maps Interactive Plant SMIP. A first set of data on articulatory timing has been recorded for the study of vocalic and consonantal coarticulation and a speech timing model was developed as a first step towards modelling motor encoding programming.
Methods for the recovery of articulatory trajectories of vowel vowel (VV) gestures have been tested, together with inverse dynamics for selected articulators. Self organized motor relaxation nets have been used to study trajectory formation. Learning of coarticulation and compensation phenomena has been experimented for selected VV with a control model.
A method for the recovery of undershoot vocalic targets from acoustic parameters has been developed using principles of dynamics and to obtain visual input data for audiovisual integration, a set of labial gestures in vowels and consonant has been recorded and processed. Visual perception of labial anticipation has been tested, and 4 audiovisual integration models have been implemented and assessed.
APPROACH AND METHODS
One can conceive of two complementary approaches to the speech inversion problem. The first uses all the knowledge in signal processing to identify the characteristics of the sources and filters corresponding to the vocal tract which produced the speech signal. The second is borrowed from control theory, and aims at determining inverse kinematics and/or dynamics for an articulatory robot with excess degrees of freedom. In both approaches, there is a clear need of knowledge of direct mapping (from articulation to acoustics), to find constraints in order to regularise the solution.
Following basic schemes in robotics, the speech production model is represented here by a realistic articulatory model, the plant, driven by a controller, ie a sequential network capable of synthesising motor sequences from sound prototypes. This ensemble, called Articulotron, displays fundamental spatio-temporal properties of serial ordering in speech (coarticulation phenomena) and adaptative behaviour to compensate for perturbations.
The robotics approach for speech allows the unification of Action and Perception. If speech communication is conceived of as a trade-off between the cost of production and the benefit of understanding, the constraints will be borrowed from the articulatory level, and the specific low level processing from auditory, and visual perception. Using an Audiovisual Perceptron to incorporate vision will lead to a more comprehensive formulation of the inversion problem: How can articulatory gestures be learned from hearing and seeing speech?
The integrated approach propounded in this project should lead (together with the Articulotron, the Audiovisual Perceptron and other tools for speech processing) to major "spinoffs" in R&D. Speech synthesis will greatly benefit from the learning ability of a robot taking advantage of adaptative biological principles. Low bit-rate transmission of speech can also be developed from this approach, through access to articulatory codebooks. Finally, speech recognition using the enhancement by vision of the acoustic signal in noise would also benefit from this low level inverse mapping.
221 00 Lund
SO14 3ZH Southampton