European Commission logo
italiano italiano
CORDIS - Risultati della ricerca dell’UE
CORDIS
Contenuto archiviato il 2024-06-18

Speech Communication with Adaptive Learning

Final Report Summary - SCALE (Speech communication with adaptive learning)

Project Objectives:

The 'Speech communication with adaptive learning' (SCALE) is a Marie Curie initial training network (ITN) that addressed several core issues in contemporary speech communication research. It had three scientific core objectives:

1. Bridging the gap between recognition and synthesis: speech research recognition - converting speech to text - has been advanced greatly thirty years ago by using a statistical machine learning technique called hidden Markov models (HMMs). Speech synthesis is the opposite processes converting text into speech sounds. The objective was to utilise this duality of speech recognition and synthesis in order to cross-fertilise both fields.
2. Bridging the gap between automated speech recognition (ASR) and human speech recognition (HSR): humans are so much better at speech recognition than machines. In particular in noisy situations we are clearly superior to machines. Also humans can relatively easily pick up new words and make them part of their vocabulary. In SCALE we addressed those issues e.g. by researching methods that can handle out-of-vocabulary words in morphologically rich languages like Polish or German.
3. Bridging the gap between signal processing and learning: again if we compare machines to humans, there are capabilities that machines still need to acquire: directional hearing. When talking on a party in a group of people, we can use our two ears to locate the speaker and thus focus on the person we are speaking with. Such capabilities would also be helpful for machines. Present techniques can do some directional hearing with eight or more 'ears' (microphones). Still the performance is not satisfactory.

Work performed

Bridging the gap between recognition and synthesis: four fellows have been trained on this research theme. The research focussed on making speech synthesis more adaptive using ideas from speech recognition. One approach was to modify the signal pre-processing (the mel frequency cepstral coefficient). The new method allows reacting to online changes in the acoustic noise. User studies using listening experiments have shown the benefit of the proposed approach. Also template-based automatic speech recognition and speech synthesis using posterior features have been investigated. Results comparable to the use of natural speech templates, both with text to speech (TTS) systems trained on TTS corpora and ASR corpora have been achieved. Regarding hierarchical trajectory models, a novel theoretical framework improving over conventional manifold learning based dimensionality reduction approaches have been developed and significant performance gains have been shown. The goal of the project on speech synthesis by analysis was to identify a complete model to address essential aspects of how humans adjust their speech production to the context in order to maximise the communication effectiveness. The results are now part of the international Hurricane challenge (http://listening-talker.org/hurricane/).

Bridging the gap between ASR and HSR: humans are able to recognise words that are new to them. ASR systems have a clear limitation: they can only recognise what is in their lexicon. In the project on open vocabulary speech recognition new subword based approaches have been explored for Polish and German. Subspace models for speech recognition are a new mathematical way to parameterise the acoustic model and to factor different variability’s for example language changes or changes in noise condition. This has been successful yielding a large number of publications. Humans are good a separating speakers by making use of the sparsity of the signal. In the project on sparse component analysis for robust distant speech recognition spatial sparsity has been utilised for the first time. This work has won an award at the International Conference on Acoustics, Speech and Signal Processing (ICASSP). Finally, associative memories for learning and decoding speech, another human inspired model, have been investigated.

Bridging the gap between signal processing and learning: this theme has focussed on beam forming and source separation for distant speech recognition. A necessary prerequisite of this type of research is a suitable data collection. USFD has collected and annotated data. One basic research project focussed on multiple speaker localisation by proposing a probabilistic framework of the steered response power of which the results have been published at ICASSP. Also information theoretic approaches have been used successfully for speech source separation. This extended past work by Kumatani et al. This theme also focussed on dereverberation techniques using multiple microphones as well as investigations on adaptive feature extraction and combination.

In addition to the work done in the themes SCALE made significant contributions to the new KALDI speech recognition toolkit.

Impact

Scale has performed six network wide training events on dedicated themes like distant speech recognition, spoken language processing by mind and machine or beyond HMMs. In addition the fellows received complementary training on areas like 'Intellectual property rights (IPR), patenting and licensing', 'Communication, negotiation and research ethics', 'Project and finance management' and 'Proposal writing and funding opportunities'. This helped the fellows to take up positions inside and outside the network after the end of SCALE.

On the scientific side, SCALE has resulted in 68 papers published primarily at Interspeech and ICASSP conferences. Moreover, SCALE fellow Arnab Ghoshal jointly with Dan Povy from Microsoft where the key authors of this by now widely used ne ASR toolkit.

Detailed information can be found on the project website http://www.scale.uni-saarland.de/ or from the coordinator Dietrich.Klakow@LSV.Uni-Saarland.De.