Speech, being the most natural means of human interaction, can also be viewed as one of the most important modalities in complex human-machine interaction. State-of-the-art speech recognition technology still lacks robustness with respect to environmental conditions and speaking style. In addition most research effort has been devoted to American English. The Coretex project aims at improving current technology by making it less sensitive to environment and linguistic factors, and more suitable to European languages. This will be achieved by fast system development and generic core technology; dynamic adaptation techniques and methods to enrich transcriptions. Evaluation and demonstration frameworks will be established to assess and illustrate improvement. An associated industrial user panel will have early access to project results and help to select the most pressing research issues and problems common to important application domains.
The objectives of the Coretex project are to:
.develop generic speech recognition technology that works well for a wide range of tasks with essentially no exposure to task specific data;
.develop methods for rapid porting to new domains and languages with limited, inaccurately or untranscribed training data;
.investigate techniques to produce an enriched symbolic speech transcription with extra information for higher level (symbolic) processing;
.explore methods to use contemporary and/or topic-related texts to improve language models, and for automatic pronunciation generation for vocabulary extension;
.integrate the methods into showcases to validate them in relevant applications;
.establish an evaluation framework, collect test suites, and define objective measures to assess improvements;
.disseminate information about the research carried out in Coretex and ensure contact with interested user groups so as to widely exploit the results.
For fast (low cost) system development, the trade-off between the amount of training data, acoustic model complexity and recognition performance will be optimised and fast acoustic adaptation methods will be applied. Automatic and semi-automatic transcription methods will be realized using recognized and approximate transcriptions in combination with confidence measures.
In order to enrich transcriptions, techniques for casing will mainly rely on specific language models, whereas punctuation methods will additionally consider acoustic measurements on pause durations and local prosody. For the purpose of speaker turn and identity tagging a number of schemes for segmentation, clustering and speaker identification will be investigated.
Contemporary data and topic-specific data consisting of textual documents and automatically transcribed audio data will be used for dynamic domain adaptation. This requires identifying and adding new words to the system vocabulary and learning the corresponding syntactic constraints. Pronunciations for new words will be derived by phoneme-to-grapheme conversion and automatic derivation methods.
A transcription browser will display parallel recogniser transcripts so as to allow the user to compare the transcription quality as a function of the system configuration and performance. An audio browser using the transcriptions and meta-data markups as well as digital libraries will demonstrate the technological improvements in the context of real applications.
For each research area, objective evaluation criteria and measures will be defined, test suites will be collected, and software tools will be developed. Differences among the four speech recognition platforms available to the consortium will be investigated in order to provide insights to the core technology.
Industrial companies and end-users will be sought to form a user panel, which will have early access to project results, and will be invited to attend annual workshops. The project results will be presented at general public events within the scientific community, and in specialized workshops at the European and international levels.
M1.1 (T0+12) Grapheme-to-phoneme conversion,
M1.2 (T0+33) Performance as a function of costs
M2.1 (T0+6) Common XML format for audio meta-data markup
M3.1 (T0+18) Test suites and protocols
M4.1 (T0+18) Selection of showcases
M4.2 (T0+30) Showcases
M5.1 (T0+6)First workshop with user panel
M5.2 (T0+20) Second workshop with user panel
M5.3 (T0+36) Final workshop with user panel.
Funding SchemeCSC - Cost-sharing contracts
75794 Paris Cedex 16