The projects aims to support and coordinate European participation in COCOSDA - the International Coordinating Committee on Speech Databases and Speech Input/Output Systems Assessement. This recent world-level development in language and speech engineering - with representatives from about twenty contries, drawn from Europe, North America, China, Japan and the Pacific Rim area - is concerned with the definition and application of multi-language databases and assessement standards and protocols in the field of Spoken Language Engineering. It is currently based to an important degree on prior European work; there is, however, no organised European presence in its activities. The following objectives are consequently at the core of the EuroCocosda project: give a focus for European work within, and input to, the new world frame provided by Cocosda; contribute to the internationally recognized need for a synergy between the areas of natural language and speech; provide for the joint production of spoken nd written language databases, in order to support cooperation among these areas.
Approach and Methodology
Different action lines will be implemented within the project in order to reach these aims. A unifying infrastructure will be established - EuroCocosda - for concerted European contribution to the main Cocosda world group. Within this framework, direct links will be maintained both with the Cocosda Central Commettee and the three Cocosda Working Groups: Recognition, Synthesis and Corpora. An organisational support foundation for the European component of a world wide telephone speech database - Polyphone - will be provided; the funding for each individual language's 5000 speakers corpus should come from national PTT resources. A coherent framework will be created both for inputs to the Cocosda world speech synthesis database and for the European component of the NL/Speech corpora initiative, NEWS, a multilanguage newspaper text and speech database.
The TED corpus (Translanguage English Database), containing spontaneous speech, read speech and some associated text material, will be directly acquired and and formatted within the project. Two groups of recorded speakers - native speakers of English with different dialects and non-native speakers of English as a foreign language - will provide multi-dialectal and multi-accent speech data. Speakers will be recorded at international conferences on speech communication; the associated text material will be organised and structured in a distributable form and made available to the scientific community at an early stage, thus providing an excellent link between natural language and speech technology. The issue of reusability will be also specifically focused in this data collection effort.
Links between COCOSDA and relevant European and international initiatives (LRE projects SQALE, RELATOR, EAGLES; ELSNET; LDC) will be created or fostered, thus providing an official and regular communication channel for exchanging experiences and data. Present needs of the scientific and user community will be surveyed and future initiatives prepared.
Exploitation and Future prospects
The following benefits and results are foreseen for the project, with special reference to the access from Europe to developments on the world scene and with a substantially enhanced possibility for European norms and de facto standards to become more widely influential and accepted:
A central position for Europe in the international scene, with major initiatives now capable of coming from EC countries acting in concert.
Better coordination between European laboratories, and a world level dimension for the joint activity of European NL and speech based workers.
An insight into world - in particular US and Japanese - technical expertise, coming from direct collaboration on common projects .
The possibility of harmonising national with world actions on database recording and labelling.
The availability of poly-language databases for developing and testing telephone based speech applications in both recognition and synthesis.
The availability of world relevant database frameworks for developing and testing multilingual speech synthesis systems.
A survey on the existence and availability of newspaper databases in various languages; the potential to create and influence the creation of new ones at the international level.
A domain specific multi-language and multi-accent corpus framework, with a large number of speakers, speaking the same language (English) in a natural way, under moderate stress, during a relative large amount of time. The corresponding text material will allow the building of lexica and language models.
Although the project is concrete and small-scale, primarily focused on the precise objective of an urgent need for European coordination within the framework of an international body which already exists, it will however prepare the ground for larger scale operations, whilst stimulating cooperation at the European and international levels.