European Commission logo
English English
CORDIS - EU research results
CORDIS
Content archived on 2024-05-30

Rapid Cross-Lingual Speaker Adaptation for Statistical Text-to-Speech Systems

Article Category

Article available in the following languages:

Novel voice adaptation methods to facilitate multilingual communication in Europe

With rapid globalisation and the need for communication across multiple languages, attention has turned to the development of supporting tools and applications. An EU initiative contributed to advances in this area that will ultimately help people communicate more effectively.

Digital Economy icon Digital Economy

The EU-funded CLSASTS (Rapid cross-lingual speaker adaptation for statistical text-to-speech systems) project set out to refine personalised speech-to-speech applications. More specifically, it aimed at extending text-to-speech synthesis through new methods for statistical text-to-speech (STS) systems. Project work covered the development of state-of-the-art English and Turkish STS systems and their extensive quality and intelligibility testing. For the Turkish system, 10 hours of voice studio recordings were gathered from 3 professional voice artists. Pronunciation generation, text processing and syntactic analysis algorithms were created for the Turkish language. Test results showed the quality and intelligibility of the Turkish STS system as equal to that of its English equivalent. A novel hybrid statistical/unit selection speech synthesis system was developed that takes advantage of the morphological structure of the Turkish language. This system was found to have better speech quality than the baseline STS system, with a minimal need for increase in memory requirements. Collection of Turkish data from broadcast news and university students enabled the creation of a database of 70 male and 70 female Turkish speakers. In addition, the CLSASTS team developed eigenvoice-based speaker adaptation algorithms and a novel Bayesian eigenvoice technique. The latter, in combination with a nearest-neighbour approach, successfully demonstrated considerably better high speaker similarity. The nearest-neighbour algorithm performed as well as the single-nearest-neighbour method. What is more, non-linear dimensionality reduction methods did not enhance the performance over the baseline system. Given the large number of languages spoken in Europe, CLSASTS will have important socioeconomic implications, with improved communication between EU countries. By contributing to ongoing speech-to-speech translation efforts, it will give Europe a competitive edge. In addition, the technology will encourage new companies and/or commercial production.

Keywords

Voice adaptation, multilingual communication, statistical text-to-speech, speech-to-speech, speaker adaptation

Discover other articles in the same domain of application