CORDIS - Forschungsergebnisse der EU
CORDIS

Rapid Cross-Lingual Speaker Adaptation for Statistical Text-to-Speech Systems

Final Report Summary - CLSASTS (Rapid Cross-Lingual Speaker Adaptation for Statistical Text-to-Speech Systems)

The difficulties that arise with multi-linguality is a problem everywhere in the world due to the increasing rates of globalisation. For example, there are 20 official languages in the European Union (EU) and 46 million people in the US speaks a language other than English at home. Therefore, effective and rapid cross-lingual voice adaptation methods will have a global impact by allowing new tools, such as personalised speech-to-speech applications, that will help people communicate more effectively.

The aim of the proposed project is to develop novel rapid cross-lingual speaker adaptation methods for statistical text-to-speech (STS) systems. STS systems can produce high quality and intelligibility synthetic speech. In fact, in the latest international competitions, some of the STS systems produced higher quality speech than the unit selection approach which has been the dominant TTS technology of the last decade. Moreover, an important advantage of the STS approach is the ability to use speaker adaptation techniques to adapt to a target speaker with a couple of minutes of data from the target speaker which is not possible with unit selection synthesis.

Rapidly adapting to a user's voice with a few seconds of data and synthesizing with that voice in another language offers many exciting opportunities for new applications and products. For example, a computer can tell stories to a kid using his/her parents’ voice or a diplomat’s voice can be transformed to another language with his own voice using a personalized speech-to-speech translation system. Hence, successful completion of the proposed work can have significant socio-economic impact.

Following work is performed within the context of the project:
1-A state of the art English STS system is developed. Quality and intelligibility of the system is tested.
2-A state of the art Turkish STS system is developed. 10 hours of recordings have been collected in a studio from three professional Turkish voice artists.
Pronunciation generation, text processing, and syntactic analysis algorithms have been developed for the Turkish language. Quality and intelligibility of the system is tested extensively.
3-A novel hybrid statistical/unit selection speech synthesis system is developed which takes advantage of the morphological structure of Turkish. The system is found to have better speech quality than the baseline STS system while minimally increasing the memory requirements.
4-State-mapping algorithm is implemented for mapping between different models. The algorithm is tested with an intralingual speech database and found to have the same performance reported in the literature.
5-Turkish data is collected from broadcast news and university students. Database of 70 male and 70 female Turkish speakers is created.
6-Eigenvoice based speaker adaptation algorithms are developed. Both PCA-based and maximum-likelihood based algorithms are implemented as the baseline system. A novel Bayesian eigenvoice technique is developed.
7-A novel cross-lingual adaptation technique based on eigenvoice methods has been developed. Perceptual eigenspace adaptation techniques have also been implemented and tested.
8-Nonlinear dimensionality reduction with mixture of factor analyzers have been implemented and tested.
9-A novel cross-lingual adaptation technique based on nearest-neighbor methods and parallel corpora has been developed.
10-A novel cross-lingual speaker adaptation algorithm based on k nearest neighbors has been developed.

Main Results:
1-Quality and intelligibility of the Turkish STS system is found to be equivalent to English. Results are reported in papers.
2-A novel hybrid statistical/concatenative STS system is developed for morphological languages and shown to have higher performance than the state-of-the-art STS system. Results are reported in papers.
3-Eigenvoice methods produced annoying artifacts. However, high speaker similarity and quality have been obtained using a novel Bayesian eigenvoice technique in combination with a nearest-neighbor approach. Substantially better speaker similarity has been obtained with the new nearest-neighbor approach compared to the existing cross-lingual adaptation algorithms in the literature.
4-The k nearest-neighbor algorithm performed as good as the single nearest neighbor method in the literature.
5-Nonlinear dimensionality reduction techniques did not improve the performance over the baseline system.

Potential Impact and Use:
One of the most important socio-economic reasons for carrying out the project is the large number of languages spoken in Europe which complicates the communication process between the EU countries. The cross-lingual speaker adaptation project proposed here can become a vital part of the ongoing speech-to-speech translation efforts in Europe. Such systems can make it easier for different cultures in Europe to interact which will help the social and economic advancement of Europe. Moreover, new companies and/or commercial products can become feasible with this technology which will also improve the competitiveness of the EU.