High-Quality voice model for STatistical parametric speech Synthesis

Project Information

HQSTS

Grant agreement ID: 655764

Project website

DOI

10.3030/655764

Project closed

EC signature date 25 March 2015

Start date 1 October 2015

End date 31 December 2017

Funded under

EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions

Total cost

€ 183 454,80

EU contribution

€ 183 454,80

183 454,80

Coordinated by

THE CHANCELLOR MASTERS AND SCHOLARS OF THE UNIVERSITY OF CAMBRIDGE
United Kingdom

Periodic Reporting for period 1 - HQSTS (High-Quality voice model for STatistical parametric speech Synthesis)

Reporting period: 2015-10-01 to 2017-09-30

A speech analysis/synthesis method aims at representing a speech waveform, produced by a person speaking, as a time sequence of parameters. Based on this time sequence, the speech waveform can be resynthesized. The analysis/synthesis methods are cornerstones for many speech technologies (e.g. text-to-speech, telecommunications, voice restoration). For the majority of applications, these methods need to have two key properties: (i) a high perceived quality of the speech sound, and, (ii) a statistical characterization of the parameters' sequence necessary for statistical approaches. The current analysis/synthesis methods exhibit however a lack of perceived quality. This issue does not pose a problem for noisy environments, but prohibits the use of statistical approaches in quiet environments, where the listener is fully aware of all the details of the sound. Recent phase processing tools allowed the description of the phase spectrum and noise properties in a way that shows the drawbacks and limits of current analysis/synthesis methods. Additionally, these same tools are also promising means for modeling the phase and noise information, which is paramount for good quality. The primary goal of the HQSTS project is to create a high-quality analysis/synthesis method that will broaden the applications of statistical approaches of speech technologies in quiet environments, where a high-quality is an absolute necessity.

A new high-quality analysis/synthesis method has been developed called Pulse Model in Log domain (PML). From a practical point of view, it prevents the buzziness often present in synthetic voices and thus clearly improved the overall quality. From a theoretical point of view, the new and simple approach offers a better control of the sound characteristics and ease the developments of further quality improvements.
A full training system for speech synthesis has also been realized during this research work that implements state-of-the-art Artificial Neural Nets techniques (ANN). This system has been made open-source in order to constitute a solid anchor for researchers and developers that need working implementation at hand in the current fast pace developments of ANN. The analysis/synthesis method as well as the training system for speech synthesis are available on GitHub.com at: https://github.com/gillesdegottex

"The work accomplished can be divided in two: The first year was dedicated to developping a new noise representation for speech signals (leading to a new vocoder named Pulse Model in Log domain (PML)). The second year mainly focused on addressing common issues happening when using a vocoder in parametric speech synthesis, namely averaging, using Artificial Neural Nets (ANN), which involved many training activities compared to the first year.

In more details, during the development of the new vocoder PML, we changed the traditional approach of linear addition between deterministic (glottal source) and noise (breathiness) that forms the source of the source filter model of voice production.
Instead of an addition in the linear time domain, we used an addition in the log spectral domain. This new approach has a few advantages against the traditional one:
i) Any buzziness effect commonly found in parametric speech synthesis is removed since the addition in the log domain scrambles the phase of the deterministic component and prevent any unatural concentration of energy.
ii) the amplitude model is set by the vocal tract model (the spectral envelope) and is preserved throughout the synthesis process, whereas that of an addition in the linear domain is also dependent on the deterministic/noise ratio, which complicates the voice control more than necessary.
iii) The mathematical definition is simpler than the traditional approach, which is a very convenient property for further developements.
The new vocoder is completed by a deinterference-based spectral envelope (from STRAIGHT or WORLD vocoders).
The source code of this new PML vocoder is open and available online (https://github.com/gillesdegottex/pulsemodel).

During the second year, time has been spent for the recruited researcher to acquire important knowledge in ANN and statistical modelling in general. More precisely, the following techniques have been studied: fully connected layers with normalisation issues; recurrent networks; recurrent outputs; 1D convolutive layers for input pre-processing; 2D convolutive layers for spectrogram and noise masks generation; Generative Adversarial Networks (GAN) (basic, least square and Wasserstein variants). Due to the recent advances in neural net models, plosives and transients synthesis takes great benefits from these techniques.

Most results created by the HQSTS project appeared in the new vocoder PML. The perceived quality has been clearly improved compared to state-of-the-art methods, especially by reducing the buzziness effect mentioned above (presented first in [C2], fully detailed in [J1] and then applied through [C3,C4]). Open-source code has been made available to the public at large on GitHub.com. This includes: The new vocoder PML; The full speech synthesis training system; The generation of listening tests and demo pages.

[J1] G. Degottex, P. Lanchantin and M. Gales, ""A Log Domain Pulse Model for Parametric Speech Synthesis"", IEEE Transactions on Audio, Speech, and Language Processing, 26(1):57-70, 2018.
[C2] G. Degottex, P. Lanchantin and M. Gales, ""A Pulse Model in Log-domain for a Uniform Synthesizer"", in Proc. 9th Speech Synthesis Workshop (SSW9), Sunnyvale, CA, USA, 2016.
[C3] G. Degottex, P. Lanchantin and M. Gales, ""Light Supervised Data Selection, Voice Quality Normalized Training and Log Domain Pulse Synthesis"", in Proc. Blizzard Challenge 2017 - EH1, Stockholm, Sweden, 2017.
[C4] M. Wan, G. Degottex and M. Gales, ""Integrated speaker-adaptive speech synthesis"", in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Okinawa, Japan, 2017."

Compared to the state of the art, the noise model developed for PML clearly removes the buzziness often present in parametric speech synthesis. Besides this very concrete result, it is also important to note the potential of using an addition in log spectral domain between deterministic and random components. This approach definitely offers a better control of the speech signal that should lead to further applicative results. The simplicity of PML's synthesis method is also a step beyond the state of the art, since the traditional model was bounding the development of high quality vocoders by an over-complicated structure.

The fast pace of new ANN improvements and the consequences implied on speech synthesis can make it difficult for the speech researchers, in both industrial and academic domains, to implement a state-of-the-art and competitive text-to-speech system.
Even though big companies (e.g. Google, Amazon, Facebook) are present in the speech community through publications, they do not always publish the full details of their systems for obvious reasons.
In this context, by making our speech synthesis training system available open-source, we hope that researchers and developers will have a solid anchor for developing and deploying state-of-the-art solutions in speech synthesis.

fig-features.png

Periodic Reporting for period 1 - HQSTS (High-Quality voice model for STatistical parametric speech Synthesis)

Download Download the content of the page