Periodic Reporting for period 1 - HQSTS (High-Quality voice model for STatistical parametric speech Synthesis)
Periodo di rendicontazione: 2015-10-01 al 2017-09-30
A new high-quality analysis/synthesis method has been developed called Pulse Model in Log domain (PML). From a practical point of view, it prevents the buzziness often present in synthetic voices and thus clearly improved the overall quality. From a theoretical point of view, the new and simple approach offers a better control of the sound characteristics and ease the developments of further quality improvements.
A full training system for speech synthesis has also been realized during this research work that implements state-of-the-art Artificial Neural Nets techniques (ANN). This system has been made open-source in order to constitute a solid anchor for researchers and developers that need working implementation at hand in the current fast pace developments of ANN. The analysis/synthesis method as well as the training system for speech synthesis are available on GitHub.com at: https://github.com/gillesdegottex
In more details, during the development of the new vocoder PML, we changed the traditional approach of linear addition between deterministic (glottal source) and noise (breathiness) that forms the source of the source filter model of voice production.
Instead of an addition in the linear time domain, we used an addition in the log spectral domain. This new approach has a few advantages against the traditional one:
i) Any buzziness effect commonly found in parametric speech synthesis is removed since the addition in the log domain scrambles the phase of the deterministic component and prevent any unatural concentration of energy.
ii) the amplitude model is set by the vocal tract model (the spectral envelope) and is preserved throughout the synthesis process, whereas that of an addition in the linear domain is also dependent on the deterministic/noise ratio, which complicates the voice control more than necessary.
iii) The mathematical definition is simpler than the traditional approach, which is a very convenient property for further developements.
The new vocoder is completed by a deinterference-based spectral envelope (from STRAIGHT or WORLD vocoders).
The source code of this new PML vocoder is open and available online (https://github.com/gillesdegottex/pulsemodel).
During the second year, time has been spent for the recruited researcher to acquire important knowledge in ANN and statistical modelling in general. More precisely, the following techniques have been studied: fully connected layers with normalisation issues; recurrent networks; recurrent outputs; 1D convolutive layers for input pre-processing; 2D convolutive layers for spectrogram and noise masks generation; Generative Adversarial Networks (GAN) (basic, least square and Wasserstein variants). Due to the recent advances in neural net models, plosives and transients synthesis takes great benefits from these techniques.
Most results created by the HQSTS project appeared in the new vocoder PML. The perceived quality has been clearly improved compared to state-of-the-art methods, especially by reducing the buzziness effect mentioned above (presented first in [C2], fully detailed in [J1] and then applied through [C3,C4]). Open-source code has been made available to the public at large on GitHub.com. This includes: The new vocoder PML; The full speech synthesis training system; The generation of listening tests and demo pages.
[J1] G. Degottex, P. Lanchantin and M. Gales, ""A Log Domain Pulse Model for Parametric Speech Synthesis"", IEEE Transactions on Audio, Speech, and Language Processing, 26(1):57-70, 2018.
[C2] G. Degottex, P. Lanchantin and M. Gales, ""A Pulse Model in Log-domain for a Uniform Synthesizer"", in Proc. 9th Speech Synthesis Workshop (SSW9), Sunnyvale, CA, USA, 2016.
[C3] G. Degottex, P. Lanchantin and M. Gales, ""Light Supervised Data Selection, Voice Quality Normalized Training and Log Domain Pulse Synthesis"", in Proc. Blizzard Challenge 2017 - EH1, Stockholm, Sweden, 2017.
[C4] M. Wan, G. Degottex and M. Gales, ""Integrated speaker-adaptive speech synthesis"", in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Okinawa, Japan, 2017."
The fast pace of new ANN improvements and the consequences implied on speech synthesis can make it difficult for the speech researchers, in both industrial and academic domains, to implement a state-of-the-art and competitive text-to-speech system.
Even though big companies (e.g. Google, Amazon, Facebook) are present in the speech community through publications, they do not always publish the full details of their systems for obvious reasons.
In this context, by making our speech synthesis training system available open-source, we hope that researchers and developers will have a solid anchor for developing and deploying state-of-the-art solutions in speech synthesis.