Voice is the most powerful medium to deliver content to people, in any language, any culture. Today we are experiencing an ever-increasing demand for global content instantly available worldwide, however, a huge amount of this content is only subtitled and not dubbed. As a result, a large number of stories are never being told, and people can’t enjoy content in their native language. This is due to the traditional approach for expressive voice recording that is time consuming and expensive, while many content creators face budget and time constraint. Text-To-Speech (TTS) solutions are increasingly emerging to overcome the standard voice production pipeline complexity, but they are still not developed enough to achieve the necessary levels of emotions and prosody to give the same sensation as human voices. Content creators, to succeed in increasingly competitive international markets, cannot rely on poor robotic artificial voices.
We have developed the first Deep Learning based technology able to synthesize controllable highly expressive speech in multiple languages with multiple voices. This technology will be the core engine of specific solutions and applications to create and localize expressive voice content for multiple markets and verticals, with the primary aim to “Voice the Unvoiced”.