Cost-effective, Multilingual, Privacy-driven voice-enabled Services

Periodic Reporting for period 2 - COMPRISE (Cost-effective, Multilingual, Privacy-driven voice-enabled Services)

Période du rapport: 2020-06-01 au 2021-11-30

Besides visual and tactile, voice interfaces are becoming an increasingly popular means of interaction with smart objects and applications. Some of the underlying technologies, namely Speech-to-Text (STT) and Natural Language Understanding (NLU), must be trained on large amounts of speech and text data stored in the Cloud. To do so, voice technology companies typically collect voice data from users and hire human annotators to transcribe them into text. Application developers then define a list of possible user requests and associated answers for every application. This process must be repeated for every language. This approach (i) raises critical privacy concerns relating to the users’ voice characteristics and the spoken contents, (ii) is not inclusive, in the sense that it fails to address languages and categories of users for which little data is available, and (iii) incurs high costs for voice technology companies, which have led to market domination by tech giants, and also for application developers.
COMPRISE has defined a fully private-by-design methodology that reduces the cost and increases the inclusiveness of voice interaction technology. The innovative software tools developed have increased privacy to an unprecedented level, allowed the development of dialogue systems without any training resources in the target language, and reduced the cost of integrating voice features in mobile applications by more than 70%. These tools are now part of the COMPRISE SDK and the COMPRISE Cloud Platform, which are available in open source for voice technology companies and application developers.
With respect to the privacy objective, we released two software tools which protect the voice of the users and their personal information: the COMPRISE Voice Transformer and the COMPRISE Text Transformer. The COMPRISE Voice Transformer aims to prevent biometric identification of the users by converting their voice to another random person’s voice. It offers the same level of privacy protection among 50 speakers as original speech among 20,000 speakers, as validated through state-of-the-art biometric protocols. The COMPRISE Text Transformer aims to identify potentially privacy-threatening words or phrases in a piece of text and to replace them by harmless alternatives preserving the text’s structure. The main innovation lies in our word and phrase replacement strategy which offers formal privacy guarantees.
With respect to the inclusiveness objective, the COMPRISE Speech-to-Text Translation tool can translate spoken language in a way that is robust to STT errors and disfluencies (e.g. hesitations, missing words). We also introduced a multilingual NLU system, which addresses the detection of user intent in any language without any training resources in that language, and an STT personalisation method, which improves STT performance by 27% relative for users with regional or foreign accents with only 1 h of untranscribed training data per accent.
With respect to the cost-effectiveness objective, COMPRISE Weakly Supervised STT reduces the amount of human annotated data needed to train STT systems by more than 40%, while COMPRISE Weakly Supervised NLU benefits from as low as 100 labeled training examples and scales seamlessly down to a zero-shot setting, requiring no training at all. All these innovative software tools leverage cutting-edge deep learning and speech and language processing approaches and new approaches developed within COMPRISE.
Existing and new software tools have been integrated into an SDK interoperating with a Cloud Platform, which provide a full-fledged open-source solution for voice technology companies and application developers. The COMPRISE SDK includes the COMPRISE Client Library, which can be deployed on any Android or iOS device and integrates all required voice functionalities, the COMPRISE App Wizard, which allows quick configuration of these functionalities, and the COMPRISE Personal Server, which runs computationally demanding services outside the device while still preserving privacy. The COMPRISE Cloud Platform provides services for data collection and curation and for system training.
We have also developed six demonstrators to showcase these innovative tools: Cookbook, Notes, Remote Presentation Control, Shoplay, Hospital Concierge; and Doctor’s Assistant. The integration of voice features in the Remote Presentation Control demonstrator took 2 PMs with COMPRISE vs. 7 PMs without it, which translates into cost savings above 70%. These demonstrators were evaluated by potential end-users, who appreciated the new user experience offered by voice features and rated the demonstrators positively. This validates the benefits of COMPRISE, especially in the sectors of smart consumer apps, e-commerce, and e-health.
All of these advances have been thoroughly followed and monitored via rigorous management tasks, via a thorough comprehensive summary and analysis of the main aspects regarding the General Data Protection Regulation (GDPR) that needs to be considered for the implementation of the project and the development of COMPRISE, and via efficient dissemination, communication and exploitation-related activities.
COMPRISE is the first project worldwide to address the issue of privacy in voice technology. Pioneering privacy preservation solutions have been developed based on research advances in speech processing, natural language processing and machine learning. Additional research has allowed us to significantly reduce data annotation and application development costs, which opens a market for European SMEs against tech giants, and to develop multilingual dialog systems and reduce the gap between easy-to-understand and accented users at little or no extra cost, so as to provide more inclusive user experience.
The COMPRISE SDK, the COMPRISE Cloud Platform, the COMPRISE Voice and Text Transformers and the COMPRISE Weakly Supervised STT and NLU tools are freely available in open source. Customisation and high-level support are also available at a cost. Thanks to our ambitious exploitation strategy, these COMPRISE outcomes are expected to enable many businesses in the Digital Single Market to quickly develop multilingual voice-enabled applications in many languages. They will also positively impact European citizens by offering unprecedented privacy guarantees, facilitating their access to voice-enabled contents and services in other languages, and improving their overall experience. COMPRISE will find application in many sectors beyond those demonstrated, e.g. e-government, e-justice, e-learning, tourism, culture, or media.
COMPRISE overview