Periodic Reporting for period 2 - PAY-ME-ATTENTION (Disrupting the Communication between Humans and Computers - Understanding the Key Message in Simultaneous Conversations Through Voice Biometrics)
Période du rapport: 2018-09-01 au 2019-08-31
Our mission is to revolutionize the communication between people and computers, allowing people to communicate naturally with the machines. We do this by creating the most innovative voice recognition algorithms based on the latest AI technologies. Our overall business objective is for the Verbio algorithms to be present in any voice assistant and voice recognition system worldwide to make them more accurate and secure. The system can work with a 1% error rate, segregating user data from other sources (silence, noise, music, and background), overcoming exiting technical bottlenecks.
Verbio’s Pay-me-attention system aims to be integrated into the natural flow of the conversation, allowing the verification and transcription to be completely transparent for the user. The integration of the Voice Pay-me-attention should significantly reduce interaction times and, consequently, bring down costs and increase the user’s satisfaction rate.
Verbio’s Pay-me-attention will work with the normal flow of the conversation (8 sentences < 30 seconds of voice parametrization), enabling a transparent training and verification. It will also allow creating a unique voiceprint for any communication channel in the omnichannel paradigm (IVR, Mobile, web, on-site).
Core technology includes mechanisms to protect verifications processes against reproduction and repetition attacks, as well as audio concatenation, allowing to define who is speaking and what.
The Pay-me attention architecture has been designed, resulting on a block diagram with the description of the connections between the project modules, as well as the inputs and outputs of each module.
We started with WP2 designing the speech enhancement module. Then, we worked on the speech diarization module. For the speech biometrics module we worked on identifying a voice segment with respect to a database of biometric vocal traces.
For the speech recognition module, a voice-to-text conversion system has been designed and implemented. The main objective of this task was to be able to provide input to the attention detection module. However, in the design, it has been very important to be able to export all the information generated by the recognition to the other modules, so that their outputs are fed back and improved. As an output of this work, we have a functional CSR showed in a demo delivered at M12. Finally, for the Speech analytics module we have been working in the module located hierarchically in the upper part of the processing of the directed attention system.
Dureing the second reporting period we have been working on a prototype for the meeting room use case, even if the different modules developed were also initially tested and evaluated for the robot use case. To evaluate the whole system, perceptual evaluation of the Meeting Assistant is being done, analysing meeting sessions in Verbio’s facilities intended as a real working environment. Moreover, we have also tested the prototype working with audio from beta testers (Softbank and AISoy) based on the meeting room assistant prototype, to detect usability problems and also evaluating any possible shortcomings of the proposed prototype.
In view of all the results of the evaluation (tasks functionalities, global evaluation, and beta tester evaluations) some modifications have been proposed to make the prototype work towards reliability, usability and optimization. These modifications are aimed to be developed along the next stage of the project in order to obtain a fully functional project, to be implemented by a client. Agreements with Softbank and Intel, who recently showed his interest in PAY-ME-ATTENTION outputs, are the boosters of this next stage.
The PAY-ME-ATTENTION project has been a success from the point of view of Verbio. Not only has it allowed and implied the integration of existing technologies, as well as their development from a slightly different approach than usual in the company, but it has endowed us with knowledge and technology that we had not developed so far. It has allowed us to work in different environments (meeting room, robot) and increase the technological capacity of the company.
Moreover, the final prototype has been successful and functional, and will allow Verbio to develop future projects both with actual partners/customers, as potential and future ones.
All planned communication and dissemination actions during the project lifetime have been correctly developed. Furthermore, KPIs defined in order to measure the effectiveness of the dissemination tools have been accomplished.
As a Business model we have analysed the financial impact of the project as a meeting room assistant. The idea is to start the launching of our software in the market with the collaboration with different clients that Verbio has nowadays like Softbank or Intel, who are interested in our new technology and want to bet for the implantation of our meeting room assistant in all the offices and meeting room of their companies. The next goal is to increase our number of clients over the next years. We have calculated to reach over 50 clients or more in the first five
There are four key reasons that voice is suited to a range of H2M applications:
1. Speech is the natural mode of communication for humans. It is both intuitive and easier to convey commands verbally.
2. Voice recognition is particularly appealing when the human's hands or eyes are otherwise occupied. For example it may not only be convenient but also a legal requirement to use verbal commands (if any) rather than be distracted by typing or using other means of communication, as in distracted driving legislation.
3. Voice telephony is an efficient means of bi-directional voice communication with machines that can listen, and respond without the need for complex commands.
4. Cost saving factors: Voice integration could potentially challenge the need for a touch screen on many devices, as it reduces the cost for devices that will be dormant for the majority of the time. This represents a change from a volume to a transactional model e.g. a million devices that generate 50,000 calls per year are probably better serviced through a voice connection than putting a €5 screen on a million devices.
Voice recognition allows a range of functions to be controlled by means of voice on a number of different device types such as computer operating systems, commercial software for computers, mobile devices (smartphones, tablets), cars, call centres, domotic devices and internet search engines such as Google. By removing the need to use physical input mechanisms, users can easily operate appliances with their hands full or while doing other tasks.