CORDIS - EU research results

Robust End-To-End SPEAKER recognition based on deep learning and attention models

Periodic Reporting for period 1 - ETE SPEAKER (Robust End-To-End SPEAKER recognition based on deep learning and attention models)

Reporting period: 2019-06-01 to 2021-01-31

Automatic speaker recognition is the task performed by a machine of identifying the person speaking in a given recording. There are several closely related tasks such as language recognition, where the system determines which language is being spoken; voice activity detection (VAD), where segments containing actual speech are separated from other unwanted information in the signal (silence, music); speaker diarization (SD), where the system determines speaker turns in a recording; and automatic speech recognition (ASR), where the system processes the speech segment in order to transcribe the message contained on it.

The complexity of these tasks lies in the wide variety of nuisance variability contained in the speech signal (recording device, acoustic conditions, etc.), which the system needs to disentangle from the information that is relevant for the target task. These challenges are faced by automatic systems and also by humans. For instance, while humans are relatively good at discriminating speakers known to them, it is a real challenge when it involves unknown voices. Thus, automatic systems are able to outperform humans for a large number of unknown speakers and take advantage of the information available in large datasets with thousands of hours of speech.

Speaker recognition as well as other related tasks have several applications in real-world scenarios, especially nowadays when more and more devices are operated by humans just with their voice. For instance, voice-driven bank applications should grant access only to the authorized person, for which robust text-dependent speaker recognitions systems are essential. Moreover, obtaining a robust speaker representation improves notably speaker diarization and all its relevant applications such as indexing audiovisual resources (internet, companies and institution meetings, court sessions, parliament sessions) or support for hearing-impaired people with speaker-colored subtitles on TV or speaker specific models for more accurate automatic transcriptions. It is also a very relevant task for production of linguistic resources useful for research and development.

The ETE SPEAKER project aims to improve speaker recognition systems to make them robust to different scenarios and specific tasks, with special focus on deep learning-based approaches. These systems are able to learn the information needed to represent and discriminate between speakers directly from data, similar to what humans do during their learning process. In this line, we have explored deep learning methods that extract information from the recordings encoding both speaker identity and message content in the context of text-dependent speaker recognition, improving existing techniques and analyzing the behavior of different modules on the system (bottleneck feature extractors, neural embedding x-vector and i-vector extractors, etc.). Furthermore, we have developed speaker diarization systems based on attention models and trained them in an end-to-end way. This way, the system performs the whole diarization task, which implies learning the separation of speaker turns, VAD and even overlapped speech where more than one speaker is speaking (which is a limitation of traditional approaches to this task) entirely from data.
"The objectives set for the ETE SPEAKER have been fulfilled. The experiments and results obtained have improved the performance of existing systems, provided a deeper analysis on certain aspects, and overcame some of the limitations of traditional approaches.

In particular, the BUT text-dependent speaker verification system developed in the context of the Short-duration Speaker Verification (SdSV) challenge and led by the researcher of the project, achieved the first position of the challenge. The SdSV evaluation focused on short-duration utterances, which poses an extra challenge for the automatic speaker recognition systems, given that each recording contains only a few seconds of speech. Results of the challenge can be seen in: The system has been analyzed in the related conference publication (Interspeech 2020) and is being extended for a journal paper (currently in preparation for submission).

In addition, the researcher has focused on the development of an end-to-end speaker diarization system, which is able to deal with overlapped speakers in opposition to the traditional approaches based on x-vector clustering. This system has been integrated and combined with the rest of the BUT diarization systems based on Bayesian HMMs and x-vectors, providing gains especially in the domain of telephone conversations. The BUT diarization system was among the top-5 performing systems for the track 1 of the last DIHARD challenge ( and the end-to-end approach has shown its potential and advantages over traditional systems when enough reliable training data is available. The analysis and improvements on this technique is an ongoing work in collaboration between the researcher and the host group of the project."
The work carried out during the ETE SPEAKER project has contributed to the state-of-the-art approaches in the field. The data available nowadays for training deep learning approaches in text-dependent speaker recognition as well as speaker diarization have allowed the study of different techniques that outperformed or overcame certain limitations of traditional automatic systems.

These findings have an impact on the field of voice-driven devices and applications, as well as the wide range of applications related to speaker diarization, such as indexing of media contents by speaker, information analysis on broadcast TV, improvement of automatic transcription with speaker information for adaptation, to name a few.

Several companies working on security or voice-controlled devices could exploit the results of this project. Regarding the social and gender plan, this project contributed to the successful international research carried out at BUT Speech@FIT group, and promoted scientific work in machine learning and speech processing among female researchers and students.
Speaker verification task
Speaker diarization task