Automatic speaker recognition is the task performed by a machine of identifying the person speaking in a given recording. There are several closely related tasks such as language recognition, where the system determines which language is being spoken; voice activity detection (VAD), where segments containing actual speech are separated from other unwanted information in the signal (silence, music); speaker diarization (SD), where the system determines speaker turns in a recording; and automatic speech recognition (ASR), where the system processes the speech segment in order to transcribe the message contained on it.
The complexity of these tasks lies in the wide variety of nuisance variability contained in the speech signal (recording device, acoustic conditions, etc.), which the system needs to disentangle from the information that is relevant for the target task. These challenges are faced by automatic systems and also by humans. For instance, while humans are relatively good at discriminating speakers known to them, it is a real challenge when it involves unknown voices. Thus, automatic systems are able to outperform humans for a large number of unknown speakers and take advantage of the information available in large datasets with thousands of hours of speech.
Speaker recognition as well as other related tasks have several applications in real-world scenarios, especially nowadays when more and more devices are operated by humans just with their voice. For instance, voice-driven bank applications should grant access only to the authorized person, for which robust text-dependent speaker recognitions systems are essential. Moreover, obtaining a robust speaker representation improves notably speaker diarization and all its relevant applications such as indexing audiovisual resources (internet, companies and institution meetings, court sessions, parliament sessions) or support for hearing-impaired people with speaker-colored subtitles on TV or speaker specific models for more accurate automatic transcriptions. It is also a very relevant task for production of linguistic resources useful for research and development.
The ETE SPEAKER project aims to improve speaker recognition systems to make them robust to different scenarios and specific tasks, with special focus on deep learning-based approaches. These systems are able to learn the information needed to represent and discriminate between speakers directly from data, similar to what humans do during their learning process. In this line, we have explored deep learning methods that extract information from the recordings encoding both speaker identity and message content in the context of text-dependent speaker recognition, improving existing techniques and analyzing the behavior of different modules on the system (bottleneck feature extractors, neural embedding x-vector and i-vector extractors, etc.). Furthermore, we have developed speaker diarization systems based on attention models and trained them in an end-to-end way. This way, the system performs the whole diarization task, which implies learning the separation of speaker turns, VAD and even overlapped speech where more than one speaker is speaking (which is a limitation of traditional approaches to this task) entirely from data.