Periodic Reporting for period 1 - SPEAKER DICE (Robust SPEAKER DIariazation systems using Bayesian inferenCE and deep learning methods)
Período documentado: 2017-03-01 hasta 2019-02-28
Although being apparently easy for humans, diarization is a highly challenging task for machines, as it deals with the complex task of Speaker Recognition, it needs to find the (unknown) number of speakers in the utterance, it has to do segmentation of speech into speaker turns (finding boundaries between speakers) and needs to deal with overlapped speech (cross-talk).
One of the main applications of Speaker Diarization is the indexing of audiovisual resources with speakers. This indexing allows a structured search and access to resources depending on the speaker of interest. This feature can be very useful in a wide range of scenarios. First of all it would be very valuable for public institutions, allowing the indexing of sessions of parliaments, courts, etc. The indexing can also be helpful for companies allowing, for example, access to specific parts of meetings or seminars.
Besides, TV, internet and radio broadcasters would benefit from such system, as they could provide a more versatile access to their contents. The indexing of TV broadcasters is of a special interest, as it would allow the automatic colouring of subtitles according to the speaker, which would make the media more accessible for hearing-impaired people.
In addition to the direct applications speaker diarization, the diarization systems are also helpful and relevant for other related tasks. To list a few, it can be used for speaker adaptation in Automatic Speech Recognition (ASR). Also, it is a very important part of the system pipeline for Speaker Recognition (SR) in wild scenarios in which several speakers are present but only one is of interest. Moreover, it is relevant for the production of linguistic resources, as it allows collecting language utterances avoiding speaker repetitions.
This project focuses on improving, extending current and developing new approaches to enhance the performance of Speaker Diarization systems. For that purpose, we set three main objectives: first, optimize the current Bayesian models which have strong mathematical foundation to achieve better performance. Second, driven by the success of the artificial Neural Network (NN) based techniques for the related speaker recognition task, we aim to integrate NN modules into the diarization pipeline. Finally, the third objective is to make the system applicable to the general case, so that it generalizes to any kind of speech and environment.
Two different neural network based modules have been integrated into the pipeline, one NN for the extraction of robust and discriminative features (embeddings) and another NN based module that detects and handles overlap speech (segments in which two or more speakers talk at the same time). The integration of these modules proved to be very successful. As a result, the BUT team led by the main researcher of the project achieved the third and first positions in the two tracks of the last DIHARD challenge, organized to foster research on hard diarization conditions. Results can be seen in: https://coml.lscp.ens.fr/dihard/2018/results.php. Also, this research led to several publications.
Finally, the diarization modules were successfully integrated in ASR and SR systems. Besides, the project rose the interest of industry: a collaboration has started with Ericsson to optimize the technology for its application on real broadcast data.