Improving computer abilities to recognise speakers

Automated speech recognition often runs into problems when multiple people are speaking. By applying big data, researchers have shown how machines can be taught to identify individual speakers.

Digital Economy

Automatic speech recognition(opens in new window) (ASR) technology enables the recognition and translation of spoken language into text by computers. As humans increasingly interact with machines using their voices – such as through mobile applications, search queries and personal assistants like Google Home – demand for this technology is set to increase. Distinguishing individual speakers and saying who speaks when in a given recording (known as speaker diarisation)(opens in new window) are specific tasks of ASR. Potential applications include granting access to an authorised person, or customising devices to provide specific functionality, depending on the speaker. In order for this technology to be consistently effective however, certain challenges need to be fully addressed. High-level background noise or when two or more speakers overlap often degrade machine performance. A lack of available hardware to train automatic systems to learn from large amounts of data has also hampered progress.

Accurate speaker recognition

The ETE SPEAKER(opens in new window) project, which was undertaken with the support of the Marie Skłodowska-Curie Actions(opens in new window) programme and coordinated by Brno University of Technology(opens in new window) in Czechia, set out to examine potential new approaches to speaker recognition. “Conditions that are common in real speech applications are still a challenge for automatic systems,” explains Marie Skłodowska-Curie fellow Alicia Lozano-Diez(opens in new window), now assistant professor at the Autonomous University of Madrid in Spain. Lozano-Diez and her team sought to develop robust speaker recognition systems that could perform the task in different scenarios. For this, they used deep learning-based algorithms, capable of discriminating between speakers directly from data. The project began with a thorough review of existing approaches, to see where new methods might be more effective. They then tested these new approaches. “A key means of making progress is technology evaluations that different experts and institutions organise,” says Lozano-Diez. “In these evaluations, experts from around the world develop systems to solve a specific task.” The ETE SPEAKER project team used these opportunities to develop and trial different approaches they developed. They then compared these with other teams, to identify remaining challenges to tackle.

Deep learning approaches

Participation in these evaluations enabled Lozano-Diez to demonstrate how speaker recognition could be improved, and how some of the limitations of traditional approaches could be overcome. The team was able to exploit the potential of deep learning approaches, in part thanks to the data available today. “One system we developed for a particular challenge(opens in new window) achieved the best results among all participants,” adds Lozano-Diez. “This evaluation focused on short-duration recordings. These can pose an extra challenge for automatic speaker recognition systems, given that each recording contains only a few seconds of speech.” The project also developed new methods for dealing with overlapped speakers for the task of speaker diarisation. Lozano-Diez plans to continue her research in this field, in pursuit of ever more accurate speaker recognition and diarisation technology. “New approaches are now able to handle the complex issue of overlapped speech by learning directly from data,” she explains. However, this type of data – accurately labelled and gathered from several different scenarios – is scarce, and Lozano-Diez believes that more research is needed in order to make this technology work properly in challenging conditions. A good example of this might be conversations in restaurants where there is a lot of background noise, or comments in a conference recorded from distant microphones.