European Commission logo
English English
CORDIS - EU research results
CORDIS

Exchanges for SPEech ReseArch aNd TechnOlogies

Periodic Reporting for period 1 - ESPERANTO (Exchanges for SPEech ReseArch aNd TechnOlogies)

Reporting period: 2021-01-01 to 2023-12-31

ESPERANTO will contribute to the development of the next generation of Artificial Intelligence; AI interacting with their users for better learning but also explainable systems: systems whose decisions are understandable by humans. In order to make these technologies accessible to the largest number of people, ESPERANTO will address applications with limited resources in the field of rare languages, robotics or education.
The aim of the ESPERANTO project is to push state-of-the-art speech technologies that need to be maintained by machine learning experts for well-defined cases using large amounts of data to a stage where they could be used and maintained by domain experts and to encourage their spreading to wider languages and contexts.
ESPERANTO also aims at initiating an international standardization of protocols and evaluation process that will enable the development and industrialization of new technologies while addressing specific societal issues. These issues encompass the use of active and interactive learning for system maintenance and adaptation, the characterization of features that are relevant for domain expert understanding of the internal mechanisms of machine or deep learning systems, the development of technologies for under-resourced scenarios and the standardization of the evaluation process for those complex tasks.

Among major events, the ESPERANTO consortium will co-organize JSALT workshops in collaboration with Johns Hopkins University. JSALT workshops consist of 2 weeks of summer school followed by 6 weeks of teamwork on a research topic chosen by the organizers and funded by major American companies. For the second time in 25 years, this key event in the field of Artificial Intelligence will take place outside the American continent and will bring 5 teams of international researchers in Europe.
The ESPERANTO project is targeting 5 main goals to develop speech processing applications and support the community.

- In WP2, the project tackles the specific aspects of limited resources that can affect the development of reliable systems.

7 corpora have been collected and are publicly released or in the process of being:
* speech corpus for individuals with vocal disabilities
* corpus dedicated to Human Assisted Lifelong Learning Speaker Diarization
* corpus featuring Malay dialect speech, accompanied by associated transcriptions
* collection of 13 hours of conversational speech in Sarawak Malay
* multi-dialectal Arabic Speech Corpus
* database specifically curated for nonnative English speech, with the primary aim of developing pronunciation scoring systems

Additionally, ESPERANTO consortium has been activelly developping and publishing new approaches to deal with under resource tasks or languages for applications as diverse as speech-to-speech translation, speaker diarization, speaker recognition, voice pathologies monitoring, Automatic Voice Disorder Detection, pronunciation scoring spoken language understanding, emotion recognition and multi-channel distant speech processing.

- in WP3, ESPERANTO aims at developing automatic systems integrating human assisted learning.
A human interface for speaker diarization correction interface has been developed and publicly released.
A special day on data collection and annotation has been organised by LMU, involving LNE, UNIZAR, USFD, BUT, OMILIA, JHU, USM, UNIMAS, Elyadata, Phonexia, CONICET, CENATAV, UY1.

- in WP4, Esperanto partners have been dealing with explainability and interpretability for speech applications.
The consortium has mostly considered 4 applications: speaker verification, diarization, emotion recognition and speech-to-speech translation.
A first approach focused on extracting Speaker and Emotion Information from Self-Supervised Speech Models via Channel-Wise Correlations.
For speaker diarization, considering that the prediction of speaker segments is not enough and that it is necessary to include additional paralinguistic information several partners aimed at converting the existing automatic outputs into interpretable clues which explains the automatic diarization.
Several articles related to this work have been submitted for publications in conferences which date is behind the date of this report.

- in WP5, the on-going work will develop metrics, protocols and scenarios to evaluate different aspects of
intelligent systems and more specifically to develop and evaluate protocols to evaluate systems involving a human in the loop, to evaluate the ability of systems to deal with limited resources when transferring knowledge from a well studied language to one with limited resources and to evaluate the level of explainability of systems.
To enable fair and reproducible benchmarking of human-in-the-loop speaker diarization, a simulation of a human expert has been implemented. This work has been published and publicly released under open-source licence.
Four challenges have been organized. Databases, scripts and evaluation protocols have been released at this occasion.

- WP6 aims at producing training material to foster a new generation of speech scientists and engineers as well as supporting and coordinating the production of teaching material, tutorials and documentations for the software frameworks.
4 Workshops and 4 two-week summer schools that train young researchers in speech processing (from master degree to post-doc researchers) have been organized, gathering more than 80 and 110 researchers in speech processing from academics and industry.
The workshops have been great opportunities for ESPERANTO partners to collaborate with each othrer but also with institutions outside the consortium and establish new promising collaborations.
Software frameworks, tools, and corpuses have been supported or created (SpeechBrain, Kaldi, Hyperion, ATCO2, Lhotse, DiaPer, GTensorFSTs, CalibrationTutorial...)
More than 30 videos of lectures and presentations are available on-line.


Dissemination and exploitation actions have been widely taken by the partners, leading to participations in European Researcher's nights, presentations to young audience (secondary and high school), many press release in main stream or specialized media.
The development of shared tools, corpora, evaluation protocols will catalyze research on related topics. In this domain, the speech community has a strong experience and results have proven to lead to numerous publications and progress. The large international consortium gathered in ESPERANTO will give an opportunity to make progress in different scientific but also societal domains. The evaluation of systems is necessary for the development of those systems embedding a human in the loop or for more explainability. The creation of baselines, metrics, protocols and corpora dedicated to those tasks is a first step towards a next generation of intelligent systems. By taking the lead on those aspects, Europe will take a strong advantage in this domain, especially considering that the resulting standards will be developed at an international level and will thus facilitate the spreading of new technologies worldwide.
Research led on low resources speech processing tasks will benefit many partners and countries as this work will include development of speech technologies for dialects and languages worldwide (Arabic dialects, Indonesian languages and dialects, African languages).
logo