Skip to main content

European Live Translator

Periodic Reporting for period 1 - ELITR (European Live Translator)

Reporting period: 2019-01-01 to 2020-06-30

The ELITR project focuses on multi-lingual text and speech translation and on meeting summarization. In a multi-lingual environment such as the EU or many international organizations, the language barrier either slows down communication and brings substantial additional costs, or prevents the communication altogether when the costs are not easy to justify or cover. Many opportunities of knowledge transfer or business are thus abandoned even before any thorough consideration.

ELITR builds upon decades of successful research in natural language processing, specifically in the area of automatic speech recognition (ASR; speech-to-text) and machine translation (MT) and targets two particular use cases: (1) live conferences and congresses, and (2) remote meetings, teleconferences. While speech recognition systems for many languages are nowadays available in consumer cell phones, the quality and long-term stability of the recognition are far from sufficient for regular business use. A similar hindrance is observed for machine translation systems: some language pairs are handled exceptionally well and MT can match the performance of humans at the level of individual sentences, but less-resourced languages are under-performing and any context beyond individual sentences is generally disregarded. In ASR, ELITR is to push the state of the art by introducing end-to-end neural ASR technology and streamlining it, i.e. making it work on a continuous flow of speech, instantly delivering output. In MT, ELITR will improve the handling of document-level phenomena and come up with multi-lingual systems, where one system caters for several or many language pairs. By combining ASR and MT, and also proposing a direct end-to-end, speech-to-text translation system, ELITR will create system of 'automatic interpretation', i.e. automatic translation of speech into many target languages.

These technologies will be tested in practice, delivering speech translation at EUROSAI Congress, the congress of supreme audit institutions of EU and neighbouring countries organized by the Supreme Audit Office of the Czech Republic, an affiliated partner of the project. Live speech from the presentations will be recognized and simultanously translated, targetting 43 languages at once. This system will serve as a support for participants not strong enough in any of the 6 languages which are covered by professional interpretation. The second usecase is covered by alfatraining, a full project partner, who develop and operate remote conferencing platform, with the goal to support cross-lingual remote calls.

Lastly, ELITR will plot the area of speech summarization, aiming at automatic creation of meeting minutes, or 'automatic minuting' for short. This topic has not been sufficiently covered by research so far and data are rather scarse. While aiming at a specific practical application, ELITR will first have to focus on a precise definition of the task, preparation of the necessary training and evaluation datasets and on proposing some baseline approaches. Given the short duration of the project, we will seek to boost the research in this area by organizing a shared task on automatic minuting. If successful, automatic minuting would bring tremendous savings of time when creating minutes or when trying to catch up with a meeting where no minutes were created.
In the first half of the project (18 months), ELITR has been progressing very well. We integrated all the research system components into a complex pipeline, ran two planned and numerous additional events of automatic subtitling from source speech into many target languages. We are now in the position that the whole system is sufficiently advanced and stable from the technical point of view to operate at live or remotely-run events. Depending on multiple factors of sound acquisition, speech quality and accent, topic and domain and the requested combination of languages, we can deliver good speech translations, presented in an accessible way either as subtitles or longer paragraphs. In less favourable settings, various of the system components can underperform, rendering the outputs insufficient for the purpose. The gradual improvement of all the underlying technologies is planned and being worked on.

The task of meeting summarization is exceptionally complex, as anticipated. We are progressing well, collecting and annotating necessary data and trying out first approaches. We do not expect to reach a fully working system for automatic minuting but we are confident that we will define the task in a concise and an attractive way so that this new field of language processing research can thrive.
The goals for the rest of the project duration are twofold: (1) continue the research of the underlying technologies of ASR and MT, aiming to improve the state of the art in difficult aspects such as robustness to the accent of the speaker, adaptability and robustness to domain mismatch, continuous learning during operation, inclusion of wider context in the decisions and end-to-end modelling which promises, i.a. better handling of ambiguous or unclear inputs, and (2) improve the quality of speech transcription and translation for our particular use case: live subtitling at the EUROSAI Congress, to be held in Prague in spring 2021. The second goal demands a thorough domain and speaker adaptation, perpetual testing and evaluation as well as inclusion of new approaches discovered in (1) as soon as they are mature enough for real-time deployment.

In the areas of speech recognition and translation, we expect our project to produce a number of novel techniques from the research point of view. At the same time, we expect to raise awareness of these technological advances thanks to our test events, primarily the main one, the EUROSAI Congress. Upon success, automatic speech transcription and translation could become a regular complement and extension to professional interpretation service, substantially broadening the scope of offered languages at a very low additional cost, or offerring automatic interpretation in situations where professionals cannot be afforded.

In the area of speech summarization, we expect to coin the task, creating a well-defined subfield of natural language processing. While a working practical application cannot be expected in this short time, we will spark a series of shared tasks (scientific competitions) which will gradually bring the precision and coverage of automatic minutes to the required levels of quality.