European Commission logo
English English
CORDIS - EU research results

European Live Translator

Periodic Reporting for period 2 - ELITR (European Live Translator)

Reporting period: 2020-07-01 to 2022-03-31

The ELITR project focused on multi-lingual text and speech translation and on meeting summarization. In a multi-lingual environment such as the EU or many international organizations, the language barrier either slows down communication and brings substantial additional costs, or prevents the communication altogether when the costs are not easy to justify or cover. Many opportunities of knowledge transfer or business are thus abandoned even before any thorough consideration.

Building upon experience in automatic speech recognition (ASR; speech-to-text) and machine translation (MT), ELITR targeted two particular use cases: (1) live conferences and congresses, and (2) remote meetings, teleconferences. While speech recognition systems for many languages are nowadays available in consumer cell phones, the quality and long-term stability of the recognition are far from sufficient for regular business use. A similar hindrance is observed for machine translation systems: some language pairs are handled exceptionally well and MT can match the performance of humans at the level of individual sentences, but less-resourced languages are under-performing and any context beyond individual sentences is generally disregarded.

The third area of ELITR goals was aimed at 'automatic minuting', i.e. automatic creation of meeting minutes, focusing specifically on work or project meetings. This topic is fairly distinct from the field of text summarization, it has not been sufficiently covered by research so far and data are rather scarce. Once the necessary output quality is reached, automatic minuting will bring tremendous savings of time when creating minutes or when trying to catch up with a meeting where no minutes were created.
In ASR, ELITR has pushed the state of the art by improving end-to-end neural ASR technology and streamlining it, i.e. making it work on a continuous flow of speech, instantly delivering output. In MT, ELITR has focused on the handling of document-level phenomena and on multi-linguality, i.e. developing systems which handle (many) more than one language pair. By combining ASR and MT, and also proposing a direct end-to-end, speech-to-text translation system, ELITR has made the first steps towards 'automatic interpretation'.

ELITR components of speech processing and machine translation produced and repeatedly improved by the research partners in ELITR were integrated into a complex and customizable system. All the planned and many additional test or demonstration events were run, remotely or in person, with the pipeline adjusted to the needs of the particular event. We were translating from the speech of the original speaker or from the output of a simultaneous interpreter who was present on site. The most complex setup (from 5 to 42 languages) was used for the EUROSAI Congress, organized by the Supreme Audit Office of the Czech Republic, an affiliated partner of the project. For details see our publication "Operating a Complex SLT System with Speakers and Human Interpreters" at the 1st Workshop on Automatic Spoken Language Translation in Real-World Settings (ASLTRW). These sessions kept attracting other event organizers and were leading to further field tests. ELITR partners are still in touch for occasional demonstrations at further events, e.g. META-FORUM 2022. As one of the important improvements, ELITR introduced a modification of the speech recognition model so that words not known to the system can be added at runtime; this is particularly important for specific terminology or people's names.

Depending on many particularities, such as multiple factors of sound acquisition, speech quality and accent, topic and domain and the requested combination of languages, the speech translation quality varies. In some settings, the outputs can be truly followed and important content accessed in the target language, despite some level of output errors. In less favourable situations, various of the system components can underperform, rendering the outputs insufficient for the purpose.

For the goal of automatic minuting (automatic creation of meeting minutes), ELITR has successfully started a novel research area, provided the community with a concrete dataset and ignited the interest by running the AutoMin 2021 shared task. As a proof-of-concept, ELITR has also integrated this system with the alfaview remote conferencing platform but we had to conclude that the speech recognition and segmentation quality in this setting is still very much a limiting factor. With manually revised speech transcripts, the best automatic minuting models produce promising outputs with acceptable fluency and a little lower adequacy. With realistic speech recognition outputs, the resulting minutes can still serve only as an initial preparation. The current best models (both by ELITR and other shared task participants) suffer from an unclear relation between the source and output length (condensing too much), and therefore of unknown level of content covered. The released ELITR Minuting Corpus however facilitates a broad range of explorations of this topic, starting from the diversity observed across multiple manually created minutes.
As documented in our publications, ELITR has substantially pushed the state of the art in speech recognition and translation, including online (i.e. simultaneous) end-to-end speech recognition and translation. ELITR has also created and released test sets ("ELITR test set" and the ESIC corpus) of speeches, their translation and also human interpretations, including a tool (SLTev) for comprehensive evaluation of the output quality. These tools serve as the basis for rigorous evaluation.

ELITR systems have the potential of considerably extending the set of languages provided at an event. While the output quality is not yet guaranteed to convey the message fully and, consequently, the users cannot rely solely on our systems, the outputs are a reasonable basis to build upon. We thus hope that our automatic speech transcription and translation will become a regular complement and extension to professional interpretation service, broadening the scope of offered languages at a very low additional cost, or offering automatic interpretation in situations where professionals cannot be afforded. The difference to the work of human interpreters has to be kept in mind. ELITR systems are trying to translate whatever was said rather literally, while human interpreters are shortening, explaining or adapting the content to the end user. As discussed in the summary video ( the goal of ELITR in this area has been to complement the work of interpreters.

In accordance with our plans, ELITR has defined the task of 'automatic minuting'. We see this as a substantial expansion of the research focus in the area of summarization. Automatic minuting cannot be readily achieved with standard text summarization techniques. ELITR has released a novel dataset, ran a shared task and one of the participating teams of ELITR has scored among the best teams.