In ASR, ELITR has pushed the state of the art by improving end-to-end neural ASR technology and streamlining it, i.e. making it work on a continuous flow of speech, instantly delivering output. In MT, ELITR has focused on the handling of document-level phenomena and on multi-linguality, i.e. developing systems which handle (many) more than one language pair. By combining ASR and MT, and also proposing a direct end-to-end, speech-to-text translation system, ELITR has made the first steps towards 'automatic interpretation'.
ELITR components of speech processing and machine translation produced and repeatedly improved by the research partners in ELITR were integrated into a complex and customizable system. All the planned and many additional test or demonstration events were run, remotely or in person, with the pipeline adjusted to the needs of the particular event. We were translating from the speech of the original speaker or from the output of a simultaneous interpreter who was present on site. The most complex setup (from 5 to 42 languages) was used for the EUROSAI Congress, organized by the Supreme Audit Office of the Czech Republic, an affiliated partner of the project. For details see our publication "Operating a Complex SLT System with Speakers and Human Interpreters" at the 1st Workshop on Automatic Spoken Language Translation in Real-World Settings (ASLTRW). These sessions kept attracting other event organizers and were leading to further field tests. ELITR partners are still in touch for occasional demonstrations at further events, e.g. META-FORUM 2022. As one of the important improvements, ELITR introduced a modification of the speech recognition model so that words not known to the system can be added at runtime; this is particularly important for specific terminology or people's names.
Depending on many particularities, such as multiple factors of sound acquisition, speech quality and accent, topic and domain and the requested combination of languages, the speech translation quality varies. In some settings, the outputs can be truly followed and important content accessed in the target language, despite some level of output errors. In less favourable situations, various of the system components can underperform, rendering the outputs insufficient for the purpose.
For the goal of automatic minuting (automatic creation of meeting minutes), ELITR has successfully started a novel research area, provided the community with a concrete dataset and ignited the interest by running the AutoMin 2021 shared task. As a proof-of-concept, ELITR has also integrated this system with the alfaview remote conferencing platform but we had to conclude that the speech recognition and segmentation quality in this setting is still very much a limiting factor. With manually revised speech transcripts, the best automatic minuting models produce promising outputs with acceptable fluency and a little lower adequacy. With realistic speech recognition outputs, the resulting minutes can still serve only as an initial preparation. The current best models (both by ELITR and other shared task participants) suffer from an unclear relation between the source and output length (condensing too much), and therefore of unknown level of content covered. The released ELITR Minuting Corpus however facilitates a broad range of explorations of this topic, starting from the diversity observed across multiple manually created minutes.