Skip to main content
European Commission logo
English English
CORDIS - EU research results
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary

Stream Learning for Multilingual Knowledge Transfer

Periodic Reporting for period 2 - SELMA (Stream Learning for Multilingual Knowledge Transfer)

Reporting period: 2022-07-01 to 2024-03-31

Very large amounts of multilingual information in the form of data are all around us and are still growing rapidly. With the emergence of LLMs / Large Language Models (open AI launched ChatGPT at the end of 2022), the AI and NLP world changed significantly. In SELMA we made use of the developments around LLMs, for instance to generate tag descriptions for selected keyword pairs.

SELMA tackled potentials from two sides in the past three years: by significantly advancing multilingual language technologies from a research perspective and by integrating concrete technological improvements into components, platforms and prototypes which to a great extent are available open source for the public and (the media) industry. SELMA shaped speech and text technologies for media analysis and production resulting in significantly improved results, e.g. with respect to topic labeling, clustering, summarisations, transcriptions, translations and voiceovers.
A focus of the SELMA work was set into a unified approach to multilingual media monitoring and content production by leveraging and contributing to advances in deep learning, in particular in multilingual language modeling, knowledge transfer and language transfer.

In the second period of the project, significant research results in the field of language technology could be made and integrated into the platforms (UC0, UC1 and UC2) and Use Case Applications / Prototypes (Podcast Creator, Diversity Application, M-PHANTOM, Diarization, DW Speaker, DW Summarizer). All objectives and KPIs were met and a great part of the developed software and components as well as the SELMA open-source platform could be released as public domain. The plain X platform evolved into a product, was rolled out at Deutsche Welle and could gain first clients. The Monitio platform was enriched with a new NLP orchestration pipeline and new multilingual NLP analyses based on state-of-the-art AI methods, thus making it more scalable and avoiding a language bottleneck of translating the content into English.
With the development of the publicly available SELMA OSS platform many “beyond state of the art” NLP achievements are already accessible for the public under: https://selma-project.github.io/. Also, many SELMA developments and outcomes were integrated into the two use case platforms for news media monitoring and media production purposes. Many of the new SELMA developments and models significantly improve the use of large data through advanced analytics and NLP technology.
Screenshot of SELMA OSS - available under: https://selma-project.github.io/