Cross-Lingual Embeddings for Less-Represented Languages in European News Media

Project Information

EMBEDDIA

Grant agreement ID: 825153

Project website

DOI

10.3030/825153

Project closed

EC signature date 24 October 2018

Start date 1 January 2019

End date 31 March 2022

Funded under

INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT)

Total cost

€ 2 998 850,00

EU contribution

€ 2 998 850,00

2 998 850,00

Coordinated by

INSTITUT JOZEF STEFAN
Slovenia

Periodic Reporting for period 2 - EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media)

Reporting period: 2020-07-01 to 2022-03-31

While advanced natural language processing (NLP) tools and resources exist for a few dominant languages, many of Europe's smaller language communities—and the news media industry that serves them—lack appropriate tools.

The EMBEDDIA project (Cross-Lingual Embeddings for Less-Represented Languages in European News Media) addressed these challenges via innovations in cross-lingual and multilingual embeddings coupled with deep neural networks to allow existing monolingual resources and tools to be used across languages.

The main scientific goals were to:
- Develop the embeddings technology for a new generation of NLP tools, which are both multilingual (able to deal with or produce content and resources in multiple languages) and cross-lingual (transfer easily across languages);
- Develop tools and resources for less-resourced morphologically rich EU languages;
- Leverage tools developed for well-resourced languages to be used for less-represented languages;
- Advance deep learning methods and improve their explainability.

Based on these scientific advances, the project developed tools to address several major challenges in the news media industry, including offensive speech filtering, and opinion mining for news comments; topic analysis, news linking, sentiment detection and summarization for news articles; and news generation from structured and non-structured data.

The project successfully reached its objectives, and released a large number of novel pretrained embeddings models and tools, including for comment moderation and keyword extraction, now integrated in production by EMBEDDIA media partners.

We built several monolingual (Slovene,Estonian) and multilingual (Croatian-Slovene-English, Finnish-Estonian-English, Lithuanian-Latvian-English) large pretrained contextual language models, created several novel evaluation benchmarks and showed improved performance of our models in monolingual processing and cross-lingual transfer. We developed a new approach of background knowledge injection into deep neural networks and adapted the techniques for explanation of classifier decisions and developed a novel technique to prevent dieselgate-like attacks on explanation techniques.

Next, we developed methods for named entity recognition and linking and event detection and tested them on well-known evaluation benchmarks (e.g. excellent results in HIPE competition at CLEF 2020, SlavNER 2021 shared task, TREC 2021 Incident Streams track). For keyword extraction, we proposed several novel methods including a multilingual unsupervised system RaKUn, monolingual supervised TNT-KID (also implemented in production by Ekspress Meedia), and a new cross-lingual method.

One focus was cross-lingual user-generated content analysis. We developed a set of new methods for author profiling, sentiment and opinion detection. In the most relevant applied task for news media, comment filtering, we developed cross-lingual methods that equal the accuracy of monolingual classifiers, but with much less target-language training data; and versions that improve performance by incorporating knowledge of topic. The Croatian media company 24sata is now using one of our comment moderation systems in production.

In news analysis research, we developed methods for interesting news retrieval, topic modelling, and a novel AutoBOT autoML approach for various classification tasks. We competed in several shared tasks including TREC 2021 background linking (1st place) and SemEval 2022 multilingual news article similarity. We also designed a novel document representation learning method based on knowledge-graphs and used it for fake news detection. We also developed a cross-lingual news sentiment detection method, and scalable methods for semantic change detection and viewpoint analysis. In addition, we created novel extractive, abstractive and visual summarisation systems.

Finally, we developed multilingual natural language generation (NLG) technology for automated journalism and tested it on EuroStat and Covid-19 datasets in six languages. We developed techniques to dynamically decide the order in which information is presented to the reader in automatically generated news texts, and several methods for headline generation.

We released a number of novel news articles and comments datasets and pretrained models and tested and integrated selected tools for keyword extraction and comment moderation to media partners’ production settings.

The tools are made available through the EMBEDDIA media assistant (EMA) platform that consists of:
- a live online EMBEDDIA demonstrator showcasing keyword extraction, comment filtering and news generation;
- dockerized components for easy installation and use of a selected range of the main tools;
- the EMBEDDIA Tools Explorer giving easy searchable access to all code and dockers;
- the TEXTA toolkit giving interactive user access to data exploration, investigative journalism and classification tools.

We disseminated the project results via the EMBEDDIA webpage (>40,000 unique visitors), Twitter account (@embeddiaproject) (>1,800 followers) and Facebook page. The EMBEDDIA media assistant had more than 270 users, and our code repository has more than 90 public items. The results of the project were published in more than 35 journal and more than 100 conference papers.

- We trained state-of-the-art contextual embeddings models for EMBEDDIA languages and produced novel evaluation datasets, including the CoSimLex and cross-lingual analogy datasets.
- We adapted popular explanation methods (IME, LIME, SHAP), developed the ExplainViz tool to explain classifications of deep neural networks, and the new AttViz tool for self-attention exploration.
- We developed a novel state-of-the-art supervised keyword extraction system TNT-KID (used in production by Ekspress Meedia), cross-lingual keyword extraction methods and created named entity recognition and linking methods for EMBEDDIA languages.
- We built monolingual comment moderation systems, trained and evaluated on our Croatian and Estonian media partners' datasets, and developed cross-lingual offensive speech detection models with very small performance drop compared to monolingual ones. Our comment moderation tools are being used in production by Croatian 24sata.
- We developed methods for background linking of topics with state-of-the-art performance and developed a novel Multilingual Dynamic Topic Model.
- We developed cross-lingual text summarisation systems and tools for producing visual textual summaries.
- We developed a news generation system and demonstrated its utility on EuroStat and COVID-19 data in several languages.
- We developed methods for diachronic news analysis using contextual embeddings.
- We developed cross-lingual methods of sentiment analysis on Twitter and news with performance comparable to monolingual ones.
- The EMA platform provides access to selected tools via dockers, code, demo and the TEXTA toolkit GUI.

The developed technology has impact on the research community (e.g. large number of downloads for our pretrained models), as well as on industry, where EMBEDDIA partners use selected tools in production, and other tools are being used by external stakeholders.

Project Logo

Periodic Reporting for period 2 - EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media)

Share this page Share this page on social networks

Download Download the content of the page