Skip to main content
Go to the home page of the European Commission (opens in new window)
English en
CORDIS - EU research results
CORDIS

High Performance Language Technologies

CORDIS provides links to public deliverables and publications of HORIZON projects.

Links to deliverables and publications from FP7 projects, as well as links to some specific result types such as dataset and software, are dynamically retrieved from OpenAIRE .

Deliverables

Initial release of monolingual and parallel data sets (opens in new window)

This deliverable consists of initial set of textual data acquired from web and non-web sources, both in monolingual and parallel parts, after cleaning done in WP2.

Software for cleaning data sets (opens in new window)

Free and open-source software will be released on GitHub.

First language models trained (opens in new window)

Language models will be made available for download however it may not have all or the cleanest data.

Translation models for select language pairs (opens in new window)

Models available for download trained using the pipeline.

Publications

A New Massive Multilingual Dataset for High-Performance Language Technologies (opens in new window)

Author(s): de Gibert, Ona; Nail, Graeme; Arefyev, Nikolay; Bañón, Marta; van der Linde, Jelmer; Ji, Shaoxiong; Zaragoza-Bernabeu, Jaume; Aulamo, Mikko; Ramírez-Sánchez, Gema; Kutuzov, Andrey; Pyysalo, Sampo; Oepen, Stephan; Tiedemann, Jörg
Published in: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, ISSN 1530-9312
Publisher: ELRA and ICCL
DOI: 10.48550/ARXIV.2403.14009

SpringerPlus (opens in new window)

Author(s): Tiedemann J.; Aulamo M.; Bakshandaeva D.; Boggia M.; Grönroos S. A.; Nieminen T.; Raganato A.; Scherrer Y.; Vázquez R.; Virpioja S.
Published in: Springer, 2023, ISSN 2193-1801
Publisher: Springer Science and Business Media Deutschland GmbH
DOI: 10.48550/ARXIV.2212.01936

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models (opens in new window)

Author(s): Shaoxiong Ji; Zihao Li; Indraneil Paul; Jaakko Paavola; Peiqin Lin; Pinzhen Chen; Dayyán O'Brien; Hengyu Luo; Hinrich Schütze; Jörg Tiedemann; Barry Haddow
Published in: CoRR, 2024, ISSN 2331-8422
Publisher: ArXiv
DOI: 10.48550/ARXIV.2409.17892

Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning? (opens in new window)

Author(s): Shaoxiong Ji; Timothee Mickus; Vincent Segonne; Jörg Tiedemann
Published in: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, ISSN 1530-9312
Publisher: ELRA and ICCL
DOI: 10.48550/ARXIV.2403.16777

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) (opens in new window)

Author(s): Laurie Burchell; Ona de Gibert; Nikolay Arefyev; Mikko Aulamo; Marta Bañón; Pinzhen Chen; Mariia Fedorova; Liane Guillou; Barry Haddow; Jan Hajic; Jindrich Helcl; Erik Henriksson; Mateusz Klimaszewski; Ville Komulainen; Andrey Kutuzov; Joona Kytöniemi; Veronika Laippala; Petter Mæhlum; Bhavitvya Malik; Farrokh Mehryary; Vladislav Mikhailov; Nikita Moghe; Amanda Myntti; Dayyán O'Brien; Stephan Oepen; Proyag Pal; Jousia Piha; Sampo Pyysalo; Gema Ramírez-Sánchez; David Samuel; Pavel Stepachev; Jörg Tiedemann; Dusan Varis; Tereza Vojtechová; Jaume Zaragoza-Bernabeu
Published in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.48550/ARXIV.2503.10267

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models? (opens in new window)

Author(s): Pinzhen Chen; Simon Yu; Zhicheng Guo; Barry Haddow
Published in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.48550/ARXIV.2406.12822

Poro 34B and the Blessing of Multilinguality (opens in new window)

Author(s): Luukkonen, Risto; Burdge, Jonathan; Zosa, Elaine; Talman, Aarne; Komulainen, Ville; Hatanpää, Väinö; Sarlin, Peter; Pyysalo, Sampo
Published in: CoRR, 2024, ISSN 2331-8422
Publisher: ArXiv
DOI: 10.48550/ARXIV.2404.01856

How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM (opens in new window)

Author(s): Ji, Shaoxiong; Chen, Pinzhen
Published in: CoRR, 2024, ISSN 2331-8422
Publisher: ArXiv
DOI: 10.48550/ARXIV.2404.04850

GPT or BERT: why not both?

Author(s): Lucas Georges Gabriel Charpentier, David Samuel
Published in: The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, 2024, ISSN 1530-9312
Publisher: Association for Computational Linguistics

Four Approaches to Low-Resource Multilingual NMT: The Helsinki Submission to the AmericasNLP 2023 Shared Task (opens in new window)

Author(s): Ona De Gibert, Raúl Vázquez, Mikko Aulamo, Yves Scherrer, Sami Virpioja, Jörg Tiedemann
Published in: 2023, ISBN 978-1-959429-91-3
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.AMERICASNLP-1.20

CUNI Systems for the WMT22 Czech-Ukrainian Translation Task (opens in new window)

Author(s): Popel, Martin; Libovický, Jindřich; Helcl, Jindřich
Published in: 2022, ISBN 978-1-959429-29-6
Publisher: Association for Computational Linguistics
DOI: 10.48550/ARXIV.2212.00486

PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India (opens in new window)

Author(s): Ashok Urlana, Pinzhen Chen, Zheng Zhao, Shay Cohen, Manish Shrivastava, Barry Haddow
Published in: 2023, ISBN 979-8-89176-061-5
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.FINDINGS-EMNLP.777

Towards Effective Disambiguation for Machine Translation with Large Language Models (opens in new window)

Author(s): Vivek Iyer, Pinzhen Chen, and Alexandra Birch
Published in: 2023, ISBN 979-8-89176-041-7
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.WMT-1.44

DocHPLT: A Massively Multilingual Document-Level Translation Dataset (opens in new window)

Author(s): Dayyán O’Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, Jörg Tiedemann
Published in: Proceedings of the Tenth Conference on Machine Translation, 2025
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2025.WMT-1.17

Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet (opens in new window)

Author(s): Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Makoto Nagata, Toshiaki Nakazawa, Martin Popel, Maja Popović, Mariya Shmatova
Published in: Proceedings of the Eighth Conference on Machine Translation, 2023, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.WMT-1.1

FinGPT: Large Generative Models for a Small Language (opens in new window)

Author(s): Luukkonen, Risto; Komulainen, Ville; Luoma, Jouni; Eskelinen, Anni; Kanerva, Jenna; Kupari, Hanna-Mari; Ginter, Filip; Laippala, Veronika; Muennighoff, Niklas; Piktus, Aleksandra; Wang, Thomas; Tazi, Nouamane; Scao, Teven Le; Wolf, Thomas; Suominen, Osma; Sairanen, Samuli; Merioksa, Mikko; Heinonen, Jyrki; Vahtola, Aija; Antao, Samuel; Pyysalo, Sampo
Published in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, ISBN 979-8-89176-060-8
Publisher: Association for Computational Linguistics
DOI: 10.48550/arxiv.2311.05640

Abstractive Event Analysis of Armed Conflicts: Introducing the UCDP-AEC Dataset

Author(s): Étienne Simon, Helene Bøsei Olsen, Ramón Carreño, Rahul Mishra, Nikolay Arefyev, Mert Can Yilmaz, Lilja Øvrelid, Erik Velldal
Published in: Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops, 2025
Publisher: HsH Applied Academics

Scaling Low-Resource MT via Synthetic Data Generation with LLMs (opens in new window)

Author(s): Ona de Gibert, Joseph Attieh, Teemu Vahtola, Mikko Aulamo, Zihao Li, Raúl Vázquez, Tiancheng Hu, Jörg Tiedemann
Published in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2025.EMNLP-MAIN.1408

Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings (opens in new window)

Author(s): David Samuel
Published in: Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, 2023, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.CONLL-BABYLM.19

NorBench – A Benchmark for Norwegian Language Models

Author(s): David Samuel, Andrey Kutuzov, Samia Touileb, Erik Velldal, Lilja Øvrelid, Egil Rønningstad, Elina Sigdel, Anna Palatkina
Published in: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023
Publisher: University of Tartu Library

Got Compute, but No Data: Lessons From Post-training a Finnish LLM

Author(s): Elaine Zosa, Ville Komulainen, Sampo Pyysalo
Published in: : Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025
Publisher: University of Tartu Library

Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets (opens in new window)

Author(s): Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, Masaaki Nagata, Stefano Perrella, Lorenzo Proietti, Martin Popel, Maja Popović, Parker Riley, Mariya Shmatova, Steinthór Steingrímsson, Lisa Yankovskaya, Vilém Zouhar
Published in: Proceedings of the Tenth Conference on Machine Translation, 2025, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2025.WMT-1.22

Tokenization with Factorized Subword Encoding (opens in new window)

Author(s): David Samuel and Lilja Øvrelid
Published in: 2023, ISBN 978-1-959429-62-3
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.FINDINGS-ACL.890

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Author(s): Erik Henriksson, Otto Tarkka, Filip Ginter
Published in: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025
Publisher: University of Tartu Library

Cher at KSAA-CAD 2024: Compressing Words and Definitions into the Same Space for Arabic Reverse Dictionary (opens in new window)

Author(s): Pinzhen Chen, Zheng Zhao, Shun Shao
Published in: Proceedings of The Second Arabic Natural Language Processing Conference, 2024, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2024.ARABICNLP-1.75

Towards Interpretable Mental Health Analysis with Large Language Models (opens in new window)

Author(s): Yang, Kailai; Ji, Shaoxiong; Zhang, Tianlin; Xie, Qianqian; Kuang, Ziyan; Ananiadou, Sophia
Published in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, ISBN 979-8-89176-060-8
Publisher: Association for Computational Linguistics
DOI: 10.48550/arxiv.2304.03347

Findings of the AmericasNLP 2025 Shared Tasks on Machine Translation, Creation of Educational Material, and Translation Metrics for Indigenous Languages of the Americas (opens in new window)

Author(s): Ona De Gibert, Robert Pugh, Ali Marashian, Raul Vazquez, Abteen Ebrahimi, Pavel Denisov, Enora Rice, Edward Gow-Smith, Juan Prieto, Melissa Robles, Rubén Manrique, Oscar Moreno, Angel Lino, Rolando Coto-Solano, Aldo Alvarez, Marvin Agüero-Torales, John E. Ortega, Luis Chiruzzo, Arturo Oncevay, Shruti Rijhwani, Katharina Von Der Wense, Manuel Mager
Published in: Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), 2025, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2025.AMERICASNLP-1.16

The OPUS-MT Dashboard – A Toolkit for a Systematic Evaluation of Open Machine Translation Models (opens in new window)

Author(s): Jörg Tiedemann and Ona de Gibert
Published in: 2023, ISBN 978-1-959429-70-8
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.ACL-DEMO.30

Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting (opens in new window)

Author(s): Bogoychev, Nikolay and Chen, Pinzhen
Published in: 2023, ISBN 979-8-89176-041-7
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.WMT-1.80

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca (opens in new window)

Author(s): Chen, Pinzhen; Ji, Shaoxiong; Bogoychev, Nikolay; Kutuzov, Andrey; Haddow, Barry; Heafield, Kenneth
Published in: EACL, 2023, ISBN 979-8-89176-088-2
Publisher: Association for Computational Linguistics
DOI: 10.48550/arxiv.2309.08958

Fine-Tuning Large Language Models with Sequential Instructions (opens in new window)

Author(s): Hanxu Hu, Simon Yu, Pinzhen Chen, Edoardo Ponti
Published in: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2025.NAACL-LONG.288

An Open Dataset and Model for Language Identification (opens in new window)

Author(s): Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield
Published in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.ACL-SHORT.75

Unsupervised Feature Selection for Effective Parallel Corpus Filtering

Author(s): Mikko Aulamo, Ona de Gibert, Sami Virpioja, and Jörg Tiedemann
Published in: Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023, ISBN 978-952-03-2947-1
Publisher: European Association for Machine Translation

Exploring Data Augmentation for Code Generation Tasks

Author(s): Pinzhen Chen, Gerasimos Lampouras
Published in: 2023, ISBN 978-1-959429-47-0
Publisher: Association for Computational Linguistics

Scaling Data-Constrained Language Models (opens in new window)

Author(s): Muennighoff, Niklas; Rush, Alexander M.; Barak, Boaz; Scao, Teven Le; Piktus, Aleksandra; Tazi, Nouamane; Pyysalo, Sampo; Wolf, Thomas; Raffel, Colin
Published in: 2023, ISSN 2331-8422
Publisher: NeurIPS'23
DOI: 10.48550/arxiv.2305.16264

CUNI Non-Autoregressive System for the WMT 22 Efficient Translation Shared Task (opens in new window)

Author(s): Helcl, Jindřich
Published in: 2022, ISBN 978-1-959429-29-6
Publisher: Association for Computational Linguistics
DOI: 10.48550/ARXIV.2212.00477

Code-Switched Language Identification is Harder Than You Think

Author(s): Burchell, Laurie and Birch, Alexandra and Thompson, Robert and Heafield, Kenneth
Published in: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, ISSN 1530-9312
Publisher: Association for Computational Linguistics

CUNI Non-Autoregressive System for the WMT 22 Efficient Translation Shared Task

Author(s): Jindřich Helcl
Published in: Proceedings of the Seventh Conference on Machine Translation (WMT), 2022, ISSN 1530-9312
Publisher: Association for Computational Linguistics

HPLT’s First Release of Data and Models

Author(s): Ramírez-Sánchez, Gema; Chen, Pinzhen; Helcl, Jindřich; Zaragoza-Bernabeu, Jaume; Malik, Bhavitvya; De Gibert Bonet, Ona; Stepachev, Pavel; Variš, Dušan; Haddow, Barry; Arefyev, Nikolay; Tiedemann, Jörg
Published in: 2024, ISSN 1530-9312
Publisher: European Association for Machine Translation (EAMT)

OpusDistillery: A Configurable End-to-End Pipeline for Systematic Multilingual Distillation of Open NMT Models

Author(s): Ona de Gibert, Tommi Nieminen, Yves Scherrer, Jörg Tiedemann
Published in: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025
Publisher: University of Tartu Library

HPLT’s Second Data Release

Author(s): Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladi
Published in: Proceedings of Machine Translation Summit XX: Volume 2, 2025
Publisher: European Association for Machine Translation

Mind the Gap: Diverse NMT Models for Resource-Constrained Environments

Author(s): Ona de Gibert, Dayyán O’Brien, Dušan Variš, Jörg Tiedemann
Published in: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025
Publisher: University of Tartu Library

Cheating to Identify Hard Problems for Neural Machine Translation (opens in new window)

Author(s): Proyag Pal, Kenneth Heafield
Published in: 2023, ISBN 978-1-959429-47-0
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.FINDINGS-EACL.120

The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics (opens in new window)

Author(s): Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch
Published in: Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, 2024, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2024.INSIGHTS-1.17

Large Language Model Inference with Lexical Shortlisting (opens in new window)

Author(s): Nikolay Bogoychev and Pinzhen Chen and Barry Haddow and Alexandra Birch
Published in: AAAI Workshop on Deployable AI, 2024, ISSN 2331-8422
Publisher: arXiv
DOI: 10.48550/ARXIV.2311.09709

Not all layers are equally as important: Every Layer Counts BERT (opens in new window)

Author(s): Lucas Georges Gabriel Charpentier, David Samuel
Published in: Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, 2023, ISSN 1530-9312
Publisher: Association for Computational Linguistics
DOI: 10.18653/V1/2023.CONLL-BABYLM.20

HPLT High-Performance Language Technology: Building LLMs and TMs in European languages

Author(s): Hajič, Jan
Published in: 2023
Publisher: Oral presentation at Skeikampen, Norway

Iterative Translation Refinement with Large Language Models (opens in new window)

Author(s): Chen, Pinzhen and Guo, Zhicheng and Haddow, Barry and Heafield, Kenneth
Published in: 2023, ISSN 2331-8422
Publisher: arXiv
DOI: 10.48550/ARXIV.2306.03856

{EEE-QA}: Exploring effective and efficient question-answer representations (opens in new window)

Author(s): Zhanghao Hu and Yijun Yang and Junjie Xu and Yifu Qiu and Pinzhen Chen
Published in: 2024, ISSN 2331-8422
Publisher: arXiv
DOI: 10.48550/ARXIV.2403.02176

Velké jazykové modely: Co znamená velké a co jazykové?

Author(s): Libovický, Jindřich
Published in: 2023
Publisher: Talk at FI MUNI, Brno, Czechia

OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models (opens in new window)

Author(s): Nikolay Bogoychev and Jelmer van der Linde and Graeme Nail and Barry Haddow and Jaume Zaragoza-Bernabeu and Gema Ramírez-Sánchez and Lukas Weymann and Tudor Nicolae Mateiu and Jindřich Helcl and Mikko Aulamo
Published in: 2023, ISSN 2331-8422
Publisher: arXiv
DOI: 10.48550/ARXIV.2311.14838

Searching for OpenAIRE data...

There was an error trying to search data from OpenAIRE

No results available

My booklet 0 0