High Performance Language Technologies

Resultado final

Initial release of monolingual and parallel data sets

This deliverable consists of initial set of textual data acquired from web and non-web sources, both in monolingual and parallel parts, after cleaning done in WP2.

HPLT resource catalogue

Final datasets of data used and models produced by the project.

Final release of monolingual and parallel data sets

This deliverable consists of the final set of textual data acquired from web and non-web sources, both in monolingual and parallel parts, after cleaning done in WP2.

Clean and filtered data sets augmented with metadata

Cleaned and filtered data sets along with metadata will be released on OPUS. Catalogue of released datasets along with description of metadata augmentation information.

Software for cleaning data sets

Free and open-source software will be released on GitHub.

First language models trained

Language models will be made available for download however it may not have all or the cleanest data.

Translation models for select language pairs

Models available for download trained using the pipeline.

HPLT pipelines and tools

Software for processing the data as done within the project.

Dashboard report

Back-end and front-end report for dashboard. Dashboards and leaderboards are published.

Report on Evaluation of trained models

Report on the evaluation of the language models as produced in WP5.

Report on language model evaluation

Report on evaluation of language models as trained in the project.

Publicaciones

MaLA-500: Massive Language Adaptation of Large Language Models

Autores: Peiqin Lin; Shaoxiong Ji; Jörg Tiedemann; André F. T. Martins; Hinrich Schütze
Publicado en: CoRR, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/ARXIV.2401.13303

A New Massive Multilingual Dataset for High-Performance Language Technologies

Autores: de Gibert, Ona; Nail, Graeme; Arefyev, Nikolay; Bañón, Marta; van der Linde, Jelmer; Ji, Shaoxiong; Zaragoza-Bernabeu, Jaume; Aulamo, Mikko; Ramírez-Sánchez, Gema; Kutuzov, Andrey; Pyysalo, Sampo; Oepen, Stephan; Tiedemann, Jörg
Publicado en: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, ISSN 1530-9312
Editor: ELRA and ICCL
DOI: 10.48550/ARXIV.2403.14009

Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation

Autores: Iyer, Vivek and Malik, Bhavitvya and Stepachev, Pavel and Chen, Pinzhen and Haddow, Barry and Birch, Alexandra
Publicado en: CoRR, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/arXiv.2408.12780

Deep-change at AXOLOTL-24: Orchestrating WSD and WSI Models for Semantic Change Modeling

Autores: Denis Kokosinskii; Mikhail Kuklin; Nikolay Arefyev
Publicado en: Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/ARXIV.2408.05184

SpringerPlus

Autores: Tiedemann J.; Aulamo M.; Bakshandaeva D.; Boggia M.; Grönroos S. A.; Nieminen T.; Raganato A.; Scherrer Y.; Vázquez R.; Virpioja S.
Publicado en: Springer, 2023, ISSN 2193-1801
Editor: Springer Science and Business Media Deutschland GmbH
DOI: 10.48550/ARXIV.2212.01936

Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

Autores: Amanda Myntti ~Amanda_Myntti1 , Erik Henriksson ~Erik_Henriksson1 , Veronika Laippala, Sampo Pyysalo
Publicado en: OpenReview, 2025, ISSN 2326-5507
Editor: OpenReview
DOI: 10.48550/arXiv.2504.01542

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Autores: Luo, Hengyu; Li, Zihao; Attieh, Joseph; Devkota, Sawal; de Gibert, Ona; Huang, Xu; Ji, Shaoxiong; Lin, Peiqin; Mantina, Bhavani Sai Praneeth Varma; Sreenidhi, Ananda; Vázquez, Raúl; Wang, Mengjie; Yusofi, Samea; Yuan, Fei; Tiedemann, Jörg
Publicado en: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2025, ISSN 0736-587X
Editor: Association for Computational Linguistics
DOI: 10.48550/ARXIV.2504.04155

Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Autores: Wenhao Zhu and Pinzhen Chen and Hanxu Hu and Shujian Huang and Fei Yuan and Jiajun Chen and Alexandra Birch
Publicado en: CoRR, 2025, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/arXiv.2502.15592

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Autores: Joona Kytöniemi and Jousia Piha and Akseli Reunamo and Fedor Vitiugin and Farrokh Mehryary and Sampo Pyysalo
Publicado en: CoRR, 2025, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/arXiv.2512.13330

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

Autores: Jindřich Libovický and Jindřich Helcl and Andrei Manea and Gianluca Vico
Publicado en: CoRR, 2025, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/arXiv.2507.22752

Pitfalls and Outlooks in Using COMET

Autores: Zouhar, Vilém and Chen, Pinzhen and Lam, Tsz Kin and Moghe, Nikita and Haddow, Barry
Publicado en: CoRR, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/arXiv.2408.15366

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

Autores: David Samuel and Lilja Øvrelid and Erik Velldal and Andrey Kutuzov
Publicado en: CoRR, 2025, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/arXiv.2512.08777

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Autores: Shaoxiong Ji; Zihao Li; Indraneil Paul; Jaakko Paavola; Peiqin Lin; Pinzhen Chen; Dayyán O'Brien; Hengyu Luo; Hinrich Schütze; Jörg Tiedemann; Barry Haddow
Publicado en: CoRR, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/ARXIV.2409.17892

Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning?

Autores: Shaoxiong Ji; Timothee Mickus; Vincent Segonne; Jörg Tiedemann
Publicado en: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, ISSN 1530-9312
Editor: ELRA and ICCL
DOI: 10.48550/ARXIV.2403.16777

Evaluating optimal reference translations

Autores: Vilém Zouhar; Věra Kloudová; Martin Popel; Ondřej Bojar
Publicado en: Natural Language Processing, 2024, ISSN 2977-0424
Editor: Cambridge University Press
DOI: 10.48550/ARXIV.2311.16787

Enriching Word Usage Graphs with Cluster Definitions

Autores: Andrey Kutuzov; Mariia Fedorova; Dominik Schlechtweg; Nikolay Arefyev
Publicado en: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, ISSN 1530-9312
Editor: ELRA and ICCL
DOI: 10.48550/ARXIV.2403.18024

The LSCD Benchmark: a Testbed for Diachronic Word Meaning Tasks

Autores: Dominik Schlechtweg; Shafqat Mumtaz Virk; Nikolay Arefyev
Publicado en: CoRR, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/ARXIV.2404.00176

Multilingual Substitution-based Word Sense Induction

Autores: Denis Kokosinskii; Nikolay Arefyev
Publicado en: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, ISSN 1530-9312
Editor: ELRA and ICCL
DOI: 10.48550/ARXIV.2405.11086

A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Autores: Zihao Li and Shaoxiong Ji and Timothee Mickus and Vincent Segonne and Jörg Tiedemann
Publicado en: CoRR, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/arXiv.2407.15489

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

Autores: Laurie Burchell; Ona de Gibert; Nikolay Arefyev; Mikko Aulamo; Marta Bañón; Pinzhen Chen; Mariia Fedorova; Liane Guillou; Barry Haddow; Jan Hajic; Jindrich Helcl; Erik Henriksson; Mateusz Klimaszewski; Ville Komulainen; Andrey Kutuzov; Joona Kytöniemi; Veronika Laippala; Petter Mæhlum; Bhavitvya Malik; Farrokh Mehryary; Vladislav Mikhailov; Nikita Moghe; Amanda Myntti; Dayyán O'Brien; Stephan Oepen; Proyag Pal; Jousia Piha; Sampo Pyysalo; Gema Ramírez-Sánchez; David Samuel; Pavel Stepachev; Jörg Tiedemann; Dusan Varis; Tereza Vojtechová; Jaume Zaragoza-Bernabeu
Publicado en: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.48550/ARXIV.2503.10267

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

Autores: Pinzhen Chen; Simon Yu; Zhicheng Guo; Barry Haddow
Publicado en: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.48550/ARXIV.2406.12822

MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization

Autores: Dayyán O'Brien, Barry Haddow, Emily Allaway, Pinzhen Chen
Publicado en: CoRR, 2025, ISSN 2331-8422
Editor: aRxIV
DOI: 10.48550/arXiv.2510.05962

Scaling Data-Constrained Language Models

Autores: Niklas Muennighoff; Alexander M. Rush; Boaz Barak; Teven Le Scao; Aleksandra Piktus; Nouamane Tazi; Sampo Pyysalo; Thomas Wolf; Colin Raffel
Publicado en: Advances in Neural Information Processing Systems 36, 2023, ISSN 1049-5258
Editor: Neural Information Processing Systems Foundation, Inc. (NeurIPS)
DOI: 10.48550/ARXIV.2305.16264

AXOLOTL’24 Shared Task on Multilingual Explainable Semantic Change Modeling

Autores: Mariia Fedorova; Timothee Mickus; Niko Partanen; Janine Siewert; Elena Spaziani; Andrey Kutuzov
Publicado en: Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/ARXIV.2407.04079

Language Resources and Evaluation

Autores: Jörg Tiedemann; Mikko Aulamo; Daria Bakshandaeva; Michele Boggia; Stig-Arne Grönroos; Tommi Nieminen; Alessandro Raganato; Yves Scherrer; Raúl Vázquez; Sami Virpioja
Publicado en: Language Resources and Evaluation, 2023, ISSN 1574-020X
Editor: Springer
DOI: 10.48550/ARXIV.2212.01936

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

Autores: Chen, Pinzhen; Ji, Shaoxiong; Bogoychev, Nikolay; Kutuzov, Andrey; Haddow, Barry; Heafield, Kenneth
Publicado en: Findings of the Association for Computational Linguistics: EACL 2024, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/ARXIV.2309.08958

How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM

Autores: Ji, Shaoxiong; Chen, Pinzhen
Publicado en: CoRR, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.48550/ARXIV.2404.04850

Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Autores: {Zhu, Dawei and Chen, Pinzhen and Zhang, Miaoran and Haddow, Barry and Shen, Xiaoyu and Klakow, Dietrich
Publicado en: CoRR, 2024, ISSN 2331-8422
Editor: ArXiv
DOI: 10.18653/v1/2024.emnlp-main.24

FastSpell: The LangId Magic Spell

Autores: Marta Bañón; Jaume Zaragoza-Bernabeu; Gema Ramírez-Sánchez; Sergio Ortiz-Rojas
Publicado en: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, ISSN 1530-9312
Editor: ELRA and ICCL
DOI: 10.48550/ARXIV.2404.08345

GPT or BERT: why not both?

Autores: Lucas Georges Gabriel Charpentier, David Samuel
Publicado en: The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, 2024, ISSN 1530-9312
Editor: Association for Computational Linguistics

Four Approaches to Low-Resource Multilingual NMT: The Helsinki Submission to the AmericasNLP 2023 Shared Task

Autores: Ona De Gibert, Raúl Vázquez, Mikko Aulamo, Yves Scherrer, Sami Virpioja, Jörg Tiedemann
Publicado en: 2023, ISBN 978-1-959429-91-3
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.AMERICASNLP-1.20

CUNI Systems for the WMT22 Czech-Ukrainian Translation Task

Autores: Popel, Martin; Libovický, Jindřich; Helcl, Jindřich
Publicado en: 2022, ISBN 978-1-959429-29-6
Editor: Association for Computational Linguistics
DOI: 10.48550/ARXIV.2212.00486

PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

Autores: Ashok Urlana, Pinzhen Chen, Zheng Zhao, Shay Cohen, Manish Shrivastava, Barry Haddow
Publicado en: 2023, ISBN 979-8-89176-061-5
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.FINDINGS-EMNLP.777

Towards Effective Disambiguation for Machine Translation with Large Language Models

Autores: Vivek Iyer, Pinzhen Chen, and Alexandra Birch
Publicado en: 2023, ISBN 979-8-89176-041-7
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.WMT-1.44

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

Autores: Dayyán O’Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, Jörg Tiedemann
Publicado en: Proceedings of the Tenth Conference on Machine Translation, 2025
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2025.WMT-1.17

Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet

Autores: Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Makoto Nagata, Toshiaki Nakazawa, Martin Popel, Maja Popović, Mariya Shmatova
Publicado en: Proceedings of the Eighth Conference on Machine Translation, 2023, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.WMT-1.1

FinGPT: Large Generative Models for a Small Language

Autores: Luukkonen, Risto; Komulainen, Ville; Luoma, Jouni; Eskelinen, Anni; Kanerva, Jenna; Kupari, Hanna-Mari; Ginter, Filip; Laippala, Veronika; Muennighoff, Niklas; Piktus, Aleksandra; Wang, Thomas; Tazi, Nouamane; Scao, Teven Le; Wolf, Thomas; Suominen, Osma; Sairanen, Samuli; Merioksa, Mikko; Heinonen, Jyrki; Vahtola, Aija; Antao, Samuel; Pyysalo, Sampo
Publicado en: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, ISBN 979-8-89176-060-8
Editor: Association for Computational Linguistics
DOI: 10.48550/arxiv.2311.05640

Abstractive Event Analysis of Armed Conflicts: Introducing the UCDP-AEC Dataset

Autores: Étienne Simon, Helene Bøsei Olsen, Ramón Carreño, Rahul Mishra, Nikolay Arefyev, Mert Can Yilmaz, Lilja Øvrelid, Erik Velldal
Publicado en: Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops, 2025
Editor: HsH Applied Academics

Scaling Low-Resource MT via Synthetic Data Generation with LLMs

Autores: Ona de Gibert, Joseph Attieh, Teemu Vahtola, Mikko Aulamo, Zihao Li, Raúl Vázquez, Tiancheng Hu, Jörg Tiedemann
Publicado en: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2025.EMNLP-MAIN.1408

Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings

Autores: David Samuel
Publicado en: Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, 2023, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.CONLL-BABYLM.19

NorBench – A Benchmark for Norwegian Language Models

Autores: David Samuel, Andrey Kutuzov, Samia Touileb, Erik Velldal, Lilja Øvrelid, Egil Rønningstad, Elina Sigdel, Anna Palatkina
Publicado en: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023
Editor: University of Tartu Library

Got Compute, but No Data: Lessons From Post-training a Finnish LLM

Autores: Elaine Zosa, Ville Komulainen, Sampo Pyysalo
Publicado en: : Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025
Editor: University of Tartu Library

Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets

Autores: Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, Masaaki Nagata, Stefano Perrella, Lorenzo Proietti, Martin Popel, Maja Popović, Parker Riley, Mariya Shmatova, Steinthór Steingrímsson, Lisa Yankovskaya, Vilém Zouhar
Publicado en: Proceedings of the Tenth Conference on Machine Translation, 2025, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2025.WMT-1.22

Tokenization with Factorized Subword Encoding

Autores: David Samuel and Lilja Øvrelid
Publicado en: 2023, ISBN 978-1-959429-62-3
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.FINDINGS-ACL.890

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Autores: Erik Henriksson, Otto Tarkka, Filip Ginter
Publicado en: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025
Editor: University of Tartu Library

Cher at KSAA-CAD 2024: Compressing Words and Definitions into the Same Space for Arabic Reverse Dictionary

Autores: Pinzhen Chen, Zheng Zhao, Shun Shao
Publicado en: Proceedings of The Second Arabic Natural Language Processing Conference, 2024, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2024.ARABICNLP-1.75

Towards Interpretable Mental Health Analysis with Large Language Models

Autores: Yang, Kailai; Ji, Shaoxiong; Zhang, Tianlin; Xie, Qianqian; Kuang, Ziyan; Ananiadou, Sophia
Publicado en: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, ISBN 979-8-89176-060-8
Editor: Association for Computational Linguistics
DOI: 10.48550/arxiv.2304.03347

Findings of the AmericasNLP 2025 Shared Tasks on Machine Translation, Creation of Educational Material, and Translation Metrics for Indigenous Languages of the Americas

Autores: Ona De Gibert, Robert Pugh, Ali Marashian, Raul Vazquez, Abteen Ebrahimi, Pavel Denisov, Enora Rice, Edward Gow-Smith, Juan Prieto, Melissa Robles, Rubén Manrique, Oscar Moreno, Angel Lino, Rolando Coto-Solano, Aldo Alvarez, Marvin Agüero-Torales, John E. Ortega, Luis Chiruzzo, Arturo Oncevay, Shruti Rijhwani, Katharina Von Der Wense, Manuel Mager
Publicado en: Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), 2025, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2025.AMERICASNLP-1.16

The OPUS-MT Dashboard – A Toolkit for a Systematic Evaluation of Open Machine Translation Models

Autores: Jörg Tiedemann and Ona de Gibert
Publicado en: 2023, ISBN 978-1-959429-70-8
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.ACL-DEMO.30

Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting

Autores: Bogoychev, Nikolay and Chen, Pinzhen
Publicado en: 2023, ISBN 979-8-89176-041-7
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.WMT-1.80

Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

Autores: Chen, Pinzhen; Ji, Shaoxiong; Bogoychev, Nikolay; Kutuzov, Andrey; Haddow, Barry; Heafield, Kenneth
Publicado en: EACL, 2023, ISBN 979-8-89176-088-2
Editor: Association for Computational Linguistics
DOI: 10.48550/arxiv.2309.08958

Fine-Tuning Large Language Models with Sequential Instructions

Autores: Hanxu Hu, Simon Yu, Pinzhen Chen, Edoardo Ponti
Publicado en: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2025.NAACL-LONG.288

HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Autores: Stephan Oepen and Nikolay Arefev and Mikko Aulamo and Marta Bañón and Maja Buljan and Laurie Burchell and Lucas Charpentier and Pinzhen Chen and Mariya Fedorova and Ona de Gibert and Barry Haddow and Jan Hajič and Jindřich Helcl and Andrey Kutuzov and Ver
Publicado en: Proceedings of the Fifteenth Language Resources and Evaluation Conference, 2026, ISSN 2522-2686
Editor: International Conference on Language Resources and Evaluation

An Open Dataset and Model for Language Identification

Autores: Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield
Publicado en: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2023, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.ACL-SHORT.75

Unsupervised Feature Selection for Effective Parallel Corpus Filtering

Autores: Mikko Aulamo, Ona de Gibert, Sami Virpioja, and Jörg Tiedemann
Publicado en: Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023, ISBN 978-952-03-2947-1
Editor: European Association for Machine Translation

Exploring Data Augmentation for Code Generation Tasks

Autores: Pinzhen Chen, Gerasimos Lampouras
Publicado en: 2023, ISBN 978-1-959429-47-0
Editor: Association for Computational Linguistics

Scaling Data-Constrained Language Models

Autores: Muennighoff, Niklas; Rush, Alexander M.; Barak, Boaz; Scao, Teven Le; Piktus, Aleksandra; Tazi, Nouamane; Pyysalo, Sampo; Wolf, Thomas; Raffel, Colin
Publicado en: 2023, ISSN 2331-8422
Editor: NeurIPS'23
DOI: 10.48550/arxiv.2305.16264

CUNI Non-Autoregressive System for the WMT 22 Efficient Translation Shared Task

Autores: Helcl, Jindřich
Publicado en: 2022, ISBN 978-1-959429-29-6
Editor: Association for Computational Linguistics
DOI: 10.48550/ARXIV.2212.00477

Code-Switched Language Identification is Harder Than You Think

Autores: Burchell, Laurie and Birch, Alexandra and Thompson, Robert and Heafield, Kenneth
Publicado en: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, ISSN 1530-9312
Editor: Association for Computational Linguistics

Poro 34B and the Blessing of Multilinguality

Autores: Luukkonen, Risto; Burdge, Jonathan; Zosa, Elaine; Talman, Aarne; Komulainen, Ville; Hatanpää, Väinö; Sarlin, Peter; Pyysalo, Sampo
Publicado en: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies, 2025, ISSN 1736-6305
Editor: University of Tartu Library
DOI: 10.48550/ARXIV.2404.01856

CUNI Non-Autoregressive System for the WMT 22 Efficient Translation Shared Task

Autores: Jindřich Helcl
Publicado en: Proceedings of the Seventh Conference on Machine Translation (WMT), 2022, ISSN 1530-9312
Editor: Association for Computational Linguistics

HPLT’s First Release of Data and Models

Autores: Ramírez-Sánchez, Gema; Chen, Pinzhen; Helcl, Jindřich; Zaragoza-Bernabeu, Jaume; Malik, Bhavitvya; De Gibert Bonet, Ona; Stepachev, Pavel; Variš, Dušan; Haddow, Barry; Arefyev, Nikolay; Tiedemann, Jörg
Publicado en: Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2), 2024, ISSN 1530-9312
Editor: European Association for Machine Translation (EAMT)

OpusDistillery: A Configurable End-to-End Pipeline for Systematic Multilingual Distillation of Open NMT Models

Autores: Ona de Gibert, Tommi Nieminen, Yves Scherrer, Jörg Tiedemann
Publicado en: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025
Editor: University of Tartu Library

HPLT’s Second Data Release

Autores: Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladi
Publicado en: Proceedings of Machine Translation Summit XX: Volume 2, 2025
Editor: European Association for Machine Translation

Mind the Gap: Diverse NMT Models for Resource-Constrained Environments

Autores: Ona de Gibert, Dayyán O’Brien, Dušan Variš, Jörg Tiedemann
Publicado en: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), 2025
Editor: University of Tartu Library

Cheating to Identify Hard Problems for Neural Machine Translation

Autores: Proyag Pal, Kenneth Heafield
Publicado en: 2023, ISBN 978-1-959429-47-0
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.FINDINGS-EACL.120

The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics

Autores: Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, Alexandra Birch
Publicado en: Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, 2024, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2024.INSIGHTS-1.17

Large Language Model Inference with Lexical Shortlisting

Autores: Nikolay Bogoychev and Pinzhen Chen and Barry Haddow and Alexandra Birch
Publicado en: AAAI Workshop on Deployable AI, 2024, ISSN 2331-8422
Editor: arXiv
DOI: 10.48550/ARXIV.2311.09709

Not all layers are equally as important: Every Layer Counts BERT

Autores: Lucas Georges Gabriel Charpentier, David Samuel
Publicado en: Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, 2023, ISSN 1530-9312
Editor: Association for Computational Linguistics
DOI: 10.18653/V1/2023.CONLL-BABYLM.20

HPLT High-Performance Language Technology: Building LLMs and TMs in European languages

Autores: Hajič, Jan
Publicado en: 2023
Editor: Oral presentation at Skeikampen, Norway

Iterative Translation Refinement with Large Language Models

Autores: Chen, Pinzhen and Guo, Zhicheng and Haddow, Barry and Heafield, Kenneth
Publicado en: 2023, ISSN 2331-8422
Editor: arXiv
DOI: 10.48550/ARXIV.2306.03856

{EEE-QA}: Exploring effective and efficient question-answer representations

Autores: Zhanghao Hu and Yijun Yang and Junjie Xu and Yifu Qiu and Pinzhen Chen
Publicado en: 2024, ISSN 2331-8422
Editor: arXiv
DOI: 10.48550/ARXIV.2403.02176

Velké jazykové modely: Co znamená velké a co jazykové?

Autores: Libovický, Jindřich
Publicado en: 2023
Editor: Talk at FI MUNI, Brno, Czechia

OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

Autores: Nikolay Bogoychev and Jelmer van der Linde and Graeme Nail and Barry Haddow and Jaume Zaragoza-Bernabeu and Gema Ramírez-Sánchez and Lukas Weymann and Tudor Nicolae Mateiu and Jindřich Helcl and Mikko Aulamo
Publicado en: 2023, ISSN 2331-8422
Editor: arXiv
DOI: 10.48550/ARXIV.2311.14838

Resultado final

Publicaciones

Descargar Descargar el contenido de la página