High Performance Language Technologies

Project Information

HPLT

Grant agreement ID: 101070350

DOI

10.3030/101070350

Project closed

EC signature date 13 June 2022

Start date 1 September 2022

End date 31 December 2025

Funded under

Digital, Industry and Space

Total cost

€ 4 058 287,50

EU contribution

€ 3 880 687,50

3 880 687,50

177 600,00

Coordinated by

UNIVERZITA KARLOVA
Czechia

Periodic Reporting for period 1 - HPLT (High Performance Language Technologies)

Reporting period: 2022-09-01 to 2024-02-29

The EU-funded HPLT project applies high-performance computing (HPC) to scale and advance language technologies. Taking advantage of recent advances in machine learning and astonishing storage capacities, it creates and processes huge language data sets and produces language and translation models in numerous languages. The resulting models will be tested from various angles to ensure smooth integration, high accuracy, and regulatory compliance concerning privacy, unwanted biases and ethical issues. The models and data sets will be a game changer in the language service market in the EU and beyond. The resulting models will be open, free and available from established language repositories for anyone interested in pursuing research or innovation projects. The project, coordinated by the Charles University in Prague (CUNI), gathers partners from 5 different universities, 2 HPC centers and a private NLP company from around Europe. So far, we have collected about 1.85 PB of data from the Internet Archive and Common Crawl, processed, cleaned and released them, built encoder-only language models for 75 languages, and a few generative, decoder-only models (Finnish, Norwegian and a few other languages). In addition, translation models have been trained and released for 18 low-resource languages paired with English. Almost 10 million GPU and 10 million CPU hours have been secured and used on various HPC facilities throughout Europe. All data and models have been released through the HPLT website and HuggingFace platform.

A collection of monolingual and bilingual corpora as HPLT Datasets version 1.0. (https://hplt-project.org/datasets/v1) has been released, quickly followed by v1.1 and v1.2 the current clean version. The corpora were compiled from the web crawls provided by Internet Archive and Common Crawl projects, totalling nearly 1.85 peta-bytes (PB) of crawls. The collection process and more detailed corpus statistics are available in Deliverable D2.1. This latest release fixed a bug in the deduplication pipeline, added additional cleaning rules to the monolingual data, including the UT1 blacklist for removing adult content sites, and applied anonymization to the bilingual datasets. The datasets are distributed via the HPLT website and via HuggingFace Hub.

We were able to receive compute and storage allocation on multiple HPCs, including the EuroHPC pre-exascale LUMI supercomputer, the petascale system Karolina and the Sigma2 NIRD service platform. In total, we were allocated more than 10 million GPU-hours and 10 million CPU-hours of compute capacity that were first used for the data cleaning, and later for training of the initial large language models (LLM) and neural machine translation (NMT) models.

We trained and publicly released the initial version of the LLMs and NMT models (https://hplt-project.org/models/llm). The released models cover a similar range of languages and language pairs as the HPLT dataset releases. In work on WP4, we created an initial series of monolingual encoder-only (BERT-like) models covering 75 languages. We have additionally completed work on several decoder-only (GPT-like) models, including the FinGPT family of large generative Finnish models, the NORA.LLM Norwegian models, as well as the 34B parameter Poro model trained on Finnish, English, and code. Building on the experience and technology developed in these efforts, we are currently training two families of generative models, one for the Nordic languages, English, and code (7B, 13B, and 33B parameter models), and one for all official EU languages (up to 71B parameters). These models, their training data and process are detailed in the publications cited above as well as in Deliverable D4.1.

We have also trained a first batch of machine translation models using a combination of previously available data and new parallel data created by WP2 and WP3. The resulting models cover 14 language pairs and were trained using the OpusPocus pipeline manager. The data cleaning and pipeline configuration and the datasets used for training these models are available in our MT model repository (https://github.com/hplt-project/HPLT-MT-Models). A more detailed description of the initial training pipeline and models is available in Deliverable D5.1.

The models are still under evaluation (standard test suites), but the uniformly trained BERT models in 75 languages have already been preliminarily evaluated (see the D4.1 deliverable report) and found to to be on par or outperforming the existing alternatives, while covering more languages. These models are unique and easily accessible to be used by academic and industrial institutions alike.
Certainly, to get on par with the largest models available globally, additional effort will be needed throughout the rest of the project and followup projects in the Digital Europe AI-06 calls and projects. In addition, to have truly Open Source results (which are now the models, but the source data only in limited form), legislative changes are necessary throughout the EU.

Rollup

Leaflet BS

Project logo

Leaflet FS

Periodic Reporting for period 1 - HPLT (High Performance Language Technologies)

Download Download the content of the page