A collection of monolingual and bilingual corpora as HPLT Datasets version 1.0. (
https://hplt-project.org/datasets/v1(opens in new window)) has been released, quickly followed by v1.1 and v1.2 the current clean version. The corpora were compiled from the web crawls provided by Internet Archive and Common Crawl projects, totalling nearly 1.85 peta-bytes (PB) of crawls. The collection process and more detailed corpus statistics are available in Deliverable D2.1. This latest release fixed a bug in the deduplication pipeline, added additional cleaning rules to the monolingual data, including the UT1 blacklist for removing adult content sites, and applied anonymization to the bilingual datasets. The datasets are distributed via the HPLT website and via HuggingFace Hub.
We were able to receive compute and storage allocation on multiple HPCs, including the EuroHPC pre-exascale LUMI supercomputer, the petascale system Karolina and the Sigma2 NIRD service platform. In total, we were allocated more than 10 million GPU-hours and 10 million CPU-hours of compute capacity that were first used for the data cleaning, and later for training of the initial large language models (LLM) and neural machine translation (NMT) models.
We trained and publicly released the initial version of the LLMs and NMT models (
https://hplt-project.org/models/llm(opens in new window)). The released models cover a similar range of languages and language pairs as the HPLT dataset releases. In work on WP4, we created an initial series of monolingual encoder-only (BERT-like) models covering 75 languages. We have additionally completed work on several decoder-only (GPT-like) models, including the FinGPT family of large generative Finnish models, the NORA.LLM Norwegian models, as well as the 34B parameter Poro model trained on Finnish, English, and code. Building on the experience and technology developed in these efforts, we are currently training two families of generative models, one for the Nordic languages, English, and code (7B, 13B, and 33B parameter models), and one for all official EU languages (up to 71B parameters). These models, their training data and process are detailed in the publications cited above as well as in Deliverable D4.1.
We have also trained a first batch of machine translation models using a combination of previously available data and new parallel data created by WP2 and WP3. The resulting models cover 14 language pairs and were trained using the OpusPocus pipeline manager. The data cleaning and pipeline configuration and the datasets used for training these models are available in our MT model repository (
https://github.com/hplt-project/HPLT-MT-Models(opens in new window)). A more detailed description of the initial training pipeline and models is available in Deliverable D5.1.