Periodic Reporting for period 1 - HPLT (High Performance Language Technologies)
Berichtszeitraum: 2022-09-01 bis 2024-02-29
We were able to receive compute and storage allocation on multiple HPCs, including the EuroHPC pre-exascale LUMI supercomputer, the petascale system Karolina and the Sigma2 NIRD service platform. In total, we were allocated more than 10 million GPU-hours and 10 million CPU-hours of compute capacity that were first used for the data cleaning, and later for training of the initial large language models (LLM) and neural machine translation (NMT) models.
We trained and publicly released the initial version of the LLMs and NMT models (https://hplt-project.org/models/llm(öffnet in neuem Fenster)). The released models cover a similar range of languages and language pairs as the HPLT dataset releases. In work on WP4, we created an initial series of monolingual encoder-only (BERT-like) models covering 75 languages. We have additionally completed work on several decoder-only (GPT-like) models, including the FinGPT family of large generative Finnish models, the NORA.LLM Norwegian models, as well as the 34B parameter Poro model trained on Finnish, English, and code. Building on the experience and technology developed in these efforts, we are currently training two families of generative models, one for the Nordic languages, English, and code (7B, 13B, and 33B parameter models), and one for all official EU languages (up to 71B parameters). These models, their training data and process are detailed in the publications cited above as well as in Deliverable D4.1.
We have also trained a first batch of machine translation models using a combination of previously available data and new parallel data created by WP2 and WP3. The resulting models cover 14 language pairs and were trained using the OpusPocus pipeline manager. The data cleaning and pipeline configuration and the datasets used for training these models are available in our MT model repository (https://github.com/hplt-project/HPLT-MT-Models(öffnet in neuem Fenster)). A more detailed description of the initial training pipeline and models is available in Deliverable D5.1.
Certainly, to get on par with the largest models available globally, additional effort will be needed throughout the rest of the project and followup projects in the Digital Europe AI-06 calls and projects. In addition, to have truly Open Source results (which are now the models, but the source data only in limited form), legislative changes are necessary throughout the EU.