A prototype system for obtaining and managing training data for multilingual learning

Información del proyecto

Data4ML

Identificador del acuerdo de subvención: 101113091

DOI

10.3030/101113091

Proyecto cerrado

Fecha de la firma de la CE 30 Marzo 2023

Fecha de inicio 1 Octubre 2023

Fecha de finalización 30 Septiembre 2025

Financiado con arreglo a

European Research Council (ERC)

Coste total

Sin datos

Aportación de la UE

€ 150 000,00

Coordinado por

TECHNISCHE UNIVERSITAET MUENCHEN
Germany

Periodic Reporting for period 1 - Data4ML (A prototype system for obtaining and managing training data for multilingual learning)

Período documentado: 2023-10-01 hasta 2025-09-30

At the start of the project, high quality machine translation systems
could not be created for most of the Earth's 7000 languages because of
lack of parallel corpora (translated sentences available in two
languages), which were used as training data for state-of-the-art
machine translation systems based on machine learning.

In well-resourced scenarios, such as in machine translation of English
to German, training data consists of many pairs of English sentences
and their German translations. In companies or organizations with
large numbers of translators, such parallel training data is created
routinely, and the main challenge for system builders is to simply
obtain access to it. For small companies, small governmental and
non-governmental organizations, and community language preservation
initiatives, the creation of parallel data has been a haphazard
process. In particular for less-resourced languages, such as the
minority languages of Europe, there is a serious lack of parallel
training data.

Smaller organizations try to overcome this by:

1) Limited use of expensive high quality human translation,

2) Discovering and downloading naturally occurring sources of parallel
data on the world wide web,

3) Synthetic data generation.

In the project, we have created a prototype system to automate
approach 2. The availability of this data makes approach 3 possible
using open source neural machine translation software. Overall our
package has successfully made this technology available to smaller
organizations (such as low-resource language activists and small
companies).

As the software is open-source and solves a problem which many
organizations faced, we expected this software to be of strong
interest. In the project proposal, we planned three workshops for
potential users. We originally built up a list of 57 people
concretely interested in the workshops, and this list currently
contains 71 people.

The first workshop took place on zoom on February 4th and February 6th
2025. We offered two sessions at different times to maximize the
number of worldwide participants, see the workshop website for further
details and the materials presented. See
alexfraser.github.io/talks/language_community_workshop.html

The February 4th workshop had 28 participants, and the February 6th
workshop had 21 participants.

In the workshops, we determined that users were more interested in
multilingual large language models than dedicated translation
systems. Large language models are rapidly becoming the
state-of-the-art for low-resource languages. They have three important
advantages over dedicated translation systems:

1) The first is that they (somewhat amazingly) don't require parallel
data, they can be trained on different data in two languages and still
learn to translate.

2) They automatically leverage related languages to strengthen the
low-resource language, e.g. for Upper Sorbian (a Slavic language
spoken in Germany by about 40000 people), multilingual large language
models are able to leverage knowledge of Czech, which is a strongly
related language.

3) They are useful for many other tasks besides translation.

In light of this change in emphasis, within the Data4ML project we
decided to carry out two new types of workshops - a low-resource
language task (Large Language Models with Limited Resources at WMT25)
and a low resource language tutorial (Low-Resource, High-Impact:
Building Corpora for Inclusive Language Technologies), which has been
accepted as a tutorial for LREC2026.

The low-resource language task is a new shared task (task studied in a
friendly competition between different research groups) at the well
known ACL Conference on Machine Translation. The first edition was
organized by Data4ML at WMT25. See the web page at
https://www2.statmt.org/wmt25/limited-resources-slavic-llm.html

Low-Resource, High-Impact: Building Corpora for Inclusive Language
Technologies has been accepted as a tutorial for LREC2026. This
tutorial addresses our original target audience, different
low-resource language communities, but with the new focus on all types
of corpora including the creation of monolingual corpora for training
large language models. See the web page at:
https://tum-nlp.github.io/low-resource-tutorial/

The three workshop activities we held have played a significant
role in the development of a community of language activitists
who are interested in NLP technology, and in particular large
language models, which are appropriate for the languages they
speak.

Overall, we are very happy with the outcome of the ERC Proof of
Concept project, as it set the stage for us to play a key role in
developing technologies for low-resource language communities for the
integration of their languages into state-of-the-art large language
models and intelligent chatbots (such as ChatGPT). This is further
evidenced by the awarding of an ERC Advanced Grant to PI Fraser,
EPICAL: Evaluating and Programming Intelligent Chatbots in Any
Language. This ERC Proof of Concept project played an important role
in the prelminary work which made that application possible.

In addition, we have created a valuable tool for low resource language
communities who are interested in mining parallel corpora for their
languages from the web, see https://github.com/shuokabe/PaSeMiLL .
Work has continued on this prototype software, with a new submission of
a paper to ACL 2026 under review, this version of the software will
be released open source shortly.

Periodic Reporting for period 1 - Data4ML (A prototype system for obtaining and managing training data for multilingual learning)

Descargar Descargar el contenido de la página