As the software is open-source and solves a problem which many
organizations faced, we expected this software to be of strong
interest. In the project proposal, we planned three workshops for
potential users. We originally built up a list of 57 people
concretely interested in the workshops, and this list currently
contains 71 people.
The first workshop took place on zoom on February 4th and February 6th
2025. We offered two sessions at different times to maximize the
number of worldwide participants, see the workshop website for further
details and the materials presented. See
alexfraser.github.io/talks/language_community_workshop.html
The February 4th workshop had 28 participants, and the February 6th
workshop had 21 participants.
In the workshops, we determined that users were more interested in
multilingual large language models than dedicated translation
systems. Large language models are rapidly becoming the
state-of-the-art for low-resource languages. They have three important
advantages over dedicated translation systems:
1) The first is that they (somewhat amazingly) don't require parallel
data, they can be trained on different data in two languages and still
learn to translate.
2) They automatically leverage related languages to strengthen the
low-resource language, e.g. for Upper Sorbian (a Slavic language
spoken in Germany by about 40000 people), multilingual large language
models are able to leverage knowledge of Czech, which is a strongly
related language.
3) They are useful for many other tasks besides translation.
In light of this change in emphasis, within the Data4ML project we
decided to carry out two new types of workshops - a low-resource
language task (Large Language Models with Limited Resources at WMT25)
and a low resource language tutorial (Low-Resource, High-Impact:
Building Corpora for Inclusive Language Technologies), which has been
accepted as a tutorial for LREC2026.
The low-resource language task is a new shared task (task studied in a
friendly competition between different research groups) at the well
known ACL Conference on Machine Translation. The first edition was
organized by Data4ML at WMT25. See the web page at
https://www2.statmt.org/wmt25/limited-resources-slavic-llm.html(odnośnik otworzy się w nowym oknie)Low-Resource, High-Impact: Building Corpora for Inclusive Language
Technologies has been accepted as a tutorial for LREC2026. This
tutorial addresses our original target audience, different
low-resource language communities, but with the new focus on all types
of corpora including the creation of monolingual corpora for training
large language models. See the web page at:
https://tum-nlp.github.io/low-resource-tutorial/(odnośnik otworzy się w nowym oknie)