Natural Language Understanding for non-standard languages and dialects

Project Information

DIALECT

Grant agreement ID: 101043235

DOI

10.3030/101043235

EC signature date 7 July 2022

Start date 1 October 2022

End date 30 September 2027

Funded under

European Research Council (ERC)

Total cost

€ 1 997 815,00

EU contribution

€ 1 997 815,00

1 997 815,00

Coordinated by

LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN
Germany

Periodic Reporting for period 1 - DIALECT (Natural Language Understanding for non-standard languages and dialects)

Reporting period: 2022-10-01 to 2025-03-31

Language technology, powered by advancements in Large Language Models (LLMs), has become an integral part of our daily lives. Yet, these systems largely cater to a narrow slice of the world’s linguistic diversity, often leaving out dialects and non-standard languages. Similarly, these technologies are built on the assumption that labeled data consists of a single "ground truth," ignoring the complexities and variations of human judgment (also called human label variation). This dual failure excludes millions of speakers and perpetuates bias, rendering language technology less effective and less equitable.

The DIALECT project aims to fundamentally change how we approach these issues by addressing variation in both input data and label outputs. DIALECT’s objectives are:
i) Develop algorithms for transferring knowledge from resource-rich languages to dialects and low-resource languages.
ii) Design models that integrate information to better handle variation.
ii) Create new datasets that include non-standard languages and dialects, while incorporating human label diversity rather than enforcing a single "correct" answer.

The ultimate goal is to build fairer and more accurate language technology. Particularly since the release of ChatGPT in November 2022, the challenges DIALECT addresses have only grown in urgency.

In the project's initial two years, we achieved several important research milestones that were in alignment with all of the objectives of DIALECT.

Our primary aim is to develop innovative technology that addresses dialectal variation in input and embraces human label variation in output. These models better reflect human uncertainty in labeling, capturing the natural way many speakers communicate. We have made numerous substantial contributions towards these goals.

"The “Problem” of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation" encapsulates my central vision for DIALECT’s label bias component. It reviews key literature on learning from multiple valid interpretations from a multidisciplinary perspective, identifies critical gaps throughout the entire NLP pipeline—not just in data gathering—and proposes paths forward. Since its publication, this seminal paper has gained considerable attention, with over 180 citations in less than two years. It has inspired several key technological directions and is influencing industry. For instance, we contributed advanced technological approaches in active learning that account for multiple plausible human interpretations to enhance learning efficiency. Furthermore, in three papers we investigate the challenge of model uncertainty, considering human uncertainty in labeling and generative AI models. These studies not only introduce groundbreaking insights but also discuss their critical implications and the necessity to integrate human label variation in model uncertainty for labeling and generation. This integration is essential for developing more trustworthy technology aligned with human communication, due to linguistic grounding.

Furthermore, we have released new datasets with un-aggregated labels, such as in "Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations," which support future innovation and benchmarking. Overall, these publications have highlighted key issues with human label variation in current technology, revolutionizing our understanding and setting the stage for an exciting research agenda in the upcoming years.

Our second goal is to delve deeper into NLP for dialects. We have published several foundational papers to advance insights in this area. "A Survey of Corpora for Germanic Low-Resource Languages and Dialects" provides a comprehensive overview of existing datasets for these languages. Building on this, we thoroughly evaluate state-of-the-art approaches in Natural Language Understanding and morphosyntax, establishing crucial groundwork. During the first year, we also developed novel datasets for NLP on dialects, including named entity recognition, a new Bavarian Treebank, and datasets for slot and intent detection in Bavarian. Given the societal impact of NLP, it is imperative to align technological development with user needs. In "What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects," we provide concrete results and innovative survey methods to reveal users' needs for NLP for dialects.

The paradigm shift in NLP to generative AI approaches has transformed the research landscape, causing a shift in the project's initial focus on purely transfer learning. We now focus on related issues such as modularity in large multilingual language models, efficient adaptation, and fairness. Key questions now include how to leverage large pre-trained language model internals or pre-processing decisions (like tokenization) to better accommodate low-resource languages and dialects, as technology is currently biased towards high-resource languages.

Regarding label bias, there has also been a shift. Standard supervised learning has largely been replaced by prompting and few-shot learning. Nevertheless, the importance of human disagreement in labeling has shifted to question on model uncertainty, model evaluation, language generation and post-hoc training (preference learning from human feedback), and how these key directions can take several human perspectives into account. Important and exciting related directions are "pluralism" in language modeling. This, deeply linked to the original problem of human label variation, shows that going beyond a single ground truth just become even more imminent and pressing. Therefore, we have exciting research ahead of us in this space.

The DIALECT project is making substantial strides in redefining how NLP addresses label bias and dialects, particularly highlighting the impact of human label variation and non-standard language on language technology. It has fostered valuable collaborations with institutions like the University of Amsterdam, the University of Cambridge, UCLovain and Bocconi University. Furthermore, we have connected with experts from various fields (including e.g. law, linguistics) and industries. This is enriching our research and opens up exciting avenues for transformative research impact.

Over 50 publications are gathering significant academic recognition with several over 100 citations in the past two years. Another key outcome is that our work has yield in initiated workshops and further collaborations (e.g. the First Workshop on Uncertainty in NLP in 2024; accepted for its second edition in 2025) and shared tasks (the third shared task on Learning from Disagreement will be held at EMNLP 2025 and is co-organized by my lab).

Finally, I was invited to give one of the three plenary talks at ACL 2024 this year, where I had the opportunity to present outcomes of DIALECTS in front over over 2,500 on-site participants. This further underscores the growing interest and enthusiam for these themes. This is further exemplified by an outstanding paper award we received at ACL 2024 for "VariErr NLI: Separating Annotation Error from Human Label Variation".

Project website: https://dialect-erc.github.io/

Periodic Reporting for period 1 - DIALECT (Natural Language Understanding for non-standard languages and dialects)

Share this page Share this page on social networks

Download Download the content of the page