Periodic Reporting for period 1 - DIALECT (Natural Language Understanding for non-standard languages and dialects)
Reporting period: 2022-10-01 to 2025-03-31
The DIALECT project aims to fundamentally change how we approach these issues by addressing variation in both input data and label outputs. DIALECT’s objectives are:
i) Develop algorithms for transferring knowledge from resource-rich languages to dialects and low-resource languages.
ii) Design models that integrate information to better handle variation.
ii) Create new datasets that include non-standard languages and dialects, while incorporating human label diversity rather than enforcing a single "correct" answer.
The ultimate goal is to build fairer and more accurate language technology. Particularly since the release of ChatGPT in November 2022, the challenges DIALECT addresses have only grown in urgency.
Our primary aim is to develop innovative technology that addresses dialectal variation in input and embraces human label variation in output. These models better reflect human uncertainty in labeling, capturing the natural way many speakers communicate. We have made numerous substantial contributions towards these goals.
"The “Problem” of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation" encapsulates my central vision for DIALECT’s label bias component. It reviews key literature on learning from multiple valid interpretations from a multidisciplinary perspective, identifies critical gaps throughout the entire NLP pipeline—not just in data gathering—and proposes paths forward. Since its publication, this seminal paper has gained considerable attention, with over 180 citations in less than two years. It has inspired several key technological directions and is influencing industry. For instance, we contributed advanced technological approaches in active learning that account for multiple plausible human interpretations to enhance learning efficiency. Furthermore, in three papers we investigate the challenge of model uncertainty, considering human uncertainty in labeling and generative AI models. These studies not only introduce groundbreaking insights but also discuss their critical implications and the necessity to integrate human label variation in model uncertainty for labeling and generation. This integration is essential for developing more trustworthy technology aligned with human communication, due to linguistic grounding.
Furthermore, we have released new datasets with un-aggregated labels, such as in "Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations," which support future innovation and benchmarking. Overall, these publications have highlighted key issues with human label variation in current technology, revolutionizing our understanding and setting the stage for an exciting research agenda in the upcoming years.
Our second goal is to delve deeper into NLP for dialects. We have published several foundational papers to advance insights in this area. "A Survey of Corpora for Germanic Low-Resource Languages and Dialects" provides a comprehensive overview of existing datasets for these languages. Building on this, we thoroughly evaluate state-of-the-art approaches in Natural Language Understanding and morphosyntax, establishing crucial groundwork. During the first year, we also developed novel datasets for NLP on dialects, including named entity recognition, a new Bavarian Treebank, and datasets for slot and intent detection in Bavarian. Given the societal impact of NLP, it is imperative to align technological development with user needs. In "What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects," we provide concrete results and innovative survey methods to reveal users' needs for NLP for dialects.
The paradigm shift in NLP to generative AI approaches has transformed the research landscape, causing a shift in the project's initial focus on purely transfer learning. We now focus on related issues such as modularity in large multilingual language models, efficient adaptation, and fairness. Key questions now include how to leverage large pre-trained language model internals or pre-processing decisions (like tokenization) to better accommodate low-resource languages and dialects, as technology is currently biased towards high-resource languages.
Regarding label bias, there has also been a shift. Standard supervised learning has largely been replaced by prompting and few-shot learning. Nevertheless, the importance of human disagreement in labeling has shifted to question on model uncertainty, model evaluation, language generation and post-hoc training (preference learning from human feedback), and how these key directions can take several human perspectives into account. Important and exciting related directions are "pluralism" in language modeling. This, deeply linked to the original problem of human label variation, shows that going beyond a single ground truth just become even more imminent and pressing. Therefore, we have exciting research ahead of us in this space.
Over 50 publications are gathering significant academic recognition with several over 100 citations in the past two years. Another key outcome is that our work has yield in initiated workshops and further collaborations (e.g. the First Workshop on Uncertainty in NLP in 2024; accepted for its second edition in 2025) and shared tasks (the third shared task on Learning from Disagreement will be held at EMNLP 2025 and is co-organized by my lab).
Finally, I was invited to give one of the three plenary talks at ACL 2024 this year, where I had the opportunity to present outcomes of DIALECTS in front over over 2,500 on-site participants. This further underscores the growing interest and enthusiam for these themes. This is further exemplified by an outstanding paper award we received at ACL 2024 for "VariErr NLI: Separating Annotation Error from Human Label Variation".
Project website: https://dialect-erc.github.io/(opens in new window)