Natural Language Understanding for non-standard languages and dialects

Descripción del proyecto

Comprensión del lenguaje natural para idiomas y dialectos no estándar

Cuando los algoritmos y los modelos lingüísticos de inteligencia artificial analizan gran cantidad de conjuntos de datos de gran tamaño, son propensos al sesgo simplemente porque la diversidad lingüística no está representada de forma adecuada. Esta exclusión afecta a millones de hablantes de dialectos o idiomas poco usuales y también los separa de las tecnologías futuras emergentes. El equipo del proyecto DIALECT, financiado con fondos europeos, creará algoritmos que faciliten niveles elevados de variación de entrada para permitir la incorporación de diversos dialectos a la tecnología lingüística. Además, ampliará las etiquetas de veracidad sobre el terreno (es decir, las instrucciones informáticas utilizadas para comprobar la precisión en el mundo real) en el aprendizaje interactivo al incluir elementos de incertidumbre humana. El resultado será menos intensivo en datos y permitirá un procesamiento del lenguaje más equitativo y preciso.

Objetivo

Dialects are ubiquitous and for many speakers are part of everyday life. They carry important social and communicative functions. Yet, dialects and non-standard languages in general are a blind spot in research on Natural Language Understanding (NLU). Despite recent breakthroughs, NLU still fails to take linguistic diversity into account. This lack of modeling language variation results in biased language models with high error rates on dialect data. This failure excludes millions of speakers today and prevents the development of future technology that can adapt to such users.

To account for linguistic diversity, a paradigm shift is needed: Away from data-hungry algorithms with passive learning from large data and single ground truth labels, which are known to be biased. To go past current learning practices, the key is to tackle variation at both ends: in input data and label bias. With DIALECT, I propose such an integrated approach, to devise algorithms which aid transfer from rich variability in inputs, and interactive learning which integrates human uncertainty in labels. This will reduce the need for data and enable better adaptation and generalization.

Advances in salient areas of deep learning research now make it possible to tackle this challenge. DIALECT’s objectives are to devise a) new algorithms and insights to address extremely scarce data setups and biased labels; b) novel representations which integrate auxiliary sources of information such as complement text data with speech; and c) new datasets with conversational data in its most natural form.

By integrating dialectal variation into models able to learn from scarce data and biased labels, the foundations will be established for fairer and more accurate NLU to break down language and literary barriers. I am privileged to carry out this integration as I have contributed to research in top venues on both cross-lingual learning and learning from biased labels.

Ámbito científico

natural sciencescomputer and information sciencesartificial intelligencemachine learningdeep learning

Palabras clave

Régimen de financiación

HORIZON-ERC - HORIZON ERC Grants

Institución de acogida

LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN

Aportación neta de la UEn

€ 1 997 815,00

Dirección

GESCHWISTER SCHOLL PLATZ 1
80539 MUNCHEN
Alemania

Región

Bayern Oberbayern München, Kreisfreie Stadt

Tipo de actividad

Higher or Secondary Education Establishments

Enlaces

Contactar con la organización Sitio web

Participación en los programas de I+D de la UE

Red de colaboración de HORIZON

Coste total

€ 1 997 815,00

Beneficiarios (1)

LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN

Alemania

Aportación neta de la UEn

€ 1 997 815,00

Descripción del proyecto

Comprensión del lenguaje natural para idiomas y dialectos no estándar

Objetivo

Ámbito científico

Palabras clave

Programa(s)

Tema(s)

Convocatoria de propuestas

Régimen de financiación

Institución de acogida

Beneficiarios (1)

Compartir esta página

Descargar