Natural Language Understanding for non-standard languages and dialects

Descrizione del progetto

Comprensione del linguaggio naturale per lingue e dialetti non standard

Quando i modelli e gli algoritmi linguistici dell’intelligenza artificiale elaborano i numeri di enormi insiemi di dati, sono soggetti a pregiudizi semplicemente perché la diversità linguistica è rappresentata in modo inadeguato. Questa esclusione riguarda milioni di persone che conversano in dialetti o lingue insolite, precludendo loro le tecnologie emergenti del futuro. Il progetto DIALECT, finanziato dall’UE, creerà algoritmi che agevolano livelli elevati di variazione dell’input per consentire l’integrazione di diversi dialetti nella tecnologia linguistica. Inoltre, amplierà le etichette di verità (cioè le istruzioni del computer utilizzate per verificare l’accuratezza nel mondo reale) nell’apprendimento interattivo, includendo elementi di incertezza umana. Il risultato sarà meno impegnativo in termini di dati e consentirà un’elaborazione linguistica più equa e accurata.

Obiettivo

Dialects are ubiquitous and for many speakers are part of everyday life. They carry important social and communicative functions. Yet, dialects and non-standard languages in general are a blind spot in research on Natural Language Understanding (NLU). Despite recent breakthroughs, NLU still fails to take linguistic diversity into account. This lack of modeling language variation results in biased language models with high error rates on dialect data. This failure excludes millions of speakers today and prevents the development of future technology that can adapt to such users.

To account for linguistic diversity, a paradigm shift is needed: Away from data-hungry algorithms with passive learning from large data and single ground truth labels, which are known to be biased. To go past current learning practices, the key is to tackle variation at both ends: in input data and label bias. With DIALECT, I propose such an integrated approach, to devise algorithms which aid transfer from rich variability in inputs, and interactive learning which integrates human uncertainty in labels. This will reduce the need for data and enable better adaptation and generalization.

Advances in salient areas of deep learning research now make it possible to tackle this challenge. DIALECT’s objectives are to devise a) new algorithms and insights to address extremely scarce data setups and biased labels; b) novel representations which integrate auxiliary sources of information such as complement text data with speech; and c) new datasets with conversational data in its most natural form.

By integrating dialectal variation into models able to learn from scarce data and biased labels, the foundations will be established for fairer and more accurate NLU to break down language and literary barriers. I am privileged to carry out this integration as I have contributed to research in top venues on both cross-lingual learning and learning from biased labels.

Campo scientifico

natural sciencescomputer and information sciencesartificial intelligencemachine learningdeep learning

Parole chiave

Meccanismo di finanziamento

HORIZON-ERC - HORIZON ERC Grants

Istituzione ospitante

LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN

Contribution nette de l'UE

€ 1 997 815,00

Indirizzo

GESCHWISTER SCHOLL PLATZ 1
80539 MUNCHEN
Germania

Regione

Bayern Oberbayern München, Kreisfreie Stadt

Tipo di attività

Higher or Secondary Education Establishments

Collegamenti

Contatta l’organizzazione Sito web

Partecipazione a programmi di R&I dell'UE

Rete di collaborazione HORIZON

Costo totale

€ 1 997 815,00

Beneficiari (1)

LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN

Germania

Contribution nette de l'UE

€ 1 997 815,00

Descrizione del progetto

Comprensione del linguaggio naturale per lingue e dialetti non standard

Obiettivo

Campo scientifico

Parole chiave

Programma(i)

Argomento(i)

Invito a presentare proposte

Meccanismo di finanziamento

Istituzione ospitante

Beneficiari (1)

Condividi questa pagina

Scarica