Natural Language Understanding for non-standard languages and dialects

Project Information

DIALECT

Grant agreement ID: 101043235

DOI

10.3030/101043235

EC signature date 7 July 2022

Start date 1 October 2022

End date 30 September 2027

Funded under

European Research Council (ERC)

Total cost

€ 1 997 815,00

EU contribution

€ 1 997 815,00

1 997 815,00

Coordinated by

LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN
Germany

Project description

When artificially intelligent language models and algorithms crunch the numbers of huge data sets, they are prone to bias simply because linguistic diversity is inadequately represented. This exclusion involves millions who converse in dialects or unusual languages. It also precludes them from emergent future technologies. The EU-funded DIALECT project will create algorithms which facilitate high levels of input variation to allow diverse dialects to be incorporated into language technology. It will also broaden ground truth labels (i.e. computer instructions used to check accuracy in the real world) in interactive learning by including elements of human uncertainty. The result will be less data-intensive, and it will make for more equitable and accurate language processing.

Objective

Dialects are ubiquitous and for many speakers are part of everyday life. They carry important social and communicative functions. Yet, dialects and non-standard languages in general are a blind spot in research on Natural Language Understanding (NLU). Despite recent breakthroughs, NLU still fails to take linguistic diversity into account. This lack of modeling language variation results in biased language models with high error rates on dialect data. This failure excludes millions of speakers today and prevents the development of future technology that can adapt to such users.

To account for linguistic diversity, a paradigm shift is needed: Away from data-hungry algorithms with passive learning from large data and single ground truth labels, which are known to be biased. To go past current learning practices, the key is to tackle variation at both ends: in input data and label bias. With DIALECT, I propose such an integrated approach, to devise algorithms which aid transfer from rich variability in inputs, and interactive learning which integrates human uncertainty in labels. This will reduce the need for data and enable better adaptation and generalization.

Advances in salient areas of deep learning research now make it possible to tackle this challenge. DIALECT’s objectives are to devise a) new algorithms and insights to address extremely scarce data setups and biased labels; b) novel representations which integrate auxiliary sources of information such as complement text data with speech; and c) new datasets with conversational data in its most natural form.

By integrating dialectal variation into models able to learn from scarce data and biased labels, the foundations will be established for fairer and more accurate NLU to break down language and literary barriers. I am privileged to carry out this integration as I have contributed to research in top venues on both cross-lingual learning and learning from biased labels.

Fields of science (EuroSciVoc)

CORDIS classifies projects with EuroSciVoc, a multilingual taxonomy of fields of science, through a semi-automatic process based on NLP techniques. See: The European Science Vocabulary.

This project has not yet been classified with EuroSciVoc.
Be the first one to suggest relevant scientific fields and help us improve our classification service

Keywords

Project’s keywords as indicated by the project coordinator. Not to be confused with the EuroSciVoc taxonomy (Fields of science)

Programme(s)

Multi-annual funding programmes that define the EU’s priorities for research and innovation.

HORIZON.1.1 - European Research Council (ERC) MAIN PROGRAMME
See all projects funded under this programme

Topic(s)

Calls for proposals are divided into topics. A topic defines a specific subject or area for which applicants can submit proposals. The description of a topic comprises its specific scope and the expected impact of the funded project.

ERC-2021-COG - ERC CONSOLIDATOR GRANTS
See all projects funded under this topic

Funding Scheme

Funding scheme (or “Type of Action”) inside a programme with common features. It specifies: the scope of what is funded; the reimbursement rate; specific evaluation criteria to qualify for funding; and the use of simplified forms of costs like lump sums.

HORIZON-ERC - HORIZON ERC Grants

See all projects funded under this funding scheme

Call for proposal

Procedure for inviting applicants to submit project proposals, with the aim of receiving EU funding.

(opens in new window) ERC-2021-COG

See all projects funded under this call

Host institution

LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN

Net EU contribution

€ 1 997 815,00

Address

GESCHWISTER SCHOLL PLATZ 1
80539 Planegg
Germany

Region

Bayern Oberbayern München, Kreisfreie Stadt

Activity type

Higher or Secondary Education Establishments

Links

Contact the organisation

Website

Participation in EU R&I programmes

HORIZON collaboration network

Total cost

€ 1 997 815,00

Beneficiaries (1)

LUDWIG-MAXIMILIANS-UNIVERSITAET MUENCHEN

Germany

Net EU contribution

€ 1 997 815,00

Natural Language Understanding for non-standard languages and dialects

Project description