Skip to main content

Morphologically-informed representations for natural language processing

Periodic Reporting for period 1 - MorphIRe (Morphologically-informed representations for natural language processing)

Berichtszeitraum: 2019-04-01 bis 2021-03-31

Natural language processing (NLP) is a technology that we encounter frequently in the digital world: for example, it is involved when we use an automatic translation service, or when typing a question into a search engine and getting back an answer extracted from the web. While this often works remarkably well for languages like English, the performance of such systems is significantly worse when it involves less-researched languages like Basque, Finnish, or Polish. This is an important societal issue, as it contributes to a "digital language divide" where speakers who are not proficient in English are put at a disadvantage. The MorphIRe project worked on closing this gap by developing techniques that perform better on a broader range of the world's languages.

Today's NLP models mostly work with artificial intelligence and machine learning: techniques that require large amounts of training data---e.g. sets of questions with their correct answers---which are then fed into an algorithm that "learns" to perform the task. Importantly, the techniques that are widely used today are indifferent to which language is being used---whether the task is performed on English or on Basque, the algorithms work exactly the same. In particular, they do not take into account the word-internal (i.e. "morphological") structure of these languages: whereas English tends to use separate words to express different grammatical and semantic concepts, morphologically richer languages like Basque can express these concepts within a single word form (compare English "because of the rain" with Basque "euriagatik").

The MorphIRe project provides direct evidence that we shouldn't ignore the morphological structure of languages when building NLP models, as it contributes to errors that current state-of-the-art NLP models make. It also proposes a new algorithm for word segmentation that better corresponds to morphological structure. By highlighting these problems in today's NLP models and working towards concrete solutions that can be integrated into these models, the MorphIRe project makes an important contribution towards improving NLP technology for a wider range of languages.
The MorphIRe project has analysed NLP models for 4 different tasks and 57 different languages with regards to errors they make that can be traced back to morphological features. A main result of this large-scale analysis is that morphology is indeed an important source of error in today's NLP models, highlighting the need for models that take this into account more explicitly. A possible way to do this is via better morphologically-aware segmentation, i.e. a way to split up words before further processing such that these splits reflect word-internal structure. To this end, the MorphIRe project has worked on segmentation algorithms that can do this in a highly multilingual setting, i.e. running on more than 100 languages at the same time. First attempts to apply these algorithms to machine translation in challenging scenarios (such as indigenous American languages like Aymara or Guaraní) have provided important signals to further revise and develop these algorithms.

The project has also produced a meta-study on how the scientific community engages with older literature over the most recent one, such as literature that motivates the need for linguistically-informed approaches versus the very latest advances in artificial intelligence that mostly do not make use of these.

Results have been disseminated at high-profile international conferences, such as the Annual Meeting of the Association for Computational Linguistics (ACL), and all papers, code & datasets produced by this project are openly accessible and re-usable by the scientific community. The next step for exploiting this project's results is successfully applying the developed algorithms to a wide range of NLP tasks and languages and improving on the state of the art for them, which the project's researcher continues to work on.
Improving NLP technology for a wide range of languages has been the key objective of this project, with great potential benefits for society. The large-scale analysis of the role of morphology in today's NLP models has laid the groundwork for achieving this. It was recognized as an important contribution to the scientific debate by being awarded "best long paper" at the conference where it was published, which also served to considerably raise awareness of this issue in the wider scientific community. The work on improved segmentation algorithms, partly finished and expected to be published before the end of 2021, has the potential to realize some of those benefits.
Word representations in NLP: today's models (left) vs. morphologically-aware models (right)