If you use online services or own a business requiring customer intelligence, chances are you’ve already benefitted from the wonders of natural language processing (NLP) technology. Chatbots, sentiment analysis, advertising or even creditworthiness assessments are some of the many ways to put the technology to good use. But there’s a catch: while it all works fine when using a language like English, NLP starts struggling as soon as it needs to deal with morphologically rich languages.
I’m sorry but tusaatsiarunnanngittualuujunga
“Take a sentence like ‘I can’t hear you very well’. For the most part, morphemes – the smallest meaningful units – are just words that can be identified by looking at the whitespace. But it all becomes much more complicated if you look at a morphologically rich language like Inuktitut (an Inuit language). Here the same sentence would be expressed in a single word ‘tusaatsiarunnanngittualuujunga’,” says Bollmann, postdoctoral researcher and Marie Skłodowska-Curie fellow at the University of Copenhagen. To adapt, most NLP models now integrate techniques like byte-pair encoding (BPE). The input is broken down even further to identify frequent character sequences. ‘Tusaa’ for instance is very common and will be represented in a single unit. But that’s still not enough to make a proper NLP model for Inuktitut. According to Bollmann, the technique is doomed to fail as it doesn’t identify units in a way that is linguistically meaningful. “Many in the NLP community believe that BPE is all we need, that it will get better with enough data and eventually will be able to figure out the relevant structure. I tend to disagree: in my opinion, we need to model each linguistic structure more explicitly,” Bollmann explains. With MorphIRe (Morphologically-informed representations for natural language processing), Bollmann uses deep learning with neural network architectures to learn the representations grounded in morphemes before applying them to state-of-the-art models for a variety of NLP tasks. His work won’t be completed until March 2021, but the research has already provided evidence that errors in current NLP algorithms can often be traced back to morphology.
“My objective now is to identify morphological structure in a way that is mostly language-independent,” Bollmann adds. “This is challenging for a lot of reasons, one of which being the lack of good, annotated resources for this task. To put it simply, if I built a system that identifies morphological structure, I would have a hard time evaluating how good it actually is because there is little data to compare the analyses to.” Another key challenge for the project is to convince more researchers that its approach is actually useful. To be viable, a morphologically informed approach to NLP would need to compete with current state-of-the-art techniques using input representations trained on expensive hardware for days or even weeks. As Bollmann notes: “It requires a lot of time and resources to compete with these models. I am currently running some pilot experiments to select a few languages that will hopefully show how my proposed approach can improve on the current state of the art.” Should he be successful, Bollmann foresees many possible applications from machine translation to search engines. But he still has a long way to go before he can consider such options.
MorphIRe, language, natural language processing, NLP, morphologically rich languages, morpheme, byte-pair encoding, BPE