Project breaks new grounds in AI to create ‘DNA of language’

With new approaches to machine learning, research provides language-independent representations of text that can tackle artificial intelligence hallucination.

Fundamental Research

Artificial intelligence (AI) hallucinations (generating content that is false and not based on real world data) have become a trend topic due to the roll-out of large language models such as ChatGPT and Bard. But through an EU-funded research project, computer scientists can get closer to perfect natural language processing (NLP) independently of language while avoiding the fake information factor in AI. MOUSSE(opens in new window), or Multilingual, Open-text Unified Syntax-independent SEmantics, investigated new directions to improve the capabilities of multilingual semantic parsing, without the heavy requirement of annotating data for each different language. “While powerful and impressive, large language models, like ChatGPT or Bard, still struggle to replicate the confidence and common sense that characterises humans. MOUSSE lays the foundations for this ambitious goal. It provides a huge repository of multilingual knowledge that can be used to ground the reasonings and outputs of these models and tackle the problematic phenomenon of hallucination,” states Roberto Navigli(opens in new window), Head of the Sapienza Natural Language Processing Group(opens in new window) and MOUSSE project coordinator.

Using multilinguality as a resource

The extensive repository developed by MOUSSE is described by Navigli as ‘the DNA of language’, since it provides the basis to construct meaningful sentences in many languages. This is thanks to the main result achieved by the project: the ability to create the computational equivalent of the mental representations humans create of texts, but independently of the language. Navigli explains: “The computer forms an idea of the meaning of a sentence that abstracts away from the language and from the surface form, that is, the words through which that meaning is expressed.” The more languages the team uses to express the semantics, the more they can corroborate the quality of the representation learnt. On the other hand, once a representation is obtained from a sentence in one language, sentences in other languages can be produced to express the same meaning. “It looks very similar to machine translation, but it goes one step further: it provides a formal, structured proof of what the machine understood,” says Navigli. By taking advantage of multilinguality, MOUSSE is contributing to levelling the NLP research field for all EU languages and hundreds of other languages. The multilingual repository can also be useful for language learners to improve their vocabulary and learn in a way that is more based on meaning than on individual words.

Leveraging AI tools for the best outcome

The capabilities developed by MOUSSE are obtained in four main steps, which are word sense disambiguation, entity linking, semantic role labelling and semantic parsing. The results achieved were possible not only through deep learning but also by keeping the model and its outputs interpretable and manipulable. In summary, the project was able to connect symbolic knowledge and neural networks, leading to an innovative neuro-symbolic approach. According to Navigli, this means it takes the best of both worlds: high performance and effectiveness from the neural models, and interpretability, manipulability and language independence from the symbolic part. Essentially, symbolic knowledge is provided by multilingual knowledge graphs like BabelNet, a huge multilingual encyclopaedic computational dictionary that was an output of the MultiJEDI project, also coordinated by Navigli. The results of both projects have been engineered and made sustainable by his successful university spinoff company, Babelscape(opens in new window).