European Commission logo
italiano italiano
CORDIS - Risultati della ricerca dell’UE
CORDIS

Multilingual, Open-text Unified Syntax-independent SEmantics

Periodic Reporting for period 3 - MOUSSE (Multilingual, Open-text Unified Syntax-independent SEmantics)

Periodo di rendicontazione: 2020-06-01 al 2021-11-30

The exponential growth of the Web is resulting in vast amounts of online content. However, the information expressed therein is not at easy reach: what we typically browse is only an infinitesimal part of the Web. And even if we had time to read all the Web we could not understand it, as most of it is written in languages we do not speak. Computers, instead, have the power to process the entire Web. But, in order to ”read” it, that is perform machine reading, they still have to face the hard problem of Natural Language Understanding, i.e. automatically making sense of human language. To tackle this long-lasting challenge in Natural Language Processing (NLP), the task of semantic parsing has recently gained popularity. This aims at creating structured representations of meaning for an input text. However, current semantic parsers require supervision, binding them to the language of interest and hindering their extension to multiple languages. Here we propose a research program to investigate radically new directions for enabling multilingual
semantic parsing, without the heavy requirement of annotating training data for each new language. The key intuitions of our proposal are treating multilinguality as a resource rather than an obstacle and embracing the knowledge-based paradigm which allows supervision in the machine learning sense to be replaced with efficacious use of lexical knowledge resources. In stage 1 of the project we will acquire a huge network of language-independent, structured semantic representations of sentences. In stage 2, we will leverage this resource to develop innovative algorithms that perform semantic parsing in any language. These two stages are mutually beneficial, progressively enriching less-resourced languages and contributing towards leveling the playing field for all languages. Enabling Natural Language Understanding across languages should have an impact on NLP and other areas of AI, plus a societal impact on language learners. An important benefit for the society would be the ability for automatic systems to "explain" their understanding of text in an intelligible way.
The MOUSSE project aims at advancing the frontiers in the field of Natural Language Understanding -- a key area of NLP aimed at determining the meaning of text. The two objectives are mostly related one-to-one with WP1 and WP2. In the first part of the project we focused on Word Sense Disambiguation and Semantic Role Labeling as preliminary tasks for constructing and enabling multilingual semantic parsing systems which are independent of the language and as independent as possibile from training data.

As regards WP1, namely the Creation of a Network of Semantic Representations, the following results have been obtained:

1) We have put forward effective, state-of-the-art approaches for integrating word and sense embeddings and create a shared, multilingual space of words and meanings, independent of the starting language. This work has started already in 2017 with the extension of the word2vec neural architecture to senses (CoNLL 2017), a bidirectional LSTM-based approach to sense embeddings called LSTMEmbed (ACL 2019) and ongoing work to be presented this year at AAAI.
2) We have created new multilingual lexical-semantic resources for representing syntagmatic information (SyntagNet) and predicate-argument information (VerbAtlas), both presented at EMNLP 2019

As regards WP2, namely Robust Multilingual Semantic Parsing, in order to be able to perform semantic parsing, two key tasks needed to be addressed multilingually: Word Sense Disambiguation and Semantic Role Labeling. During the first part of the project, we have made many key steps forward which pave the way to multilingual semantic parsing, namely:

1) We have proposed innovative techiniques for silver multilingual training data creation which address the so-called knowledge acquisition bottleneck: Train-O-Matic, an innovative method for bootstrapping the creation of sense-annotated training datasets for (presented EMNLP 2017, and extension of which is to appear in the Artificial Intelligence Journal, 2020). While the work presented at EMNLP has addressed the task of Multilingual Word Sense Disambiguation (WSD), this is currently being extended to Multilingual Semantic Role Labeling and later during the project will be applied to Multilingual Semantic Parsing. A higher-performance approach has been presented at ACL 2019 which leverages Wikipedia categories and latent sense representations to produce large training data for nouns in multiple languages, with state-of-the-art performances in Word Sense Disambiguation. Additional approaches for the automatic creation of large training corpora have been explored (ACL 2017 and LREC 2018).
2) We have studied the impact of knowledge-based methods for learning sense distributions (AAAI 2018). Sense distributions are an essential prerequisite for the creation of large-scale datasets for Natural Language Understanding, and therefore for the long-term goal of multilingual semantic parsing.
3) We have introduced the first neural model based on recurrent neural networks, and in particular Long Short-Term Memory Models, for Word Sense Disambiguation (EMNLP 2017), which achieved state-of-the-art performance in WSD and provided a major breakthrough: the ability to perform zero-shot learning on multilingual WSD, that is, to train the system on English data (the only currently available in the field) and test it on arbitrary languages, thanks to the use of multilingual embeddings which share the semantics of words across languages.
4) We then moved to the task of Semantic Role Labeling, by putting forward a novel approach to SRL, that is, dependency-based system (EMNLP 2019 and available at http://verbatlas.org) which takes advantage of current English-only SRL resources (PropBank) and our large-scale multilingual verbal frame resource (VerbAtlas).
- and involves main subtasks: predicate identification, predicate sense disambiguation, argument identification and argument role labelling.

Work on WP3 is currently ongoing, with – besides the work on silver data creation (see Train-O-Matic and OneSeC above) – experts on Chinese and Arabic working on key issues which address low-resourced languages.
As regards WP4, i.e. evaluation, we have studied the impact of word senses in downstream applications such as sentiment analysis and text classification ([4] presented at ACL 2017) and have released important multilingual datasets for WSD (Train-O-Matic, EuroSense, OneSeC, SyntagNet) and SRL (SyntagNet, VerbAtlas) which will represent key benchmarks for future research in the field.

Collaborations have been established with prof. James Pustejovsky (Brandeis University), Luigi Di Caro (University of Torino), Rocco Tripodi (University of Venice) and Christophe Gravier (Université de Lyon) on different aspects of meaning representation and their interactions with words in arbitrary languages. Several papers are currently under submission at relevant conferences.
The results of the first part of MOUSSE have progressed the state of the art in the field in the following respects:

1) it is now possible to perform multilingual Word Sense Disambiguation without needing manually created training data, thanks to novel approaches to high-quality silver data creation.
2) knowledge-based approaches have been shown to rival neural supervised approaches thanks to the integration of lexical-semantic syntagmatic information.
3) it is now possible to perform Semantic Role Labeling in arbitrary languages, thanks to the availability of VerbAtlas, a novel verb resource which overcomes the issues of PropBank and related resources (scalability, language specificity, human readability) in the literature and encodes the semantics of verb predicates and their arguments in a language-independent manner. Semantic Role Labeling systems which exploit VerbAtlas.

It is expected that, by the end of the project, a full-fledged, high-performance (joint dependency- and span-based) multilingual Semantic Role Labeling approach will be put forward. An innovative multilingual semantic parsing approach, which will produce structured representations for input sentences, is also expected.