Skip to main content
European Commission logo print header

Lexical Acquisition Across Languages

Periodic Reporting for period 4 - LEXICAL (Lexical Acquisition Across Languages)

Período documentado: 2020-03-01 hasta 2021-08-31

Due to the growing volume of textual information available in multiple languages, there is a great demand for Natural Language Processing (NLP) techniques that can automatically process and manage multi-lingual texts, supporting information access and communication in core areas of society (e.g. healthcare, business, science). Many NLP tasks and applications rely on task-specific lexicons (e.g. dictionaries, word classifications) for optimal performance. Recently, automatic acquisition of lexicons from relevant texts has proved a promising, cost-effective alternative to manual lexicography. It has the potential to considerably enhance the viability and portability of NLP technology both within and across languages. However, this approach has been explored for a very small number of resource-rich languages only, leaving the vast majority of worlds’ languages without useful technology. The ambitious goal of this project was to take research in lexical acquisition to the level where it can support multi-lingual NLP, involving also languages for which no parallel language resources (e.g. corpora, knowledge resources) are available. Building on an emerging line of research which uses mainly naturally occurring supervision (connections between languages) to guide cross-lingual NLP, we developed a radically novel approach to lexical acquisition. This approach is capable of transferring lexical knowledge from one language to another as well as simultaneously learning it for a diverse set of languages using new methodology based on guiding joint learning and inference with rich knowledge about cross-lingual connections. The project has not only created novel lexical acquisition technology but has also taken cross-lingual NLP a big step toward to the direction where it is no longer dependent on parallel resources. We have demonstrated that our approach can support fundamental tasks and applications aimed at broadening the global reach of NLP to areas where it is now critically needed.
This project was aimed at developing entirely novel methodology to enable lexical classification across languages. Although techniques for cross-lingual NLP existed prior to the project, the success of previous techniques was mainly limited to resource-rich languages and scenarios where sufficient parallel or annotated data were available. The idea of this project was to innovate methodology that could be usefully applied even when such data resources are missing, either in part or completely. The key idea of the project is to develop rich knowledge about typological language connections and to use such knowledge to guide the joint learning and inference of lexical information across languages. We developed novel methodology for representation learning that could capture the range of lexical information of interest to the project (predicate-argument information, selectional preferences and verb classes), both within and across languages. We improved the ability of representation learning to capture such monolingual, bilingual and multilingual information in a variety of ways. This included improving its capability to benefit from multimodal data, character and word level information, contextual and syntactic information, as well as from monolingual and cross-lingual constraints. We also improved the ability of representation learning to deal with resource-poor scenarios, e.g. those that suffer from unseen words. Finally, we demonstrated how such methods can be extended to integrate (and be constrained by) typological information, and how this information can be used to guide lexical acquisition. We evaluated the technical innovations using existing and novel evaluation resources as well as by using them to improve the performance of many cross-lingual and multilingual application tasks. Our project resulted in many resources and techniques capable of capturing rich lexical information across languages, including low-resourced ones. These are fundamental advances that the field can benefit from. We published the advances in the top conferences and journals in the field to ensure impact on the wider NLP community.
This project has pushed the frontiers of our understanding of language processing as well as extended our ability to apply NLP across languages. We have develop a novel, expressive approach where joint learning and inference across languages is guided with rich knowledge about cross-lingual connections. This is a ground-breaking contribution since multilingualism is one of the biggest current challenges in NLP. In addition to handling multiple languages, our model can also handle complex, syntactic-semantic linguistic knowledge, advancing the capabilities of joint learning and inference models of language. We have mainly focussed on a core component at the heart of many NLP systems: the lexicon. Rich lexical representations that link together the syntax and semantics of verbs provide the effective means to deal with many challenges (e.g. ambiguity, noise, data sparsity) in NLP. They are important for the many applications that benefit from information related to the predicate-argument structure. We have provided the improved means to tune and create such resources automatically (i.e. in a cost-effective manner) and have extended this approach to resource-poor languages and domains. Our techniques support more accurate prediction of the appropriate interpretation of text within languages and improve our ability to match syntactic and semantic variations across languages for applications such as machine translation. Ultimately, improved automatic information processing is beneficial to communication and can support key areas of society (e.g. science, healthcare, trade). We aim to extend these benefits to global level. Our project also provided rich material for theoretical investigations because it brings together insights about language connections and probabilistic verb knowledge in data. This can benefit linguistic and cognitive sciences and can also lead to improvements in language education (e.g. second language learning).
screenshot-2021-10-20-at-14-41-05.png