Skip to main content

Developing Multilingual Web-scale Language Technologies

Objective

This project aims to provide meaning to the web. MEANING will enhance current web applications by automatically increasing the linguistic depth and breath of existing multilingual resources and by devising improved concept-based Natural Language Processing (NLP) technologies using those resources. Current web access applications are based on words; MEANING will open the way for access to the Multilingual Web based on concepts, providing applications with capabilities that significantly exceed those currently available. MEANING will facilitate development of concept-based open domain Internet applications. Furthermore, MEANING will supply a common conceptual structure to Internet documents, thus facilitating knowledge management of web content. This project aims to provide meaning to the web. MEANING will enhance current web applications by automatically increasing the linguistic depth and breath of existing multilingual resources and by devising improved concept-based Natural Language Processing (NLP) technologies using those resources. Current web access applications are based on words; MEANING will open the way for access to the Multilingual Web based on concepts, providing applications with capabilities that significantly exceed those currently available. MEANING will facilitate development of concept-based open domain Internet applications. Furthermore, MEANING will supply a common conceptual structure to Internet documents, thus facilitating knowledge management of web content.

OBJECTIVES
To be able to build the next generation of intelligent open domain HLT application systems we need to solve two complementary intermediate tasks: Word Sense Disambiguation (WSD) and large-scale enrichment of Lexical Knowledge Bases. WSD is the task of assigning the appropriate meaning (sense) to a given word in a text or discourse. And this is one of the most important open problems in NLP. However, progress is difficult due to the following paradox:
1) In order to enrich Lexical Knowledge Bases we need to acquire information from corpora, which have been accurately tagged with word senses.
2) In order to achieve accurate WSD, we need far more linguistic and semantic knowledge than is available in current lexical knowledge bases (e.g. WordNets). The major objective of MEANING is to provide innovate technology to solve this problem.

DESCRIPTION OF WORK
MEANING will develop concept- based technologies and resources through large-scale processing over the web, robust and fast machine learning algorithms, very large lexical resources and new strategies for combining them. MEANING will treat the web as a (huge) corpus to learn information from, since even the largest conventional corpora available (e.g. the British National Corpus) are not large enough to be able to acquire reliable information in sufficient detail about language behaviour. Moreover, most European languages do not have large or diverse enough corpora available. We will use a combination of Machine Learning and novel Knowledge-Based techniques in order to enrich the structure of the WordNets in different domains (subsets of the web) in five European languages: English, Italian, Spanish, Catalan and Basque. MEANING will produce:
a) A Tool Set that using the semantic knowledge of EuroWordNet will obtain automatically from the web large collections of examples and for each particular word sense.
b) A Tool Set for enriching EuroWordNet using the knowledge acquired automatically from the Web.
c) A Tool Set for selecting accurately the senses of the open-class words for the languages involved in the project. MEANING will also develop a Multilingual Central Repository to maintain compatibility between WordNets of different languages and versions, past and new. The acquired knowledge from each language will be consistently uploaded to the Multilingual Central Repository and ported over to the local WordNets involved in the project. MEANING will also produce a semantically annotated corpus for each WordNet word sense, that is, a Multilingual Web corpus with semantically annotated corpora containing concept and domain labels.

Coordinator

UNIVERSITAT POLITECNICA DE CATALUNYA
Address
Jordi Girona 31
08034 Barcelona
Spain

Participants (3)

ISTITUTO TRENTINO DI CULTURA
Italy
Address
Via Santa Croce 77
38100 Trento
THE UNIVERSITY OF SUSSEX
United Kingdom
Address
Sussex House Falmer
BN1 9RH Falmer, Brighton, East Sussex
UNIVERSIDAD DEL PAIS VASCO/EUSKAL HERRIKO UNIBERTSITATEA
Spain
Address
Rectorado De Upv-ehu, Campus De Leioa, Barrio Sarriena S/n
48940 Leioa/bizkaia