Skip to main content

The Automatic Generation of Lexical Databases Analogous to WordNet

Final Report Summary - AUTOWORDNET (The Automatic Generation of Lexical Databases Analogous to WordNet)

WordNet is a large lexical database of English where nouns, verbs, adjectives and adverbs are grouped into sets of synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. WordNet has turned out to be an indispensable resource in the processing of natural language, and based on its model similar lexical databases have been created for many other languages.

However, creating such databases takes many years of work and is very costly. To investigate alternatives, the AutoWordNet project was conducted with the aim of automatically building a resource as similar as possible to WordNet using unsupervised methods which were applied to raw text. Hereby the methodology is largely language independent and was applied to four European languages, namely English, French, German, and Spanish.

To achieve the objective of generating a WordNet-like resource, mainly three steps were conducted which yielded the following results:

1) Computing word similarities:
Starting from a large corpus of the respective language, three vector space approaches for computing semantically related words were implemented and compared. The first one used raw corpora, the second one syntactically annotated corpora, and the third one singular value decomposition for dimensionality reduction of the semantic space. For evaluation, we tested the standard TOEFL synonym dataset, the Princeton evocation dataset, and then came up with a newly created large dataset specifically designed for this purpose.
The outcome of the evaluation was that the method using singular value decomposition performed best and came close in performance to similarity judgements obtained from native speakers. Using this method, for all four project languages (English, French, German, and Spanish) large thesauri of related words were generated, with each comprising about 60,000 words.

2) Inducing word senses:
To identify each word’s senses, an algorithm for unsupervised word sense induction was developed which clustered local context vectors. This method was compared to a number of other methods which e.g. maximized sense descriptor dissimilarity (based on global co-occurrence vectors, i.e. vectors averaging over a full corpus) or considered higher order dependencies using independent component analysis. The comparison showed that clustering local context vectors led to the best results. So this method was used to automatically discover the senses of the words in the thesauri.

3) Discovering conceptual relations between words:
To find out about the conceptual relations between words the methodology for computing relational similarities introduced by Peter Turney was replicated and further refined. Essentially, a matrix of word pairs (rows) and their connecting words (columns) was build up, and the word pairs were clustered. This led to clusters of word pairs sharing the same conceptual relation. The finding was that, although the clusters typically make sense, the corresponding automatically induced relations cannot be expected to match the set of relations known from WordNet.

The outcome of these activities has been published and disseminated as follows: The major resources, among them the above mentioned large thesauri of related words for the four project languages, as well as the synonym test set for evaluation, were made available on the project website free of charge. The scientific results were published in 17 papers, with several more papers upcoming. 15 scientific presentations were given at conferences and on invitation, and a similar number of conferences and workshops has been attended. Progress in the field was stimulated by co-editing three contributed volumes and by co-organizing seven thematically relevant international workshops which were co-located with major conferences (ACL, COLING, EACL, LREC, and TALN) and which brought together researchers interested in the field. One of the workshops included a shared task on multi-stimulus association whose results are important for a better understanding of WordNet's concept of a "synonym set" (synset) if to be expressed in terms of corpus statistics rather than being based on human intuitions. An ongoing activity is to co-edit a special issue of the Journal of Natural Language Engineering on the topic of Machine Translation Using Comparable Corpora.

To promote wide dissemination, most of the papers and proceedings are open access publications, with many of them accessible via the Anthology of the Association for Computational Linguistics (ACL Anthology). To also encourage open access book publications in our field, together with two colleagues and in a framework provided by "Language Science Press" the open access book series "Translation and Multilingual Natural Language Processing" was initiated.

Concerning the impact of the project, there are two major outcomes: One is to foster cooperation among researchers, as mentioned above. Their contributions led to about 2000 pages of peer-reviewed scientific publications in books and proceedings. The other outcome is the scientific value of the project itself. Maciej Pisecki and colleagues have stated that "A wordnet – a rich repository of knowledge about words – is a key element of … language processing” and that "A language without a wordnet is at a severe disadvantage." However, despite severe efforts, WordNets are not yet available for most languages, including many EU languages. This proposal has implemented a methodology for automatically generating resources replicating some aspects of WordNet for any language where large enough monolingual text corpora are available. This methodology has been applied to four European languages (English, French, German and Spanish). Thus, on one hand, useful lexical resources of practical relevance have been produced. On the other hand, a manually created WordNet can be considered a collection of human intuitions on language. By replicating such intuitions automatically through corpus analysis, conclusions about human lexical acquisition can be drawn, in particular concerning word meaning, word senses, and conceptual relations among words. Thus the project gave a deeper insight into some aspects of human cognition.

Although the proposed methodology is unlikely to completely replace current manual techniques of compiling lexical databases in the near future, it should be useful to efficiently aggregate relevant information for subsequent human inspection, thereby helping to make the manual work more efficient and less costly. This is of particular importance as the suggested methods should in principle be applicable to all languages so that the potential savings multiply.

URL of project website: http://www.ftsk.uni-mainz.de/user/rapp/autowordnet/