Skip to main content

Multilingual Lexicon Extraction from Comparable Corpora

Periodic Report Summary 2 - MULTILEX (Multilingual Lexicon Extraction from Comparable Corpora)

Given large collections of parallel (i.e. translated) texts, there are well known techniques to establish links between corresponding words across languages, thus extracting bilingual dictionaries from parallel corpora. This is done by successively applying a sentence- and a word-alignment step. However, parallel texts are a scarce resource for most language pairs involving lesser-used languages. On the other hand, human second language acquisition seems not to require the reception of large amounts of translated texts, which indicates that there must be another way of crossing the language barrier. It appears that the human capabilities are based on looking at comparable resources, i.e. texts or speech on related topics in different languages, which, however, are not translations of each other. Comparable (written or spoken) corpora are far more common than parallel corpora, thus offering the chance to overcome the data acquisition bottleneck.

Despite its cognitive motivation, in the MULTILEX-project we do not attempt to simulate the complexities of human second language acquisition, but try to show that it is possible by purely technical means to automatically extract information on word- and multiword-translations from comparable corpora. The aim is to push the boundaries of current approaches, which typically utilize correlations between co-occurrence patterns across languages, in several ways:

1) Eliminating the need for initial lexicons by using a bootstrapping approach which only requires a few seed translations.

2) Implementing a new methodology which first establishes alignments between comparable documents across languages, and then computes cross-lingual alignments between words and multiword-units.

3) Improving the quality of computed word translations by applying an interlingua approach, which, by relying on several pivot languages, allows a highly effective multi-dimensional cross-check.

4) Showing that, by looking at foreign citations, language translations can even be derived from a single monolingual text corpus.

In the reporting period (second half of the project), the following major subtasks were conducted:

A system for unsupervised word sense induction and disambiguation was implemented and compared to several existing systems. An optimized system was applied to subsets of the respective Wikipedia editions of all five project languages, namely Dutch, English, French, German, and Spanish. Subsequently, the algorithm for lexicon extraction from comparable corpora (as developed in the first reporting period) was applied to the pairs of the disambiguated corpora and thus translations of word senses were automatically identified.

It has been shown, that the extraction of multiword translations can be conducted using the same algorithm as applied to single words. Only, in a pre-processing step, the multiword units have to be identified. For this purpose, we developed a tool which is based on the observation that multiword units typically show certain patterns of tag sequences. The necessary part-of-speech-tagging was conducted using TreeTagger which is applicable to many languages.

To evaluate the results for single words and for multiword units, we primarily used word equations as derived from word-aligned parallel corpora (whereby word alignment was conducted using standard Moses procedures) as well as word equations from Wikipedia interlanguage links.

We also tested our algorithm for lexicon extraction on various text types, such as the Gigaword Corpora from the Linguistic Data Consortium (newsticker texts), and the Wacky-Corpora from the Web-as-a-Corpus-Initiative. Based on these very large monolingual corpora, we also implemented a method for extracting word translations by looking at citations of foreign words. Hereby the idea is that a foreign word’s strongest associations (frequently co-occurring words) are likely to be their translations. This hypothesis could be confirmed under the condition that a sufficient number of a foreign word’s citations occurs in the underlying corpus.

Beyond the core project goals, we were also active in practical applications such as machine translation post-editing and its evaluation. Here students were intensively involved and contributed in the framework of teaching projects.

Dissemination of the scientific results was achieved through several publications. Also, in the reporting period two editions of the workshop series on “Building and Using Comparable Corpora” were co-organized at ACL 2017 in Vancouver (Canada) and at LREC 2018 in Miyazaki (Japan). In both editions, a shared task on “Identifying Parallel Sentences in Comparable Corpora” was conducted. The workshop proceedings are available online as open-access publications. An edition of another workshop series, namely the “Workshop on Hybrid Approaches to Translation” was co-organized at the International Conference on Computational Linguistics (COLING 2016) in Osaka (Japan). All proceedings were published. In addition to a number of contributed presentations at international conferences, at the Conference on “Contemporary Issues in Data Science” (CIDAS in Zanjan, Iran) an invited plenary speech and an interview with an Iranian TV station were given.

The fellow has also accepted a number of invitations to Programme Committees of conferences and workshops, and to the scientific board of the journal Computer Speech and Language. He also supervised several theses and acted as an external reviewer for the PhD thesis of Othman Zennaki (Université Grenoble Alpes) on the topic of automatic creation of linguistic tools and resources from parallel corpora.