Community Research and Development Information Service - CORDIS

Periodic Report Summary 1 - MULTILEX (Multilingual Lexicon Extraction from Comparable Corpora)

Given large collections of parallel (i.e. translated) texts, it is a well-known technique to establish links between corresponding words across languages, thus extracting bilingual dictionaries from parallel corpora. This is done by successively applying a sentence- and a word-alignment step. However, parallel texts are a scarce resource for most language pairs involving lesser-used languages. On the other hand, human second language acquisition seems not to require the reception of large amounts of translated texts, which indicates that there must be another way of crossing the language barrier. It appears that the human capabilities are based on looking at comparable resources, i.e. texts or speech on related topics in different languages, which, however, are not translations of each other. Comparable (written or spoken) corpora are far more common than parallel corpora, thus offering the chance to overcome the data acquisition bottleneck.

Despite its cognitive motivation, in the MULTILEX-project we do not attempt to simulate the complexities of human second language acquisition, but try to show that it is possible by purely technical means to automatically extract information on word- and multiword-translations from comparable corpora. The aim is to push the boundaries of current approaches, which typically utilize correlations between co-occurrence patterns across languages, in several ways:
1) Eliminating the need for initial lexicons by using a bootstrapping approach which only requires a few seed translations.
2) Implementing a new methodology which first establishes alignments between comparable documents across languages, and then computes cross-lingual alignments between words and multiword-units.
3) Improving the quality of computed word translations by applying an interlingua approach, which, by relying on several pivot languages, allows a highly effective multi-dimensional cross-check.
4) Showing that, by looking at foreign citations, language translations can even be derived from a single monolingual text corpus.

In the reporting period (first half of the project), the following subtasks were implemented:

1) In order to have gold standard data for system evaluation, test sets of 1000 word equations each were developed for eight language pairs involving the languages Chinese, Dutch, English, French, German, Portuguese, Russian, Spanish and Ukrainian. Analogously, lists of multiword-units and their translations were also compiled. As our main resource we used Wikipedia, where word correspondences can be identified via interlanguage links as created by the Wikipedia authors.

2) A bootstrapping approach for bilingual lexicon extraction was implemented which reduces the need for seed dictionaries. This approach is based on the generation of multi-stimulus-associations and the identification of correspondences across languages.

3) Another approach for bilingual lexicon extraction is based on aligned comparable documents rather than on comparable corpora. We implemented a system which does not require a seed lexicon, but instead requires aligned comparable documents, i.e. pairs of documents in different languages. In these documents the most salient keywords were identified. Using the resulting pairs of keyword lists, word alignment tools similar to those used for parallel corpus word alignment can be successfully applied.

4) Dictionary translations can not only be determined directly, but also indirectly using a pivot language. If several pivot languages are utilized, potential translations can be re-confirmed through mutual cross-validation. In our implementation, we use three pivot languages and select the best translations through voting, i.e. translations that are proposed through more than one pivot language are preferred.

Dissemination of the scientific results was achieved through several conference and journal articles. In addition, a special issue of the “Journal of Natural Language Engineering” (Cambridge University Press) was co-edited and published. It deals with the topic of “Machine Translation Using Comparable Corpora”. Also, a book on “Hybrid Approaches to Machine Translation” (Springer) was co-edited and published. In addition, two editions of the workshop series on “Building and Using Comparable Corpora” were co-organized at ACL 2015 in Beijing and at LREC 2016 in Portoroz (Slovenia). At the 2015 edition, also a shared task on “Identifying Parallel Sentences in Comparable Corpora” was conducted. Two editions of another workshop series, namely the “Workshop on Hybrid Approaches to Translation” were co-organized, one also at ACL 2015 in Beijing, the other at EAMT 2016 in Riga (Latvia). All proceedings were published in the ACL Anthology.

Contact

Sascha Hofmann, (Administrative Managing Director)
Tel.: +49 7274 508 35111
Fax: +49 7274 508 35412
E-mail

Subjects

Life Sciences
Record Number: 194260 / Last updated on: 2017-02-14
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top