CORDIS - Risultati della ricerca dell’UE
CORDIS

Statistical Machine Translation Using Monolingual Corpora

Final Report Summary - MONOTRANS (Statistical machine translation using monolingual corpora)

As evidenced by a number of machine translation (MT) competitions, statistical MT is producing encouraging results for language pairs where large corpora of previously translated texts are available for training. However, in practice the availability of such data is often a severe bottleneck. In the MONOTRANS project, we therefore implemented an MT system which only requires a bilingual dictionary and monolingual text corpora, which considerably relieves the data acquisition problem. We realised a two-stage procedure. In the first step, we create a database of translation equivalents by extracting them from a pair of comparable monolingual corpora. In the second step, we translate new sentences by retrieving appropriate translation equivalents from the database and by merging them using a combinatorial approach.

In the following, we give an overview on the various steps of the project.

Corpus acquisition: For our three project languages: English, German, and Spanish, we acquired the very large (> 1 billion words) Gigaword corpora from the Linguistic Data consortium and the WaCky corpora from the Web-as-Corpus initiative. We also acquired the Google 1T 5-gram data providing the frequencies of word sequences of lengths 1 to 5 in web documents comprising a trillion words.

Acquisition of machine readable dictionaries: To be able to make our results freely available, we decided not to use copyright-restricted commercial dictionaries for our research. Instead we automatically generated fairly comprehensive dictionaries from the parallel Europarl corpus using the Moses toolkit. It should be noted, however, that our MT system would likewise work with any other word-form oriented dictionaries.

Generating thesauri of related words: Replacing words of the source or the target language by synonyms can lead to better translations. We implemented an algorithm capable of computing synonyms and related words by analysing the distributions of words in corpora. For implementation, we used an adapted version of latent semantic analysis. Large thesauri have been generated for all project languages.

Automatic expansion of dictionaries using comparable corpora: To make our dictionaries more comprehensive, we implemented a methodology for deriving terminology translations from comparable corpora. We used a technique utilising the similarity of co-occurrence patterns in different languages. By using small base dictionaries for bridging between languages, our procedure extends the methodology for constructing thesauri of related words to the multilingual case. We were able to obtain state-of-the-art results for all project languages.

Database of translation equivalents: For the three language pairs, German - English, Spanish - English, and Spanish - German dictionaries of translation equivalents were compiled. For this purpose, source language word sequences were translated word by word, thereby generating all possible permutations with regard to word ambiguity and (optionally) word order. Each permutation was looked up in a database of target language n-grams, which also provides n-gram frequencies. The most frequent n-gram is considered to be the best translation.

Translation engine: The translation engine is based on the methodology for identifying translation equivalents. It segments each source sentence into sequences of n-grams, identifies each n-gram's translation equivalent, and generates a full sentence translation by combining the translation equivalents. Whereas most other statistical MT systems use n-grams of length 3, our system can deal with variable n-gram length up to 5, backing off to shorter n-grams if data is sparse.

Evaluation: For each language pair, sample texts were translated and standard BLEU evaluation scores were computed. As an alternative, we also developed a new methodology for MT evaluation which is called backtranslation score and is based on comparing an original text to a roundtrip translation.