Skip to main content
Ir a la página de inicio de la Comisión Europea (se abrirá en una nueva ventana)
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS
Contenido archivado el 2024-05-29

A Combinatorial Approach to Machine Translation

Final Activity Report Summary - COMTRANS (A combinatorial approach to machine translation)

In the COMTRANS project a machine translation system for three European language pairs has been designed and realised which implements the novel combinatorial approach to statistical machine translation. It is based on machine learning, i.e. the system learns how to translate by being trained on large quantities of previously translated texts. Many resources that have been constructed during the project have been made publicly available and are available for download, among them the translation engine and automatically generated sample dictionaries for the language pairs English - German, English - French, and German - French. The underlying methodology is likewise applicable to other language pairs provided the necessary training corpora are available.

The combinatorial approach has been designed to overcome a number of notorious shortcomings that were observed with conventional machine translation systems. Conventional systems usually rely on dictionaries that are in essence mappings between individual words of the source language and the target language. Criteria for the disambiguation of ambiguous words and for differences in word order between the two languages are only in a limited way accounted for in the lexicon. Instead, these important issues are dealt with in the translation engines. Because the engines tend to be compact and (even with data oriented approaches) do not appropriately reflect the complexity of the problem, this approach generally does not account for the more fine grained facets of word behaviour. This leads to wrong generalisations and as a consequence translation quality tends to be poor.

In the COMTRANS project this problem is approached by using a new type of lexicon that is not based on individual words but on pairs of words. For each pair of adjacent words in the source language the lexicon lists the possible translations in the target language together with information on the order and distance of the target words. The process of machine translation is then seen as a combinatorial problem: For all word pairs in a source sentence all possible translations are retrieved from the lexicon and then those translations are discarded that lead to contradictions when constructing the target sentence. This process implicitly leads to word sense disambiguation and to language specific reordering of words. As there are many more word pairs than individual words in a language, the information content of the new type of lexicon is considerably higher than that of conventional lexicons. On one hand, this gives the potential for better translations. On the other hand, it is not realistically possible to construct such a lexicon manually. For this reason, an important part of the work was to develop a method that is capable of deriving such a lexicon automatically from previously translated texts.

The combinatorial approach is related to phrase-based statistical machine translation. However, a major difference is that the combinatorial approach does not require the definition and explicit identification of phrases, that it allows for discontinuous word sequences (separated by wildcards), and that it has the potential of taking into account a more comprehensive set of associations between words.
Mi folleto 0 0