CORDIS - Forschungsergebnisse der EU
CORDIS
Inhalt archiviert am 2024-06-18

Testing the portability of techniques to handle dissimilar source and target languages in MT

Final Report Summary - ENEUS (Testing the portability of techniques to handle dissimilar source and target languages in MT)

The ENEUS project brought together expertise from three different fields, that of linguistics, computer science and translation. The main objective of the ENEUS project was to contribute to machine translation (MT) research by studying the portability of architectures and techniques to handle dissimilar source and target languages.

The ENEUS project has measured the adequacy of the Matxin machine translation architecture to be ported to different language pairs. We have performed a complete assessment of the system with the idea of having analytic languages at source and agglutinative languages at target in the translation process. Matxin has proven a suitable architecture for translation between dissimilar languages because it is prepared to handle deep analysis, with emphasis on morphosyntax. The flexibility to move information within a morphologically-informed, named dependency tree enriched with chunk knowledge makes the architecture a suitable platform for tackling dissimilar languages.

During the assessment exercise, we have ported the existing Spanish-Basque system to work in the English-Basque direction, thus building the ENEUS RBMT prototype. At the end of the reporting period, the ENEUS RBMT prototype, with a coverage of 35,000 entries can address simple affirmative, negative and interrogative sentences made up of tenses in the indicative for all four subject-object paradigms, for active and passive voices, as well as imperatives. Moreover, in their simplest forms, it can already handle relative clauses, completives, conditionals and a number of adverbial clauses (time, place and reason), and certain non-finite structures.

Translation from an analytic language into an agglutinative language is challenging for statistical machine translation (SMT) systems. In the field, specialised techniques – segmentation and reordering – have been proposed to address this scenario. The ENEUS project has studied the agglutinative features and word order profiles of an analytic language, English, and three agglutinative languages, namely, Basque, Finnish and Hungarian. We have contributed to theoretically analysing the potential gains of the mentioned techniques and to proposing language-specific implementations through a series of segmentation, agreement and reordering rules.

This work clearly shows that not all agglutinative languages can be addressed equally in SMT. The level of agglutination, the features that get agglutinated, vary from language to language. Among the languages studied, the use of postpositions as equivalents to English prepositions was the only feature shared by the three languages. Several segmentation schemes have been proposed in the field. Some split all morphemes in the agglutinated words, others group all suffixes in one group, and yet others propose more subtle groupings. We have seen, however, that a more source language-oriented approach might be possible and beneficial, as it might be more focused and customised to improving one-to-one alignment rates.

Agreement within the prepositional phrase has also been explored. We have observed that requirements vary significantly from language to language, with language-specific patterns emerging. Coupled with segmentation, agreement replication in the source shows potential to improve one-to-one alignment training. Finally, we also saw that word order patterns do not defer considerably for the English-Finnish and English-Hungarian pairs, although they are significant for the English-Basque pair. We conclude, therefore, that reordering has low potential to improve alignment for the English-Finnish and English-Hungarian pairs, but could have a very positive impact in English-Basque translation. ENEUS SMT systems were built for all pairs following the outcomes of the analysis.

A large-scale human evaluation campaign was led as part of the ENEUS outreach programme (www.ebaluatoia.org). Over 500 users helped compare four English-Basque MT systems developed during the ENEUS project (RBMT, baseline SMT, morphologically-savvy SMT, hybrid MT) as well as Google’s state-of-the-art translator. Pair-wise comparisons of the MT output of 500 sentences were collected for the 10 possible system combinations. Results show that the morphologically-savvy SMT system was on equal terms with Google’s translator. These two systems performed better against all other systems.
We evaluated the English-Finnish and English-Hungarian SMT systems separately. We calculated automatic metrics for all systems and performed a human evaluation with best two systems. Three evaluators per language pair compared the system outputs and identified the better translation. Additionally, one evaluator per target language was asked to perform a qualitative error-analysis. TAUS granted us access to their DQF evaluation platform to run the comparison evaluation. For Finnish, the system with training performed on aligned-lemmas won. Adequacy and morphological issues were identified as most prominent. For Hungarian, the baseline system won. Adequacy and lexical selection errors were identified as most prominent. We have analysed the reasons for the negative results and have pinpointed technical aspects, as well as the high level of terminology and short sentences in the corpus domain as possible motives.

Finally, as part of the ENEUS exploitation plan, we are conducting further productivity tests to push the ENEUS systems to industrial use. The best ENEUS system has been integrated within the Bologna Translation Service at Elhuyar (an EU-funded ICT PSP 4th Call, Theme 6: Multilingual Web project, ID 270915). We test the gain in translation productivity with the use of machine translation output. Translators have been asked to translate and post-edit texts and we are analysing the word-per-hour rates they achieve.

Apart from the scientific and industrial contributions, the societal impact of the project is seen at various levels. The ENEUS prototypes will be made available to users through the Matxin website, powered by Elhuyar (http://matxin.elhuyar.org/). The ENEUS RBMT system is the first open-source English-Basque MT system available today. It will be made available to developers through the Matxin page on sourceforge (http://sourceforge.net/projects/matxin/?source=directory). Since its publication on sourceforge in May 2006, Matxin has been downloaded 2.493 times. Thanks to the ENEUS project, the system now offers the possibility to build and research with English and Spanish source languages – already implemented - into any other language. This is a powerful tool for all, researchers, lecturers and students in Natural Language Processing and, in particular, machine translation, as well as industrial partners.

On a more general level, the outreach programme conducted during the ENEUS project, with the large-scale evaluation/competition as primary activity, served to attract the general population to research and raise awareness of European research, and specifically, the Marie Curie Actions.