CORDIS - EU research results

Integration of Machine Translation Paradigms

Final Report Summary - IMTRAP (Integration of Machine Translation Paradigms)

Machine Translation (MT) is a highly interdisciplinary and multidisciplinary field since it is approached from the point of view of engineering, computer science, informatics, statistics and linguists. In the last years, MT has improved results as proofed by freely available translators such as Google Translate, Bing Translator and Apertium. Since technology behind those translation systems is different (either corpus-based, which learns from large quantities of parallel corpus, or rule-based, which learns from human written rules and dictionaries), IMTraP has proposed and experimented with both architectures with the objective of extracting the best of each and to boost the cooperation between both communities. In addition, the project has presented leading results in the new research line of neural MT with a character-based approach. The project focused on the three most spoken languages in the world: Chinese, Spanish and English; and all translation combinations among them. These language pairs do not only involve many economic and cultural interests, but they also include some of the most relevant MT challenges such as morphological, syntactic and semantic variations.
This project deeply analyzed how the statistical MT approach (one of the most popular corpus-based approach) has been enhanced at each written linguistic level (i.e. orthographic, morphological, lexical, syntactic and semantic) and published the study (‘Statistical Machine Translation Enhancements through Linguistic Levels: A Survey, ACM Computing Surveys, 2015’). The study showed that: “the holistic statistical MT is still not able to correctly cover all the translation challenges that arise. Alternatively, instead of being general, each extension to statistical MT tends to focus on one particular challenge to achieve the desired enhancement, and these particular approaches are usually focused on one of the linguistic levels mentioned earlier”.
After this analysis, the project focused on integrating morphological, syntactic and semantic knowledge into either statistical or rule-based systems. Morphology integration has been studied by means of experimenting with different target simplifications and then, generating the full-form with classification techniques, see details in (‘Morphology Generation for Statistical Machine Translation using Deep Learning Techniques’, CORR, 2016). For the challenging pair of Chinese-Spanish, experiments show that simplification only in gender and number almost achieves improvements as good as the simplification on lemmas. This is an interesting result that reduces the level of complexity for the classification task. We have successfully used classification techniques based on deep learning that achieve 93% of effectiveness in classifying the number and 98% in gender. When building the final integration, statistical translation with generation of morphology achieves improvements overall translation.
Syntactic knowledge has been introduced by integrating manual and statistical techniques in a hybrid MT system guided by a rule-based system in the specific task of Chinese-to-Spanish. The manual procedure consisted in performing a translation of a source text and contrasting the output translation, the source and the reference. From this observation, manual patterns were extracted in order to design a rule. The procedure of extracting statistical rules, inspired in previous work by Sánchez-Cartagena et al., consists in: first, aligning the given parallel corpora at the level of words; second, extracting bilingual units; third, restricting bilingual units using the bilingual dictionary of the rule-based system. Manual and automatic rules are combined together by giving preference to the manual rules. Further details can be found in the paper: ‘Description of the Chinese-to-Spanish Rule-Based Machine Translation System Developed with a Hybrid Combination of Human Annotation and Statistical Techniques’, ACM TALLIP, 2016.
Semantic knowledge has been introduced by developing a methodology to address lexical disambiguation in a standard statistical phrase-based system. Similarity among source contexts is used to select appropriate translation units. The information is introduced as a novel feature of the phrase-based model (Figure 1) and it is used to select the translation units extracted from the training sentence more similar to the sentence to translate. The similarity is computed through a deep auto-encoder representation, which allows to obtain effective low-dimensional embedding of data and statistically significant improvements on two different tasks (English-to-Spanish and English-to-Hindi). Further details can be found in the paper: ‘A Deep Source-Context Feature for Lexical Selection in Statistical Machine Translation’, Pattern Recognition Letters, 2016.
All these tasks (i.e. integration of morphology, syntax and semantics) have been addressed also from a holistic point of view by experimenting with the new neural MT paradigm, which addresses the MT challenge with an auto-encoding paradigm. In this line, we have proposed the character-based neural MT approach (Figure 2) published in ACL 2016 (‘Character-based Neural Machine Translation’), which represents a leading publication in an emergent research.
IMTraP impacts directly in basic MT research and innovation, by proposing and inspiring improved MT with an integration paradigm. This better quality in European MT technologies and integration of rule-based and statistical MT communities benefits policy makers, civil society, the European Civil Service and international relationships (specially with the Asian community since the project has emphasized addressing Chinese).

Most relevant outputs of the project are listed as follows. Results have been published in 14 ISI journal papers and 27 international conferences, books and other journals. The researcher has received approval of a technology disclosure pending of patenting in I2R; organised 5 workshop editions to promote rule-based and statistical community integration; taught 10 dissemination invited talks and 8 related courses; proposed alone or in cooperation accepted projects such as Ramon y Cajal (senior research programme) and the DeepVoice project, both funded by MINECO; and proposed non-funded projects such as BBVA foundation and FET Open-RIA from European Commission; supervised 8 students (bachelor, master and PhD thesis) on related topics. As a consequence of the project, relations between I2R and UPC have improved and there is an interchange of students and an agreement in preparation between both institutions that will easy future collaborations.
In summary, IMTraP focused on the problem of dynamically integrating the two most popular MT paradigms: the rule-based and the statistical-based. We used linguistic technologies developed either for the rule-based MT systems or other natural language processing tasks into statistical MT systems. Linguistic technologies included: bilingual dictionaries, transfer rules, statistical parsing, word sense disambiguation, morphological and syntactic analysis. The new paradigm provided solutions to current MT challenges such as unknown words, reordering and semantic ambiguities.