Pushing the boundaries of automatic translation

Our globalised interconnected reality demands ever-smarter automatic translation tools. Via deep learning, a team delivers solutions for statistical machine translation.

Digital Economy

Instant translation across European languages is pivotal for effective governance in the EU as well as for academic and commercial activities. Data-driven approaches based on machine-learning techniques are widely used to this end. The basic knowledge is derived from a parallel corpus of texts and their translations. This means that an elevated level of translation quality is reached in domains with large parallel corpora like international and EU organisations. Conversely, numerous other domains such as medical or legal literature, that lack large parallel corpora suffer from unevenly low translation quality. Employing a two-pronged approach, the EU-funded DASMT(opens in new window) project improved knowledge acquisition for automatic translation. It focused on how to benefit from large out-of-domain parallel corpora in domain-specific translation systems, and on mining and appropriately weighing knowledge available from in-domain texts that are not parallel.

Deep learning: a challenge and an opportunity

The DASMT team initially became involved with deep learning, which requires graphics processing units(opens in new window) (GPUs), by buying gaming PCs with consumer GPUs. Project coordinator Alexander Fraser comments: “These really looked like gaming machines with, for instance, external water cooling … yet we quickly determined that we needed to change our whole research programme to work with deep learning models for translation, which was a lot of effort in the second and the third year of the project and required significant server purchases. But this ultimately made a big difference in the impact we had.” The DASMT solutions have a direct impact on providers of translation services as well as an academic impact, since domain adaptation applies to all natural language-processing systems and many areas of artificial intelligence research.

Holistic results for the realm of machine translation

DASMT improved translation to morphologically rich languages that use classifiers. Consequently, the interest switched to neural machine translation(opens in new window) (NMT), a new technology overcoming some limitations in phrase-based statistical machine translation, the previous state of the art. Important work was done here, on both inflectional generalisation and improving linguistic representation, as well as on fast training algorithms. Surprisingly, the researchers found themselves working on training machine translation systems without the use of any parallel data. Moreover, they researched document translation, utilising the full context and thus achieving better modelling. The project also focused on several under-resourced languages with few digital resources, such as Hiligaynon(opens in new window), an important language of the Philippines. Through special case studies with Upper Sorbian(opens in new window) (a minority Slavic language of Germany) and Chuvash(opens in new window) (a minority language of Russia), the team enriched the research on effective pretraining for unsupervised NMT. Finally, DASMT managed to create a high-performance, hate speech detection system(opens in new window). The DASMT team has open-sourced their improved systems and is committed to communicating the results to the machine translation and multilingual natural language-processing communities. “In the future, we will pursue further research funding from both European and national agencies, and we will also create a spin-off which has both commercial and non-profit focuses, since there is significant interest from both sectors in our improved multilingual models,” reveals Fraser.