Instant translation across European languages is pivotal for effective governance in the EU as well as for academic and commercial activities. Data-driven approaches based on machine-learning techniques are widely used to this end. The basic knowledge is derived from a parallel corpus of texts and their translations. This means that an elevated level of translation quality is reached in domains with large parallel corpora like international and EU organisations. Conversely, numerous other domains such as medical or legal literature, that lack large parallel corpora suffer from unevenly low translation quality. Employing a two-pronged approach, the EU-funded DASMT project improved knowledge acquisition for automatic translation. It focused on how to benefit from large out-of-domain parallel corpora in domain-specific translation systems, and on mining and appropriately weighing knowledge available from in-domain texts that are not parallel.
Deep learning: a challenge and an opportunity
The DASMT team initially became involved with deep learning, which requires graphics processing units (GPUs), by buying gaming PCs with consumer GPUs. Project coordinator Alexander Fraser comments: “These really looked like gaming machines with, for instance, external water cooling … yet we quickly determined that we needed to change our whole research programme to work with deep learning models for translation, which was a lot of effort in the second and the third year of the project and required significant server purchases. But this ultimately made a big difference in the impact we had.” The DASMT solutions have a direct impact on providers of translation services as well as an academic impact, since domain adaptation applies to all natural language-processing systems and many areas of artificial intelligence research.
Holistic results for the realm of machine translation
DASMT improved translation to morphologically rich languages that use classifiers. Consequently, the interest switched to neural machine translation (NMT), a new technology overcoming some limitations in phrase-based statistical machine translation, the previous state of the art. Important work was done here, on both inflectional generalisation and improving linguistic representation, as well as on fast training algorithms. Surprisingly, the researchers found themselves working on training machine translation systems without the use of any parallel data. Moreover, they researched document translation, utilising the full context and thus achieving better modelling. The project also focused on several under-resourced languages with few digital resources, such as Hiligaynon, an important language of the Philippines. Through special case studies with Upper Sorbian (a minority Slavic language of Germany) and Chuvash (a minority language of Russia), the team enriched the research on effective pretraining for unsupervised NMT. Finally, DASMT managed to create a high-performance, hate speech detection system. The DASMT team has open-sourced their improved systems and is committed to communicating the results to the machine translation and multilingual natural language-processing communities. “In the future, we will pursue further research funding from both European and national agencies, and we will also create a spin-off which has both commercial and non-profit focuses, since there is significant interest from both sectors in our improved multilingual models,” reveals Fraser.
DASMT, statistical machine translation, automatic translation, translation, parallel corpora, deep learning, language, multilingual,