Domain Adaptation for Statistical Machine Translation

Periodic Reporting for period 4 - DASMT (Domain Adaptation for Statistical Machine Translation)

Periodo di rendicontazione: 2020-06-01 al 2021-11-30

Rapid translation between European languages is a cornerstone of good
governance in the EU, and of great academic and commercial
interest. Data-driven approaches to machine translation based on
machine learning techniques are widely used and constitute the
state-of-the-art. The basic knowledge source is a parallel corpus,
texts and their translations. For domains where large parallel corpora
are available, such as the proceedings of the European Parliament, a
high level of translation quality is reached. However, in countless
other domains where large parallel corpora are not available, such as
medical literature or legal decisions, translation quality is
unacceptably poor. Given the strong demand for automatic translation
capabilities this is a problem of critical importance.

We worked on solving two basic problems of knowledge acquisition
for machine translation. The first problem is determining how to
benefit from large out-of-domain parallel corpora in domain-specific
translation systems. The second problem is mining and appropriately
weighting knowledge available from in-domain texts which are not
parallel.

Our work has resulted in a break-through in translation quality for the
vast number of domains with less parallel text available, and has a
direct impact on companies providing translation services. The
academic impact of our work has been large because solutions to
the challenge of domain adaptation apply to all natural language
processing systems and in numerous other areas of artificial
intelligence research based on machine learning approaches.

In the first year of the project, we carried out work on improving
translation to morphologically rich languages using classifiers. This work
was integrated into the Moses open source statistical machine
translation system which is widely used in both academic and
commercial environments. We had important followup papers on this work
at the beginning of the second year, in addition to carrying out work
on better linguistic modeling and on modeling transliteration.

In the second year of the project, we completely switched to neural
machine translation, a new technology overcoming some limitations in
the previous state-of-the-art (which was phrase-based statistical
machine translation). We carried out important work here on both
inflectional generalization and improving linguistic representation,
as well as on fast training algorithms. We participated in an
important machine translation community shared task, and had excellent
results (with a particular highlight being having the best English to
German system for news translation according to human judgments of
the translation output).

In the third year of the project,
we developed new technology for automatically finding the translation
of terms which are not in our parallel training data. We also showed
how to leverage large in-domain corpora in tasks like this one and
other tasks in natural language processing. An interesting result of
this work is that we have very good performance on detecting the
sentiment of Spanish tweets without using Spanish language training
data (i.e. only using English language training data).

In the third year of the project we also began to work on two important new areas of research which
we had not previously planned to address. The first is on training
machine translation systems without the use of any parallel data. This
is an exciting development that will allow us to address translation
tasks which we were not able to previously consider due to lack of
training data. The second is that we have carried out research on
document translation, which allows us to use information from the full
document in the translation process, allowing us to better model,
e.g. an ambiguous word in terms of the full context of the document
rather than only using the sentence the word occurs in.

In the fourth year of the project we continued on previous research but we also began a new research focus on multilingual
representation learning, publishing two highly cited papers on language neutrality. We were also able to publish important findings in the areas of document translation,
covering unknown words by just-in-time corpus mining,
and addressing the specific problem of translation of anaphora (such as pronouns).

In the fifth year of the project we expanded our focus on multilingual representation learning to a number of underresourced
languages. Our first work was on Hiligaynon, an important language of the
Philippines, for which there are very few digital resources. We extended previous inflectional
work to word formational phenomena. We showed that the problem of coreference in
machine translation has not been solved yet, through the creation of an adversarial test set.
We studied character-level NMT models trained through curriculum learning. We began a
longer term collaboration with the communities of Uppper and Lower Sorbian speakers (a minority slavic language of
Germany), and organized a shared task at ACL-WMT focusing on unsupervised translation of Upper Sorbian /
German, where we also had the best result versus seven other research teams.
The latter result was based on research on effective pretraining for unsupervised NMT.

n the sixth year of the project, we expanded our focus on multilingual representation learning.
We studied transfer learning for cross-lingual tasks further, creating a high performance hate
speech detection system. We further expanded our work on translation of rare and
unseen word senses in a collaboration with Cambridge. We studied domain adaptation for
neural machine translation. We showed that the use of document-level context in neural
machine translation allowed us to address the important problem of zero-resource domains (domains which do not
occur in the training data). We determined how to adapt entities across languages and
cultures in a collaboration with University of Maryland. We organized the shared
task at ACL-WMT focusing on Lower Sorbian, but additionally covering Upper Sorbian again and also Chuvash, a
minority language of Russia.

We have improved the state-of-the-art in phrase-based statistical
machine translation by incorporating classifiers for dealing with rich
morphology, and by studying linguistic and script-related
problems.

We have also improved the state-of-the-art with respect to using
linguistic information in neural machine translation, using bilingual
word embeddings for finding the translations of unknown terms from
comparable corpora (such as Wikipedia), and cross-language adaptation
of classifiers, particularly in the unsupervised case.

We scaled our approaches for mining terms
and parallel sentences to web-scale. We showed how to use mined
terms and parallel sentences in neural machine translation systems to solve the
out-of-vocabulary problem which occurs when carrying out domain
adaptation. We completed the creation of a document translation
system, which uses document-level context in translation (rather than
considering each sentence in isolation), and showed how it models
domain, and in particular showed very novel results for zero-resource domains.
Finally, we created state-of-the-art unsupervised machine
translation systems, i.e. systems which do not require any parallel
training data.

Domain adaptation without knowing the domain

Periodic Reporting for period 4 - DASMT (Domain Adaptation for Statistical Machine Translation)

Condividi questa pagina

Scarica