In the first year of the project, we carried out work on improving
translation to morphologically rich languages using classifiers. This work
was integrated into the Moses open source statistical machine
translation system which is widely used in both academic and
commercial environments. We had important followup papers on this work
at the beginning of the second year, in addition to carrying out work
on better linguistic modeling and on modeling transliteration.
In the second year of the project, we completely switched to neural
machine translation, a new technology overcoming some limitations in
the previous state-of-the-art (which was phrase-based statistical
machine translation). We carried out important work here on both
inflectional generalization and improving linguistic representation,
as well as on fast training algorithms. We participated in an
important machine translation community shared task, and had excellent
results (with a particular highlight being having the best English to
German system for news translation according to human judgments of
the translation output).
In the third year of the project,
we developed new technology for automatically finding the translation
of terms which are not in our parallel training data. We also showed
how to leverage large in-domain corpora in tasks like this one and
other tasks in natural language processing. An interesting result of
this work is that we have very good performance on detecting the
sentiment of Spanish tweets without using Spanish language training
data (i.e. only using English language training data).
In the third year of the project we also began to work on two important new areas of research which
we had not previously planned to address. The first is on training
machine translation systems without the use of any parallel data. This
is an exciting development that will allow us to address translation
tasks which we were not able to previously consider due to lack of
training data. The second is that we have carried out research on
document translation, which allows us to use information from the full
document in the translation process, allowing us to better model,
e.g. an ambiguous word in terms of the full context of the document
rather than only using the sentence the word occurs in.
In the fourth year of the project we continued on previous research but we also began a new research focus on multilingual
representation learning, publishing two highly cited papers on language neutrality. We were also able to publish important findings in the areas of document translation,
covering unknown words by just-in-time corpus mining,
and addressing the specific problem of translation of anaphora (such as pronouns).
In the fifth year of the project we expanded our focus on multilingual representation learning to a number of underresourced
languages. Our first work was on Hiligaynon, an important language of the
Philippines, for which there are very few digital resources. We extended previous inflectional
work to word formational phenomena. We showed that the problem of coreference in
machine translation has not been solved yet, through the creation of an adversarial test set.
We studied character-level NMT models trained through curriculum learning. We began a
longer term collaboration with the communities of Uppper and Lower Sorbian speakers (a minority slavic language of
Germany), and organized a shared task at ACL-WMT focusing on unsupervised translation of Upper Sorbian /
German, where we also had the best result versus seven other research teams.
The latter result was based on research on effective pretraining for unsupervised NMT.
n the sixth year of the project, we expanded our focus on multilingual representation learning.
We studied transfer learning for cross-lingual tasks further, creating a high performance hate
speech detection system. We further expanded our work on translation of rare and
unseen word senses in a collaboration with Cambridge. We studied domain adaptation for
neural machine translation. We showed that the use of document-level context in neural
machine translation allowed us to address the important problem of zero-resource domains (domains which do not
occur in the training data). We determined how to adapt entities across languages and
cultures in a collaboration with University of Maryland. We organized the shared
task at ACL-WMT focusing on Lower Sorbian, but additionally covering Upper Sorbian again and also Chuvash, a
minority language of Russia.