Periodic Reporting for period 1 - SEBAMAT (Semantics-Based Machine Translation)
Période du rapport: 2020-04-01 au 2022-03-31
A system for word sense disambiguation was designed and implemented using this baseline translation system. To annotate a text in a source language with word senses, it is first translated into the target language using this baseline system. Then a word alignment between source and target language is conducted using the fastalign word aligner. Now the target language words aligned with ambiguous source language words are considered as sense descriptors used to disambiguate these. Positive features of this word sense disambiguation system are that the sense labels are very clear to anyone proficient in the source and the target language, and that the sense granularity is exactly what is needed for machine translation. On the downside, in cases where the source language words and their target language translations have the same ambiguity, no disambiguation is possible. However, this is not a problem in the context of machine translation where the purpose is to select correct translations.
Five language pairs of the well-known Europarl corpus were annotated with word senses. The translation quality of a neural machine translation system trained on these annotated corpora was compared to the high-quality baseline system. The results of the baseline system were slightly better which means that in our specific setting word sense disambiguation does not bring any improvements. This confirms the results of previous work trying to introduce word sense disambiguation to machine translation.
As an alternative to word sense disambiguation, the Europarl corpora were annotated with semantic roles using a system provided by the Allen Institute of Artificial Intelligence. Again, a neural machine translation system was trained on the annotated corpora for five language pairs and the results were compared to those of the baseline system. Here the results using semantic role labeling were somewhat better for all five language pairs, indicating a consistent improvement.
A problem with self-learning machine translation systems is the data acquisition bottleneck, meaning that for each language pair a large training corpus of human translations is required. To deal with this problem, we implemented multilingual neural machine translation systems, i.e. systems which have more than one language on the source and usually also on the target side. Using the standard encoder/decoder architecture, these systems have been shown to produce language-agnostic contextual vectors at the interface between encoder and decoder. This implies that they can translate between combinations of language pairs they have not been trained on. For example, if a multilingual system was trained with French-English and Spanish-German, such a system can also translate between French-German and Spanish-English. For a number of language pairs, we could show that the translation quality of such systems can be surprisingly good.