Skip to main content
Aller à la page d’accueil de la Commission européenne (s’ouvre dans une nouvelle fenêtre)
français français
CORDIS - Résultats de la recherche de l’UE
CORDIS

Semantics-Based Machine Translation

Periodic Reporting for period 1 - SEBAMAT (Semantics-Based Machine Translation)

Période du rapport: 2020-04-01 au 2022-03-31

Most current machine translation systems are corpus-based. They typically take the semantics of a text only in so far into account as they are implicit in the underlying text corpora. This is also true for the recent neural machine translation systems which, in comparison to standard phrase-based systems, tend to have the focus even more on fluency rather than adequacy. However, the question is whether it is possible to improve the use of semantic knowledge. For example, it has been suggested that future machine translation systems should use information of the type "who is doing what to whom, when and why", which may require the identification of the semantic roles of the items occurring in a sentence. To move forward in the direction of semantics-based machine translation, we propose to implement and evaluate three different approaches: The first approach is based on state of the art machine translation but considers word senses rather than words. That is, a word sense disambiguation system is used to determine the word senses in large parallel text corpora. Then a neural machine translation system is trained on the word-sense-disambiguated rather than the original parallel corpora. Our second approach uses role labeling for identifying the semantic roles of the words in a sentence. In this case the neural machine translation system is trained on a corpus which was annotated with semantic roles. With both word sense disambiguation and semantic role labeling it is hoped that the respective annotation software does a better job than what neural machine translation is doing implicitly, and that this may improve translation quality. In contrast, the third approach tries to reduce the data acquisition bottleneck as encountered in the case of low-resource languages. It uses multilingual neural machine translation systems to translate between language pairs where no parallel data is available.
Using the neural machine translation toolkit Marian NMT, a baseline neural machine translation system was implemented. It is very flexible and can be easily used for any language pair for which large enough parallel corpora are available. It was trained on 10 language pairs involving the project languages English, French, German, Greek and Spanish. For participation in the 2021 shared task on similar language translation at the Conference on Machine Translation (WMT 2021), the system was also trained in both directions of Portuguese – Spanish and Catalan – Spanish and achieved first and second places in this competition.
A system for word sense disambiguation was designed and implemented using this baseline translation system. To annotate a text in a source language with word senses, it is first translated into the target language using this baseline system. Then a word alignment between source and target language is conducted using the fastalign word aligner. Now the target language words aligned with ambiguous source language words are considered as sense descriptors used to disambiguate these. Positive features of this word sense disambiguation system are that the sense labels are very clear to anyone proficient in the source and the target language, and that the sense granularity is exactly what is needed for machine translation. On the downside, in cases where the source language words and their target language translations have the same ambiguity, no disambiguation is possible. However, this is not a problem in the context of machine translation where the purpose is to select correct translations.
Five language pairs of the well-known Europarl corpus were annotated with word senses. The translation quality of a neural machine translation system trained on these annotated corpora was compared to the high-quality baseline system. The results of the baseline system were slightly better which means that in our specific setting word sense disambiguation does not bring any improvements. This confirms the results of previous work trying to introduce word sense disambiguation to machine translation.
As an alternative to word sense disambiguation, the Europarl corpora were annotated with semantic roles using a system provided by the Allen Institute of Artificial Intelligence. Again, a neural machine translation system was trained on the annotated corpora for five language pairs and the results were compared to those of the baseline system. Here the results using semantic role labeling were somewhat better for all five language pairs, indicating a consistent improvement.
A problem with self-learning machine translation systems is the data acquisition bottleneck, meaning that for each language pair a large training corpus of human translations is required. To deal with this problem, we implemented multilingual neural machine translation systems, i.e. systems which have more than one language on the source and usually also on the target side. Using the standard encoder/decoder architecture, these systems have been shown to produce language-agnostic contextual vectors at the interface between encoder and decoder. This implies that they can translate between combinations of language pairs they have not been trained on. For example, if a multilingual system was trained with French-English and Spanish-German, such a system can also translate between French-German and Spanish-English. For a number of language pairs, we could show that the translation quality of such systems can be surprisingly good.
Progress beyond the state of the art includes mainly the following points: A strong baseline system for neural machine translation was implemented which performed well in a shared task on machine translation of similar languages. An innovative system for word sense disambiguation which can be easily adapted to many languages was designed, implemented, and applied to the five SEBAMAT languages. A system for neural machine translation trained with word-sense-disambiguated corpora was implemented and evaluated. To also investigate on a different type of semantics, the English side of the Europarl corpus was annotated with semantic roles using the semantic role labeling system provided by the Allen Institute of Artificial Intelligence. A neural machine translation system was trained with this annotated corpus on the source language side and the corresponding unannotated corpus on the target language. For five language pairs, a consistent improvement could be achieved when compared to the baseline system. Finally, it could be demonstrated that a multilingual neural machine translation system greatly reduces the need for parallel corpora, thereby only marginally affecting translation quality. In summary, the project has shown innovative ways how to improve machine translation. Machine translation, being among the most popular services on the internet, is an important means for international communication and a useful support tool for human translators.
sebamat-logo.jpg