Skip to main content

Global Under-Resourced MEedia Translation

Periodic Reporting for period 1 - GoURMET (Global Under-Resourced MEedia Translation)

Reporting period: 2019-01-01 to 2020-06-30

Machine translation (MT) is an increasingly important technology for supporting communication in a globalised world. Millions of users rely on MT for either assimilation, the use of the raw MT output to make sense of foreign texts, or for dissemination, the use of the MT output to create a draft translation which is then corrected and published. Although the uptake of MT technology has gradually increased over the last ten years, recent advances in neural machine translation (NMT) have resulted in significant interest in industry and have led to very rapid adoption of the new paradigm (e.g. Google, Facebook, UN, World International Patent Office). Although these models have shown significant advances in state-of-the-art performance, they are data intensive and require parallel corpora of many millions of humanly translated sentences for training. Neural machine translation is currently not able to deliver usable translations for the vast majority of language pairs in the world. This is especially problematic for our user partners, the BBC and DW who need access to fast and accurate translation for languages with very few resources.

The aim of GoURMET is to significantly improve the robustness and applicability of neural machine translation for low-resource language pairs and domains.

GoURMET has five objectives:
1. Advancing low-resource deep learning for natural language applications;
2. Development of a high-quality machine translation for low-resource language pairs and domains; 3. Development of tools for media analysts and journalists;
4. Sustainable, maintainable platform and services;
5. Dissemination and communication of project results to stakeholders and user group.

Achieving these aims requires advancing the state of the art in low-resource machine learning. We are investigating making translation significantly more robust by using the intuition that translated (or parallel) corpora contain enormous redundancies, and are an inefficient way to learn to translate. Inspired by human learning, we will study methods of building up meaning compositionally, biasing the models to concentrate their capacity on patterns that are likely to better generalise to unseen sentences and are therefore more data efficient. We will also leverage another human capacity, the ability to “learn to learn” or to build on knowledge learned in related tasks, by developing machine learning techniques such as transfer learning and data augmentation. This allows us to extract knowledge from monolingual and parallel resources from other languages and domains. This project combines fundamental research in rapid deep learning, with lower-risk, data-driven machine learning research in order to deliver useful products to our industry partners.
Objective 1: Advancing low-resource deep learning for natural language applications

Our progress with regard to this objective has been significant. The strong research component of the project is reflected in our scientific publications (at the time of writing, 18 publications shared in an open-access manner via OpenAIRE, the European Open Science Initiative). We have more work which is under review, or published as pre-prints or theses. We have released 22 different repositories related to research software that accompanies these publications. All project output such as software, datasets and trained models are available at this URL:

Objective 2: Development of high-quality machine translation for under-resourced language pairs and domains

We address this objective by pursuing research into improving the data collection pipeline. We have released 10 data sets so far, both parallel data sets and some monolingual data sets.We further ensure our success at pursuing this objective by having a 9 month cycle of building, delivering, and evaluating translation models for low-resource languages. We have currently completed two rounds of translation model building. In round one we delivered translation models into and out of English for Gujarati, Bulgarian, Turkish and Swahili. In round two we delivered Amharic, Kyrgyz, Serbian and Tamil. We prioritise languages which are strategically important for the BBC and DW, and then from a shortlist, the research partners select languages which are interesting for their research and have a variety of resources to work with. These models are released to the public.

Objective 3: Development of tools for analysts and journalists

Our progress towards this objective has been largely achieved through the development of theGourmet translation platform. This platform is based on serverless AWS architecture and currently supports seven of the eight languages delivered in the project. It is available via an API, and there is also demo front end which is linked to from the GoURMET website.
Scientific and Technological Impact:

Our main research focus during the project will be on machine translation, by itself an active research field where we can achieve a large academic impact.
Our project has already spurred research in low-resource languages by creating test sets for the most well recognised machine translation shared task, WMT. We created a task in Gujarathi in 2019, and Tamil in 2020. We are working with Translators with Borders and the Masekane project to promote future African language shared tasks. Our data sets and models for low-resource languages are providing the building blocks for other researchers to continue our work.

Global Content Creation:

In 2017, the BBC World Service experienced its biggest expansion since the 1940s, with the goal of bringing its independent journalism to millions more people around the world, including in places where media freedom is under threat.
In order to make this expansion sustainable and affordable in the long term, the BBC and DW are committed to developing automated tools to augment their journalists workflow. Machine translation speeds up content creation by giving an automated first pass at reversioning the text. This allows the journalists to focus on retelling the story instead of having to perform rote translation first. This is, however, just the start of the benefit of machine translation for creating content. Another significant benefit to the journalists is that they will be able to monitor their colleagues from other language teams to keep up to date with news in related areas where they do not speak the language.

Multilingual Media Monitoring and the Future of Journalism:

The rise of the Internet and social media have contributed to the democratisation of news. News and information have never been more easily and rapidly available, and the lines are getting blurred between news makers and news takers – readers have become sources and opinion makers in the news. Issues such as terrorism and global warming affect all countries around the world, and in order to understand the viewpoints and the concerns of a broad range of people it is essential that our media monitoring tools include language capabilities with a broader reach than has been traditionally possible.
This is where GoURMET ’s multilingual capabilities are key. We will be able to deliver coverage of countries at the forefront of the news such as Nigeria and India to deliver a truly global perspective.