Community Research and Development Information Service - CORDIS

H2020

HimL Report Summary

Project ID: 644402
Funded under: H2020-EU.2.1.1.4.

Periodic Reporting for period 1 - HimL (Health in my Language)

Reporting period: 2015-02-01 to 2016-07-31

Summary of the context and overall objectives of the project

The goal of the HimL project is to increase timely access to important public health information by making it available to consumers in their own language. We do this using high quality machine translation optimised for semantic fidelity. This addresses two distinct but related problems:

1) A large proportion of public health information is local: While general information about diabetes may be available in Polish, Romanian or Urdu, what a Polish speaker in Aberdeen, or a Romanian speaker in Edinburgh or an Urdu speaker in Glasgow needs is local information about diabetes services and care, but in their own language.
2) Best practice in public health care can change, as a result of new research studies and/or meta-analyses.
Recommendations in line with best practice need to be made available to consumers, as soon as possible, through timely, accurate translation into their own language.

The first problem reflects the translation needs of national and regional public health services such as our partner NHS 24, while the second reflects the translation needs of trans-national health information providers such as our partner COCHRANE. We will address these needs by applying recent advances in machine translation (MT) to make their texts available in a timely fashion to a much wider range of language communities.
Fully automatic translation systems will be created to translate public health information from English into Czech, German, Polish and Romanian. This particular set of languages have been chosen because of the needs of our user partners, and because they cover the three major families of European languages (Slavic, Germanic and Romance). What is more, all four of these languages are classified as having “weak or no support” or “fragmentary support” according to the META-NET language white papers.

The innovations in MT that we adapt, integrate and apply in the project include:
1) Domain adaptation to build MT systems tailored to the public health domain in terms of terminology, register and reading level;
2) Semantically–aware MT to improve translation fidelity, along with semantic evaluation to tune the systems for fidelity and serve as specific automated progress metrics;
3) Morphology prediction to translate from morphologically impoverished English to the morphologically richer target languages considered in this project.

Through the life of the project we will transfer the above improvements in MT from lab-based systems, to live on-line health care services which are highly trusted and have large numbers of users. This project will allow these multi-lingual services to expand coverage to more content and to new languages, becoming more widely useful to European healthcare consumers. The HimL project places a great deal of emphasis on the deployment, evaluation and dissemination of current MT research and the user partners are committed to proving both the usefulness and the impact of the project, by running extensive user acceptance testing, and collecting web traffic and web user feedback.

In summary, the implementational objectives of the project are:
1) Collate the latest research on high accuracy machine translation, to develop systems which are measurably more reliable, for our particular domain, than baseline state-of-the-art models.
2) Deploy translation engines as services with a simple interface and scalable performance.
3) Integrate translation functionality seamlessly into the content management workflow of two high-profile on-line healthcare information providers.
4) Add translation functionality to their websites, carefully managing user expectation and evaluating user satisfaction.
5) Comprehensively measuring the impact of this new functionality on the services provided by NHS24 and COCHRANE.

By achieving these more concrete objectives we will be taking steps towards achieving our global objectives:
1) Increase the accuracy of machine translation, making it more reliable and more widely useful.
2) Increase the availability and reliability of local public health information to recent immigrants.
3) Increase access to the latest best practices in health care information to people whose first language is not currently well supported.
4) Decrease the cost to public services of maintaining large amounts of multi-lingual content.

Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far

"The workplan for HimL envisaged a system release in each year of the project, which would be integrated into each of the partner's websites by the end of the third quarter, then evaluated in the final quarter. We planned a phased incorporation of new technologies into the releases, evaluating the innovations as are integrated.

At the time of the report, we have already integrated and evaluated the Year 1 (Y1) systems and are currently working on the integration of the Year 2 (Y2) systems. The Y2 systems incorporate domain adaptation in each of the language pairs, morphologogical processing for Czech and German, the "core fidelity" technique to reduce the occurrence of clear semantic errors. Our aim for this year's system integration is to have all these components working together correctly in the translation server, so that they can be used to translate the user partner's websites. The translation is incorporated into the publication process employed by each of these partners, although at this stage of the project the translations only appear on non-public, development versions of the sites.

Our evaluation for Y1 trialled a new human-assisted method for semantic machine translation, aiming to measure how much meaning is preserved in the translation, and pin-pointing the errors in the translation.
Building on our experiences with this trial, we will develop the method further in Y2, using it to compare different types of system. We will also apply other types of evaluation, both human and automatic, to measure our progress between Y1 and Y2, and inform system development for the final year.

Behind the system building and integration efforts, the HimL partners have been developing and selecting technologies that could be used in the Y3 systems. In the "data and adaptation" work package, our main focus so far has been on harvesting suitable training data, and choosing the best techniques for making sure that the translations produced are appropriate for the domain (i.e., adapted). In the second half of the project we will also turn our attention to how best to use the large amounts of monolingual medical text to enhance our translation systems. In the "semantics" work package we have been working on various techniques to discourage the systems from producing semantically incorrect translations, for example by automatically removing the source of such translations from the models altogether, exploiting automatic semantic anaylsers, and techniques for dealing correctly with negation. In the "morphology" work package we have been improving our models for prediction of correct morphology in German and Czech, and extending them to Romanian and Polish.

Since the HimL project began there has been an important development in machine translation research which we are tracking carefully. We refer to the emergence of "neural network" or "deep learning" models for machine translation, known as "neural machine translation" (NMT). In evaluation campaigns (where researchers compare their systems against others on standard data sets) in 2015 and 2016, NMT systems have in many cases out-performed earlier types of MT systems. It is still early days for NMT, and there are many practical problems in deploying such systems, as well as questions about how well they will perform on specific domains, and their potential for biasing towards fluency at the expense of adequacy. However the field is moving very rapidly and HimL cannot afford to ignore this development. We are already building NMT systems with HimL data sets to evaluate against existing MT systems.~"

Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)

"Within HimL, we aim to make progress beyond state-of-the-art in five areas, which we describe below.

Data and Adaptation. We aim to create translation systems which are tailored for public health information text. To do this we gather all possible resources (starting with those collected for the previous EU project, Khresmoi) that could be useful in this domain. We supplement these resources with general purpose texts commonly used for translation systems, such as the European parliamentary proceedings (europarl) etc.
Since these general purpose texts are much larger than the in-domain texts we need to apply domain-adaptation techniques to prevent incorrect senses from being preferred over the correct in-domain senses in the translation output. The biggest problem with lack of in-domain data, however, is generally out-of-vocabulary words, particularly with regard to technical terms. To address this problem we employ additional terminology resources wherever possible and supplement these by mining terms from non-parallel texts.

Semantically Motivated MT. Central to our approach is the idea that translation of public health information needs to preserve the meaning of the source sentence, even if this may mean sacrificing some fluency. To this end, we apply recent research from QTLeap and other EU and non-EU projects, and use robust semantic evaluation metrics to validate our approaches. We incorporate shallow semantic parsing and semantic role labelling (SRL) into the translation systems, in order to ensure that translations which do not preserve the "who did what to whom" and appropriate polarity are penalised by the model. We also incorporate fidelity checks into the system using shallow syntactic information to reduce predictable errors.
Finally, we improve lexical semantics through existing large-scale high-quality dictionaries.

Morphology. The target languages in HimL (German, Czech, Polish and Romanian) all exhibit a degree of morphological complexity not found in the source language (English) and all possess case systems.
In order to generate accurate translations in these languages, therefore, we need to have mechanisms in place to ensure that the correct morphological variants are chosen. To achieve this, we refine and apply techniques mainly developed by LMU Munich and Charles University for use in German and Czech, to the 4 HimL target languages.
These techniques include both the corrective approach (depfix) and the two-step approach, where we first translate to a simplified target language representation, then use a prediction model to generate the correct morphology.

Deployment. Staged deployment of our lab-based models is led by Lingea. They are implementing a simple API, similar to Google’s translation API, which is directly used by content management systems of the NHS 24 and Cochrane, to ensure a tight and seamless translation functionality to the live websites. On the client side, NHS 24 and Cochrane ensure that the multi-lingual translation functionality is easy to use and that user expectations about the quality of machine translation are appropriately managed.
Once new functionality is deployed, we will run a marketing campaign which will create enough traffic to gather feedback for evaluation. Engaging with the public in this way will also help us to gain insight into the reactions and attitudes to MT content in this domain.

Evaluation. In the evaluation package, our aims are to measure the effectiveness of our improvements in MT, as well as to feed back diagnostic information on the MT systems we create. Evaluation ranges from fully automatic, to full-blown trials of MT ""in the wild"". We are developing new automatic and human-assisted semantic metrics to assist us in tuning our systems towards system fidelity, based on previous work from Edinburgh and Charles University in this area. We also participate in relevant open evaluation campaigns to benchmark our research in MT in the public health information domain. Our user partners (i.e. NHS 24 and Cochrane) have translation needs which serve with state-of-the-art translation systems, thereby validating recent advances in the technology. The NHS 24 and Cochrane will perform extensive user acceptance tests and when our translation functionality goes live, they will gather valuable data on the reaction to fully automatic translation in our domain. This will enable us to assess the usefulness and impact of MT in demanding, real-life use cases.

In order to discuss the impact of the project, we focus on two different areas; the societal and economic impact, as well as the scientific and technological impact.

Within this project, the targets are two types of consumer-oriented public health information services that are critical to the well-being of citizens in the EU: local health information (enabling citizens to obtain information that is geographically useful and relevant to the facilities available to them), and best practice (enabling citizens to effectively evaluate and choose which medical treatments are appropriate to them). These correspond to our two user partners: NHS 24 and Cochrane. In the first case, the translation need arises since such health information is generally produced in the national language of the country, and many immigrant residents do not have a good command of this language. For the second type of information, this may well be produced in a widely spoken language like English, but should be available in as many languages as possible in order to maximise reach. The availability of both these types of public health information for each citizen in her or his own language is of critical importance for their well-being.

Even though the eventual impact of HimL in the public health domain will be significant, the application of high accuracy machine translation is not certainly not limited to public health. Public information services provided by social services, law enforcement, immigration and many others would also benefit. Not only would they be able to expand the number of people that can access their services, they would save money on expensive and slow human translation, and it will be far easier to maintain translations of documents which are constantly changing.

Cochrane reviews, and in particular their associated plain language summaries, are critical resources for healthcare consumers with a particular information. A large amount of Cochrane content has been translated into Spanish and French, supported by national governments in both cases. The effect in terms of access to the Cochrane Summaries consumer portal was profound, with Spain changing from a minimal number of accesses in a year, to being one of the most active countries in terms of hits. The availability of the Spanish summaries resulted in a substantial increase in the number of visits from Spanish language countries, and these now make up 25% of all visits to the consumer portal. of hits. Both examples highlight the unmet need of non-English speakers (both consumers and health professionals) to access health information in their own language, and the potential of translations to address this gap. Both the French and Spanish projects have however faced difficulties in relation to sustainability working with high cost professional translation and it is clear that they need to explore other approaches if they want to be able to continue their translation efforts reliably.

The HimL project will therefore carry out work that is vital to Cochrane’s strategy, and will dramatically increase the access to Cochrane’s valuable medical information. In the short-term, the availability of high quality fully automatic machine translation will allow Cochrane to disseminate its information to the language communities in the project, and through that, to increase impact and capacity in Eastern Europe, which is a region that is currently underrepresented within Cochrane and a strong priority within Cochrane’s overall organisational strategy. In the longer-term, a successfully carried out Innovation Action in the HimL project will enable Cochrane to determine how to begin producing information in many more languages than it currently serves.

The HimL project targets translation from English into four languages (Romanian, German, Czech and Polish) which have a range of different resources, but have relatively low translation quality. We will bring together advances in domain adaptation, treatment of morphology and semantic fidelity in order to create translation systems which can be deployed for public health information. We expect the advances in translation quality we aim to create to be relevant for many other European languages which share similar properties, especially free word order and rich morphology.

But translation success for any language pair requires that core problems of language processing be addressed, as this project plans to do. This will not only impact the narrow application of machine translation, but also aid other natural language applications, such as question answering, dialogue systems, summarisation, text analytics, information retrieval and information extraction.

Finally we note that HimL will aim to link with the rest of the program created to address the EU's ICT17 "Cracking the Language Barrier" call. This call aims to deliver improvements in translation quality, but these are only really useful, if they can prove their mettle in "real" industry and application driven use-scenarios. To this end, HimL engages in partnership with the Research and Innovation project QT21. This partnership ensures that QT21 technology advances will be easily available to HimL to stress test the technologies in industry-driven application scenarios. Technologies developed in QT21 may excel in the lab, or in the shared task, but the key question is whether they can still excel in a real-world scenario."

Related information

Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top