Skip to main content

Advanced Natural Language Interface for multilingual Text Generation in Healthcare


The ANTHEM project aims to build a prototype that translates medical diagnoses from French or Dutch into Dutch, French or German. The in natural language expressed diagnose will also be encoded in a standard classification and hence will be suitable for computer processing. The prototype in form of an Application Programming Interface can be embedded in healthcare information systems.

The project aims to develop a portable prototype of a natural language interface enabling the user to enter medical diagnoses in Dutch or French, in a healthcare information system. The prototype will translate the diagnoses into Dutch, French and German and at the same time encode them in an international standardised classification scheme of diseases ICD-9/10-CM. The prototype will be delivered as an Application Programming Interface for further integration in other healthcare systems by third party developers. A secondary aim is to use the same sublanguage approach for the analysis of standardised medical diagnostic expressions used in disease classification systems to facilitate mapping to other systems and smooth the transition from older to newer versions (e.g., ICD-9-CM to ICD-10-CM).

The workplan includes:

1. the collation, structuring and tagging of corpora of medical diagnoses,
2. the modelling of this medical sublanguage using the interface structure of the CAT2 formalism developed during EUROTRA,
3. the representation of ICD-9-CM expressions using typed feature logic which by means of inheritance will support a hierarchical classification of terms,
4. the creation of a medical term lexicon in a format that makes it also accessible for other applications,
5. the development of software modules able to analyse the sublanguage input and to create an abstract semantic representation, to translate it into Dutch, French and German and to generate the relevant ICD-9/10-CM code,
6. the integration of the prototype into two existing healthcare systems and subsequent testing in a real medical environment.

Approach and Methodology

The approach is to convert the input statements in a language-independent semantic representation, which will be used as an interlingua for the translation process. SNOMED (Systematised Nomenclature of Medicine) codes, which combine seven types of medical elements (topography, morphology, aetiology, function, disease, procedure and occupation), are the basic components for this semantic representation. Beside translation, the semantic representation of a given statement will also be used to generate the relevant illness code according to the ICD-9-CM classification.

The diagnoses used as input statements are expressed in a well-limited sublanguage with a high rate of nominalisation. Thus the input is well suited for an unambiguous machine processing. Starting from a medical diagnose text corpora the medical sublanguage will be modelled and implemented in the CAT2 representation formalism. The project will use and adapt existing CAT2 lingware to build the translation modules.

Some PROLOG/CAT2 predicates will be rewritten in C to optimise the performance and to facilitate the integration of the API in a host application written in C.

Exploitation and Future Prospects

Most medical databases are only suited to the registration of factual knowledge without any indication about the links between the facts or their rationale. Only natural language provides the user with sufficient expressiveness to record diagnoses in enough detail. The prototype will perform the difficult task of transcribing the original diagnosis in a format suitable for computer processing. Hence this interface will provide the user with the ability to use a healthcare information system with the same flexibility as its paper-based counterpart.

The design of the prototype as an Application Programming Interface ensures portability and the possibility of integration in various applications. In the test phase the API will be embedded in two different systems: the one, MEDIDOC, is an on-line healthcare information system, the other is used by the Belgian army to encode medical diagnoses in batch mode. The results will validate the approach and further adaptation of the modelled linguistic knowledge may be reused in other medical sub domains' texts.


University Hospital Gent
De Pintelaan 185
9000 Gent

Participants (6)

Centre Universitaire

Datasoft Management NV


Hopital Militaire

Universite de Liege

Universität des Saarlandes
Martin-luther-straße 14
66111 Saarbrücken