The aim of the EMIR project is to validate a linguistic and statistical approach for the indexation of free text and multilingual query of textual databases by the use of a prototype. The final goal is to provide the user with an opportunity to query in his own language text databases written in various languages. It will also make it possible to query simultaneously in one language databases containing texts in several different languages.
A feasibility study is being carried out into the automatic indexing of free text and the multilingual querying of text databases. At the end of the study, tools and utilities designed for such purposes will have been embodied in a demonstration prototype. To develop this, existing tools will be used to carry out such tasks as automatic indexing (based on statistical methods and using a linguistic treatment which employs morphological and syntactic analysis). The automatic indexing method produces, as part of the formatted database, a statistical model which can be used during the query answering phase to sort documents according to a relevance hierarchy. Monolingual queries in natural language can use a reformulation expert system which has at its disposal a large vocabulary stock. Work has started on an existing English/French prototype extending to an English/German pair which requires the development of an analyzer for German. The French/German pair will follow, resulting in a trilingual query system. Methods and tools could then be applied to other languages. Multilingual text databases will be employed.
A first prototype of the bilingual French English interrogation system has been developed. It is based on word for word translations.
A second prototype capable of taking multiunit words and expressions into account is currently in the experimental stage.
The final version of the bilingual prototype integrating both kinds of translations will be ready at the end of 1993. At the same time, a first version of the German monolingual prototype has been developed. It is based on a linguistic analysis integrating a morphological analysis including the treatment of 1-word compounds. This analysis is based on the full term dictionary. The syntactic analysis includes grammatical disambiguation and a simplified recognition of dependency relations.
The system developed within the project must be domain dependent. When processing a new domain, little work is needed to adapt the dictionaries and the user is helped by tools developed inside the project to perform this adaptation. More specifically, a semi-automatic method has been developed to extract compounds and their translations from texts that have already been translated.
In order to prove the generality of the approach, experimentation is done on three languages: English, French, and German. The English-French and French-German couples are currently under work. The German parser has been developed within the framework of the project. This parser specifically takes into account the splitting of compounds which is crucial for information retrieval systems.