Skip to main content

Bilingual Automatic Parallel Indexing and Classification

Objective

Most of today's published scientific and technical articles are written in English. The number of English documents being collected and maintained by information brokers/providers such as bibliographic database producers, libraries and publishers has increased rapidly. However, there is always a significant number of documents available only in the native language of the author. One method for facilitating the reliable and accurate access to this information is provided by smart indexing processes. This ensures a consistent indexing in multiple languages and also allows for the multilingual presentation of the information.

The objective of project BINDEX is to support information producers by integrating an existing generic solution, the AUTINDEX system, to automatically index and classify documents in English and in German into the production process of two user organisations. AUTINDEX takes advantage of sophisticated language processing technologies and already existing special purpose language resources such as thesauri, classification schemes and large lexicons which have to be adapted to the specific user requirements. Due to a modular design, the outcome of BINDEX will comprise various software utilities for monolingual indexing and classification in English and German as well as for a parallel bilingual indexing and classification, together with appropriate APIs to facilitate the integration of AUTINDEX in existing workflow environments.

Objectives:
The objective of BINDEX is to support information producers by integrating a generic solution, the AUTINDEX system, to index and classify automatically documents in English and in German into the production process of the two users with the advantage of quicker, cheaper and more consistent population of information repositories. AUTINDEX takes advantage of sophisticated language processing technologies and already existing special purpose language resources such as thesauri, classification schemes and large lexicons which have to be adapted to the specific user requirements. Due to a modular design, the outcome of the project will be various applicable mature software utilities for monolingual indexing and classification in English and German as well as for a parallel bilingual indexing and classification together with appropriate APIs to facilitate the integration of AUTINDEX in existing workflow environments.

Work description:
The aim of the trial BINDEX is to adopt the prototype of the AUTINDEX system which indexes and classifies automatically bilingual documents into the production process of the two users involved. As outcome an applicable mature software utility will be developed which can be used for a bilingual (English and German) indexing and classification. Additionally to a modular approach and well-defined APIs the system could be easily extended to cover other languages as well. The AUTINDEX approach is based on a controlled vocabulary and advanced natural language processing technologies. The controlled vocabulary is provided by a classical thesaurus together with a specialised bilingual dictionary, which presents a merge of the IAI German-English respectively English-German transfer dictionary and a so-called conversion dictionary, which maps different descriptor types in one language into the other. The linguistic processing provides all the information necessary to assign the thesaurus concepts to words including multiword units of the documents, i.e. the indexing, by performing a morpho-syntactic analysis, a term recognition component based on a shallow parsing combined with statistical techniques. Classification of documents is also based on the output of the linguistic processing and the classification schemes already in use on user' sides. Within this trial, the AUTINDEX system will be adopted to the requirements of the two users involved whereas in a first step the monolingual modules of the complete system are adopted and improved, and in the second phase the bilingual component will be enhanced. All modules will then have the same functional level. The whole system will be implemented as a web-service, therefore appropriate multilingual user interfaces will be developed as well as APIs to integrate the system into the production cycle of the potential users.

Milestones:
Three milestones can be identified: The first marks the further improved German indexing and classification component of the AUTINDEX system, the second the elaborated English component, and the third consists of a bilingual, English and German, component. All three components will be integrated in the workflow of the two users involved in the trial and will be intensively evaluated. The expected results will be a usable near market software package. Also a demonstrator will be available.

Funding Scheme

ACM - Preparatory, accompanying and support measures

Coordinator

FACHINFORMATIONSZENTRUM TECHNIK E.V.
Address
Ostbahnhofstrasse 13
60314 Frankfurt
Germany

Participants (2)

GESELLSCHAFT ZUR FOERDERUNG DER ANGEWANDTEN INFORMATIONSFORSCHUNG E.V.
Germany
Address
Martin-luther-strasse 14
66111 Saarbruecken
THE INSTITUTION OF ELECTRICAL ENGINEERS
United Kingdom
Address
Savoy Place
WC2 0BL London