Conceptual Retrieval of Information using Semantic dicTionary in three Languages


The CRISTAL project addresses the area of text retrieval and indexing. The project will develop a multilingual (French, English and Italian) natural language interface in order to retrieve monolingual (French) text in a corpus of newspaper articles. The system will integrate linguistic methods and information retrieval techniques.

The goal is to provide access to textual information by matching query and text concepts rather than by string or keyword matching. Thus the project will provide the ability to search for an idea, without requiring knowledge of the texts examined or mastery of a cryptic query language. The project will reuse an existing conceptual dictionary, Dicologique, as well as results from other projects: SIMPR, PLUS and COBALT. The project will be carried out along two development axes.

1. The adaptation of the French conceptual dictionary, Dicologique, a device that maps natural language lexemes into concepts. This involves an expansion of the structure to accommodate multilinguism (English and Italian) and the semantic analysis of English and Italian subsets. The necessary software tools to consult and update the conceptual dictionary will be built during the project.
2. The development of a concept-based information retrieval environment that includes an indexing module, a search engine and a dialogue management module. The interface will first accept a natural language query and then refine and disambiguate this query through a dialogue with the user. Concept based retrieval will be investigated.

The consortium is composed of industrial partners, research organisations and a user. The user will provide real corpora and will participate to the requirement specification. The project will demonstrate and evaluate the techniques and tools through final end-user applications.

Approach and Methodology

The conceptual dictionary Dicologique is composed of a multi-hierarchical tree structure where each leaf corresponds to a word or a phrase. A variety of types are used to characterise the nodes. These types enable grouping of concepts by topics, building of links (IS_A, SORT_OF, PART_OF...) linking near-synonyms, and grouping concepts by characteristics (size, shape...)

The conceptual dictionary will be extended to multilinguism by giving English and Italian equivalents for each concept of the studied subset, on a 1 to 1 basis. Where exact word equivalents are not available, phrases will be used. These direct links between words will avoid creating three isomorphic semantic structures for the three languages used in the query input.

The parser will comprise a morphological and a syntactic analyser and an interpretation component. The main purpose of the parsing is to disambiguate syntactic senses of words in the document texts as well as in the natural language queries. The interpretation component of the parser will then map the syntactical output to the conceptual dictionary. The approach to solving ambiguities will rely whenever possible on the context of the document or the context of the query conversation, otherwise it will be solved by questioning the user. The dialogue manager will be simplified compared to other systems through constraining the expected user response. The Esprit project PLUS demonstrator is the starting point of the dialogue module.

To enable multilingual access to the text database, the documents will be indexed monolingually but queries will be processed multilingually. The concepts extracted from the query are substituted by their target (French) equivalents, which are then used in the indexing process. A formal notion of semantic distance will be defined during the project and a threshold will enable too distant concepts in the matching process to be filtered out.

Exploitation and Future Prospects

The project is carried out by an industrially based consortium and the coordinator has a solid reputation. A follow-up of the project could turn the prototype into a commercial product. The project aims at a generic application that provides electronic access to current information. Access to remote data-banks over network services such as Minitel "information kiosks" and direct access to bulk data distributed on CD-ROM are potential applications. The approach is domain independent and the system could also be adapted for public information suppliers or for engineering purposes (technical documentation, maintenance manuals, test reports, ...)

The major improvement compared to off-the-shelf products results from the combination of:

1. multilinguality; the user is able to access information in a foreign language without needing a perfect knowledge of that language,
2. the ability to access information in free natural language,
3. the ability to search for an idea as opposed to keyword matching.

The project expects scientific results in the fields of man-machine communication, dialogue management and conceptual dictionary building. It will test the theoretical models developed during the previous years in a concrete commercial domain. Cooperation is foreseen with other LRE indexing and information retrieval projects.


Cap Gemini Innovation
86/90 Rue Thiers
92513 Boulogne-billancourt

Participants (5)

Cap Volmac

Consiglio Nazionale delle Ricerche (CNR)
Via Della Faggiola 32
56100 Pisa
L'Europeenne de Donnees


University of Manchester Institute of Science and Technology (UMIST)
United Kingdom
Sackville Street
M60 1QD Manchester