Skip to main content

Construction, augmentation and use of knowledge bases from natural language documents

Objective

COBALT is concerned with the problem of capturing factual and definitional knowledge from machine readable textual sources for assertion in an existing knowledge base. The general aim of the project is to demonstrate how different state-of-the-art natural language engineering technologies and pieces of software can be integrated to build a system supporting a better exploitation of information in the financial domain, through enhancements of a knowledge base. A message routing application will be built to demonstrate one of the possible uses of the tools and techniques developed in the project.

The technical goal is to improve the performance of off-the-shelf text categorisation systems by integrating current text categorisation techniques with state-of-the-art knowledge representation and selected natural language parsing and understanding techniques. The project will exploit existing technology, research results and software modules and concentrate R&D efforts on integration issues.

The idea is thus to achieve some European innovative results on text categorisation using the results of leading technology in this field and on the state-of-the-art technology in NLP. An Interest Group composed of major Italian banks will be set up in the course of the project to provide input and feedback.

The general background of the project is the widely acknowledged problem of setting up, augmenting and using large knowledge based systems (LKBSs), due to the impossibility of manually encoding all the information to be stored (the so-called knowledge acquisition bottle-neck).

The project is based on a double assumption:
for many LKBSs applications, a great deal of the necessary knowledge already exists as printed or computer-readable texts;
the state of the art in the artificial intelligence (AI) and computational linguistics (CL) fields allows the possibility of (semi-) automatic processing of such texts to translate basic semantic content into a KRL suitable for many different high-level LKBSs applications.

The prototype COBALT system will classify and store each new item in a KB according to a defined hierarchical structure of categories to be used for application specific storing, summarising and retrieval tasks. From a functional point of view the prototype will thus belong to the class of Text Categorisation systems. The basic idea is to exploit text categorisation for a first-level broad categorisation of items and for selecting relevant text portions to be analysed later with natural language understanding techniques (parsing and semantic analysis). The combined results of the two analyses will enhance the original KB and thus constitute the basis for a very accurate, second-level category assignment activity.

A running demonstration will be developed, in the first phases of the project, evolving to further levels of complexity in an incremental life cycle style. The basic language for the prototype system will be English. A feasibility study for the adaptation of the prototype to other languages and other application domains is envisaged within the project.

Currently there is no significant presence of the European IT industry in the text categorisation technology and its applications: the main products come from the USA. COBALT will start from state of the art results and extend them in terms of new technologies, and greater benefits and functionality.

The main results of COBALT will be:

a practical test of how the current state of the art approaches in NLP and AI technology can support the transfer of knowledge recorded in natural language texts (financial domain) into knowledge bases.
the definition and prototypical implementation of a basic technology for advanced text categorisation tasks.

Experimentation in the field of online financial news intelligent routing will be carried out; thus we can expect that the R&D activity in COBALT will be directly exploited in the realisation of a new generation of intelligent routing applications. The basic technology, however, will be able to support the development of a large set of very interesting applications based on text categorisation. Some potential application areas are:

automated information filtering and routing, as an extension to domains other than the financial one for people and companies using information from news wire feeds or generic text as well;
text classification, for information vending services, both automatically or interactively with domain experts;
large archives and databases intelligent navigation, for example CD-ROM navigation with simple hypertextual capabilities derived from a KB description and structuring of the CD contents.

Quinary plans to exploit the project results in two ways:

products development: the new COBALT product on information routing will be added to the already developed Quinary products line in banking.
technology transfer services: the project results are a natural extension of the collection of technologies Quinary is able to support in consulting, training and systems development services.

UMIST will exploit results internally to enhance its teaching of natural language processing and knowledge engineering techniques and as a background technology for future research and development projects as in robust text processing. STEP Informatique envisages the possibility of an enhancing of the Legal Advisory Systems developed in the ESPRIT II NOMOS Project (in which it is a partner) by making use of COBALT derived techniques, and is ready to take part in the creation of a commercial product of this type to be engineered from the results of the COBALT project.

Coordinator

Quinary SpA
Address
Via Crivelli 51/1
20121 Milano
Italy

Participants (2)

Step Informatique
France
Address
20 Rue Martel
75010 Paris
University of Manchester Institute of Science and Technology (UMIST)
United Kingdom
Address
Sackville Street
M60 1QD Manchester