EUROWORDNET will produce a multilingual database for use in a variety of applications, including machine-aided translation and quality information retrieval. The database will establish basic semantic relations between words for several European languages. The wordnets will then be linked to the American wordnet for English to derive a shared top-ontology. In providing easy access to words and related meanings, the resources so obtained will enable terms employed in user enquiries to be expanded to any set of closely related terms in a language, resulting in better retrieval of information in terms of recall.
The database has been designed with a number of important characteristics:
- it is multi-lingual,
- it can handle language-specific information extracted from diverse resources,
- it provides a formal system usable in information retrieval applications as well as in the development of more complex knowledge bases of the future.
The database also has some innovative features, such as:
- an 'interlingual index', a pool of concepts (the superset of all language-specific concepts), where all shared language independent information is stored,
- a facility to label relations such as disjunction, conjunction, factivity, reversal and negation,
- encoding of explicit semantic relations across parts-of-speech,
- different interpretations of particular WordNet 1.5 relations,
- addition of new relations.
For each relation language, specific test sentences are provided, with examples, to verify the relations between word pairs. Public guides for coding the semantic relations in each language will be published. The manuals provide a check-list and explain how the test should be applied to derive the semantic relations between word-meanings. Finally, the database will have a specialised interface to cope with the complexity of a multi-lingual semantic database.
The coding of the first subset of the most fundamental meanings, so called 'base concepts', is in progress. Base concepts are used to define any more specific concepts and their meanings in the languages involved. They have the most relations and occupy major positions in semantic hierarchies or taxonomies. A common set of base concepts has been defined on the basis of having similar criteria across all the languages involved (English, Dutch, Italian and Spanish).
The User Requirement and the Market
As direct user-involvement in the project is quite modest, a larger user group, currently with 35 members, of interested companies and institutions has been established. The purpose of this group is to create a wider awareness of the use of this type of resource and to establish co-operation with other groups with a common interest. Each member of the user group receives key deliverables and data samples and is asked to provide feedback.
A number of different types of user have been identified:
- publishers, interested either in providing the initial resources or in the development of similar products (dictionaries, thesauri etc.),
- research institutes and R&D departments of companies working in the field of knowledge engineering or linguistic databases,
- organisations using or applying similar resources in the development of services or products which need multi-lingual semantic resources, such as WWW search engines,
- end-users interested in products helping them to manage their information resources.
The Way Ahead
The core of the database will be finished and extended with an official version of the merged top-ontology and the results will be verified.
The aim is to have produced a rich and high quality coding of semantic relations and equivalence relations for a common set of about 5,000 base concepts in the four languages by mid-97. By the end of the year this first subset will have been verified and available for testing. The resources will be tested and demonstrated in an information retrieval system by Novell, one of the partners in the consortium.
Discussion fora and workshops have been arranged, where the project will lead discussion on the design, validation and standardisation of multi-lingual semantic databases.
Funding SchemeCSC - Cost-sharing contracts