Community Research and Development Information Service - CORDIS


The paper presents a language-independent approach to controlled vocabulary keyword assignment using the EUROVOC thesaurus. Due to the multilingual nature of EUROVOC, the keywords for a document written in one language can be displayed in all eleven official European Union languages. The mapping of documents written in different languages to the same multilingual thesaurus furthermore allows cross-language document comparison. The assignment of the controlled vocabulary thesaurus descriptors is achieved by using a statistical system which uses a collection of manually indexed documents to identify, for each thesaurus descriptor, a large number of lemmas which are statistically associated to the descriptor. These associated words are then used during the assignment procedure to identify a ranked list of those EUROVOC terms which are most likely to be good keywords for a given document. The paper also describes the challenges of this task and discusses the achieved results of the fully functional prototype.

Additional information

Bibliographic Reference: An oral report given at: XVII Congress of Spanish Society for Natural Language Processing. Organised by: Spanish Society for Natural Language Processing. Given at: Jaen (ES), 12-14 September 2001
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top