Community Research and Development Information Service - CORDIS

Abstract

An approach to calculating the semantic similarity of documents written in the same or in different languages is presented. The similarity calculation is achieved by representing the document contents in a language-independent way, using the descriptor terms of the multilingual thesaurus EUROVOC, and by then calculating the distance between these documents. While EUROVOC is a carefully handcrafted knowledge structure, all procedures use statistical techniques. The method was applied to a collection of 3318 English and Spanish parallel texts and evaluated by measuring the number of times the translation of a given document was identified as the most similar document. The very good results showed the feasibility and usefulness of the approach.

Further information on the Third International Conference on Intelligent Text Processing and Computational Linguistics is available on the World Wide Web at: http://www.cicling.org/2002 2

Additional information

Authors: STEINBERGER R, JRC, IPSC, CSCF, Ispra (IT);POULIQUEN B, JRC, IPSC, CSCF, Ispra (IT);HAGMAN J, JRC, IPSC, CSCF, Ispra (IT)
Bibliographic Reference: An oral report given at: The Third International Conference on Intelligent Text Processing and Computational Linguistics. Held in: Mexico City (MX), 17-23 February 2002
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top