Community Research and Development Information Service - CORDIS

FP7

CORLEONE, Core Linguistic Entity Online Extraction

Funded under: FP7-ICT

Abstract

This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources: (a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys state-of-the-art finite-state techniques. Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system developed in the JRC by the Web Mining and Intelligence Action of IPSC. This report constitutes an end-user guide for COLREONE and provides some scientifically interesting details of how it was implemented.

Additional information

Authors: PISKORSKI J, European Commission, Joint Research Centre, Institute for the Protection and the Security of the Citizen, Ispra (IT)
Bibliographic Reference: EUR 23393 EN (2008), 32pp. Free of charge
Availability: http://bookshop.europa.eu/is-bin/INTERSHOP.enfinity/WFS/EU-Bookshop-Site/en_GB/-/EUR/ViewPublication-Start?PublicationKey=LBNA23393 (Catalogue Number: LB-NA-23393-EN-C)
Record Number: 200910503 / Last updated on: 2009-12-11
Category: PUBLICATION
Original language: en
Available languages: en