Bibliographic Texts Compositional Analysis

Informations projet

BIBLIOTECA

N° de convention de subvention: 2023

Projet clôturé

Date de début 10 Janvier 1994

Date de fin 9 Mai 1995

Financé au titre de

Specific programme of research and technological development (EEC) in the field of telematic systems in areas of general interest - Libraries -, 1990-1994

Coût total

Aucune donnée

Contribution de l’UE

Aucune donnée

Coordonné par

Universidad Complutense de Madrid
Spain

Objectif

This project attempts to integrate work in the fields of:

automatic retrospective conversion of library catalogues;
OCR/ICR technology;
natural language processing.

This project attemps to define and implement tools for intelligent recognition, analysis and transformation of information contained in a variety of library type documents.

The project developped a toolbox which allows analysis of the field/subfield structure underlying bibliographic references, dictionary entries, indexes of scientific periodicals etc. An intelligent document recognition system has been developed to enhance existing OCR/ICR technology by better image pre-processing, segmentation and incremental feedback from analysis of the documents. A structured analysis procedure allows the breakdown of texts into shorter units until individual informational elements are made explicit. These elements are then transferred to SGML which enables a normalised structure model to be developed.

The toolbox has been tested on indexes and references in scientific periodicals, tables of contents and catalogue cards.
Impact and results:

The BIBLIOTECA toolbox will substantially decrease the cost, time and effort involved in creation and update of bibliographic databases, by the substitution of manual analysis and key-board entry with intelligent document reading, using scanning, OCR/ICR and artificial intelligence techniques.

The benefits and results of this project include:

creation of keyword indexes from table of contents and indexes in books;
creation of article databases from content pages in serials;
creation of citation indexes from bibliographic references;
investigation of the possibility of more advanced 'intelligent' systems for indexing and classification;
automatic transformation of card files into standard formats.

Deliverables

The main deliverable from this project is a system which can:

produce keyword indexes from tables of contents and book indexes;
create article databases from contents pages in serials;
generate citation indexes from bibliographic references;
provide bibliographic databases from printed bibliographical dictionaries;
automatically transform card files into standard formats.

Development software and certain specifications, lists and technical documents were categorised as restricted.

Deliverables in the public domain are:

Detailed and top level work plans;
Technical reports: text selection criteria, draft framework, field structure, integrity criteria;
Appraisal: strengths, weaknesses, costs, benefits, performance;
Testing, examples and results;
Project reports (including final).
Technical approach:

The project was organised into seven workpackages:

Corpus selection and analysis;
Intelligent document recognition, enhanced with pre-processing functions, high level character segmentation and incremental processing;
Automatic field compositional analysis - outputting field parsers for different document classes;
Context based error detection and correction;
Conversion to SGML and translation to bibliographic database, CD-ROM and MARC variant;
Training, testing and evaluation;
Dissemination of information and exploitation.

The project used a 'rapid prototyping' methodology for software development.

Key issues:

BIBLIOTECA integrated work in the fields of:

automatic retrospective conversion of library catalogues;
OCR/ICR technology;
natural language processing

to develop a toolbox for analysis of informal and formal field/subfield structures underlying indexes of periodicals, dictionaries, tables of contents and bibliographical references.

It then resolved the following issues:

Adaptation of OCR/ICR to librarian needs and upgrade of image pre-processing, segmentation and feedback from document analysis.
Devising a flexible system for analysis of semi-formatted documents, with contextual error detection, correction and feedback.
Breakdown of texts into explicit elements suitable for transformation into SGML (Standard Generalised Markup Language), which can facilitate conversion to other formats, such as MARC.

Documentation is available from the contact below and from http://www.csic.es/cbic/teca.htm(s’ouvre dans une nouvelle fenêtre).

Champ scientifique (EuroSciVoc)

CORDIS classe les projets avec EuroSciVoc, une taxonomie multilingue des domaines scientifiques, grâce à un processus semi-automatique basé sur des techniques TLN. Voir: Le vocabulaire scientifique européen.

Programme(s)

Programmes de financement pluriannuels qui définissent les priorités de l’UE en matière de recherche et d’innovation.

FP3-LIBRARIES - Specific programme of research and technological development (EEC) in the field of telematic systems in areas of general interest - Libraries -, 1990-1994

Thème(s)

Les appels à propositions sont divisés en thèmes. Un thème définit un sujet ou un domaine spécifique dans le cadre duquel les candidats peuvent soumettre des propositions. La description d’un thème comprend sa portée spécifique et l’impact attendu du projet financé.

4.17 - New bibliographic record products and services applying internationally recognized standards

Appel à propositions

Procédure par laquelle les candidats sont invités à soumettre des propositions de projet en vue de bénéficier d’un financement de l’UE.

Données non disponibles

Régime de financement

Régime de financement (ou «type d’action») à l’intérieur d’un programme présentant des caractéristiques communes. Le régime de financement précise le champ d’application de ce qui est financé, le taux de remboursement, les critères d’évaluation spécifiques pour bénéficier du financement et les formes simplifiées de couverture des coûts, telles que les montants forfaitaires.

Données non disponibles

Coordinateur

Universidad Complutense de Madrid

Contribution de l’UE

Aucune donnée

Adresse

Edificio Filosofia B., Ciudad Universitaria
28040 Madrid
Espagne

Coût total

Aucune donnée

Participants (6)

Biblioteca Nazionale Napoli

Italie

Contribution de l’UE

Aucune donnée

Adresse

Piazza Plebiscito
80100 Napoli

Coût total

Aucune donnée

C.BIC/CISC

Espagne

Contribution de l’UE

Aucune donnée

Adresse

Coût total

Aucune donnée

Database Informatica SpA

Italie

Contribution de l’UE

Aucune donnée

Adresse

Coût total

Aucune donnée

Instituto Cervantes

Espagne

Contribution de l’UE

Aucune donnée

Adresse

Coût total

Aucune donnée

Matra Cap Systèmes

France

Contribution de l’UE

Aucune donnée

Adresse

Coût total

Aucune donnée

Thamus Consorzio per la Linguistica Computazionale

Italie

Contribution de l’UE

Aucune donnée

Adresse

Via Mercantesse 3
20021 Baranzate di Bollate

Coût total

Aucune donnée

Objectif

Champ scientifique (EuroSciVoc)

CORDIS classe les projets avec EuroSciVoc, une taxonomie multilingue des domaines scientifiques, grâce à un processus semi-automatique basé sur des techniques TLN. Voir: Le vocabulaire scientifique européen.

Programme(s)

Programmes de financement pluriannuels qui définissent les priorités de l’UE en matière de recherche et d’innovation.

Thème(s)

Les appels à propositions sont divisés en thèmes. Un thème définit un sujet ou un domaine spécifique dans le cadre duquel les candidats peuvent soumettre des propositions. La description d’un thème comprend sa portée spécifique et l’impact attendu du projet financé.

Appel à propositions

Procédure par laquelle les candidats sont invités à soumettre des propositions de projet en vue de bénéficier d’un financement de l’UE.

Coordinateur

Participants (6)

Partager cette page Partager cette page sur les réseaux sociaux

Télécharger Télécharger le contenu de la page

Bibliographic Texts Compositional Analysis

Objectif

Champ scientifique (EuroSciVoc) CORDIS classe les projets avec EuroSciVoc, une taxonomie multilingue des domaines scientifiques, grâce à un processus semi-automatique basé sur des techniques TLN. Voir: Le vocabulaire scientifique européen.

Programme(s) Programmes de financement pluriannuels qui définissent les priorités de l’UE en matière de recherche et d’innovation.

Thème(s) Les appels à propositions sont divisés en thèmes. Un thème définit un sujet ou un domaine spécifique dans le cadre duquel les candidats peuvent soumettre des propositions. La description d’un thème comprend sa portée spécifique et l’impact attendu du projet financé.

Appel à propositions Procédure par laquelle les candidats sont invités à soumettre des propositions de projet en vue de bénéficier d’un financement de l’UE.

Coordinateur

Participants (6)

Partager cette page Partager cette page sur les réseaux sociaux

Télécharger Télécharger le contenu de la page

Champ scientifique (EuroSciVoc)

CORDIS classe les projets avec EuroSciVoc, une taxonomie multilingue des domaines scientifiques, grâce à un processus semi-automatique basé sur des techniques TLN. Voir: Le vocabulaire scientifique européen.

Programme(s)

Programmes de financement pluriannuels qui définissent les priorités de l’UE en matière de recherche et d’innovation.

Thème(s)

Les appels à propositions sont divisés en thèmes. Un thème définit un sujet ou un domaine spécifique dans le cadre duquel les candidats peuvent soumettre des propositions. La description d’un thème comprend sa portée spécifique et l’impact attendu du projet financé.

Appel à propositions

Procédure par laquelle les candidats sont invités à soumettre des propositions de projet en vue de bénéficier d’un financement de l’UE.