Telematics for Libraries - Projects
Updated: 15 JUN 99
|Project Number and Title
1047 - MARC Optical Recognition
|FP 3/ IV
|OCR/ICR; retrospective conversion; library catalogues; structure recognition; character recognition
- New bibliographic record products and services applying internationally recognised standards (Theme: 17)
- The project had as key goals to evaluate the feasibility of OCR/ICR as an approach to the retrospective conversion of library catalogues, in printed form, through:
- development of a prototype tool;
- integration of prototype into a production environment;
- test and assessment of methods under real conditions.
- The retrospective conversion of library catalogues depends equally on character conversion of the data and on coding of the data's structure. Previous work investigated OCR but with only limited automatic treatment of the structure and formatting. Taking as source records a printed national bibliography, the project used state-of-the art tools in OCR/ICR and integrated these with an ODA-based approach to structure recognition in order to generate high-quality, UNIMARC-formatted records.
- MORE was divided into three phases: specification, development and evaluation. Within the phases, tasks were scheduled over seven workpackages:
- Technical specifications;
- Structure recognition;
- Character recognition;
- Testing & acceptance of software;
- Production test.
- The system directly assimilates printed catalogues into machine-readable format via OCR. The tools for character and structure recognition can be configured to process all catalogues which have a sufficiently homogeneous structure.
- When errors or other exceptions occur, the image of the original document, with the problem high-lighted, is displayed, with the best estimate solution plus alternatives. Verified data is converted to high quality UNIMARC formatted records.
- The developed prototype was tested under production conditions using the 'Bibliographie de Belgique 1973', selected because its records pre-dated current layout standards. Nevertheless the success of the tests clearly demonstrated the viability and potential of the method.
- The main technical issues explored were:
- Role and use of dictionaries, both generic and specific application derived;
- Analysis and modelling of library catalogue data structures;
- Integration of structure and character recognition tools.
Impact and results
- The project will permit the extension and application of existing techniques to other domains of document processing in library catalogues.
- The results include: Specifications of record structure analysis and recognition; Prototype workstation for OCR/ICR and structure recognition of printed library catalogue records; Sample conversions of printed national bibliographic records; Report on feasibility and cost-effectiveness of the approach.
- Input accuracy, targeted at 99.8%, compares to double keying standards. Input speed, however, is much greater and the treatment of errors more immediate and informative, with document handling largely eliminated.
- The method is technically and commercially feasible for a catalogue conversion system. As such it would be expected to at least halve human involvement in the process.
- The production-tested prototype can be adopted as a commercial-grade workstation for RECON of printed library catalogues.
- Software design and specification documents are deliverables of the project but have restricted availability.
- Other published reports cover:
- An evaluation of the prototype;
- An evaluation of tests on the 'Bibliographie de Belgique 1973'.
|Name of Institution/Organisation
||Postal Code / City
||F - 75025 PARIS CEDEX 01
|Title, First Name, Name
18, rue Saint Denis
||+33-1 44 76 86 20
||+33-1 44 76 86 39
|Name of Institution/Organisation
|Centre de Recherche Informatique, Nancy
|Bibliothèque Royale Albert 1er