The project is to study the feasibility of scanning the contents pages of technical journals with a view to including the information on the individual items in each issue in online library catalogues. The principal goal of the project is a feasibility study but a pilot/prototype will also be developed.
It also looks at the feasibility of integrating the text of journal articles into online catalogues and the possibility of generating the text of catalogue entries through scanning of contents pages. The results of the project will not be specific to any one library but ought to be applicable to any online library catalogue.
Impact and results:
RIDDLE has demonstrated that the concept of automatic capture of journal contents pages - and even journal data - for inclusion into OPACs, is feasible.
The prototype provides a sound basis for the development of a commercially available system, especially ported on to a Windows PC.
Apart from a demonstration system, in the public domain are reports as follows:
User and Technical Requirements;
Communications and Transmission Issues.
The project was divided into logical workpackages:
User and technical requirements, in which library staff, users and other interested parties were canvassed. Needs were determined for catalogue entries in terms of information content and format:
accuracy and speed of availability;
ease of operation;
cost, performance and efficiency of a suitable system;
Scanning, which reviewed the state-of-the-art of hardware and software in the library environment;
Scanned Image to Text, which examined conversion to meaningful text (while preserving typographical information) via OCR/ICR;
Translation of contents pages text to on-line library catalogue format, which identified the parts of contents pages which were relevant and automatically converted them to OPAC format;
Communications, which dealt with problems of transmitting data to remote OPACs over public networks;
Exploitation, which culminated in the production of a prototype system. Means of exploitation via a market study were examined, along with possible barriers, such as intellectual property rights, technology issues and resource commitment;
Information dissemination using traditional media, such as articles and conferences, and electronically on the World Wide Web.
The current state-of-the-art in scanning technology has been investigated. International industry and formal standards such as SGML (standard generalised markup language) ISO/lEG 8879-1986 for text markup, TIFF format for image compression and JPEG for graphics compression have also been examined in this context. Having analysed the current technical possibilities for image capture recognition and storage, the integration of such data into the online catalogue was explored and a prototype system was developed.
The main technical issues are:
- The integration of scanned images into online catalogues;
- The specification of a general non application specific solution to imaging problems in library online catalogues.
Documentation is available from the contact below and on http://www.cwi.nl/cwi/projects/riddle.html .
OX11 0QX Didcot
1098 SJ Amsterdam