CONCERTATION MEETING ON DIGITAL IMAGIN
held on 7 November 1994 at the European Commission, Luxembourg
This is a summary report of the concertation meeting on the subject of digital imaging held by the European Commission's Library Unit (DG XIII-E-3) at the Jean Monnet Building in Luxembourg on 7 November 1994. The full text of the papers presented will be published separately.
Experience on the state of the art and on 12 imaging developments was presented by 16 experts, most of whom are directly involved in collaborative R&D; projects. The day was judged by those present to be successful and worthwhile. As a result, the Unit expects to organise further such days on different topics.
Five main technology areas were covered (listed below). Papers were presented on these technologies by invited external experts (updating and scene-setting papers), and by representatives of projects which are and which are not funded by the Commission. A number of common issues were identified, some of which may be worthy of specific further actions. This is being considered within the Libraries Unit.
There are currently many library research projects in Europe which are using, or investigating the use of, digital imaging. Several of these projects are funded by the European Commission's Libraries Programme. This Concertation Day was organised to give members of some of these projects an opportunity to exchange experiences, for the benefit of all concerned. Particular emphasis was placed on allowing opportunities for discussion. It is the second Concertation Day organised by the Libraries Unit on issues of interest to several project teams.
The day was organised by the Libraries Unit with assistance from Marc Fresko of Applerace Limited, who also drafted this summary.
SUMMARY OF PRESENTATIONS
The day's agenda comprised five topics divided into fifteen presentations. The five topics were:
- Intelligent Character Recognition;
- Networking and Delivery;
- Image Retrieval and Recognition.
As a detailed account of each paper is given in the accompanying publication
, this section consists of a summary of the main issues and common themes which arose.
The pace of developments in scanner technology is relatively gentle. However, costs are falling with increased manufacturing volumes; desktop drum scanners are a particular case. A significant standard, TWAIN, is now adhered to by a number of manufacturers; this standard makes it easier to use different scanners with one application system.
Research by the RIDDLE project suggests that most libraries do not have a scanner, but that they are prepared to purchase a flatbed device if necessary. Libraries have some special needs for some applications, notably a bookedge when scanning bound materials, low light exposures when scanning ancient works and a careful consideration of the drop-out colours in some applications. Sometimes, scanners' standard features (intended for commercial use) are unhelpful, and have to be deactivated. Frequently the scanning will be intended for Intelligent Character Recognition (ICR) which introduces another set of complexities; see below.
Ease of use remains a concern, as (in some applications at least) scanning can be difficult and labour intensive. To an extent, the operators have to adapt their way of working to the demands of the technology.
Scanning from microfilm is difficult for several reasons. As discovered by project INCIPIT, problems are particularly acute when scanning microfilm of incunabula. As work is concentrated largely on selected pages, one film does not contain long runs of images which are consistently sized and positioned. The project team has addressed these problems in two ways:
- by developing new operational quality standards (e.g. regarding positioning) for microfilming, to ensure that future microfilms are suitable for efficient scanning rather than just for archival;
- by bypassing many of the microfilm scanners' automatic features in order to gain more control over the scanning process to ensure that the image is faithful to the original.
Unsurprisingly, copyright remains an issue which causes concern for many potential applications. This is not further examined here, as it has been studied elsewhere.
Storage media have developed, and are continuing to develop fast - possibly faster than any of the other key technologies. The developments are in the direction of:
- increasing storage densities;
- increasing speeds;
- decreasing costs;
- greater diversity of choice.
This is perhaps just as well, for imaging projects are notorious for creating large files, and demanding correspondingly large amounts of storage. Some standards are applicable, but typically they are rapidly overtaken, having a commercial lifetime far shorter than the media they refer to. Libraries with projects involving long term storage should ensure that media refreshing and/or long-term migration is planned and budgeted. Libraries with numerous projects should also consider in-house "rules" or standards for storage (and compression and formats etc.) to ensure that different imagebases can be accessed consistently in the future.
CD-R technologies have seen a particularly rapid growth due to spectacular price reductions and growth of CD-ROM drive availabilities. They are already used by some projects (ELISE, British Library's PIX).
Despite the advances in storage technologies, cost-effective architectures for large databases with distributed users were a widespread concern. This is examined below in the section on networking.
Intelligent Character Recognition
The accuracy of Intelligent Character Recognition (ICR) has increased significantly in recent years. For example, research conducted at the University of Las Vegas survey
showed improvements of character accuracy in several leading packages averaging 10% in 1994
and 25% to 50% in 1993.
At the same time, the effective speed of ICR has increased as processing power at the desktop has become more affordable. ICR therefore holds increasing attractions for several library applications, but there remain problems.
One of the least tractable problems is that of multiple concurrent character sets. Particularly with catalogue cards and bibliographies, libraries frequently have to deal with several alphabets simultaneously, sometimes including archaic diacritics such as in Ancient Greek. No commercial package even deals with all the possible characters and diacritics in the Latin alphabet, an aspect investigated in detail by the FACIT project. Scientific and other special characters pose similar difficulties. There is hope that Unicode-based software will eventually help to address these problems, at least in part; but there is no timetable in sight.
The typical experience is that any library projects involving ICR encounter problems specific to the application. For example,
- the FACIT project has catalogue cards which have been produced on different typewriters (and hence with different fonts) over a long period, and also experiences problems when cards are too thick for the scanner feeder;
- in the FASTDOC project an ICR system is configured to determine automatically, say, the location of paper titles on a particular journal contents page, but the configuration will have to be repeated every time the journal's format changes; and the ICR operator will have no warning that this is needed. In the context of handling large volumes of journals, this is troublesome;
- the RIDDLE project encountered difficulties using ICR on journal titles where the title is printed in a light colour on a dark background.
Networking and Delivery
Wide area networking technologies, though developing apace, are lagging behind most or all of the other key technologies (LANs, processing power, storage etc.) in terms of price/performance improvements. Wide area communications with very high bandwidths remain prohibitively expensive for most libraries. This has severe implications for distributed applications involving either large quantities of images or large images or both; and the implications are even worse for moving images and so-called "multimedia" documents.
This leads to uncertainty when planning distributed image-based applications. Illustratively, should images be held centrally, at the expense of high communications costs, or should they be duplicated and distributed, at the cost of one-time distribution and the management of redundancy? To date these questions have not been answered definitively.
The difficulties of using ISDN due to inadequate standardisation were acknowledged. In practice, ISDN can be used successfully, but if two sites do not have identical communications equipment they may not be able to communicate via ISDN.
Image Retrieval and Recognition
Techniques for image recognition are at a very early stage. Content-based approaches such as IBM's QBIC are still at the research stage. The HISTORIA project (which had not started at the time of the Concertation Day) plans to experiment with the storage and retrieval of heraldic images according to their colour, shape and texture. Today however, many approaches rely on objective and/or subjective key data, keywords and/or phrases being specified by qualified indexers, often using an agreed thesaurus. It follows that different collections of images will typically have been indexed using different vocabularies at least, and different levels of detail at worst.
Additionally, different collections' index or catalogue entries will generally have different record formats. This has led the Van Eyck project to work on the concepts of "umbrella list" (effectively a concordance between different terms) and "core record" to allow otherwise incompatible collections to be treated together.
The Van Eyck project has also recently published a survey
of 34 image indexing approaches in use around the world.
A number of issues which are of key importance to a wide range of libraries imaging applications. They are:
- Suitability of scanners for library applications (as they are generally developed with commercial documents in mind).
- Ease of use and repeatability of performance of scanners.
- Lifetime of optical storage media and drives, and techniques for refreshing and preserving data stored on such media.
- Recognition of diacritics, non-Latin characters and scientific symbols by ICR systems.
- High costs of wide area networking, with the concommitant difficulty of designing an optimal storage architecture for distributed imaging applications.
- Difficulties in using ISDN, as it does not always work if more than one supplier's equipment is used.
- Automated image recognition is still at a very early stage of development.
- Problems with image indexing based on textual descriptions) owing to the lack of standard practices.
These issues will be considered for further investigation.