Skip to main content

METADATA ENGINE

Objective

"Metadata" are playing a significant role in "digital preservation": Firstly, they are, in conjunction with emerging standards (such as XML, EAD, Dublin Core or RDF ), among the most promising ways to keep digital material "alive" over the years and decades. Secondly, metadata are needed for all kinds of resource discovery, i. e. using and accessing digital collections in a user-friendly way. The METADATA ENGINE project picks up these considerations and will develop software modules in order to automate metadata capturing by introducing layout and document analysis as a key technology for digitisation software. METAe will enhance dramatically the quality of creating and maintaining digital collections of printed material such as books and journals.

Objectives:
The METAe project will address the need for an automated generation of metadata during the conversion of printed documents and thus be able to make large scale digitisation of printed material, such as books and journals, more reliable in terms of digital preservation, more cost-effective in terms of automation, and more user-oriented in terms of future applications.
In order to achieve these aims the METADATA ENGINE project will
(1) introduce layout and document analysis to be employed as a key technology in future digitisation software,
(2) develop capturing and conversion tools for the automated recording and generation of administrative and descriptive metadata,
(3) develop an omnifont OCR-engine specialising in processing old European typefaces of the 19th century,
(4) strictly obey emerging standards in the fields of digital preservation and resource description, such as XML, EAD, TEI, or ISO 12083,
(5) develop a XML search engine capable for retrieving the tagged full text and the images.

Work description:
The METAe project will develop a software package which extensively automates and improves the generation of metadata by applying new technologies for character, layout and document recognition, and converts the captured information into XML documents. These XML files will serve as a basis for a variety of applications, such as new XML search engines, navigation tools, electronic books, audio books, or the automated production of HTML, XHTML, PDF or PS files.
The METAe package consists of (1) an input module for scanning printed material and importing existing bibliographic metadata, (2) an omnifont character recognition module (OCR-engine) specialising in typefaces of the 19th century, (3) a document analysis module capable of classifying pages according to their physical and logical structure (items such as title pages, table of contents pages, etc., will be recognised automatically), (4) a page layout analysis module capable of analysing and segmenting page elements such as page numbers, headings, captions, footnotes, pictures, highlighted phrases, or graphical separators, (5) a knowledge base providing a controlled vocabulary and rules for the recognition process (the table of contents is, in most cases, called "contents"), (6) a conversion module assembling an XML document containing all recognised metadata, and (7) an export module for the XML enriched document and the scanned image.
The XML documents will be generated according to emerging standards for digital preservation and the electronic interchange of information such as RDF, DC, EAD, TEI, or ISO 12083.
In order to introduce a wide public to the new features of accessing and browsing images and XML-marked full texts, a METAe search engine and web application will be developed as well.

Milestones:
1. The METADATA ENGINE will be the main software package for the automated generation of descriptive, administrative and technical metadata during the digital conversion process and the assembling of XML documents.
2. The METAe OCR engine will be an omni-font OCR engine specialising in Fraktur and old European typefaces of the 19th century. Historical dictionaries for five European languages will complete this OCR engine.
3. The METAe search engine and web application will be developed in order to show the new possibilities in retrieving and accessing digital converted documents which have been processed by the METADATA ENGINE

Funding Scheme

CSC - Cost-sharing contracts

Coordinator

LEOPOLD FRANZENS UNIVERSITAET INNSBRUCK
Address
Innrain 52
6020 Innsbruck
Austria

Participants (12)

ABBYY EUROPE GMBH
Germany
Address
Anglerstrasse 6
80339 Muenchen
BIBLIOTECA STATALE A. BALDINI
Italy
Address
Via Di Villa Sacchetti 5
00197 Roma
BIBLIOTHEQUE NATIONALE DE FRANCE
France
Address
Quai Francois Mauriac
75706 Paris
CCS COMPACT COMPUTER SYSTEME GMBH
Germany
Address
Schwanenwik 32
22087 Hamburg
FRIEDRICH-EBERT-STIFTUNG E.V.
Germany
Address
Godesberger Allee 149
53175 Bonn
INTERUNIVERSITAERES INSTITUT FUER INFORMATIONSSYSTEME ZUR UNTERSTUETZUNG SEHGESCHAEDIGTER STUDIERENDER
Austria
Address
Altenbergerstrasse 69
4040 Linz
KARL-FRANZENS-UNIVERSITAET GRAZ
Austria
Address
Universitaetsplatz 3
8010 Graz
NATIONAL LIBRARY OF NORWAY, RANA DIVISION
Norway
Address
Finsetveien 2
8607 Mo I Rana
SCUOLA NORMALE SUPERIORE
Italy
Address
Piazza Dei Cavalieri 7
56126 Pisa
THE UNIVERSITY OF HERTFORDSHIRE
United Kingdom
Address
College Lane
AL10 9AB Hatfield, Hertfordshire
UNIVERSIDAD DE ALICANTE
Spain
Address
Lugar Campo Rabasa 99
03690 San Vicente Del Raspeig (Alicante)
UNIVERSITA DEGLI STUDI DI FIRENZE
Italy
Address
Piazza San Marco 4
50121 Firenze