Objective
"Metadata" are playing a significant role in "digital preservation": Firstly, they are, in conjunction with emerging standards (such as XML, EAD, Dublin Core or RDF ), among the most promising ways to keep digital material "alive" over the years and decades. Secondly, metadata are needed for all kinds of resource discovery, i. e. using and accessing digital collections in a user-friendly way. The METADATA ENGINE project picks up these considerations and will develop software modules in order to automate metadata capturing by introducing layout and document analysis as a key technology for digitisation software. METAe will enhance dramatically the quality of creating and maintaining digital collections of printed material such as books and journals.
Objectives:
The METAe project will address the need for an automated generation of metadata during the conversion of printed documents and thus be able to make large scale digitisation of printed material, such as books and journals, more reliable in terms of digital preservation, more cost-effective in terms of automation, and more user-oriented in terms of future applications.
In order to achieve these aims the METADATA ENGINE project will
(1) introduce layout and document analysis to be employed as a key technology in future digitisation software,
(2) develop capturing and conversion tools for the automated recording and generation of administrative and descriptive metadata,
(3) develop an omnifont OCR-engine specialising in processing old European typefaces of the 19th century,
(4) strictly obey emerging standards in the fields of digital preservation and resource description, such as XML, EAD, TEI, or ISO 12083,
(5) develop a XML search engine capable for retrieving the tagged full text and the images.
Work description:
The METAe project will develop a software package which extensively automates and improves the generation of metadata by applying new technologies for character, layout and document recognition, and converts the captured information into XML documents. These XML files will serve as a basis for a variety of applications, such as new XML search engines, navigation tools, electronic books, audio books, or the automated production of HTML, XHTML, PDF or PS files.
The METAe package consists of (1) an input module for scanning printed material and importing existing bibliographic metadata, (2) an omnifont character recognition module (OCR-engine) specialising in typefaces of the 19th century, (3) a document analysis module capable of classifying pages according to their physical and logical structure (items such as title pages, table of contents pages, etc., will be recognised automatically), (4) a page layout analysis module capable of analysing and segmenting page elements such as page numbers, headings, captions, footnotes, pictures, highlighted phrases, or graphical separators, (5) a knowledge base providing a controlled vocabulary and rules for the recognition process (the table of contents is, in most cases, called "contents"), (6) a conversion module assembling an XML document containing all recognised metadata, and (7) an export module for the XML enriched document and the scanned image.
The XML documents will be generated according to emerging standards for digital preservation and the electronic interchange of information such as RDF, DC, EAD, TEI, or ISO 12083.
In order to introduce a wide public to the new features of accessing and browsing images and XML-marked full texts, a METAe search engine and web application will be developed as well.
Milestones:
1. The METADATA ENGINE will be the main software package for the automated generation of descriptive, administrative and technical metadata during the digital conversion process and the assembling of XML documents.
2. The METAe OCR engine will be an omni-font OCR engine specialising in Fraktur and old European typefaces of the 19th century. Historical dictionaries for five European languages will complete this OCR engine.
3. The METAe search engine and web application will be developed in order to show the new possibilities in retrieving and accessing digital converted documents which have been processed by the METADATA ENGINE
Fields of science
Not validated
Not validated
Call for proposal
Data not availableFunding Scheme
CSC - Cost-sharing contractsCoordinator
6020 INNSBRUCK
Austria