Collaboratory for Annotation, Indexing and Retrieval of Digitised Historical Archive Material

A software component was implemented which allows users of the system to annotate COLLATE objects (digitised documents and document annotations). The annotations are classified by the users themselves, using so-called discourse structure relations (DSR). The provided set of basic relations/link types was derived from discourse structure theory in computational-linguistics. Its application enabled the project to monitor scientific discourses of the users about documents via interrelated/nested annotations. Methods for document retrieval have been developed, using both document metadata (cataloguing, keyword indexing) to retrieve documents and the discussion threads (via annotations) about a document. Full text search in annotations allows for identifying certain annotation threads and their according documents by merging all annotations of a document and seeing them as an extension of it. Context-based retrieval, on the other hand, performs an in-depth analysis of the discourse to find out which statements were made about a document and to which degree these statements are agreed or disagreed by other users. The more common agreement a statement gets, the more it can be seen as a fact rather than a subjective point of view. The statements made about a document are used for retrieval purposes. Whereas document retrieval using metadata and annotation search are integrated into the COLLATE prototype, context-based retrieval will be subject to implementation and evaluation after the end of the project.

XML CONTENT MANAGER is software comprising a set of functions that can be used to create, instantiate and manage metadata associated to multimedia documents. This requires two phases: designing of the Document Type Definition (DTD) and implementation of the basic function necessary to create and manage documents expressed in XML and stored within a DBMS XML compliant. Existing tools that facilitate the design and development of the Document Type Definitions (DTDs) are exploited. The software is a generic framework to manage data represented in XML format and metadata represented in RDF format. It is a complete Web-enabled system that provides a new type of open robust and extensible technology. It supplies the classical operations of a generic DBMS and it is possible to attach it directly to the Internet without programming complex server scripting and gateway administration. The XMLCM's objective is to establish a highly reliable, scalable and integrated open environment by letting users develop reusable Web Services solutions incorporating data in any format. By managing data in XML format, the product removes restrictions on data sources, transaction types, deployment and scalability. Users can wrap, link and run Web Services, from both legacy and dynamic data sources and deliver the results via a browser, PDA or cell-phone. Aside of the traditional local access, XMLCM offers Web Services access that reaches scalability and interoperability levels not achievable by other paradigms. The principal use of the XMLCM is in the field of: Integration of Heterogeneous Persistence Layers (RDBMS, XML-based repositories); XML Representation of Metadata Models. The XML is able to represent the information in all components (links between data, semantic and knowledge) so we are working to substitute the RDBMS with the XMLCM to store (like a XML data base) and to retrieval data, information and knowledge.

DIGITAL WATERMARKING MODULE: Tools for author/provider authentication (copyright watermarking) and document authentication (integrity watermarking). Upon request from the client the watermarking server retrieves the watermark, confirms/negates the integrity of the document (integrity watermark) and sends the client the retrieved copyright information (if any). A challenging security issue for COLLATE was to protect the property rights of the digital repository and prevent misuse, i.e. guarantee owner authentication (copyright watermarking) and data authentication (integrity watermarking) of the documents. We implemented a watermarking engine, which is based on a protection scheme with a private server detector and symmetric copyright and integrity watermarking algorithms. Using the implemented watermarking tools, the content suppliers of the COLLATE project were enabled to embed copyright information into all documents of the digital repository with a robust watermarking algorithm, while a fragile watermarking algorithm was ensuring their integrity. The most important part of our work was to tune both algorithms in such a way that they could be optimised with respect to the characteristics of the present COLLATE-specific document types in the collection and to the combination of the copyright and integrity watermarking. These specifically developed algorithms can be tuned, in order to meet not only the most important requirements about security but also those about feasibility and complexity, without compromising the flexibility of the work of the COLLATE users. Further development, adapting the algorithms to other document types and new application domains that have even higher security requirements, and marketing of the product is a major concern of the IPSI department MERIT and a spin-of company.

WWW-based collaboratory: integrated software system for archives, researchers and end-users working with digitized historic-cultural material, which provides functions for cataloguing, content-based in-depth indexing and annotation of documents, and content-based retrieval functions. The COLLATE system supports distributes users groups in their collaborative in-depth indexing and content-based search of digitized documents. It integrates: - A complex working environment with easy access to a large digital repository providing document- and task-specific input forms for detailed cataloguing, indexing and annotation of these documents. - An advanced retrieval mechanism, which allows content- and context-based access to documents and annotations, combining various types of search functions (e.g., direct database access, attribute-value searches and full text search) exploiting both user-created and automatically generated metadata. - An innovative annotation-based collaboration facility which manages communications between users (e.g., mutual requests and task assignments), allowing to create a discourse in the form of nested annotations and notifying the virtual team with information about the state of their collaborative work. COLLATE is one of the first fully working collaboratories in the Humanities. The various prototype incarnations developed within the last two project years have successfully been used by real-life users. Film experts set up and indexed/annotated a large document collection on film history, and they plan to continue work after the end of the project together with selected cooperation partners. Although COLLATE accounts for user requirements in the chosen example domain, the developed technology is largly generic and easily adaptable to other application domains that are similarly information-intensive and profit from collaborative knowledge work.

DOCUMENT PRE-PROCESSING & CLASSIFICATION SYSTEM: software components for the automatic detection of the layout structure of documents and for the automatic classification of documents, on the basis of their layout structure. Once the layout structure has been found for a set of training documents, another learning tool is employed to induce rules for the automatic classification of documents on the basis of spatial and perceptual factors. Lastly, the logical components of the document can be identified, by associating some layout components with the corresponding human-perceptible meaning. Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data elements that carry descriptions of their meaning, usage and relationship. Moreover, the combination with XSLT enables any browser to render the original layout structure of the paper documents accurately. However, an effective transformation of paper documents into XML format is a complex process involving several steps. The document pre-processing and classification system developed in the project applies knowledge technologies to many document processing steps, namely rule-based systems for semantic indexing of documents and the extraction of the necessary knowledge by means of machine learning techniques. The system can be used as an effective tool for automated annotation of paper documents with a relatively regular layout structure. WISDOM++ is a document processing system for digitised textual paper documents. First, the scanned image is segmented into basic layout components (non-overlapping rectangular blocks enclosing content portions) classified according to the type of their content (e.g., text, graphics, etc). Second, layout analysis is performed to detect structures among blocks. The result is a tree-like structure, which associates the content of a document with a hierarchy of layout components, such as blocks, lines, and paragraphs. Third, the classification step aims at identifying the membership class of a document (e.g. censorship decision, newspaper article, etc.), using rules automatically learned. Document image understanding creates a mapping of the layout structure into the logical structure, which associates the content with a hierarchy of logical components, such as the name of the censorer in a censorship document. At this point, OCR can be applied only to those textual components of interest for the application domain, and its content can be stored for future retrieval purposes. The result of the document analysis is an XML document that makes the document image retrievable. As an example, we can automatically identify a document as being a censorship document coming from a specific authority and can additionally identify, e.g., the name of the censor.

Deliverables

Share this page

Download