DOCUMENT PRE-PROCESSING & CLASSIFICATION SYSTEM: software components for the automatic detection of the layout structure of documents and for the automatic classification of documents, on the basis of their layout structure. Once the layout structure has been found for a set of training documents, another learning tool is employed to induce rules for the automatic classification of documents on the basis of spatial and perceptual factors. Lastly, the logical components of the document can be identified, by associating some layout components with the corresponding human-perceptible meaning.
Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data elements that carry descriptions of their meaning, usage and relationship. Moreover, the combination with XSLT enables any browser to render the original layout structure of the paper documents accurately. However, an effective transformation of paper documents into XML format is a complex process involving several steps.
The document pre-processing and classification system developed in the project applies knowledge technologies to many document processing steps, namely rule-based systems for semantic indexing of documents and the extraction of the necessary knowledge by means of machine learning techniques. The system can be used as an effective tool for automated annotation of paper documents with a relatively regular layout structure.
WISDOM++ is a document processing system for digitised textual paper documents. First, the scanned image is segmented into basic layout components (non-overlapping rectangular blocks enclosing content portions) classified according to the type of their content (e.g., text, graphics, etc). Second, layout analysis is performed to detect structures among blocks. The result is a tree-like structure, which associates the content of a document with a hierarchy of layout components, such as blocks, lines, and paragraphs. Third, the classification step aims at identifying the membership class of a document (e.g. censorship decision, newspaper article, etc.), using rules automatically learned.
Document image understanding creates a mapping of the layout structure into the logical structure, which associates the content with a hierarchy of logical components, such as the name of the censorer in a censorship document. At this point, OCR can be applied only to those textual components of interest for the application domain, and its content can be stored for future retrieval purposes.
The result of the document analysis is an XML document that makes the document image retrievable. As an example, we can automatically identify a document as being a censorship document coming from a specific authority and can additionally identify, e.g., the name of the censor.