Community Research and Development Information Service - CORDIS

H2020

READ Report Summary

Project ID: 674943
Funded under: H2020-EU.1.4.1.3.

Periodic Reporting for period 1 - READ (Recognition and Enrichment of Archival Documents)

Reporting period: 2016-01-01 to 2016-12-31

Summary of the context and overall objectives of the project

The history of Europe is preserved in it’s archives. Thousands of shelf-kilometres containing billions of documents provide a true picture of the everyday life (and struggles) of Europeans citizens from the Middle Ages to the present day. But this treasure is hidden from the public - not because it cannot be accessed physically or digitized, but simply due to the fact that until now it has not been possible to search through archival material in the way that we expect; searching the full-text of historical documents which are handwritten in variants of historical languages and which have highly sophisticated layouts such as registers and tables.
The H2020 project READ (Recognition and Enrichment of Archival Documents) will revolutionize access to historical collections from archives and libraries by supporting cutting edge research in Pattern Recognition, Computer Vision, Natural Language Processing and Digital Humanities. Namely Handwritten Text Recognition and Keyword Spotting are key technologies where European universities are at the forefront of research. These technologies are made available via the service platform “Transkribus”. It offers the world’s first implementation of a freely available Handwritten Text Recognition engine, capable of being trained on medieval handwriting found in famous codices in the same way as on individual handwriting from famous persons of the 20th century. The main European scripts can be trained and recognised, as well as Hebrew, Arabic or Bangla.
The main objective of the READ Virtual Research Environment is to set up a European e-infrastructure for historical documents enabling users to recognize, extract information from, annotate and finally to make documents available to other platforms and repositories. The Virtual Research Environment “Transkribus” aims to provide benefits for all user groups involved in the “eco-system” of historical documents: Archives and libraries as content holders get the chance to enrich their documents on a large scale with full-text transcription and searching, (digital) humanities scholars are enabled to work intensively with historical documents in a sheltered and highly specialized environment, computer scientists are supported with large scale datasets and reference data made available to them for research directly connected to real-world challenges and finally the public is supported to enjoy the benefits of accessing digital archives.

Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far

Our work in Y1 of the project focused on two main areas:
First of all we set a bunch of activities to make the project and the technology known to our four target groups. This started with a three days conference combing the public kick-off meeting of the project with a convention meeting of the co:op project. More than 150 people from over 20 countries took part in the conference. Videos of the presentations are online and an important resource for dissemination activities. Reactions on the this conference were highly positive and opened the door to many archives and research groups. Dissemination activities were continued on several channels. One of the most important were about More than 20 workshops were organized by several groups in the project and held in a number of countries (, Austria, France, Germany, Netherlands, Finland, Denmark, Norway, Italy, Switzerland, United Kingdom, Spain). Hundreds of people took part in these workshops and got familiar with the expert tool from the Transkribus platform.
Based on the overwhelming interest of archives and research groups in the project we were able to conclude 25 Memorandum of Understandings with Institutions from all over Europe. These MoUs provide an excellent framework for cooperation. Among these are the Hessian State Archive (Germany), the Archivo Storico Ricordi (Italy), Huygens Institute for the History of the Netherlands (Netherlands), Alfred Escher Foundation (Switzerland) or The Linnean Society (United Kingdom), to mention just a few of this list.
As a result the Transkribus platform has now more than 5000 (!) registered users, representing archivists, librarians, researchers, scholars and public users (family historians) from all over Europe and abroad.
Our second focus was the implementation of the Transkribus platform integrating a number of tools developed by the research groups in the project. Special attention was given here to defining interfaces and data exchange formats, to set up application servers for easy deployment of the single tools (which are coming in different operating systems and computer languages) and also to tackle the challenge of being able to store and process millions of images files. As a highlight we can mention that the award winning Handwritten Text Recognition engine from the CITLab team of the University of Rostock has been implemented in the Transkribus platform. This engine provided the best results in the scientific competitions held on historical handwriting at the leading conferences (ICDAR 2015, ICFHR 2014) in the field.

Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)

To cope with the challenge of providing a comprehensive Research Environment for all kinds of historical documents a large number of tools and services need to be developed. Nevertheless there are two main issues which have to be resolved and where research needs to focus: First the improvement of existing HTR engines, either based on Hidden Markov Models (HMMs) or on Recurrent Neural Networks (RNN) was done by two groups in the project. Results at the ICDAR 2017 competition on HTR showed that already excellent figures can be achieved on historical handwriting. Clear improvements have also taken place by making the HTR engine completely independent from the reading order of the characters, so that now Hebrew, Arabic or other “right to left” alphabets can be trained and recognized. The second focus was put on the layout analysis task in the project. Actually this is one of the bottlenecks of current processing that historical archival documents often have a very irregular and hard to process layout structure. Not only that characters are running across lines (long ascenders and descenders) but the writing is often not straight, sometimes changes complete the direction (e.g. notes and marginalia), or follows complex tables with dozens or even hundreds of cells on one page. One of the main accents set by the project was to create the largest reference dataset for historical layout analysis ever made public in the document image and analysis research field. Moreover a new concept of “baselines” instead of “line regions” was introduced by the READ project team which simplifies the creation of reference data significantly compared to current approaches.

Related information

Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top