CORDIS - EU research results
CORDIS

Digital Bridge: Optical Character Recognition for Early Printed Books in Latin

Periodic Reporting for period 1 - LatinOCR (Digital Bridge: Optical Character Recognition for Early Printed Books in Latin)

Reporting period: 2015-03-01 to 2016-08-31

The objective of Durham's Digital Bridge Latin OCR project was to prove the viability of developing an open source Optical Character Recognition (OCR) software for early modern Latin texts as a not-for-profit business venture. In the course of the project, the software was successfully developed and a company limited by guarantee, Rescribe Ltd, incorporated to administer and market the technology. A Beta testing phase with several partners strongly indicated that the demand for this type of service exists within the UK and the rest of Europe.

The project was managed under three main headings: (1) the development of the requisite software, (2) the incorporation of the not-for-profit company and (3) outreach to potential clients and users of the software.

(1) The software development was handled jointly by the software developer and the project manager, both hired in-house, as well as the PI and Senior Advisor on the project.
The software package developed is based on the Free and Open Source Software program Tesseract, initiated by Hewlett and Packard and further developed under Google's auspices. Whilst Tesseract can been successfully used for a suite of modern languages, it had to be adapted specifically to the orthographic peculiarities of Latin and historic glyphs of the manuscript tradition still present in the early modern printing alphabets. This training process was completed in successive steps over the entire project period and continuously tested and further improved. A full set of characters now supported can be found on the official webpage where the software can be downloaded for free: https://latinocr.org

Developing the software for business purposes and making it accessible to the broader academic community required a two-fold approach. While the tools used internally should be as effective as possible in order to tackle the challenges associated with OCR on early modern Latin printed texts, the resources published for the public needed to be easier to use and explained in detail for less tech-savvy users. The development of the software was therefore accompanied by the successive testing and documentation of additional open source software programs used to aide an unsophisticated user in the application of our software.

(2) The not-for-profit company was successfully incorporated in June under the name "Rescribe Ltd". The board is composed of the PI, Prof. Barbara Graziosi the Senior Advisor, Dr. Peter Heslin, and another university employee, Mr Michael Bath. The incorporation of the not-for-profit company was completed in cooperation with Durham University's Business center. As the company was intended to operate on a not-for-profit basis, the university's usual spin-out model for commercialising outcomes of academic research project did not prove suitable to our aims. In previous cases, research outcomes were commercialized through shareholder companies, under strong protection of the Intellectual Property initially owned by the university. In order for Rescribe to be able to utilize the software developed by university employees, a licensing agreement was drafted which allowed the company to utilize and publish the software without IP restrictions. Instead of receiving shares in the company, the university is represented on the board. This new type of business spin out is setting a useful precedent beyond the classic commercialization model used for university research outcomes.

(3) The outreach combined both interpersonal and public approaches. A number of libraries and archives were invited to submit sample texts for a Beta testing phase. Amongst these were UCL's Rare Book Collection, the library of the Society of the Middle Temple, the College of Royal Psychologists and the Salamanca Group. Each partner was invited to submit three documents and received the OCR results in form of searchable PDFs, raw text and encoded files free of charge, alongside a report analysing the OCR results for each text. The partners were sent a survey in return the results of which were used to inform the business strategy. The lynchpin for public outreach is provided by its internet presence. The not-for-profit company and the free, donwloadable software respectively are purposely represented on two separate websites (https://rescribe.xyz for the business, https://latinocr.org for the software), even if presented by the same team. As a consequence, the free software is conceptually singled out from the business venture to have its own identity on the web. Links to the free software download are disseminated amongst relevant internet fora and interested individuals.