We often look at history books as the most valuable recollections of society’s past struggles and breakthroughs. But as far as details go, nothing really beats the millions of events, stories and names discussed in newspapers every single day. As such, newspapers are an integral part of our cultural heritage. They need to be digitalised and stored – which explains why libraries across Europe have been stepping up their efforts and will continue to do so over the coming years. But current digitalisation methods are not without their drawbacks. As Antoine Doucet, professor and researcher at the University of La Rochelle, points out: “Much remains to be done for the collections to be truly available to ordinary citizens and humanities scholars, so that they can benefit from the new possibilities of digital methods for their research.” There are several problems at hand, which Doucet has been aiming to overcome with funding under the NewsEye (A Digital Investigator for Historical Newspapers) project: The low quality of digitised newspapers, the lack of adequate search and analysis tools, and the dizzying amount of information available which calls for new ways to help users find what they’re looking for. The first issue is therefore tied to the fact that most library collections were digitised decades ago. Applying optical character recognition (OCR) to such archives often results in poor-quality output. This is problematic, as users of historical newspapers need high-quality text recognition results in order to search, find and browse through relevant content. NewsEye overcomes this problem by combining advanced technologies for text recognition, layout analysis, article separation and other related tasks. Furthermore, Doucet and his team developed semantic tools that enrich the text with data such as named entities (people, companies, countries, etc.) or events. These can then be linked to external data sources like Wikidata, which helps provide more accurate search results that even cross language barriers.
Enhanced research potential
“Semantic enrichment provides powerful search capabilities and supports further analysis of the content. The applied methods are strongly based on statistical approaches and avoid dependencies on external dictionaries or high-level linguistic analysis. This makes our tools applicable to a wide range of languages,” Doucet says. This is indeed a great step forward. Users of historical newspapers need effective tools to index and search newspaper content in various ways to discover topics, trends and patterns. Such tools were largely non-existent before NewsEye, and those that existed failed to cope with the noisy, low-quality OCR results. This brings us to the third problem: State-of-the-art tools for text analysis are not adapted to the needs of historical newspaper users. NewsEye fills this gap with Dynamic Text Analysis tools. These support interactive queries to discover different viewpoints, subtopics or trends concerning the selected topic, the named entity, the newspaper, the timeframe, etc. This all provides insights into the newspaper collection in contextualised and comparative manners. Last but not least, users interested in historical questions and needing to deal with billions of items will benefit from the project’s so-called Personal Research Assistant. Doucet explains: “The Assistant will autonomously investigate newspaper content on behalf of the user and will report on findings which it assesses as potentially interesting. It will also provide a transparently presented rationale for how the assessment was made so the findings can be understood and verified by the user.” All NewsEye tools are available on the project website. Many of them are well on their way to being fully exploited and sustained, and Doucet intends to eventually make them useful beyond the scope of newspaper research. Funding has already been granted for such exploration, in the context of further projects at the regional, national and European level.
NewsEye, history, historic newspaper, research, OCR, text analysis