Skip to main content

NewsEye: A Digital Investigator for Historical Newspapers

Periodic Reporting for period 2 - NewsEye (NewsEye: A Digital Investigator for Historical Newspapers)

Période du rapport: 2019-05-01 au 2022-01-31

Newspapers collect information about cultural, political and social events in a more detailed way than any other public record. Since their beginnings in the 17th century they are recording billions of events, stories and names, in almost every language, every country and every day. Newspapers were always an important medium for the dissemination of public and political opinions, literary works, essays and art. This thematic wealth sets them at the center stage for anyone interested in European cultural heritage.

In the last decades, tens of millions of newspaper pages from European libraries have been digitized and made available online, while national libraries will intensify their digitization efforts in the coming years. There is large demand for access to historical newspapers. At this very moment, probably thousands of European citizens are accessing digitized versions of historical newspapers utilizing digital library services. Whilst the broad public shows general interest in this historical and cultural resource, it is of crucial importance for many humanities scholars.

The NewsEye project involves national libraries, humanities and social science research groups and computer science research groups. It addressed a number of challenges, which resulted in significant scientific advances, in several directions:
- in text recognition, text analysis, natural language processing, computational creativity and natural language generation, with regard to historical newspapers but also more universally,
- in digital newspaper research, addressing a number of editorial issues like OCR and article separation,
- in digital humanities, in respect to huge amounts of text material, availability of useful tools and possibilities of searching and browsing,
- in history, in terms of analyzing historical assets with new methods across different language corpora.
The project advanced the state of the art and produced open science outputs in all the directions listed previousy. These outputs consist in numerous tools, datasets, trained models, scientific pubilcations, videos, screencasts, podcasts, etc. All of them were made publically available in a sustainable way, and inventoried on the project website. Most of these outputs culminate in the NewsEye platform, which demonstrates how these results can be combined into an innovative user interface for cross-lingual historical newspaper analysis.

NewsEye has been successful in terms of its communication, dissemination aims and achievements. Concrete exploitation leads were established early on and pushed throughout. The NewsEye events (conferences, workshops, trainings, hackathons, etc.) attracted numerous participants from various user groups. NewsEye has paved the way for future research to be undertaken in the European Commission’s Horizon Europe and Digital Europe programs, bridging the gap between computer science, cultural heritage and digital humanities (and their funding streams). The development of the NewsEye project has proven the value and necessity of progressing toward opening the utility of historical newspaper data as a concerted effort combining expertise in digital cultural heritage, digital humanities and computer science.
By building on one of the largest and most significant digital collections of cultural heritage in Europe, the core NewsEye objective was to deliver innovative tools and services to significantly improve the way historical newspapers can be accessed, explored and analyzed, intending widespread use and large impact. The project created a valuable, inexpensive, and immediately useful NewsEye toolbox and demonstrator platform for assisting users of all types, available as open science through the project's Github repository (https://github.com/NewsEye/) while public datasets and models were made available through Zenodo. The developed workflow is composed of four main layers, each providing advanced techniques and tools for:
- Text Recognition and Article Separation, extracting the layout of newspapers (e.g. articles and graphical regions) from digitized newspapers and transforming the content to textual format, providing full articles through automatic layout analysis, text recognition and article separation.
- Semantic Text Enrichment, enhancing the utility of the newspaper collections by enriching the texts with higher-level semantic annotation using named-entity recognition. Extracted named entities were linked to external references (such as the Wikipedia) across languages, with the goal to support multilingual analysis. This layer also ensured event detection, as support for pattern discovery from textual contents.
- Dynamic Text Analysis, providing tools to exploit the enriched data for more elaborated analysis of user-selected newspaper content, supporting interactive queries to discover different viewpoints, sub-topics or trends concerning the selected topic, named entity, newspaper, timeframe or other category, so as to provide insights into the newspaper collection in contextualized and comparative manners.
- Intelligent analysis and reporting (“Personalized Research Assistant”), providing an alternative, “intelligent” interface to the other tools and the data, carrying out iterative cycles of analysis and reporting to the user in natural language. The user became able to authorize the Personal Research Assistant to investigate a given topic (or time window or newspaper etc.) on the user’s behalf, with the Assistant reporting back on findings which it assesses as potentially interesting for the user, reported in natural language and in a transparent manner so the findings can be understood and verified by the user. Given the European context, we were be able not only to analyze newspapers written in multiple languages but also to report on the findings in multiple languages; to this end, the Assistant used multilingual natural language generation (NLG) to produce textual descriptions of the results obtained by the Investigator.

The NewsEye consortium further involved experts whose role was to ensure (i) additional technical expertise in the above-mentioned aspects, (ii) access to and enrichment of digitized newspapers, (iii) insight and experience in using historical newspapers as a rich cultural heritage resource for the understanding of developments in society, economy and politics, (iv) use cases with the aim to address important humanities’ research desiderata and gain experience and feedback to guide iterative development of the NewsEye demonstrator, and (v) strong dissemination and viable paths towards wider adoption and sustainability of the developed tools.

All the results and outputs of the project are available on the project website, notably with data sets, publications and source code inventoried under its "Open Science" tab.
NewsEye logo