CORDIS - Forschungsergebnisse der EU
CORDIS

NewsEye: A Digital Investigator for Historical Newspapers

Leistungen

Automatic Text Recognition (final)

Reports on software tools and modules incl documentation for Automatic Text Recognition Technical Reports on further development and innovative adaptation of algorithms and methods for Automatic Text Recognition

Dissemination, communication and exploitation of results (e) (final)

The PEDR will be delivered at M3 and the project will followthrough by maintaining a rolling plan of activities to disseminate and exploit project results including reports or publications for each event on a particular topic This deliverable includes rapid dissemination channels in the form of blog posts tweets and other online media as well as more traditional dissemination outputs conference papers scholarly articlesAt M12 M24 and M36 we will provide yearly reports on the execution of the PEDR as well as on all dissemination and communication events organized during the projects Main dissemination and communication events are planned at M3 M14 M24 M25 M26 and M30 but will be reported on yearly together with smaller scale eventsThis deliverable under the lead of WP7 by BNF after M36 will provide details on the dissemination communication and exploitation of results during the project extension

Layout Analysis (final)

Reports on software tools and modules incl documentation for Layout Analysis Technical Reports on further development and innovative adaptation of algorithms and methods for Layout Analysis

Usability/Fit for research purpose test of tools and user interfaces (c) (final)

The deliverables will report on testing the methods tools and interfaces to the core They are the result of collaboration on the mockups and prototypes workshophackathon participation with the computer science groups and the libraries as indicated in Task T74 providing extensive feedback on tools and methods UIBKICH will supervise the production of reports in preparation for and as a followup to the tools prototypes betaversions and publishable tools and along the timeline of WP7 The final version is due at M34 with a possible update at M45

Contextualized Case Studies for academic use (d) (final)

The deliverables will report on the four digital humanities case studies prepared by using already existing methods and tools as well as the ones to be developed in this project showing progress and improvement of search and research outcome UIBKICH will be responsible for the case studies on migration UHDH for the case study on nationalisms and revolutions UNIVIE for the case study on media and journalism and UPVM for the case study on gender The members of the DHgroup will furthermore compare and contrast the results of the case studies in order to show how newspapers work both as a space for change as well as for stability while addressing the relationship between press politics and society in different regions and languages across Europe thus showing the transformation of our societiesThe deliverables will a include thorough literature and background research for each of the case studies b work with the semantically enriched The deliverables will report on testing the methods tools and interfaces to the core They are the result of collaboration on the mockups and prototypes workshophackathon participation with the computer science groups and the libraries as indicated in Task T74 providing extensive feedback on tools and methods UIBKICH will supervise the production of reports in preparation for and as a followup to the tools prototypes betaversions and publishable tools and along the timeline of WP7text as well as applicationutilization of the developed dynamic text analysis features in different languages in order to improve the quality of the case studies c show how the developed tools contribute to change and continuity discussions for European societiesDraftreports will be delivered at M6 complete reports at M12 while final reports to be submitted for publication in renowned humanities and digital humanities journals will be completed at M24 and M36

Personal Research Assistant: Explainer (b) (final)

This deliverable describes the Explainer component The first version M24 will be able to produce initial descriptions of strategies goals and decisions of the Investigator while the second version M36 describes the final version The final version is due at M36 with a possible update at M45

Article separation (c) (final)

Reports on software tools and modules incl documentation for Article Separation Technical Reports on further development and innovative adaptation of algorithms and methods for Article Separation journal research paper submissions on new preferably Machine Learning based neural algorithms and technologies for Article Separation along with the inherently used Layout Analysis Text Line Detection and Automatic Text The final version is due at M36 with a possible update at M45

Event detection (final)

Report on the level of completion of the event detection tool at M24 present the state of the art in event detection replying on the detection of events based on the sole document content using stringbased multilingual approaches based on rhetoric and specificities of the news genre as previously developed at ULR The second version at M36 will integrate contrastive knowledge from other documents The final version is due at M36 with a possible update at M45

Personal Research Assistant: Reporter (c) (final)

This deliverable describes the Reporter component and how it is used The first version M12 will be capable of some simple natural language generation using relatively rigid document structures and mechanisms for talking about the results of tools produced in WP34 during year one The second version M24 will have more elaborate document structuring and will be able to report more flexibly on a wider range of analysis results The second version will also have a first version of summarization of textual contents The third version of the deliverable M36 will describe the final version with full functionality The final version is due at M36 with a possible update at M45

Use of project results for the general public (b) (final)

The deliverables will report on the texts podcasts and social media activities by the digital humanities group UNIVIE will be supervising the podcast production UPVM the linking with Wikipedia and UHDH the social media activities

NewsEye Demonstrator (c) (final)

Reports and software on the development of the NewsEye Demonstrator a web based user interface for tools developed in WP3 and 4 and for the Personal Research Assistant WP5 Tools for the user interface of WP3 will be provided at M12 while the complete Minimum viable product MVP will be delivered at M24 and the final version at M36 The final version is due at M36 with a possible update at M45

Sustainability plan (c) (final)

The project will conceptualize a sustainability strategy for the longterm access of tools and data generated by the project to be planned in full details at M26 being implemented at M36 and fully implemented at M45

Stance detection (final)

Reports on the level of completion of the software tool for stance detection M12 The first version at M12 will rely on standards of the state of the art and the second version at M24 contains our principal research contribution robust to noise and language independent

Showcase case studies for the user interface (b) (final)

The deliverables will consist of texts videos statistics search paths how to etc on the user interface and on the project homepage All partners of the digital humanities group will contribute to the deliverable

Personnal Research Assistant: Investigator (c) (final)

The deliverable describes the Investigator tool In the first iteration M12 the Investigator will be capable of planning forming and running some queries using analysis tools developed in parallel in WP34 and of interacting with the user in simple ways to continue the investigation In the second iteration M24 the Investigator will also be able to create strategies for investigation to analyze the results obtained and to adjust its strategy accordingly The third iteration M36 describes the final version with full functionality The final version is due at M36 with a possible update at M45

Advanced tool to query the enriched data sets (final)

Report on the software to query the data sets (M6). The first version is delivered early on at M6 to allow que-rying the data set as soon as possible, without the semantic enrichment produced in other deliverables of WP3, and the second version at M12 reporting on the software to analyze the data and the enriched data sets is delivered as soon as possible, and allows querying the data set and the enriched data set, including the se-mantic text enrichment to be produced in the rest of WP3 (D3.1-D3.3).

Data models (d) (final)

Regular reports providing a detailed description of the data models formats and specifications used in the project including publicly available example data

Data collection and preservation (d) (final)

Report and data collection

Comparative analysis of data between contexts (b) (final)

Reports on the developed methods and tools for dynamic comparative analysis of data between given contexts The first version at M24 describes the methods to extract sets of characteristics to describe similarities or contrasts between document groups and the second version at M36 describes the final methods to extract contrasting characteristics from groups of documents integrated with work on intelligible descriptions The final version is due at M36 with a possible update at M45

Educational material for teachers, pupils and lay historians (b) (final)

The deliverables consist of prototypes of the educational material in M24 and the online published material in M36 While all partners of the digital humanities group will contribute in the production of the material UHDH will supervise the production of material for teachers UPVM for pupils and students and UIBKICH for lay historians in different languagesA report on educational material prototypes will be delivered at M24 the final report will be delivered at M36

Analysis of data in a given context (c) (final)

Reports on the level of completion of the software tool for dynamic analysis of data in a given context The first version at M12 will be tools for building multilingual topic models topic hierarchies and dynamic topic models and using them to analyze articles in the initial dataset the second version at M24 contains document analysis methods for article similarity and link discovery to suggest related articles combining multilingual hierarchical dynamic topic models and the third version at M36 contains document analysis methods refined on the basis of feedback from their use in Personal Research Assistant and evaluation of their integration with intelligible descriptions The final version is due at M36 with a possible update at M45

NE recognition and linking (final)

Reports on the level of completion of the software tool to recognize and link NEs The first version at M12 will rely on standards of the state of the art and the second version at M24 contains our principal research contribution robust to noise and language independent

Intelligible representation of statistical analysis (b) (final)

Reports on the methods and tools for outputting humanintelligible representations based on the outputs from statistical models developed in T41 and T42 The first version at M24 describes the methods that provide intelligible namesdescriptions of topics and extracted characteristics for use in Personal Research Assistant and the second version at M36 describes the final methods to provide intelligible descriptions refined after integration in Personal Research Assistant The final version is due at M36 with a possible update at M45

Project website (to be continuously updated)

The project will maintain a website that will act as a portal for the communications activities. In M1 a web page will be published to advertise and announce the project. By M8 the full website structure will be in place, integrating social media (such as Twitter) channels. The website will be maintained throughout the duration of the project and content will be contributed by all project partners.

Data management plan

The NewsEye project will contribute to the open research data pilot. According to the guidelines for Research Data Management of Horizon 2020 (http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf) a Data Management Plan will be written during the first six months explaining what data will be generated, collected, shared and curated during project duration as well as after the project’s end. It will consider the different kinds of research outcomes (WP6) and data (WP2-5) resulting from the project. One im-portant goal of Newseye is to make its data findable, accessible, interoperable and reusable (FAIR).

Veröffentlichungen

Exploring Entities in Event Detection as Question Answering

Autoren: Boros, Emanuela; Moreno, Jose G.; Doucet, Antoine
Veröffentlicht in: Proceedings of the 44th European Conference on Information Retrieval (ECIR), 2022
Herausgeber: Springer
DOI: 10.5281/zenodo.5779941

L3i at SemEval-2022 Task 11: Straightforward Additional Context for Multilingual Named Entity Recognition

Autoren: Emanuela Boros, Carlos-Emiliano Gonzalez-Gallardo, Jose G. Moreno, Antoine Doucet
Veröffentlicht in: International Workshop on Semantic Evaluation (SemEval), Ausgabe Task 11, 2022
Herausgeber: ACL
DOI: 10.5281/zenodo.6369947

A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers

Autoren: Ahmed Hamdi; Elvys Linhares Pontes; Emanuela Boros; Thi Tuyet Hai Nguyen; Günter Hackl; Jose G. Moreno; Antoine Doucet
Veröffentlicht in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, Seite(n) 2328–2334
Herausgeber: ACM
DOI: 10.1145/3404835.3463255

Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition

Autoren: Ahmed Hamdi; Axel Jean-Caurant; Nicolas Sidere; Mickaël Coustaty; Antoine Doucet
Veröffentlicht in: Proceedings of the 24th International Conference on Theory and Practice of Digital Libraries, TPDL 2020, Ausgabe 12246, 2020, Seite(n) 87–101
Herausgeber: Springer
DOI: 10.1007/978-3-030-54956-5_7

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Autoren: Emanuela Boros; Ahmed Hamdi; Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; Jose G. Moreno; Nicolas Sidere; Antoine Doucet
Veröffentlicht in: Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL), 2020, Seite(n) 431–441
Herausgeber: ACL
DOI: 10.18653/v1/2020.conll-1.35

Exploring Entities in Event Detection as Question Answering

Autoren: Boros, Emanuela; Moreno, Jose G.; Doucet, Antoine
Veröffentlicht in: European Conference on Information Retrieval (ECIR 2022), 2022, Seite(n) 65-79, ISBN 978-3-030-99735-9
Herausgeber: Springer
DOI: 10.1007/978-3-030-99736-6_5

Grammatical Profiling for Semantic Change Detection

Autoren: Giulianelli, Mario; Kutuzov, Andrey; Pivovarova, Lidia
Veröffentlicht in: Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021), 2021
Herausgeber: ACL
DOI: 10.18653/v1/2021.conll-1.33

Multilingual Epidemic Event Extraction

Autoren: Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine; Lejeune, Gaël; Jatowt, Adam; Odeo, Moses
Veröffentlicht in: Proceedings of the 23rd International Conference on Asian Digital Libraries (ICADL)., Ausgabe 13133, 2021, Seite(n) 139–156
Herausgeber: Springer
DOI: 10.5281/zenodo.5779966

Transformer-based Methods for Recognizing Ultra Fine-grained Entities (RUFES)

Autoren: Boros, Emanuela; Doucet, Antoine
Veröffentlicht in: Thirteenth Text Analysis Conference ((TAC 2020), 2021
Herausgeber: NIST
DOI: 10.5281/zenodo.4555778

Information Extraction from Invoices

Autoren: Ahmed Hamdi; Elodie Carel; Aurelie Joseph; Mickael Coustaty; Antoine Doucet
Veröffentlicht in: International Conference on Document Analysis and Recognition ICDAR 2021, Ausgabe 12822, 2021, Seite(n) 699–714
Herausgeber: Springer
DOI: 10.1007/978-3-030-86331-9_45

Event Detection with Entity Markers

Autoren: Emanuela Boros; Jose G. Moreno; Antoine Doucet
Veröffentlicht in: Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021), Ausgabe 12657, 2021, Seite(n) 233–240
Herausgeber: Springer
DOI: 10.1007/978-3-030-72240-1_20

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Autoren: Quan Duong; Mika K Hämäläinen; Simon Hengchen
Veröffentlicht in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 2020, Seite(n) 240–248
Herausgeber: ACL
DOI: 10.5281/zenodo.4242890

Dataset for Temporal Analysis of English-French Cognates

Autoren: Frossard, Esteban; Coustaty, Mickael; Doucet, Antoine; Jatowt, Adam; Hengchen, Simon
Veröffentlicht in: Proceedings of the 12th Language Resources and Evaluation Conference, 2020, Seite(n) 855–859
Herausgeber: European Language Resources Association
DOI: 10.5281/zenodo.3693650

NewsEye: A digital investigator for historical newspapers

Autoren: Doucet, Antoine; Gasteiner, Martin; Granroth-Wilding, Mark; Kaiser, Max; Kaukonen, Minna; Labahn, Roger; Moreux, Jean-Philippe; Muehlberger, Guenter; Pfanzelter, Eva; Therenty, Marie-Eve; Toivonen, Hannu; Tolonen, Mikko
Veröffentlicht in: 15th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2020, 2020
Herausgeber: ADHO
DOI: 10.5281/zenodo.3895269

Robust Named Entity Recognition and Linking on Historical Multilingual Documents

Autoren: Emanuela Boros; Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; Ahmed Hamdi; José Moreno; Nicolas Sidère; Antoine Doucet
Veröffentlicht in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Ausgabe 2696, 2020, Seite(n) 1-17
Herausgeber: CEUR
DOI: 10.5281/zenodo.4068074

Using a Frustratingly Easy Domain and Tagset Adaptation for Creating Slavic Named Entity Recognition Systems

Autoren: Cabrera-Diego, Luis Adrián; Moreno, Jose G.; Doucet, Antoine
Veröffentlicht in: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing (BSNLP at ACL), 2021, Seite(n) 98–104
Herausgeber: ACL
DOI: 10.5281/zenodo.4730477

SpaceWars: A Web Interface for Exploring the Spatio-temporal Dimensions of WWI Newspaper Reporting

Autoren: Gutehrlé, Nicolas; Harlamov, Oleg; Karimi, Farimah; Wei, Haoyu; Jean-Caurant, Axel; Pivovarova, Lidia
Veröffentlicht in: Proceedings of the 6th International Workshop on Computational History (HistoInformatics 2021), 2021
Herausgeber: CEUR
DOI: 10.5281/zenodo.5566463

Disappearing Discourses: Avoiding anachronisms and teleology with data-driven methods in studying digital newspaper collections

Autoren: Zosa, Elaine; Hengchen, Simon; Marjanen, Jani; Pivovarova, Lidia; Tolonen, Mikko
Veröffentlicht in: Digital Humanities in the Nordic countries (DHN 2020), 2020
Herausgeber: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.3631613

Atténuer les erreurs de numérisation dans la reconnaissance d'entités nommées pour les documents historiques

Autoren: Boros, Emanuela; Hamdi, Ahmed; Linhares Pontes, Elvys; Cabrera-Diego, Luis Adrián; Moreno, José G.; Sidere, Nicolas; Doucet, Antoine
Veröffentlicht in: Conférence en Recherche d’Informations et Applications - CORIA 2021, French Information Retrieval Conference,, 2021
Herausgeber: ARIA
DOI: 10.24348/coria.2021.mini_24

Neural Machine Translation with BERT for Post-OCR Error Detection and Correction

Autoren: Thi Tuyet Hai Nguyen; Adam Jatowt; Nhu-Van Nguyen; Mickael Coustaty; Antoine Doucet
Veröffentlicht in: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 2020, Seite(n) 333–336
Herausgeber: ACM
DOI: 10.1145/3383583.3398605

Post-OCR Error Detection by Generating Plausible Candidates

Autoren: Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, Antoine Doucet
Veröffentlicht in: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, Seite(n) 876-881, ISBN 978-1-7281-3014-9
Herausgeber: IEEE
DOI: 10.1109/ICDAR.2019.00145

Elastic Embedded Background Linking for News Articles with Keywords, Entities and Events.

Autoren: Luis Adrián Cabrera-Diego, Emanuela Boros, Antoine Doucet
Veröffentlicht in: Text REtrieval Conference (TREC) 2021, Ausgabe News Track, 2022
Herausgeber: NIST
DOI: 10.5281/zenodo.6334523

Opening Digitized Newspapers for Different User Groups - Successes and Challenges

Autoren: Juha Rautiainen
Veröffentlicht in: IFLA World Library and Information Congress 2019, 2019
Herausgeber: IFLA
DOI: 10.5281/zenodo.3403158

A Baseline Document Planning Method for Automated Journalism

Autoren: Leo Leppänen; Hannu Toivonen
Veröffentlicht in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 2021, Seite(n) 101–111
Herausgeber: ACL
DOI: 10.5281/zenodo.4694492

Personal Research Assistant for Online Exploration of Historical News

Autoren: Lidia Pivovarova; Axel Jean-Caurant; Jari Avikainen; Khalid Alnajjar; Mark Granroth-Wilding; Leo Leppänen; Elaine Zosa; Hannu Toivonen
Veröffentlicht in: Proceedings of the 42nd European Conference on IR Research, Ausgabe 12036, 2020, Seite(n) 481–485, ISBN 9783030454418
Herausgeber: Springer
DOI: 10.1007/978-3-030-45442-5_62

Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic languages

Autoren: Piskorski, Jakub; Babych, Bogdan; Kancheva, Zara; Kanishcheva, Olga; Lebedeva, Maria; Marcinczuk, Michał; Nakov, Preslav; Osenova, Petya; Pivovarova, Lidia; Pollak, Senja; Přibáň, Pavel; Radev, Ivaylo; Robnik-Šikonja, Marko; Starko, Vasyl; Steinberger, Josef; Yangarber, Roman
Veröffentlicht in: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, 2021, Seite(n) 122–133
Herausgeber: ACL
DOI: 10.5281/zenodo.4635585

When to Use OCR Post-correction for Named Entity Recognition?

Autoren: Vinh-Nam Huynh; Ahmed Hamdi; Antoine Doucet
Veröffentlicht in: Proceedings of the 14th International Conference on Data Analytics in Logistics (ICDAL 2020), Ausgabe 12504, 2020, Seite(n) 33–42, ISBN 9783030644512
Herausgeber: Springer
DOI: 10.1007/978-3-030-64452-9_3

A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval

Autoren: Elaine Zosa; Mark Granroth-Wilding; Lidia Pivovarova
Veröffentlicht in: Proceedings of the Workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020), 2020, Seite(n) 32-37
Herausgeber: ACL
DOI: 10.5281/zenodo.3751036

"Transformer-based Methods with #Entities for Detecting Emergency Events on Social Media"

Autoren: Emanuela Boros, Nhu Khoa Nguyen, Gaël Lejeune, Mickaël Coustaty, Antoine Doucet
Veröffentlicht in: Text REtrieval Conference (TREC) 2021, Ausgabe Incident Streams Track, 2022
Herausgeber: NIST
DOI: 10.5281/zenodo.6334513

Simple ways to improve NER in every language using markup

Autoren: Luis Adrián Cabrera-Diego; Moreno, J. G.; Doucet, A.
Veröffentlicht in: Proceedings of the 2nd International Workshop on Cross-Lingual Event-Centric Open Analytics Co-Located with the 30th The Web Conference (WWW 2021), 2021, ISSN 1613-0073
Herausgeber: CEUR-WS
DOI: 10.5281/zenodo.4680998

Digging Deeper into the Finnish Parliamentary Protocols – Using a Lexical Semantic Tagger for Studying Meaning Change of Everyman's Rights (allemansrätten)

Autoren: Kettunen, Kimmo; La Mela, Matti
Veröffentlicht in: Proceedings of the Digital Humanities in the Nordic Countries (5th Conference), 2020, Seite(n) 63–80
Herausgeber: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.3676371

Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents

Autoren: Ehrmann, Maud; Romanello, Matteo; Doucet, Antoine; Clematide, Simon
Veröffentlicht in: European Conference on Information Retrieval (ECIR 2022), 2022, Seite(n) 347–354, ISBN 978-3-030-99739-7
Herausgeber: Springer
DOI: 10.1007/978-3-030-99739-7_44

Event Related Document Retrieval with Multilingual Real World Event Representation

Autoren: Guillaume Bernard, Cyrille Suire, Cyril Faucher, Antoine Doucet
Veröffentlicht in: Proceedings of the 20th International Semantic Web Conference (ISWC), 2021
Herausgeber: CEUR-WS
DOI: 10.5281/zenodo.5900742

Three-part diachronic semantic change dataset for Russian

Autoren: Andrey Kutuzov; Lidia Pivovarova
Veröffentlicht in: Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021, 2021, Seite(n) 7-13
Herausgeber: ACL
DOI: 10.18653/v1/2021.lchange-1.2

ICDAR 2019 Competition on Post-OCR Text Correction

Autoren: Christophe Rigaud; Antoine Doucet; Mickaël Coustaty; Jean-Philippe Moreux
Veröffentlicht in: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, ISBN 978-1-7281-3015-6
Herausgeber: IEEE
DOI: 10.1109/icdar.2019.00255

Multilingual Dynamic Topic Model

Autoren: Zosa, Elaine; Granroth-Wilding, Mark; Department of Computer Science, University of Helsinki, Finland
Veröffentlicht in: Proceedings - Natural Language Processing in a Deep Learning World (RANLP), 2019, Seite(n) 1388–1396
Herausgeber: RANLP
DOI: 10.26615/978-954-452-056-4_159

Visual Topic Modelling for NewsImage Task at MediaEval 2021

Autoren: Lidia Pivovarova, Elaine Zosa
Veröffentlicht in: Working Notes Proceedings of the MediaEval 2021 Workshop, 2021
Herausgeber: CEUR-WS
DOI: 10.5281/zenodo.5900719

Linking Named Entities across Languages using Multilingual Word Embeddings

Autoren: Elvys Linhares Pontes; Jose G. Moreno; Antoine Doucet
Veröffentlicht in: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2020, Seite(n) 329–332
Herausgeber: ACM
DOI: 10.1145/3383583.3398597

Can Umlauts Ruin Your Research in Digitized Newspaper Collections? A NewsEye Case Study on 'The Dark Sides of War' (1914–1918)

Autoren: Klaus, Barbara
Veröffentlicht in: Proceedings of the Digital Humanities in the Nordic Countries (5th Conference), Ausgabe 2612, 2020, Seite(n) 267–274
Herausgeber: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.4686731

Large Scale Analysis of Semantic and Temporal Aspects in Cultural Heritage Collection's Search

Autoren: Sumikawa, Yasunobu; Jatowt, Adam; Doucet, Antoine; Moreux, Jean-Phillippe
Veröffentlicht in: 2019 JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), Urbana-Champaign, Illinois, June 2-6, 2019, Ausgabe yearly, 2019, Seite(n) 77-86, ISBN 978-1-7281-1547-4
Herausgeber: IEEE computer society
DOI: 10.1109/jcdl.2019.00021

Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

Autoren: Nguyen, Thi-Tuyet-Hai; Jatowt, Adam; Coustaty, Mickael; Nguyen, Nhu-Van; Doucet, Antoine
Veröffentlicht in: 2019 JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), Urbana-Champaign, Illinois, June 2-6, 2019, Ausgabe yearly, 2019, Seite(n) 29-38, ISBN 978-1-7281-1547-4
Herausgeber: IEEE computer society
DOI: 10.1109/jcdl.2019.00015

Towards Data-Driven Generation of Visualizations for Automatically Generated News Articles

Autoren: Rola Alhalaseh, Myriam Munezero, Miika Leinonen, Leo Leppänen, Jari Avikainen, Hannu Toivonen
Veröffentlicht in: Proceedings of the 22nd International Academic Mindtrek Conference on - Mindtrek '18, Ausgabe yearly, 2018, Seite(n) 100-109, ISBN 9781-450365895
Herausgeber: ACM Press
DOI: 10.1145/3275116.3275131

An Analysis of the Performance of Named Entity Recognition over OCRed Documents

Autoren: Hamdi, Ahmed; Jean-Caurant, Axel; Sidere, Nicolas; Coustaty, Mickael; Doucet, Antoine
Veröffentlicht in: 2019 JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), Urbana-Champaign, Illinois, June 2-6, 2019, Ausgabe yearly, 2019, Seite(n) 333-334, ISBN 978-1-7281-1547-4
Herausgeber: IEEE computer society
DOI: 10.1109/jcdl.2019.00057

Impact Analysis of Document Digitization on Event Extraction

Autoren: Nhu Khoa Nguyen; Emanuela Boroş; Gaël Lejeune; Antoine Doucet
Veröffentlicht in: Proceedings of the 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020), Ausgabe 2735, 2020, Seite(n) 17–28
Herausgeber: CEUR-WS
DOI: 10.5281/zenodo.4734267

Scalable and Interpretable Semantic Change Detection

Autoren: Syrielle Montariol; Matej Martinc; Lidia Pivovarova
Veröffentlicht in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021, Seite(n) 4642–4652
Herausgeber: ACL
DOI: 10.18653/v1/2021.naacl-main.369

Word Clustering for Historical Newspapers Analysis

Autoren: Lidia Pivovarova; Jani Marjanen; Elaine Zosa
Veröffentlicht in: Proceedings of the Workshop on Language Technology for Digital Historical Archives, 2019, Seite(n) 3-10
Herausgeber: ACL Bulgaria
DOI: 10.26615/978-954-452-059-5_002

Multilingual Epidemiological Text Classification: A Comparative Study

Autoren: Stephen Mutuvi; Emanuela Boros; Antoine Doucet; Adam Jatowt; Gaël Lejeune; Moses Odeo
Veröffentlicht in: Proceedings of the 28th International Conference on Computational Linguistics (COLING), 2020, Seite(n) 6172–6183
Herausgeber: ACL
DOI: 10.18653/v1/2020.coling-main.543

Impact of OCR Quality on Named Entity Linking

Autoren: Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet
Veröffentlicht in: International Conference on Asia-Pacific Digital Libraries 2019, 2019, Seite(n) 102–115, ISBN 978-3-030-34058-2
Herausgeber: Springer
DOI: 10.1007/978-3-030-34058-2_11

Entity Linking for Historical Documents: Challenges and Solutions

Autoren: Pontes, Elvys Linhares; Cabrera-Diego, Luis Adrián; Moreno, José G.; Boros, Emanuela; Pontes, Elvys,; Hamdi, Ahmed; Sidère, Nicolas; Coustaty, Mickaël; Doucet, Antoine
Veröffentlicht in: Proceedings of the 22nd International Conference on Asia-Pacific Digital Libraries (ICADL 2020), Ausgabe 12504, 2020, Seite(n) 215–231, ISBN 9783030644512
Herausgeber: Springer
DOI: 10.1007/978-3-030-64452-9_19

Clustering Ideological Terms in Historical Newspaper Data with Diachronic Word Embeddings

Autoren: Jani Pekka Marjanen; Lidia Pivovarova; Elaine Zosa; Jussi Kurunmäki
Veröffentlicht in: HistoInformatics 2019: International Workshop on Computational History 2019, part of TPDL 2019, 2019
Herausgeber: Springer
DOI: 10.5281/zenodo.3689466

Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

Autoren: Elaine Zosa, Stephen Mutuvi, Mark Granroth-Wilding, Antoine Doucet
Veröffentlicht in: International Conference on Asian Digital Libraries (ICADL), 2021, ISBN 978-3-030-91668-8
Herausgeber: Springer
DOI: 10.1007/978-3-030-91669-5_30

Topic Modelling Discourse Dynamics in Historical Newspapers

Autoren: Marjanen, Jani; Zosa, Elaine; Hengchen, Simon; Pivovarova, Lidia; Tolonen, Mikko
Veröffentlicht in: Proceedings of the 5th Conference Digital Humanities in the Nordic Countries (DHN 2020), 2020, Seite(n) 63-77
Herausgeber: CEUR-WS
DOI: 10.5281/zenodo.5648114

Benchmarks for Unsupervised Discourse Change Detection

Autoren: Duong, Quan; Pivovarova, Lidia; Zosa, Elaine
Veröffentlicht in: Proceedings of the 6th International Workshop on Computational History (HistoInformatics 2021), Ausgabe 2981, 2021
Herausgeber: Springer
DOI: 10.5281/zenodo.5780033

Capturing Evolution in Word Usage: Just Add More Clusters?

Autoren: Matej Martinc; Syrielle Montariol; Elaine Zosa; Lidia Pivovarova
Veröffentlicht in: WWW '20: Companion Proceedings of the Web Conference 2020, 2020, Seite(n) 343-349
Herausgeber: ACM
DOI: 10.1145/3366424.3382186

A Dataset for Multi-lingual Epidemiological Event Extraction

Autoren: Mutuvi, Stephen; Doucet, Antoine; Lejeune, Gael; Odeo, Moses
Veröffentlicht in: Proceedings of the 12th Language Resources and Evaluation Conference, 2020, Seite(n) 4139–4144
Herausgeber: European Language Resources Association
DOI: 10.5281/zenodo.3709626

Not All Comments are Equal: Insights into Comment Moderation from a Topic-Aware Model

Autoren: Elaine Zosa; Ravi Shekhar; Mladen Karan; Matthew Purver
Veröffentlicht in: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021, Seite(n) 1652–1662
Herausgeber: RANLP
DOI: 10.5281/zenodo.5648098

EMBEDDIA at SemEval-2022 Task 8: Investigating Sentence, Image, and Knowledge Graph Representations for Multilingual News Article Similarity

Autoren: Elaine Zosa, Emanuela Boros, Boshko Koloski, Lidia Pivovarova
Veröffentlicht in: Proceedings of SemEval-2022 Workshop Task 8, 2022
Herausgeber: ACL
DOI: 10.5281/zenodo.6369944

Token-Level Multilingual Epidemic Dataset for Event Extraction

Autoren: Stephen Mutuvi; Stephen Mutuvi; Emanuela Boros; Antoine Doucet; Gaël Lejeune; Adam Jatowt; Moses Odeo
Veröffentlicht in: Proceedings of the 25th International Conference on Theory and Practice of Digital Libraries (TPDL), Ausgabe 12866, 2021, Seite(n) 55–59
Herausgeber: Springer
DOI: 10.5281/zenodo.5780019

Evaluating Sequence-to-Sequence Models for Handwritten Text Recognition

Autoren: Johannes Michael, Roger Labahn, Tobias Gruning, Jochen Zollner
Veröffentlicht in: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, Seite(n) 1286-1293, ISBN 978-1-7281-3014-9
Herausgeber: IEEE
DOI: 10.1109/icdar.2019.00208

L3i_LBPAM at the FinSim-2 task: Learning Financial Semantic Similarities with Siamese Transformers

Autoren: Nhu Khoa Nguyen; Emanuela Boros; Gaël Lejeune; Antoine Doucet; Thierry Delahaut
Veröffentlicht in: Companion Proceedings of the Web Conference, 2020, Seite(n) 302–306
Herausgeber: ACM
DOI: 10.5281/zenodo.4734321

The Helsinki Digital Humanities Hackathon: Two Perspectives on Multidisciplinary Historical Newspapers Research in a Hackathon Context

Autoren: Ros, Ruben; Oberbichler, Sarah
Veröffentlicht in: Proceedings of the Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020, 2020, Seite(n) 66–74
Herausgeber: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.3689228

Multilingual Topic Labelling of News Topics using Ontological Mapping

Autoren: Elaine Zosa, Lidia Pivovarova, Michele Boggia, Sardana Ivanova
Veröffentlicht in: European Conference on Information Retrieval (ECIR), 2022
Herausgeber: Springer
DOI: 10.5281/zenodo.6334491

Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie

Autoren: Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine; Lejeune, Gaël; Jatowt, Adam; Odeo, Moses
Veröffentlicht in: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, 2021
Herausgeber: ARIA
DOI: 10.5281/zenodo.4734471

A Comprehensive Extraction of Relevant Real-World-Event Qualifiers for Semantic Search Engines

Autoren: Guillaume Bernard, Cyrille Suire, Cyril Faucher, Antoine Doucet
Veröffentlicht in: International Conference on Theory and Practice of Digital Libraries (TPDL), 2021, Seite(n) 153-164, ISBN 978-3-030-86323-4
Herausgeber: Springer
DOI: 10.1007/978-3-030-86324-1_19

A Method for Wavelet-Based Time Series Analysis of Historical Newspapers

Autoren: Avikainen, Jari
Veröffentlicht in: 2019
Herausgeber: University of Helsinki
DOI: 10.5281/zenodo.3628262

"""Wir dürfen wieder Österreicher sein!"" Die Rolle der Tagespresse in österreichischen Nation-Building-Prozessen 1945–1948 – eine quantitative Analyse ausgewählter digitaler Zeitungskorpora samt Vorschlägen zur didaktischen Umsetzung"

Autoren: Stefan Patrick Hechl
Veröffentlicht in: 2021
Herausgeber: Universität Innsbruck
DOI: 10.5281/zenodo.4468295

Wortvektoren

Autoren: Laasch, Bastian Marc
Veröffentlicht in: 2018
Herausgeber: University of Rostock
DOI: 10.18453/rosdok_id00002309

Embeddings built on 19th century newspapers from Finland

Autoren: Lidia Pivovarova, Elaine Zosa, Jani Marjanen
Veröffentlicht in: 2019
Herausgeber: Zenodo
DOI: 10.5281/zenodo.3557480

Doing historical research with digital newspapers – perspectives of DH scholars

Autoren: Sarah Oberbichler, Eva Pfanzelter, Stefan Hechl, Jani Marjanen
Veröffentlicht in: Europeana Tech, Ausgabe Ausgabe 16: Newspapers, 2021
Herausgeber: Europeana

Using LDA and Jensen-Shannon Distance (JSD) to group similar newspaper articles

Autoren: Sarah Oberbichler
Veröffentlicht in: 2020
Herausgeber: Zenodo
DOI: 10.5281/zenodo.3887193

The Book of Abstracts for What’s Past is Prologue: The NewsEye International Conference.

Autoren: Antti Kanner, Eetu Mäkelä, Jani Marjanen, Mikko Tolonen, Sarah Oberbichler, Quan Duong, Lidia Pivovarova, Dilawar Ali, Steven Verstockt, Étienne Ollion, Rubing Shen, Matthias Arnold, David Brown, Raven Adam, Saranya Balasubramanian, Vera Maria Charvat, Manfred Füllsack, Jörn Kleinert, Hanna Misera, Nenad Pantelic, Jakob Sonnberger, Georg Vogelor, Alessandra De Mulder, Heikki K
Veröffentlicht in: 2021
Herausgeber: Zenodo
DOI: 10.5281/zenodo.5167375

Covid-19 et grippe espagnole: Quand la presse du XXe siècle rappelle celle de 2020

Autoren: Nejma Omari, Antoine Doucet
Veröffentlicht in: 2020
Herausgeber: The Conversation

Annotation Guidelines for Named Entity Recognition, Entity Linking and Stance Detection (v3.1)

Autoren: Ahmed Hamdi, Elvys Linhares Pontes, Antoine Doucet
Veröffentlicht in: 2021
Herausgeber: Zenodo
DOI: 10.5281/zenodo.4574199

NewsEye Policy Brief

Autoren: NewsEye consortium
Veröffentlicht in: 2020
Herausgeber: Zenodo
DOI: 10.5281/zenodo.4291895

Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents

Autoren: Emanuela Boros, Nhu Khoa Nguyen, Gaël Lejeune, Antoine Doucet
Veröffentlicht in: International Journal on Digital Libraries, Ausgabe 14325012, 2022, ISSN 1432-5012
Herausgeber: Springer Verlag
DOI: 10.1007/s00799-022-00325-2

The expansion of isms, 1820-1917: Data-driven analysis of political language in digitized newspaper collections

Autoren: Jani Marjanen; Jussi Antero Kurunmäki; Lidia Pivovarova; Elaine Zosa
Veröffentlicht in: Journal of Data Mining & Digital Humanities, HistoInformatics, Ausgabe 6159, 2020, ISSN 2416-5999
Herausgeber: EPIsciences
DOI: 10.5281/zenodo.4447025

A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

Autoren: Linhares Pontes, Elvys; Huet, Stéphane; Torres Moreno, Juan Manuel; Gouveia da Silva, Thiago; Carneiro Linhares, Andréa
Veröffentlicht in: Computación y Sistemas, Ausgabe 24 (2), 2020, ISSN 2007-9737
Herausgeber: IPN
DOI: 10.13053/cys-24-2-3335

Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians

Autoren: Sarah Oberbichler; Emanuela Boros; Antoine Doucet; Jani Marjanen; Eva Pfanzelter; Juha Rautiainen; Hannu Toivonen; Mikko Tolonen
Veröffentlicht in: Journal of the Association for Information Science and Technology, Ausgabe 73 (2), 2022, Seite(n) 225–239, ISSN 2330-1643
Herausgeber: John Wiley and Sons Ltd
DOI: 10.1002/asi.24565

In Depth Analysis of the Impact of OCR Errors on Named Entity Recognition and Linking

Autoren: Ahmed Hamdi, Evlys Linhares Pontes, Nicolas Sidère, Mickaël Coustaty, Antoine Doucet
Veröffentlicht in: Natural Language Engineering, 2022, Seite(n) 1-24, ISSN 1351-3249
Herausgeber: Cambridge University Press
DOI: 10.1017/s1351324922000110

Digital interfaces of historical newspapers: opportunities, restrictions and recommendations

Autoren: Eva Pfanzelter; Sarah Oberbichler; Jani Marjanen; Pierre-Carl Langlais; Stefan Hechl
Veröffentlicht in: Journal of Data Mining and Digital Humanities, Volume on HistoInformatics, Ausgabe 6121, 2021, ISSN 2416-5999
Herausgeber: EPIsciences
DOI: 10.5281/zenodo.4446818

Als eine andere Epidemie die Welt in Atem hielt: Die Spanische Grippe 1918/19 in der österreichischen Presse

Autoren: Sarah Oberbichler, Stefan Hechl, Eva Pfanzelter
Veröffentlicht in: Tiroler Chronist - Fachblatt von und für Chronisten in Nord-, Süd- und Osttirol, Ausgabe 154, 2020, Seite(n) 15-22, ISSN 1990-9799
Herausgeber: Tiroler Bildungsforum

A data-driven approach to studying changing vocabularies in historical newspaper collections

Autoren: Hengchen, Simon; Ros, Ruben; Marjanen, Jani; Tolonen, Mikko
Veröffentlicht in: Digital Scholarship in the Humanities, Ausgabe 36, 2021, Seite(n) 109–126, ISSN 2055-7671
Herausgeber: Oxford University Press
DOI: 10.5281/zenodo.5783070

Survey of Post-OCR Processing Approaches

Autoren: Thi Tuyet Hai Nguyen; Adam Jatowt; Mickaël Coustaty; Antoine Doucet
Veröffentlicht in: ACM Computing Surveys, Ausgabe 54(6), 2022, Seite(n) 1–37, ISSN 0360-0300
Herausgeber: Association for Computing Machinary, Inc.
DOI: 10.1145/3453476

A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917

Autoren: Jani Marjanen; Villle Vaara; Antti Kanner; Hege Roivainen; Eetu Mäkelä; Leo Lahti; Mikko Tolonen
Veröffentlicht in: Journal of European Periodical Studies, Ausgabe 4 (1), 2019, Seite(n) 55–78, ISSN 2506-6587
Herausgeber: ESPRit (European Society for Periodical Research)
DOI: 10.21825/jeps.v4i1.10483

MELHISSA: a multilingual entity linking architecture for historical press articles

Autoren: Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; Jose G. Moreno; Emanuela Boros; Ahmed Hamdi; Antoine Doucet; Nicolas Sidere; Mickaël Coustaty
Veröffentlicht in: International Journal on Digital Libraries, 2021, ISSN 1432-5012
Herausgeber: Springer Verlag
DOI: 10.1007/s00799-021-00319-6

Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods

Autoren: Sarah Oberbichler, Eva Pfanzelter
Veröffentlicht in: Journal of Digital History, 2021
Herausgeber: De Gruyter

Tracing Discourses in Digital Newspaper Collections: A Contribution to Digital Hermeneutics while Investigating 'Return Migration' in Historical Press Coverage

Autoren: Sarah Oberbichler, Eva Pfanzelter
Veröffentlicht in: Digitised Newspapers – A New Eldorado for Historians?, 2022, ISBN 9783110729214
Herausgeber: De Gruyter Oldenbourg

Crossing or Intersecting the Emperor’s Desk with digitized Newspaper Data: Entity-source-networks in the late Habsburg Empire

Autoren: Martin Gasteiner, Andreas Enderlin
Veröffentlicht in: Digitised Newspapers – A New Eldorado for Historians?, 2022, ISBN 9783110729214
Herausgeber: De Gruyter Oldenbourg

ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset

Autoren: Johannes Michael; Max Weidemann; Bastian Laasch; Roger Labahn
Veröffentlicht in: Proceedings of ICPR International Workshops and Challenges (2020), Ausgabe 12668, 2021, Seite(n) 405–418
Herausgeber: Springer
DOI: 10.1007/978-3-030-68793-9_30

International: From Legal to Civic Discourse and Beyond in the Nineteenth Century

Autoren: Jani Marjanen, Ruben Ros
Veröffentlicht in: Nationalism and Internationalism Intertwined - A European History of Concepts Beyond the Nation State, 2022, Seite(n) 60-85, ISBN 978-1-80073-314-5
Herausgeber: Berghahn

Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction

Autoren: Thi-Tuyet-Hai Nguyen, Mickael Coustaty, Antoine Doucet, Adam Jatowt, Nhu-Van Nguyen
Veröffentlicht in: Maturity and Innovation in Digital Libraries - 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19-22, 2018, Proceedings, Ausgabe 11279, 2018, Seite(n) 278-289, ISBN 978-3-030-04256-1
Herausgeber: Springer International Publishing
DOI: 10.1007/978-3-030-04257-8_29

Evaluating the Impact of OCR Errors on Topic Modeling

Autoren: Stephen Mutuvi, Antoine Doucet, Moses Odeo, Adam Jatowt
Veröffentlicht in: Maturity and Innovation in Digital Libraries - 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19-22, 2018, Proceedings, Ausgabe 11279, 2018, Seite(n) 3-14, ISBN 978-3-030-04256-1
Herausgeber: Springer International Publishing
DOI: 10.1007/978-3-030-04257-8_1

National Sentiment: Nation Building and Emotional Language in Nineteenth-Century Finland

Autoren: Jani Marjanen
Veröffentlicht in: Lived Nation as the History of Experiences and Emotions in Finland, 1800-2000, 2021, Seite(n) 61–83, ISBN 978-3-030-69881-2
Herausgeber: Springer
DOI: 10.1007/978-3-030-69882-9_3

Suche nach OpenAIRE-Daten ...

Bei der Suche nach OpenAIRE-Daten ist ein Fehler aufgetreten

Es liegen keine Ergebnisse vor