Skip to main content

"Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives"

Final Report Summary - ALEXANDRIA (Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives)

The ALEXANDRIA project worked on developing models, tools and techniques necessary to explore and analyze Web archives in a meaningful way. Alexandria has made relevant and impacting technological advancements in building information systems for large-scale temporal collections -- like Web Archives, News archives, and Wikipedia etc. In the first half of Alexandria we partnered with the Internet Archive to obtain large real-world datasets like the entire German and UK Web from 1998-2013. We build an infrastructure to efficiently access such large data sources, derive meaningful sub-collections for closer examination, and to extract knowledge items and semantic
annotations. These tools have been adopted by the digital library community in general facilitated by impacting research articles and outreach activities like Hackathons and workshops.

The Alexandria team proposed novel information retrieval techniques towards making such temporal collections searchable and therefore usable by social scientists and humanities researchers. In this context, we focused on 1) evolution-aware entity-based enrichment and indexing, 2) aggregating social networks and streams, 3) temporal retrieval and ranking, and 4) collaborative exploration and analytics.

Related to 1), given the importance of Wikipedia (being the largest and most popular general reference work on the Web), we found that a pressing issue was the enrichment of such knowledge sources with knowledge from archived content. To solve this, we developed methods to use high-quality news sources to enrich Wikipedia content by suggesting articles with novel and important facts, and explored how to improve the missing and outdated citation problem by suggesting citations from external temporal news collections. This has become a new research direction in its own right, with quite a few followup research from other groups, and ongoing collaboration with Wikimedia researchers. We also developed ArchiveSpark, with the motivation to provide easier access / analysis methods to non-CS researchers who work with Web archives.

Related to 2), we focused on entity- and event-oriented aggregation approaches and constructing publicly available innovative datasets that can foster further research in related problems. Due to ethical and privacy concerns, we decided to provide only fully anonymized metadata and semantic information for social media data sets, for example a large-scale, fully-anonymized knowledge base (called TweetsKB) for a large collection of tweets, spanning more than five years. We also collected and released a large dataset that includes news articles from highly popular media agencies and cover world wide events in a full year.

Related to 3), Alexandria researchers proposed ranking models that actively take current popularity and historical significance of events into account towards ranking historical documents, and extensively investigated temporal diversification approaches to help users get an overview of results across time and topics. We also proposed a state-of-the-art news ranking (queryless) retrieval model that reconciles news event popularity and historical cues for ranking a daily batch of a large news dataset, and proposed novel retrieval methods to improve navigational search in Web archives.

Related to 4), we were interested in studying how to support collaborative and complex exploration and analysis processes and how to leverage (user) exploration and analysis processes to improve the Web archive. With ArchiveWeb, we are providing a collaborative searching and sharing platform, with search history and websites archiving functionality, and the LogCanvas search histories visualization, collaboratively construct, improve and explore web collections.

During the whole period, ALEXANDRIA related research provided for lively discussions and fruitful scientific exchange for a larger scientific community at our yearly ALEXANDRIA workshops, bringing together communities involved in Web Archiving, Digital Preservation, Digital Humanities and Information Retrieval to encourage a closer dialogue between researchers from these different disciplines and institutions.