Final Report Summary - ALEXANDRIA (Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives)
annotations. These tools have been adopted by the digital library community in general facilitated by impacting research articles and outreach activities like Hackathons and workshops.
The Alexandria team proposed novel information retrieval techniques towards making such temporal collections searchable and therefore usable by social scientists and humanities researchers. In this context, we focused on 1) evolution-aware entity-based enrichment and indexing, 2) aggregating social networks and streams, 3) temporal retrieval and ranking, and 4) collaborative exploration and analytics.
Related to 1), given the importance of Wikipedia (being the largest and most popular general reference work on the Web), we found that a pressing issue was the enrichment of such knowledge sources with knowledge from archived content. To solve this, we developed methods to use high-quality news sources to enrich Wikipedia content by suggesting articles with novel and important facts, and explored how to improve the missing and outdated citation problem by suggesting citations from external temporal news collections. This has become a new research direction in its own right, with quite a few followup research from other groups, and ongoing collaboration with Wikimedia researchers. We also developed ArchiveSpark, with the motivation to provide easier access / analysis methods to non-CS researchers who work with Web archives.
Related to 2), we focused on entity- and event-oriented aggregation approaches and constructing publicly available innovative datasets that can foster further research in related problems. Due to ethical and privacy concerns, we decided to provide only fully anonymized metadata and semantic information for social media data sets, for example a large-scale, fully-anonymized knowledge base (called TweetsKB) for a large collection of tweets, spanning more than five years. We also collected and released a large dataset that includes news articles from highly popular media agencies and cover world wide events in a full year.
Related to 3), Alexandria researchers proposed ranking models that actively take current popularity and historical significance of events into account towards ranking historical documents, and extensively investigated temporal diversification approaches to help users get an overview of results across time and topics. We also proposed a state-of-the-art news ranking (queryless) retrieval model that reconciles news event popularity and historical cues for ranking a daily batch of a large news dataset, and proposed novel retrieval methods to improve navigational search in Web archives.
Related to 4), we were interested in studying how to support collaborative and complex exploration and analysis processes and how to leverage (user) exploration and analysis processes to improve the Web archive. With ArchiveWeb, we are providing a collaborative searching and sharing platform, with search history and websites archiving functionality, and the LogCanvas search histories visualization, collaboratively construct, improve and explore web collections.
During the whole period, ALEXANDRIA related research provided for lively discussions and fruitful scientific exchange for a larger scientific community at our yearly ALEXANDRIA workshops, bringing together communities involved in Web Archiving, Digital Preservation, Digital Humanities and Information Retrieval to encourage a closer dialogue between researchers from these different disciplines and institutions.