Community Research and Development Information Service - CORDIS


ALEXANDRIA Report Summary

Project ID: 339233
Funded under: FP7-IDEAS-ERC
Country: Germany

Mid-Term Report Summary - ALEXANDRIA (Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives)

The ALEXANDRIA project aims to develop models, tools and techniques necessary to explore and analyze Web archives in a meaningful way. Alexandria is making relevant and impacting technological advancements in building information systems for large-scale temporal collections -- like Web Archives, News archives, and Wikipedia etc. In the first half of Alexandria we partnered with the Internet Archive to obtain large real-world datasets like the entire German and UK Web from 1998-2013. We build an infrastructure to efficiently access such large data sources, derive meaningful sub-collections for closer examination, and to extract knowledge items and semantic annotations. These tools have been adopted by the digital library community in general facilitated by impacting research articles and outreach activities like Hackathons and workshops.

The Alexandria team proposed novel information retrieval techniques towards making such temporal collections searchable and therefore usable by social scientists and humanities researchers. Current search algorithms have a tendency to be biased towards recent and freshness in results. However, this is not always desirable when searching collections that span across decades. To this extent, we have designed algorithms and built systems that can reconcile such historical intents. Current research on supporting search on such large collections has always been plagued by technical challenges of how to index these collections for search. Broadly speaking, making these collections searchable presents a high technical entry barrier due to the large size of these collections. In another strand of our research, we focus on identifying only the important documents in the archive and only the relevant content in documents to be indexed. This results in much more reasonable data size given computational resources and thus lowers the entry barrier for search. Additionally, we have also built search systems based on these ideas and they are accessible online along with being published in reputed venues.

A related technical issue in supporting search, exploration and knowledge creation over such collections is that the quality of the results have to be evaluated by humans. Typically, in information systems this is done either through controlled user studies or by using crowdsourcing. The larger the scale of the evaluation the more confident we are with our results. To this effect we have made fundamental contributions towards cost-effective crowdsourcing that allow us to scale evaluations from humans.

Finally, we have made crucial contributions in using the information in news collections that are missing or underrepresented in knowledge bases like Wikipedia. Temporal collections, it turns out, house a lot of missing information that are crucial for knowledge bases. In our research, we exploit these collections as knowledge sources to enrich Wikipedia. Specifically, we try to identify Wikipedia pages that are under-represented and recommend potentially important facts extracted from news collections. We also make contributions towards improving citations, an often acknowledged problem, in a collaboratively edited Wikipedia. Specifically, we support existing facts in Wikipedia with citations from credible sources thereby improving its overall knowledge quality.

Reported by

Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top