Final Report Summary - COMDATA (Infrastructures for Community-Based Data Management)
During this project, my team and I worked on both physical data storage and logical data abstractions. From the physical data storage side, our initial efforts focused on designing and implementing diplodocus[RDF], currently the fastest RDF data management system. We also devised new data distribution mechanisms and developed several new data partitioning algorithms. We extended diplodocus[RDF] to work in massively parallel, cloud environments, and to handle provenance data. We also led an international effort to assess the extent to which noSQL stores could be used to store Linked Data in the cloud. The second part of my project, Logical Data Abstractions, also showed considerable progress. My team and I were lucky enough to have three papers accepted at the top conferences in the domain: ZenCrowd, a system using crowdsourcing and probabilistic reasoning to link textual data to the Linked Data Cloud, was accepted at the World Wide Web conference. We also published a paper at SigIR (the top venue in Information Retrieval) devising new search methods for LoD data (the methods combine standard inverted indices and structured graph search). Finally, we devised a new system called Pick-a-Crowd to integrate logical abstractions using crowdsourcing and social networks, which was presented at the World Wide Web conference.
In terms of results and outcome, this project will allow to better handle community-based data, both in terms of efficiency and in terms of effectiveness. With the recent commitment of many governments (including Switzerland and several EU countries) towards publishing their data online using open formats and the excitement generated by Linked Data formats in several scientific and industrial domains, we can expect the proliferation of a multitude of large-scale, interlinked data sets on the Internet in the near future. The information infrastructures required to efficiently query, combine, and manipulate those data sets are today missing. In this context, it is essential to tackle the two problems defined above today in order to foster the development of future, efficient, and robust infrastructures capable of supporting a wide range of higher-level applications on those data sets. In particular, techniques to better handle federated queries and heterogeneous datasets are today crucially needed.