European Commission logo
English English
CORDIS - EU research results
CORDIS

Foundations of Web Data Management

Final Report Summary - WEBDAM (Foundations of Web Data Management)

The Webdam ERC grant was a 5-year project that started in December 2008. The goal was to develop a formal model for Web data management that would open new horizons for the development of the Web in a well-principled way, enhancing its functionality, performance, and reliability.

Information of interest may be found on the Web in a variety of forms, in many systems, with different access protocols. For instance, a standard user may have information on many devices (smartphone, laptop, TV box, etc.), many systems (mailers, blogs, web sites, etc.), many social networks (Facebook, Picasa, etc.). This same user may have access to more information from family, friends, associations, companies, etc., or organizations (tax, health, etc.). The control and management of this diversity are today beyond the skill of casual users. Facing similar issues, companies see the cost of managing and integrating information skyrocketing. The thesis of Webdam is that managing this diversity of data can be achieved using a distributed knowledge base handling both data and meta-data, as well as access control and localization information, in a unique holistic setting. We believe that complex Web data management tasks currently requiring deep expertise will be greatly facilitated by the automatic reasoning of the inference engine of the knowledge base. We have obtained fundamental results in that direction and started experimenting with a prototype system.

Distributed knowledge base and Webdamlog As a foundation for managing distribution, we studied a model of a distributed knowledge base, that handles data and meta-data, as well as access control and localization, in a unique integrated setting. The main contribution is a novel rule-based language, namely Webdamlog, featuring the new concept of delegation. Using delegation, peers can exchange knowledge and distribute computation. We have implemented a system supporting Webdamlog, studied optimization techniques adapted to that setting, and evaluated the performance of the system, notably in presence of access control.

Imprecise data and Probabilistic XML Data from the Web are imprecise and uncertain. To manage this imprecision in a well-principled way, we have made significant advances in the field of probabilistic databases, and specifically, probabilistic XML. (XML is a semi-structured data model, the standard for data exchange on the Web). We have introduced new tractable probabilistic models for representing uncertain hierarchical information, and carried out in-depth studies of query evaluation, aggregation, and updates in various probabilistic XML models.

Business artifacts and Collaborative workflows Also, when supporting complex activities in a Web setting, one typically has to organize the cooperation between possibly many systems, and notably the sequencing of their tasks. The specification of such sequencings, sometimes referred to as choreography, is little understood. We pursued an original approach that models tasks with pieces of data, that are called business artifacts (following IBM terminology). The evolution of an artifact is constrained by rules on the evolution of the data. Using this approach, we developed fundamental works in order to understand the intrinsic nature of workflows shared by collaborative systems.

Webdam has stressed education. In particular,

• A textbook (advanced undergraduate or graduate level) on Web data management has been published at Cambridge University Press [98] in 2009. The book is available for free on the Webdam Web site at http://webdam.inria.fr/wordpress/index.html.

• A book “Sciences des données” has been published at Fayard [97] in 2012. The book is available for free at http://lecons-cdf.revues.org/506 and in English translation by Liz Libbrecht at http://lecons-cdf.revues.org/558.



Major difficulties encountered Complexity of the model
At the early stages of the project, lots of work was devoted to the Active XML model and important results obtained. When considering issues such as trust, access rights, or provenance in the context of social data, the tree aspect of Active XML model turned out to complicate some of the issues. So, we refocused on the relational model and Datalog-style languages for some of the more recent works, e.g. the work around Webdamlog. See Section A.6.11. Future works should reconcile the two approaches.

Localization In a first phase, theWebdam project was involving researchers of the Institut National de Recherche en Informatique et Automatique (Inria), from the former teams Gemo/Leo at University Paris Sud (now Oak team) and Dahu at École normale supérieure de Cachan (ENS Cachan). The project also rapidly involved researchers at Télécom ParisTech, around Pierre Senellart, especially on probabilistic data. It turned out to be more complicated than expected to focus the group in such a distributed environment. The research was concentrated after a couple of years at ENS Cachan and Télécom ParisTech.

Human Resources These form the main asset of a project such as Webdam. The main reason of the success of Webdam was that we could bring together some incredible talents. This brought some unexpected inherent issue: While we could offer only temporary positions, talented people were offered permanent positions elsewhere (in top places such as Oxford, Tel Aviv, UCSD...) and naturally accepted them. Although this was a difficulty, we also found that this was an opportunity for Webdam to grow by developing collaboration with some of the members who left. Such collaborations are notably ongoing with Tel Aviv University and UCSD.