Skip to main content
European Commission logo print header

Mob Data Sourcing

Final Report Summary - MODAS (Mob Data Sourcing)

Crowd-based data sourcing is a new and powerful data procurement paradigm that engages Web users to collectively contribute data, analyze information and share opinions. Crowd-based data sourcing democratizes data-collection, cutting companies' and researchers' reliance on stagnant, overused datasets and bears great potential for revolutionizing our information world. Yet, triumph has been limited to only a handful of successful projects like Wikipedia or IMDb. This stems notably from the difficulty of managing huge volumes of data and users of questionable quality and reliability. Every single initiative had to battle, almost from scratch, the same non-trivial challenges. The ad hoc solutions, even when successful, are application specific and rarely sharable. Our goal in the MoDaS project was to develop solid scientific foundations for Web-scale data sourcing. Such a principled approach is essential to obtain knowledge of superior quality, to realize the task more effectively and automatically, be able to reuse solutions, and thereby to accelerate the pace of practical adoption of this new technology that is revolutionizing our life. Following this goal we have investigated the logical, algorithmic, and methodological foundations for the management of large scale crowd-sourced data as well as the development of applications over such information.

A major contribution of the research is the development of a novel model for data-centric crowd sourcing, which we call crowd mining. To understand the importance of this new model, observe that a key challenge is crowd-based data management is that the human knowledge forms an open world and it is thus difficult to know what kind of information we should be looking for. Classic database research have addressed this problem by data mining techniques that identify interesting data patterns. These techniques, however, are not suitable for the crowd. This is mainly due to properties of the human memory, such as the tendency to remember simple trends and summaries rather than exact details. Following these observations, MoDaS managed to develop for the first time the foundations of crowd mining. We defined the formal settings for crowd mining; based on these, we designed a framework of generic components, used for choosing the best questions to ask the crowd and mining significant patterns from the answers. We suggested generic implementations for these components, and tested the resulting algorithm's performance on benchmarks that we designed for this purpose. Our algorithms consistently outperform alternative baseline algorithms. Encouraged by success of this direction, we then explored a novel approach that broadens crowd data sourcing by enabling users to pose general questions, to mine the crowd for potentially relevant data, and to receive concise, relevant answers that represent frequent, significant data patterns . Our approach is based on (1) a simple generic model that captures both ontological knowledge as well as the individual history or habits of crowd members from which frequent patterns are mined; (2) a query language in which users can declaratively specify their information needs and the data patterns of interest; (3) efficient query evaluation algorithms, which enables mining semantically concise answers while minimizing the number of questions posed to the crowd; and (4) an implementation of these ideas that mines the crowd through an interactive user interface. Experimental results with both real-life crowd and synthetic data demonstrate the feasibility and effectiveness of the approach.

We believe that the crowd mining framework developed in MoDaS is precisely the technological breakthrough needed for opening the way for developing a new and otherwise unattainable universe of knowledge in a wide range of applications, from scientific ones to social and economic ones.