Skip to main content
European Commission logo print header

Exploiting User-Generated Geospatial Content Streams

Final Report Summary - GEOSTREAM (Exploiting User-Generated Geospatial Content Streams)

Executive Summary:
The overall goal of the GEOSTREAM project was to develop techniques and tools for collecting, integrating, analysing and publishing user-generated geospatial content. In particular, the underlying idea and motivation is to harvest geospatial content made available by users on the Web to enhance and enrich services related, for example, to mobile travel guides and geomarketing. To that end, the work undertaken in the project has focused on designing and developing a set of services and tools that cover different aspects of this processing workflow, from data collection and integration to content authoring and publishing. The main parts of the work have focused on the four dimensions described below.

The first step, which sets the foundation for the whole approach, has been to collect, integrate and analyse user-generated geospatial content from Web sources. For this purpose, a set of popular Web sources have been identified and selected. In particular, we have selected sources that provide data about Points of Interest (Wikipedia, OpenStreetMap, Wikimapia, Foursquare), photos (Flickr, Panoramio), events (Eventful, Last.fm) and text messages (Twitter). To retrieve data from these sources, respective clients were developed, taking into consideration the corresponding access methods (e.g. RESTful APIs, SPARQL endpoints) and the imposed restrictions (e.g. number of queries, size of returned results). For storing and managing the data, a relational data model and schema was used. For the content store, we used PostgreSQL, with the PostGIS extension for geospatial support. Subsequently, the raw content collected from these Web sources undertakes an integration and analysis process, in order to address the problems of heterogeneity and low quality. First, to deal with semantic heterogeneity across sources, we specified a common category classification and we implemented a method to automatically map categories used in the original sources to this common category hierarchy. Second, we implemented an entity matching algorithm to identify matching entities in various sources in order to avoid duplicates as well as to combine partial information for obtaining more complete entity representations. For this purpose, we combined semantic matching with the criterion of spatial proximity to increase the accuracy of the matching. Finally, a clustering process was implemented that allows to group together entities (by type, category and proximity), thus finding high-density regions of interest and obtaining aggregated results to deal with noise and outliers. After this process of integration and analysis, the results are indexed by a Solr server to facilitate search and exploration.

The second step has been to extend this process toward a novel approach following a map-reduce-like architecture and the use of browser-based computation. For this purpose, the work focused on two main aspects. The first aspect was to implement the tasks for data collection and clustering in JavaScript in order to allow their execution on the client-side, i.e. by the users' browsers. The second aspect was to parallelize the execution of these processes, transforming the original centralized workflow to one where the server is responsible for splitting the overall task into smaller tasks, then assigning them to available clients (i.e. user browsers) for execution, and finally collecting and integrating the results. The benefit of this new approach is that it allows to move some more time consuming parts of the process from the server to the clients, parallelizing the work to be done.

In parallel, a third dimension of the work involved designing and implementing a set of authoring tools and services. In particular, the goal was to facilitate users in the process of creating spatio-textual content. This included two main parts. The first part comprised a geocoding engine, through which textual documents can be parsed to identify and indicate relevant locations on the map. The second part comprises a Web application through which a user can annotate Points of Interest, as well as create collections of them (e.g. to describe a trip in a travel guide).

Finally, the fourth part of the work built upon the previous ones to design and develop an API and a mobile application allowing GEOSTREAM users to manage and create geospatial and multimedia content. This includes services for finding available content based on the user's location (automatically retrieved via the mobile device's GPS), annotating this content with further information (e.g. textual descriptions, tags, images) and organizing it in -potentially nested- collections. The provided services support also sharing content with other users and receiving notifications based on the user's profile.

Additionally to the technical developments in the project, as described above, the work focused also on the following aspects: (a) investigation of terms of use and licensing issues related to the third-party content collected from the selected Web sources; (b) conducting user surveys to evaluate the developed components and to collect feedback for further improvements; (c) dissemination activities, to communicate the project's results to potential users and other interested partners, both in the scientific community (via publications at internationals conferences) and the industry (via participation and presentations in exhibitions and fairs); exploitation plans, by identifying and specifying the possibilities for reusing and integrating the components and services developed in the project into commercial products and services of the SME partners.
Project Context and Objectives:
The advent of Web 2.0 has placed individuals, their information and their social connections to the centre of the stage. The Web has evolved from an academic and business tool to one of the prime means for casual everyday people to network, communicate, publish and exchange all sorts of different data, and collaborate on projects and activities. Thus, in addition to professionally-produced material, the public has also been allowed and encouraged to make its content available online. The ever-increasing popularity of open online communities and social-networking sites has meant a dramatic drop in the barrier-to-entry for casual computer users to generate and upload their content on the Web.

The volumes of such User-Generated Content (UGC) are already staggering and constantly growing. For instance, Wikipedia includes, among others, information about thousands of places and Points of Interest (POIs). Inspired by its success, OpenStreetMap and Wikimapia have emerged as collaborative mapping projects, both having a user base that exceeds a million users, resulting in a large pool of crowdsourced geospatial data. An abundance of geospatial entities can also be obtained from various other services (e.g. Google Places), from user check-ins in social networks (e.g. Foursquare), from Web sites providing information about events (e.g. Eventful), etc. This new wealth of sources and content opens up new opportunities and challenges for improving, enriching and enhancing applications and services in the geospatial domain, such as location-based services, trip planning or geomarketing.

Acknowledging this geospatial data tsunami, one of the main challenges that need to be addressed is how to utilize such content to provide useful information and knowledge in geospatial services and applications. This is especially crucial for SMEs providing products and services in relevant sectors, in order to maintain and further strengthen their competitive standing. However, no clear technological solutions are available that allow for straightforward exploitation. Below, we list the main problems and challenges to be faced:

- Content is authored by largely untrained individuals and their actions are almost always voluntary; hence, results may not be accurate. Thus, methods are needed to assign quality measures for examined content, align datasets with each other, and combine many observations to derive more accurate information.

- A myriad of Web applications producing geospatial data exists today, each having its distinct data model and means to access the data, making it difficult to combine data from multiple sources. Comparing and integrating data from various sources requires the creation of corresponding tools. Metadata and schemas describing each source play an essential role in this.

- When assessing the quality of datasets, many tasks are rather labour intensive, especially when dealing with user-generated content. Consequently, with the focus of the project being on user-generated content, the user should be engaged not only in content creation, but also on its quality assessment.

- User-generated content may never replace quality data sources but may be used to complement them. The most important value of user-generated content may lie in what it can tell about local activities in various geographic locations that go unnoticed by the world's media. It is in that area that user-generated content may offer the most interesting, lasting, and compelling value to businesses. This requires the development of techniques for the integration of user-generated content with high-quality data sources. Aspects to be considered include provenance of the data and metadata that could allow reasoning about quality.

- Despite the huge amounts of available data, there still remain areas and applications where further -or, often, periodic- geospatial data collection is required. The SMEs could use the Web 2.0 trend to engage the public or specific user groups in the collection of specific content. For example, when collecting experiences during a trip, it would be helpful to record such information in a structured way and in relation to an existing travel guide. Web-based tools need to be developed for this purpose, with emphasis on being simple to use in order to engage the broadest possible audience in this process.

- Often, collecting content has the implicit intent of disseminating this content again in aggregated form. In this sense, content collection should go hand in hand with publication. Taking the travel guide example, this could result in content being published in real time by the users for the users. In this regard, the explosion of smart phone usage provides an adequate data collection and distribution means.

Facing these challenges, the main objectives set for the GEOSTREAM project include the development of techniques and tools for managing user-generated geospatial content, supporting users in authoring such content, and using this content for providing related services. More specifically, the main objectives of the project are as follows:

- Collection, integration and mining of crowdsourced geospatial content: the purpose of this part of the work is to lay the foundation by designing and implementing the content store and the components for collecting user-generated geospatial data from Web sources, and subsequently providing the basic functionalities for integration and analysis of this content.

- Web computing framework: the objective of this part of the work is to extend the work done for data collection and integration toward a novel approach that will be based on a map-reduce-like architecture and on browser-based computation, making it possible to parallelize the tasks and to transfer parts of the computation from the server to the clients.

- Authoring tools: this objective concerns the design and development of a set of tools and services for facilitating users in authoring geospatial content. The developed tools include a geocoding service as a main part, and a Web application where users can create rich spatio-textual content.

- Mobile computing framework: this object refers to the design and development of an API and mobile application for providing functionalities that allow mobile users (e.g. an author on a trip or a tourist) to create geospatial and multimedia content. combining it with the user-generated content collected and analyzed from other Web sources.

Pursuing the objectives outlined above, the work in GEOSTREAM has materialized the following concrete results:

- A compiler for user-generated geospatial content that supports data collection, fusion and integration of large amounts of data from various available sources in order to produce integrated datasets, assess their quality, and relate
them to existing data from other more authoritative sources (e.g. governmental, commercial). The developed tools and services are meant to provide an easy-to-use technology that produces good quality datasets that can be used in various existing commercial products of the SMEs involved in the project.

- A Web computing framework that allows any Web user to be involved in the process of compiling user-generated geospatial content. Utilizing browser-based computing and an adapted map-reduce approach, workloads for compiling user-generated geospatial content are sent to browsers for execution at the client-side. This provides a means to add computing power to data compilation tasks, helping the SMEs to collect and analyse content faster and easier.

- Web-based authoring tools for generating rich geospatial content: this covers the requirement of SMEs to develop specific tools that will allow their customers to provide feedback in terms of geospatial data, i.e. to relate rich content (text, audio and images) to locations. For this purpose, a simple-to-use Web interface has been developed that can be used by as many (non-expert) users in order to collect as much relevant geospatial content as possible. The so-collected data can subsequently be used for user-generated travel guides or other related products and services.

- A live mobile guide platform for the collection and delivery of content and services in mobile phones: this platform relies on the results of the content collection and analysis workflow, combining and extending it with a content annotation and authoring environment running on a mobile computing context. This mobile computing platform serves as the basis of future mobile applications that can be developed by the SMEs to further handle and exploit this rich geospatial content.



Project Results:
The work in the GEOSTREAM project has materialized in four concrete results, which are described below.

I. Compiler for user-generated geospatial content

This module provides the foundation for collecting, integrating and analysing user-generated geospatial content from the Web, which is subsequently used to enhance and enrich other applications, potentially being combined with own content being generated by the users of those applications. Specifically, this modules addresses the following aspects: (a) clients for accessing and retrieving content from selected Web sources; (b) a common category hierarchy for classifying the collected content, and a process for semi-automatically classifying incoming data to those classes; (c) an entity deduplication process for detecting multiple records of a Point of Interest found in different sources; (d) density-based clustering for mining regions of interest. Next, we present each of these functionalities in more detail.

In the context of the project, we are interested in four types of (geo-)content:
- Points of Interest (POIs), which include places that users would like to visit (e.g. museums, monuments) or use services from (e.g. restaurants, shops, banks, train stations);
- photos, which allow users to obtain visual information about a location;
- events, e.g. concerts, exhibitions, which users may be interested to attend;
- text feeds, in particular, Twitter messages, which may provide further information about a location (e.g. topics, sentiments, events).

For this purpose, we have selected a set of Web sources, and we have developed clients for obtaining such crowdsourced information from them. In particular, we have considered sources that are very popular and widely used, and which provide such type of content, making it accessible in an explicit manner (e.g. via a link to download data, a query interface, or an API). Specifically, the following sources have been used in the project: DBpedia, OpenStreetMap, Wikimapia, Foursquare, Google Places (for POIs); Flickr, Panoramio (for photos); Eventful, Last.fm (for events); Twitter (for text messages). All the selected sources provide a structured means to retrieve content, in particular either via a SPARQL query endpoint (i.e. querying RDF data) or, in most cases, via a REST API. The retrieved content is typically provided in XML and/or JSON format. For each of these sources, a client was developed to collect data for given locations. An important issue to consider for developing these clients is to respect the limitations that each source imposes, which basically have these two forms: (a) imposing an upper limit on the number of requests allowed for a given time period and/or (b) imposing an upper limit on the number of results retrieved for each request.

A further issue that increases the complexity of the retrieval process is the different names and types of input and output parameters supported by each source and API, although there are also some basic common ones. The main input parameters involve:
- Location: In all cases, it is possible to specify the search area of interest as a query parameter for filtering data. Typically, this is done by specifying the corners of a bounding box (i.e. min/max values of latitude and longitude). An exception is Google Places, in which a point and a search radius are used.
- Type: Some sources allow filtering retrieved data by their type, i.e. to specify a category or keyword as query parameter.
- Date: Some sources allow retrieving data within specified time periods.
- Language: Some sources allow restricting the retrieval process to data in a particular language.
- Format: Some sources allow specifying the desired format of the response (e.g. xml, json).
- Page number/size: Typically, sources use paging to limit the number of results returned per request. In such cases, it is possible –up to some limit- to use these parameters to retrieve results from subsequent pages beyond the first one.
- API key: Commercial sources providing access to their data through an API require first that the user registers to acquire an API key, which is used then as a parameter in all API method calls. This allows the data providers to monitor the retrieval process, enforce limitations to data access, compute statistics, etc. When registering, the user needs to provide information such as the type of his/her organization and purpose of use (academic, commercial, etc.).

Similarly, the information returned in the response includes:
- Label: A label of the retrieved entity.
- Type: One or more categories the retrieved entity belongs to.
- Description: A short text providing summary information for the retrieved entity.
- URL: A link to the Wikipedia entry for this entity or to the official page.
- Location: The location of the entity specified as latitude and longitude.
- Address: Address of the entity in the form of country, city, street, postal code, etc.
- Tags: Tags associated with the entity.
- Likes/ratings/comments: User ratings and comments for the entity.
- Photos: URL(s) of image(s) associated with the retrieved entity. In some cases, photos are also provided in different resolutions. A thumbnail may also be provided.
- Date: The date that e.g. a photo was taken or uploaded, or that an event takes place.
- Opening hours: Opening hours that may apply for the entity.
Moreover, each source assigns an internal, unique ID to each entry, which is included in the response. We store and use these IDs, in conjunction with the source ID as prefix, to uniquely identify the retrieved entities. Availability of information for these attributes varies significantly, both across sources and within each source for different entities.

Typically, one of the main attributes used in virtually all sources to describe an entity is its type(s) or category(-ies), which indicates the nature or usage of the entity, e.g. museum, restaurant, shopping mall, metro station. This information is very important for allowing users to search and browse the available content. However, there is not a single, common ontology or classification scheme used across all sources. Instead, each source typically employs its own classification or, a frequent practice in crowdsourced content, does not strictly specify a predefined set of hierarchically organized categories but rather allows users to freely categorize resources, e.g. by assigning tags to them, using either predefined classes or classes created by other users or introducing their own. Although this practice makes it easier, simpler and faster for users to generate and contribute content, it results in very high diversity and heterogeneity among the various data sources, or even within each single source. Therefore, before data collected from different sources can be processed and analysed, a reconciliation mechanism is required to map these various taxonomies to a common classification scheme. To address this issue, we have defined a category hierarchy using as basis the taxonomy used by Foursquare, which we found to have similar focus to our needs, but with some modifications to make it easier to accommodate other sources as well. Further, we have limited the maximum depth to three levels (e.g. Athletics Sports -> Gym Fitness -> Yoga Studio) in order to make it simpler and easier to use. Although this category hierarchy serves for further integration and analysis, as well as later on for faceted browsing to explore available content, we do not replace the original categories or tags assigned to a resource, but maintain them complementary to be available to the user when searching. The top level classes of the GEOSTREAM category hierarchy are outlined below:
- Professional: includes professional points of interest, such as factories, companies, breweries, etc.
- Entertainment: includes all types of points of interest for entertainment from café’s to night life spots, theatres, casinos, etc.
- Food: includes any type of cuisines & restaurants.
- Services: includes any type of services such as taxis, car rental, emergency services, banks/ATMs, etc.
- Religion: includes any type of religious places, such as churches, places of worship, cemeteries etc.
- Culture: includes historic monuments, museums, etc.
- Athletics and Sports: includes stadiums, gyms, etc.
- Travel and Transport: includes transportation points such as bus stops, metro stations, etc., as well as places for accommodation.
- Shops: includes any type of shops.
- Places: includes points of interest such as parks, lakes, fountains, etc.
- Education: includes schools, universities, colleges, etc.

Having specified a common category hierarchy for use in GEOSTREAM, the next step was to compute mappings from the categories used by each source to this common classification. This task, i.e. schema/ontology mapping, is one of the most fundamental and recurring problems in many areas, such as resource discovery, data integration, data migration, query translation, peer to peer networks, agent communication, and has been the focus of a lot of research over many years. The underlying techniques include tokenization, semantic matching based on dictionaries that provide synonyms and hypernyms, string similarities, etc. Thus, we have leveraged such techniques to implement a service for semi-automatically classifying incoming data. The system matches each incoming POI to one or more categories in the internal schema, specifying a degree of match; then, the administrator can view and edit (i.e. accept, reject or modify) the results via a Web user interface.

The next step in the process of integrating and reconciling data collected from the different sources is to match entities across them. This problem arises from the fact that often the same entities, especially popular POIs, appear in many sources, perhaps with different, complementary or sometimes even conflicting representations. Similar to the problem of category mappings addressed in the previous section, this is again an important, recurring and widely studied problem in the area of information integration, appearing in the literature under various terms, such as entity matching, record linkage, data deduplication, etc. Finding matching entities across sources, or even within the same source, is important for various reasons. First, it is often the case that different sources provide complementary information, thus identifying matching entities allows to build a richer and more complete entity profile (e.g. finding photos from one source, comments and ratings from another, etc.). Second, it can help to detect, and possibly even correct, errors, in the case that conflicting information for the same entity is found in various sources. Automatic correction can be achieved when fusion rules are available, e.g. based on an assessment of the quality and trustworthiness of each source, such as "always prefer values obtained from Wikipedia over other sources", or based on a voting scheme, e.g. keep the value provided by most sources (if found in more than two). Third, having identified duplicates is useful when users are searching and browsing the available content, since allowing duplicates in the search results can easily deteriorate the user experience and cause frustration. The challenge here, as in the case of reconciling different category hierarchies, arises from the fact that there are no unique, common identifiers for the entities to be used by all sources. Moreover, the name of an entity may appear with slight variations, or even misspellings, in different sources. Thus, the typical approach is to define some measure of similarity between entities, and then consider as matches the cases where this similarity exceeds a specified threshold. In our case, since we are dealing with geospatial entities, we take into consideration for the matching process two attributes: (a) the location of an entity and (b) its name/label. More specifically, two entities are considered to match if: (a) their distance is below a specified threshold, and (b) their name similarity is above a specified threshold.

Content provided by users in a volunteered manner may often be of low accuracy and quality, e.g. containing misspellings, incomplete, or erroneous information. However, although this holds for individual pieces of information, the vast volumes of data and the fact that the same or similar information can be found in various sources and/or by various users make it possible to derive meaningful, high-quality information by aggregating the data and considering statistical results. Another motivation for aggregating content is the increasing interest to move from the notion of Points of Interest (POIs) to Regions of Interest (ROIs). This is driven by the fact that users often prefer areas with high density of POIs in order to have many alternatives to pick from and to be able to visit multiple places within a shorter period of time. In our case, the main underlying technique for achieving this task of identifying ROIs is clustering. Clustering groups together similar (according to some specified notion of similarity) items. It comprises one of the most widely used techniques in statistical data analysis. Several clustering techniques and algorithms have been developed over the years. Among the most typical ones are: (a) centroid-based clustering, where groups of objects are created around central points, and (b) density-based clustering, where clusters are defined as areas of higher density. For our purposes, we have used density-based clustering, since the intuition is to identify areas with high concentration of some particular entities of interest. More specifically, the clustering algorithm we have implemented and used is based on the well-known DBSCAN algorithm with some adaptations and modifications.

II. Web computing platform

In this part, we have developed a framework that utilizes browser-based computing for the compilation of user-generated geospatial content. More specifically, we have designed and implemented a novel application of the map-reduce computation concept in connection with browser computing to accomplish mining and integration of user-generated content by means of Web-site visitors that contribute their "expertise" and computing power (browser) to perform this task. The design of our architecture is inspired by the well-known map-reduce computing framework. Borrowing from this idea, we implemented a system that can split large jobs into smaller tasks, offering an API for clients to contribute to individual tasks. Given that most of the tasks are spatially bound, we have an inherent and intuitive way to split jobs into smaller tasks, by applying a grid mask to the area we are interested in, and creating tasks that each handles a unique grid cell.

The motivation behind this approach is to build a scalable system that can grow its throughput alongside its user-base and, therefore, its needs. To accomplish this, we can assign the more time-consuming parts of the system to clients in order to offload the server and promote parallelization. Hence, the aim was to implement a pluggable architecture that includes a master server that can coordinate long running processes and delegate tasks to modular workers that know how to handle each job. The number of workers can be adjusted according to the load of the system and should in most cases provide near linear scalability.

One of the biggest advantages of this approach is the lack of need to explicitly write parallel code. All that is required is to build a tasks assignment system that will split a job into smaller tasks, and make sure that the individual tasks are as less dependent on one another as possible. This last condition is crucial for achieving better performance when increasing the number of workers, since new workers will not have to wait for previous ones to finish their job before they can start processing their respective task.

We have identified two main areas that our system can benefit from using this approach: (a) the online collection process of available user contributed data from the various external providers, and (b) the clustering of the collected data to identify regions of interest, i.e. regions with high density of POIs of the same category.

The online collection process is not especially computationally expensive, since the main task is just to contact all the external sources, provide the correct parameters and download the available data. However, due to the high amount of network requests involved, the process can be extremely time consuming. Furthermore, writing parallel code to simultaneously access all these sources is not trivial. As described earlier, our approach removes the need to write parallel code, since if needed we can just increase the number of running workers, to achieve higher throughput.

The detection of regions of interest process is more computationally expensive. Despite the fact that this process can also be parallelized by dividing the original area into smaller parts, there exists a tight coupling between the outputs of each of the smaller tasks, as clusters of each individual cell eventually have to be merged. So this presents a nice case to benchmark our approach in scenarios where the individual tasks are not completely independent.

The main goal was to design a generic mechanism that, given an area and a task, will initially divide the area into smaller regions and then assign the task and each of the small regions to available workers for processing. This mechanism is exposed via a programmatic Ruby API and an HTTP endpoint so that it can coordinate not only standard Ruby scripts (for example our existing download clients for Flickr, Panoramio, etc.) but also workers implemented in any kind of programming language or platform. To demonstrate this, we have re-implemented our download clients in Ruby to be conformant with the new API and we implemented from scratch the same clients in JavaScript. This allows both clients to run simultaneously and in as many "copies" as needed. For example, for a given area, we can initiate two Ruby Panoramio clients, and open three browsers to load the JavaScript Panoramio clients, resulting in having a total of five clients collecting Panoramio data for the area at the same time.

The data collection process uses a top-bottom approach. As the data distribution is not known beforehand, the splitting of an area into smaller cells has to be done as new data is downloaded. The process begins by adding the original bounding box of the area that we want to collect data from into a queue. As long as there are items in the queue, workers can dequeue bounding boxes from it for processing. Processing starts by checking the number of available data already collected inside this bounding box (for example from previous runs of the search or from queries for bounding boxes that enclosed this one, i.e. ancestors). If the number of existing results inside the examined bounding box is greater or equal to a given threshold, then we immediately split it into quadrants. Otherwise, we start querying the external source for new data within this bounding box, by making requests that will download at most a maximum number of results at a time, until reaching the limit imposed by the source or retrieving all available data. If we manage to collect as many results as the upper limit, then it is likely that more results are available, so this bounding box is marked accordingly in order to be then split into four quadrants that are, in turn, enqueued for processing.

We also implemented a JavaScript version of the DBSCAN clustering algorithm to identify regions of interest in a given area. Here, the main concept includes splitting a given area either iteratively or recursively into quadrants, then running the problem's algorithm within these smaller cells. The clustering procedure, as opposed to the data collection, follows a bottom-up approach. In this case, the data are known a priori, so the entire area can be split based on some criteria, before starting the assignments to the workers. The only parameter required for the split is the total number of POIs enclosed within a grid cell. Given this number, a recursive operation begins that checks if the current bounding box contains more POIs than this threshold. If it does, it is split into four quadrants and the recursion continues. Once this process is done, we start assigning cells to workers. The grid cells can be viewed as a quad-tree with the cell corresponding to the entire area as the root and the smallest cells as leafs. In this notion, only leaf cells need to be assigned to workers for clustering. Intermediate nodes need only to run a merge procedure once all children are complete. The merge procedure can be run by the scheduling system or assigned to a worker. In our implementation we decided to leave the merging to the main infrastructure, i.e. the scheduling system.

A particularity in our approach is the use of a browser as a computing platform. Data collection and computation tasks are delegated to the connected search clients (browsers) and the server only coordinates the process. As a side effect, wherever data retrieval from external sources is needed, no actual data is downloaded directly by the server, but all traffic is handled by the clients, effectively balancing the data over the network.

III. Rich authoring tool

In contrast to the previous two parts, that deal with collecting, integrating and analysing already existing geospatial content made available by users on various Web sources, the goal of this part is to explore and address the opposite direction: how to enable and facilitate users to generate such content in the first place. In particular, we have pursued two main directions. The first was to provide a "rich" authoring tool, that is a Web-based application that combines a map interface and word processing capabilities to allow users to create and edit structured spatio-textual content in an easy way. This requires some modest effort from the user to edit the content, e.g. by filling forms and use drag-and-drop actions, but the created content is richer in terms of structure and semantics and hence can be more easily and accurately organized, search and exploited further. The second direction was a "light" authoring tool, where the basic idea is to minimize, or even completely eliminate, the user's involvement, by starting with a text document that is already available and trying to automatically identify locations described in it. Again, this is combined with a map interface, where automatically computed results are presented to the user in the form of suggestions for validation or selection. An essential part of this tool is a geocoding service.

Eventually, both modules were combined in an integrated Web application, so that users have a single point of access for content authoring, and can select, based on their needs and use cases, which steps of the process to perform. In addition to the Web user interface, we have provided an API that can be used to search, create and edit content from other applications, so that the implemented functionalities can be extended or customized to fit the exact needs of the SME partners.

In GEOSTREAM, as well as in many other similar applications or research work in the literature, the main functionalities and services evolve most often around the notion of Points of Interest (POIs), such as monuments, restaurants, shops, train stations, museums, parks, etc. Numerous lists of such POIs exist, both free and commercial. However, these typically include some basic, "objective" (in the sense of not being user-dependent) information about a POI, such as its location, name, category, creation date, etc. What can add valuable information in these descriptions is augmenting them with content denoting user perspectives, experiences and opinions about these POIs, such as ratings, comments, recommendations about when to visit or how to get there, etc. Hence, our goal in designing and implementing the rich authoring environment was to allow users to augment POI descriptions with their own information. For this purpose, and given the concept of mobile travel guides as the main motivating example in GEOSTREAM, we used the concept "visit" to represent an augmented description of a POI from the perspective of a particular user. Furthermore, POIs may be related and appear together in a collection under some criterion, e.g. the POIs that a traveler visited during her trip. To model this aspect, we use the concept of "trip", which constitutes a (potentially ordered) set of visits to particular POIs. Consequently, the data model for the rich authoring environment comprises four main entities: users, POIs, visits, and trips. Nevertheless, the design and functionalities of the tools are not limited to this particular use case of mobile travel guides but can be adapted in a straightforward way to similar applications that manage collections of user-augmented location descriptions.

Furthermore, we considered also an alternative method for obtaining user content that involves minimum involvement by the user, extracting instead geospatial information automatically from text documents. This approach is aimed, for example, at authors writing a travel guide using some word processing software or travellers writing on a blog to express and share their experiences from their trip. However, as we wish to view these functionalities as complementary rather than as alternative, mutually exclusive ways of how a user can create content, we have combined both parts in a single workflow and Web application, where the user can choose which steps to perform.

In a nutshell, the process is as follows. The user starts with an existing text document that contains mentions of one or more locations (among other). This is given as input to the application, which automatically identifies and extracts the mentioned locations, mapping them to POIs found in the GEOSTREAM database (or other external sources that have been indicated). Then, instead of simply annotating the original document with a specific markup indicating the geocoded words or phrases, it uses these extracted POIs as input for the rich authoring tool, i.e. allowing the user to proceed in the next stage to add more information regarding them. The central processing component of the light authoring environment is the geocoding engine, which identifies locations in text documents appearing in the GEOSTREAM database (or other external data sources configured for use). The goecoding engine employs machine learning to perform its task, i.e. to detect and geocode spatial entities in natural language text. More specifically, it includes a sentence detector, a tokenizer, a location finder, a parts-of-speech (POS) tagger, a chunker, and a parser. For its implementation, the Apache OpenNLP java library has been used.

The geocoding engine processes natural language text input by the user by preforming the following main operations:
1. Sentence detection: identifies the sentence boundaries of the input text by determining the punctuation characters that mark the end of a sentence.
2. Sentence splitting: each sentence is further subdivided into words (tokens) using a tokenizer specific for the particular natural language.
3. Tagging: each token of a sentence is tagged with part-of-speech tags.
4. Spatial entity detection: entity detection is performed and locations are extracted from the “name finder” component, which looks up the examined token in the GeoStream database (or other provided source of POIs)
5. Entity annotation: the location(s) found in the text are hyperlinked with relevant geodata from the back-end database.

IV. Live mobile computing platform

In this part, the main goal was to investigate innovative concepts for geospatial content streams created by taking advantage the advanced capabilities and features of smartphones. More specifically, the aim was to increase the value of already existing user-generated content by making implicit geospatial aspects visible and usable. Another focus was the collection, fusion and provision of live user-generated content. To that end, we have developed a framework to provide a common API usable from web authoring tools as well as mobile tools. In order to realize multiple use cases from different domains, the generic framework comprises functions that all use cases have in common. In particular, in all scenarios studied, new data is captured, linked to already existing data, and visualized to the user. A user management and a flexible subscription service enable near real time communication and collaborative authoring.

The live mobile guide prototype demonstrates all the functionalities offered by the framework. To cover different use cases, the application was designed as a blueprint for developers to test and experiment with the functions provided and to monitor the created content. By adding an application-specific user interface design, the application can be adapted for numerous use cases.

The concept phase started with use cases from the tourism domain, and with travellers and travel book authors being the main target groups. Since the objective of GEOSTREAM is to collect, share and repurpose user-generated geospatial content, two of the major aspects studied were communication and collaboration and linking and combining content from different authors. When examining use cases and functionalities necessary or desirable for travellers and travel book authors, it is almost immediately clear that there is a substantial overlap.

Consider the following examples. A traveller walks around the city, creates new content, which he shares with friends or other travellers, but also recommends places to visit for friends, or comments on content from other people, or combine elements for her own diary. She would also subscribe to news about a place to receive real-time notification on information updates. A travel book author researches, collects, reviews and recombines geo-related information for a new book or a book update. When on the go she would create detail information, share it with her co-author and assign information verification or collection tasks, answering such tasks, or exchanging advice. These individual requirements were being compiled into a set of more or less use case independent features, describing general objects and actions for using geospatial content. This set built the base for the document model and framework that were developed.

Working with user-generated content produces different functionalities that could be summarized to form the main features. First of all, it should be possible to record new content. Independent of the type of content, new entities like structured documents, places, tasks, events, media or tags need to be created. Second, the new content has to be visible for the user. He needs to view the data independently of location and time. The content needs to be linked to already existing content in order to communicate and get-in-touch. This leads to the sharing of data with friends, followers, co-authors of the company or the public. It is also important to recap content and digitalize and link older and existing content as well as the possibility to review data and check and correct information.

These actions depend on the user situation. During the creation of content different context parameters are used to structure the content and add metadata for further processing. The taken parameters, such as location, time, viewing direction, or app context, enable linking to existing published geo-content. Also, when searching for information, the user situation like location or time is used to present only desired results. Live data fusion is used to link the data in near real time and enables vivid communication and collaboration of users.

Based on the presented functions a general document model was developed to represent the user-generated content. It is a general concept to save and link all kinds of entities like places, images, documents or media. One "Document" is used to model single "bites" of information like a short message, a picture, a paragraph, a task, a form, a short video as well as more complex documents consisting of a sequence of other documents (e.g. a short story, a slide show, or the contents of a book). A document always contains two types of information: metadata and the actual content. The content property transports the actual information in form of text, audio, video, image data, etc. The metadata consist of general properties and features to closer describe the user situation. For example, the creation time is taken from the actual situation and with the duration, a time span of an event could be declared. Further, the location could be described closer by a location object, which contains a label, tags and the geometry. The geometry could be a single point to describe an exact position or a polygon to define an area. The content part is used to model a single (short) document or a document collection and always consists of exactly one of the presented properties. All previous entities could be modelled with the content property. Simple notes are saved as text; more complex information as opening hours with different table cells could be represented in JSON format. Media documents like images, videos or audio could be specified directly.

Furthermore, there exist two ways of linking documents together: direct and indirect. The optional properties ids and docs are used for directly linking documents. This means that one bit contains the id of two other bits containing an image and a text note. So these two bits are directly linked together, the documents know each other and the connecting bit may be linked additionally to a location. Documents within a document collection inherit the metadata from their collection; however, all metadata properties can be overridden individually by a single document. Another possibility is to link documents indirectly using tags. Tags can be character strings taken from a common taxonomy, temporal information (e.g. season or weekday) or they can be geo-based, such that content objects are linked by a common location or neighbourhood. Two documents (e.g. a picture and a place description) can be tagged with the same location, which means the image document and the text note (somehow) belong together.

Using the document model described above, different kinds of use cases can be tackled, including use cases from entirely different domains. Just to mention some examples, the following users can benefit from the concept:
- Travel Agency: recommendation and planning
- First Aid Helper: alerting and coordinating volunteers to help in emergency situations
- Marketing/Real Estate Agency: evaluation of property objects
- Citizen Reporter: report accidents, potholes
- Craft Enterprise: coordination and control of work orders.

The developed document model was implemented in a framework to provide a common API – in form of a set of Web services – usable from Web authoring tools as well as mobile tools. The following aspects play a major role in the development:
- Separating the GEOSTREAM content from publicly available third party content
- Providing a flexible approach supporting a variety of use cases
- Supporting collaborative authoring, using (near) real-time notifications

The third party content is saved in a different repository, thus the data is kept separately and independently. Further, it is also possible to separate the GEOSTREAM content, so that different people can use the storage without interferences. Different work projects can be created with no crossing information. The functions are divided in separate modules, so that for every use case a special combination and configuration is possible.

The framework consists of four separate services: (a) Content Service, (b) Subscription Service, (c) Authentication Service, and (d) Profile Service. These are briefly described next. The content service is used to flexibly manage documents tagged with geospatial, temporal, as well as other semantic data. Documents are structured in a very generic way to support a wide spectrum of use cases. The content service supports management functions (e.g. upload, list, download, and remove) as well as retrieval functionality (e.g. search). The subscription service provides a way for users to specify their interest in specific locations, in certain types of documents, in topics, or in a combination of these aspects. If a new document that matches the search criteria specified in a notification subscription is uploaded to the content service, the user gets a heads up and can perform an information update. The authentication service allows the registration and login for users. It generates a session token to ensure the authentication of the user. The profile service manages the user profiles. The user can see and edit her own profile, retrieve other profiles if she is authorized. Invitation for other user could be created and accepted. Furthermore, the location service matches a string of text against the GEOSTREAM places database (generated from third party user generated content) and proposes a set of locations referenced in the input text. A user can use a Web or mobile application to use the services.

Potential Impact:
I. Impact

It is widely recognized that the efficient use of content is of pivotal importance to the competitiveness of the European economy. To that end, GEOSTREAM creates technology to support a series of services that allow one to exploit the ever-increasing treasure of user-generated geospatial content: (a) geospatial data harvesting exploits already existing content by providing intelligent data mining and fusion algorithms; (b) tool-based crowdsourcing focuses on creating easy-to-use software that supports the user in the generation of such content, e.g. forum software for travel guide publishers; (c) live guides provide a means to collect geospatial data in the field and fuse them in real time.

More specifically, the project provides the participating SMEs with efficient means for tapping into the ever increasing stream of user-generated geospatial data to improve and diversify their existing products and services, thus becoming more innovative and competitive. This includes: (a) enriching and diversifying their content by adding user-generated content from the Web and from their own customer base via advanced and easy to use authoring tools; (b) upgrading their mobile guide technology by introducing a live data collection and publishing channel; (c) linking their data to related user-generated geospatial data on the Web, and producing aggregated collections; (d) improving their products with novel geospatial search services and geoblogging software; (e) improving geospatial business intelligence in the area of geomarketing services.

In more detail, in the GEOSTREAM project, via the SMEs participating in the consortium, three main domains are investigated and addressed. The first is mobile travel guides. There exists already a wide range of traditional travel guides that have been also converted to electronic guides. However, these guides suffer from the following disadvantages: (i) electronic guides are mere conversion of the printed books, (ii) content is controlled by the authors, and (iii) although user feedback is collected in forums, it is unstructured and cannot directly be used in guide productions. Hence, content and service development in this direction will be positively affected by the GEOSTREAM project in the following ways:
- Novel content: (a) user-contributed content will be structured and available for future publications; (b) existing user-generated geospatial content will become a content source in publications representing, e.g. tourist perception of places
- Novel Applications: (a) A rich user-generated geospatial content stream that will be used to structure and direct customer feedback in forums; (b) a mobile guide that will allow for direct collection and communication with users/customers resulting in live mobile travel guides.

The second main domain concerns services related to geomarketing. Enhancing business data with geocoordinates makes it possible to visualize address data spatially and, by using statistical procedures, detect correlations and patterns in the data so that improved strategic decisions can be taken by management. In the area of geomarketing, automated compilation as well as manual and flexible processing and maintenance of addresses, including appurtenant spatial information -hence, address coordinates- can be regarded as the basic prerequisites for performing spatial Business Intelligence (BI) analysis. For over 15 years, these spatial BI analyses and spatial information systems have offered customers significant added value for optimizing processes in the areas of marketing strategy, sales force supervision, route optimization, site planning, marketing and controlling. Approximately 70% of data existing in companies exhibit a geographic component. Ideally, this can be a complete address, replete with street number. However, it is very common for geospatial information to be incomplete, in the form of location information or a short description of the location, such as a point of interest (e.g. "across from Big Ben," "north of Hyde Park," etc.), or the rough information for a city district. This imprecise available information cannot be augmented by coordinates using customary geocoding methods, as there are no reference objects in address databases. In order to augment addresses as completely as possible in spite of this, the challenge lies in converting even these imprecise and irregular spatial data into location coordinates that are as precise as possible. Localization of this imprecise input can only be made possible by reverting to user-generated content.

The third domain involves geospatial data publishing, and more particular services related to real estate, fleet management and asset tracking. In this case, the main objective is to integrate user-contributed data with other existing content and services, such as routing. Especially, the developed techniques and tools in GEOSTREAM can benefit existing applications in the following ways: (a) to increase the available content, and hence increase the usage of the application and the corresponding site's traffic; (b) to integrate novel content authoring means that provide users with a simple means to add content to the application, and (c) to create a mobile client application for live content collection and sharing.

II. Dissemination

Throughout the duration of the GEOSTREAM project, the partners have realised a series of activities to promote the project and its achieved results and outcomes. From the very beginning of the project, the following main activities have been realised:
- designing a project logo;
- creating a public project web site;
- designing a project brochure and leaflet;
- promoting the project by using every partner's own network (i.e. customers, collaborators).

The GEOSTREAM logo, as well as the brochure and leaflet, can be found at the public project web site (http://geocontentstream.eu/). The project logo has been designed to combine all relevant aspects of the project’s purpose: content, user, geospatial, stream. The project's public web site provides all basic information about the project, such as its main focus and objectives, the participating organisations, the public deliverables, scientific publications, and the achieved results. The GEOSTREAM brochure and leaflet have been designed to give a brief description of the project's content. In particular, the brochure is more detailed, while the leaflet is aimed at being more concise. Furthermore, to maximise the impact of the dissemination efforts, the members of the consortium have decided that every project partner will exploit their own network and communication, marketing and sales channels in order to present the GEOSTREAM project to the relevant communities as well as to their clients and business partners in public and private business. Additionally, on every project partner's homepage, including both RTD performers and SME participants, the GEOSTREAM project is mentioned with short description of the project focus including a reference to the GEOSTREAM web site.

To promote the results and outcomes of GEOSTREAM the project partners realised and planned several activities, including:
- Presentations and publications (workshops, conferences)
- Tradeshows
- Product Integration (commercial spatial solution): Proof of concept, integrating GEOSTREAM output in commercial geospatial solutions

In term of scientific publications, the following list includes the works that have been published before the end of the project:
- D. Sacharidis, D. Skoutas, G. Skoumas. Continuous Monitoring of Nearest Trajectories. In Proc. of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas, Texas, USA, November 4-7, 2014.
- G. Lamprianidis, D. Skoutas, G. Papatheodorou, D. Pfoser. Extraction, Integration and Exploration of Crowdsourced Geospatial Content from Multiple Web Sources. In Proc. of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Dallas, Texas, USA, November 4-7, 2014 (Demo paper).
- G. Lamprianidis, D. Skoutas, G. Papatheodorou, D. Pfoser. Extraction, Integration and Analysis of Crowdsourced Points of Interest from Multiple Web Sources. In Proc. of the 3rd ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information, Dallas, Texas, USA, November 4, 2014.
- G. Skoumas, D. Skoutas, A. Vlachaki. Efficient identification and approximation of k-nearest moving neighbors. In Proc. of the 21st SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 254-263, Orlando, FL, USA, November 5-8, 2013.

In addition to the aforementioned scientific publications, WIGeoGIS (SME participant) has presented the GEOSTREAM project in two events listed below:
- WIGeoGIS KnowledgeDay 2014, Munich (Date: June 214)
- WIGeoGIS Clients Meeting 2014, Vienna (Date: October 2014)
Both of these events take place every year, with participants being WIGeoGIS clients (public and private business; various business sectors), as well as data and research partners. Around 100 to 130 people are typically attending these events, to be informed about client solutions as well as about WIGeoGIS research activities and product innovation. In addition to presentations, WIGeoGIS solutions and results of research products are presented on work stations in live web applications. It is already planned to similarly present the final project results and achievements in these events in 2015.

The GEOSTREAM project partner Michael Müller Verlag has realised and planned the following promotion activities:
- Book Fair in Frankfurt (October 2013, October 2014), Germany
- Book Fair in Leipzig (March 2014, March 2015), Germany
- ITB Berlin (March 2014, March 2015; MMV is official partner), Germany
The Book Fair in Frankfurt is one of the world leading book fairs with more than 400,000 visitors, 9,000 journalists and around 7,200 exhibitors coming from more than 100 countries. The Book Fair in Leipzig is almost same size like the Frankfurt Book Fair: over 400,000 visitors, approximately 6,500 journalists and almost 4,400 exhibitors form more than 90 countries. Also, the ITB is one of the biggest trade shows in its business. Michael Müller Verlag is going to present their whole range of travel guides (print, eBook and aps). In 2014, Michael Müller Verlag was official partner and sponsored this fair with a free ITB Berlin travel guide. The ITB counts more than 320,000 visitors, 10,000 journalists and almost 20.000 exhibitors for more than 180 countries.

Additionally, MMV uses more marketing channels to promote the participation and output of the GEOSTREAM project. Activities include:
- GEOSTREAM area at the official MMV forum
- Presentation on Facebook as part of MMV travel guides
- Newsletter with special feature focus on GEOSTREAM
- Press events and media travel events with particular attention on GEOSTREAM features in MMV apps.

Furthermore, GEOSTREAM partner TALENT is joining the following trade fairs in Greece, where results of the GEOSTREAM project will be presented:
- Posidonia International Shipping Exhibition; next event: June 2015
- InfoCom media; next event: October 2015

III. Exploitation

Next, we outline the exploitation plans and actions in terms of integrating and using the tools and services developed in GEOSTREAM in order to enhance and enrich the applications and services of the SME participants.

The data gathering API, one essential result of the GEOSTREAM project, will be technically integrated in several products the GEOSTREAM SME partners already have in the market. In particular, MMV will use the API in all their apps and will enrich all travel guides, published by Michael Müller Verlag, with data and information generated by the GEOSTREAM API and sources. The apps, available by Michael Müller, are compliant for all three common mobile platforms (iOS, Android and Windows Mobile).

WIGeoGIS will use the GEOSTREAM API to bring social media content together with commercial location planning. WIGeoGIS is convinced that social media content will gain more importance not only in Business-to-Consumer (B2C) business, but also in Business-to-Business (B2B) applications, for example location planning done by retail companies or distribution of advertising media. In some commercial applications, realised by WIGeoGIS, the integration of social media and user generated content is already realised. One characteristic example is a web based application that is used for the assessment of shopping centre locations. This application is up and running and is used by one WIGeoGIS client since August 2014. WIGeoGIS supplies up to 30 clients with web based geomarketing applications. Here there exists a wide potential to include user generated content and to enrich these solutions with the GEOSTREAM API.

TALENT will also enrich existing products with GEOSTREAM output. TALENT is provider of mobile applications with the focus on shipping navigation and guides for cultural topics. There are three products of TALENT product range, in which user generated content gathered will be a reasonable supplement: the Nautilus Charts, the Cruiser off-line navigators, and the Cultural Navigator of Athens. The main features offered by the Nautilus Charts application include: map browsing with zoom in/out and pan around facilities; full detail maps as distributed in official ENCs, locally available on the device; GPS position and sailing trail. The main features offered by the Cruiser off-line navigators include: routing facilities; various map styles can be implemented; additional data layers can be superimposed on demand; accompanying information (texts, pictures, etc.) can be attached to data points. The main features offered by the Cultural Navigator of Athens include: mobile navigator for 250 points of cultural interest in Athens; use of interactive maps, guided narration and Augmented Reality.

In GEOSTREAM, the main focus was on a tourism-related scenario; nevertheless, other similar applications can also benefit, e.g. in geomarketing. Below we present some further use cases that the SME partners have identified, in order to provide a wider view and set of opportunities for exploitation of the project results.

a) Private Industry (B2B)
i. Retail:
- Expansion/Location Planning: Evaluation of locations; Evaluation of potential of demand (e.g. planning of new stores/locations)
- Marketing: Analysis of local potential of demand (e.g. market entry strategy)
ii. Advertising / Outdoor advertising:
- Distribution: Evaluation of locations of bill boards, city lights, advertising columns
iii. Real Estate:
- Market Research: Evaluation of developments of regional real estate markets; Forecast of development of rent levels
- Distribution: Evaluation of real estate property locations; definition of target groups for real estate properties.

b) Public business (B2C)
i. Tourism:
- Websites of Tour Operators, Tourism Associations
- Websites and information platforms of Municipalities, Cities
- Touristic information websites focused on specific topics or target groups
ii. Civil Service
- Marketing research for Cities, Municipalities, administrative regions and units
- Information platforms of administrative units
iii. Usage in Research and Market Research

List of Websites:
http://geocontentstream.eu/