Final Report Summary - SEMDATA (Semantic Data Management)
The Semantic Web has been defined as an extension of the current World Wide Web, where information is published and interlinked in order to facilitate the exploitation of its structure and semantics (meaning) for both humans and machines. To foster the realization of the Semantic Web, the World Wide Web Consortium (W3C) developed during the last decade a set of languages to represent metadata (RDF) and ontologies (RDF Schema and OWL variants), and to allow executing queries over data (SPARQL). Research in the early years was mostly concerned with the definition and implementation of these languages, the development of the corresponding support technologies, and applications in various domains.
During recent years, Semantic Web technologies have been increasingly adopted by mainstream corporations (e.g. in online publishing, healthcare), governments worldwide thanks to the rise of public-sector information reuse and open data initiatives, and scientific communities (e.g. Life Sciences, Geography, Astronomy). Major search engine providers and social network sites have recognized the benefits of using semantic data, launching services that leverage semantic data on the Web to improve end user experience (e.g. Google and Facebook's Knowledge Graphs, the schema.org initiative). And the Linked Open Data community movement has shown that it is possible to expose large amounts of RDF data on the Web. An example of this uptake is that approximately one third of participants in the most recent International Semantic Web Conference (ISWC2017) were affiliated to companies.
Semantic data management refers to a range of techniques for the manipulation and usage of data based on its meaning. In this project, we understand semantic data as data expressed in RDF, the lingua franca of linked open data and hence the default data model for annotating data on the Web. Now that the foundational concepts and technologies are laid, and a critical amount of semantic, linked data has been published, the next crucial step has to do with maturity and quality. In fact, many aspects of Linked Data management face great challenges in mastering the varying maturity and quality of the data sources exposed online. One of the reasons for this state of affairs is the ‘publish-first-refine-later’ philosophy promoted by the Linked Open Data movement, complemented by the open, decentralized nature of the environment in which we are operating. While both have led to amazing developments in terms of the amount of data made available online, the maturity and quality of the actual data and of the links connecting data sets is something that the community is often left to resolve. In this context maturity and quality are understood under the general term of ‘fitness-for-use’, which cover features related to data completeness, presence of (obvious) inconsistencies, timeliness, or relevance to the application domain. These features may span to different data sets, given the prominent role that linkage plays in Linked Data management.
In the context of this project, secondments have allowed project partners to produce results in the areas of curation, preservation and provenance, dynamicity and efficiency and scalability, which have resulted in:
- Several tutorials that have been prepared and run jointly by several project beneficiaries and additional third parties with whom these beneficiaries often collaborate. The tutorials have been taught associated to Semantic Web and data management conferences, and have been generally well attended and all the materials are available online. Three MOOCs have been also prepared on topics related to semantic data management.
- Several workshops have been run on SemDATA topics, including a Dagstuhl workshop on crowdsourcing (2014) and another one on Citizen Science (2017).
- SemDATA beneficiaries have been active in W3C groups strongly related to SemDATA (working group on Spatial Data on the Web and community group on RDF Stream Processing).
- A good number of scientific papers and a book have been directly obtained from the work done during SemDATA secondments. This includes several surveys: on temporal ontologies, on entity resolution techniques for the Web of Data, or on large-scale reasoning with semantic data (with a comprehensive and systematic overview and analysis of mass-parallelization techniques applied to a variety of reasoning methods), as well as more focused contributions showing advances to the state of the art in the areas involved in SemDATA.
Among the main contributions that can be highlighted we have:
# Time and space representation, and reasoning, in semantic data
# Ontology evolution
# Temporal ontologies
# Crowdsourcing, provenance and quality in semantic data
# Entity resolution on the Web of Data
# Licensing in the Web of Data
# Theoretical aspects of SPARQL provenance and Ontology-Based Data Access
# Contributions to the SSN Ontology
# RDF Stream Processing representation and query languages, and their semantics
# RDF Stream Processing systems, using Big Data infrastructure
# Corpus of social network data
# Parallel semantic query evaluation and reasoning
# Massively parallel argumentation techniques
# Parallel spatial and temporal reasoning
During recent years, Semantic Web technologies have been increasingly adopted by mainstream corporations (e.g. in online publishing, healthcare), governments worldwide thanks to the rise of public-sector information reuse and open data initiatives, and scientific communities (e.g. Life Sciences, Geography, Astronomy). Major search engine providers and social network sites have recognized the benefits of using semantic data, launching services that leverage semantic data on the Web to improve end user experience (e.g. Google and Facebook's Knowledge Graphs, the schema.org initiative). And the Linked Open Data community movement has shown that it is possible to expose large amounts of RDF data on the Web. An example of this uptake is that approximately one third of participants in the most recent International Semantic Web Conference (ISWC2017) were affiliated to companies.
Semantic data management refers to a range of techniques for the manipulation and usage of data based on its meaning. In this project, we understand semantic data as data expressed in RDF, the lingua franca of linked open data and hence the default data model for annotating data on the Web. Now that the foundational concepts and technologies are laid, and a critical amount of semantic, linked data has been published, the next crucial step has to do with maturity and quality. In fact, many aspects of Linked Data management face great challenges in mastering the varying maturity and quality of the data sources exposed online. One of the reasons for this state of affairs is the ‘publish-first-refine-later’ philosophy promoted by the Linked Open Data movement, complemented by the open, decentralized nature of the environment in which we are operating. While both have led to amazing developments in terms of the amount of data made available online, the maturity and quality of the actual data and of the links connecting data sets is something that the community is often left to resolve. In this context maturity and quality are understood under the general term of ‘fitness-for-use’, which cover features related to data completeness, presence of (obvious) inconsistencies, timeliness, or relevance to the application domain. These features may span to different data sets, given the prominent role that linkage plays in Linked Data management.
In the context of this project, secondments have allowed project partners to produce results in the areas of curation, preservation and provenance, dynamicity and efficiency and scalability, which have resulted in:
- Several tutorials that have been prepared and run jointly by several project beneficiaries and additional third parties with whom these beneficiaries often collaborate. The tutorials have been taught associated to Semantic Web and data management conferences, and have been generally well attended and all the materials are available online. Three MOOCs have been also prepared on topics related to semantic data management.
- Several workshops have been run on SemDATA topics, including a Dagstuhl workshop on crowdsourcing (2014) and another one on Citizen Science (2017).
- SemDATA beneficiaries have been active in W3C groups strongly related to SemDATA (working group on Spatial Data on the Web and community group on RDF Stream Processing).
- A good number of scientific papers and a book have been directly obtained from the work done during SemDATA secondments. This includes several surveys: on temporal ontologies, on entity resolution techniques for the Web of Data, or on large-scale reasoning with semantic data (with a comprehensive and systematic overview and analysis of mass-parallelization techniques applied to a variety of reasoning methods), as well as more focused contributions showing advances to the state of the art in the areas involved in SemDATA.
Among the main contributions that can be highlighted we have:
# Time and space representation, and reasoning, in semantic data
# Ontology evolution
# Temporal ontologies
# Crowdsourcing, provenance and quality in semantic data
# Entity resolution on the Web of Data
# Licensing in the Web of Data
# Theoretical aspects of SPARQL provenance and Ontology-Based Data Access
# Contributions to the SSN Ontology
# RDF Stream Processing representation and query languages, and their semantics
# RDF Stream Processing systems, using Big Data infrastructure
# Corpus of social network data
# Parallel semantic query evaluation and reasoning
# Massively parallel argumentation techniques
# Parallel spatial and temporal reasoning