Semantic tools for digital libraries

Final Report Summary - SEMLIB (Semantic tools for digital libraries)

SEMLIB partners are small to medium-sized enterprises (SMEs) with different experiences and target markets that look at the linked data and semantic web as possibly key technologies to advance their commercial products, produce innovation and bring added value to their customers.

In this context, the SEMLIB project focuses on addressing tree main needs:

- to enhance SMEs product with the capability of understanding and exporting linked data;
- to enable end-user involvement in creating structured data, enriching online content and linking it to the web of data;
- to make it easier for users to discover data and information by offering recommendations based on both user behaviours and semantics of the content.

The goal of the SMEs was that of obtaining both configurable software modules to be plugged into their data workflows, along with the know-how needed for further developing the solutions and bring them to the market.

Two research and technology development (RTD) providers, UNIVPM(Italy) and NUIG (Ireland), which have long academic experience of semantic web development, participated the project and produced the two main outcomes:

Pundit (see http://thepund.it for details ) is a tool to semantically augment webpages. It supports different kind of annotations, from simple comments to automatically extracted tags and typed relations and RDF graph composition. Pundit comes as a javascript library to be included in web applications or as a bookmarklet to install locally and annotate the web while surfing. With the tool users create links to the web of data (e.g. to DBPedia, Freebase, Wordnet, Europeana and other possible sources) as well as use their own vocabularies and 'computational ontologies'.

Pundit allows collaborative annotation and exposes data to be consumed via SPARQL endpoints or dedicated REST API. The Pundit final release has been described in D3.3.

SDLR (see http://sldr.deri.ie/ for details) is a linked data recommendation engine. The system was designed with the intent of allowing users to compute recommendations from linked data (i.e. RDF). It does so by utilising the SPARQL query language for the purpose of declaratively specifying data that should be used to compute recommendations and then storing those recommendations as linked data afterwards.

For the computation of recommendations, the system uses a complex library of recommendation algorithms. Recommendations can be based on preferences and/or similarities between the entities selected in the initial SPARQL query and retain their linkage to the data input into the system. The SLDR final release has been described in D4.3.

During the last phase of the project, the RTD results have been successfully integrated by each SME with its own web application resulting in online demonstrative digital libraries with annotation and recommendation capabilities. As the project concluded and initial feedback collected from the market is very promising, each SME is now in the process of bringing SEMLIB results to a production-ready state and integrate it into their businesses.

Project context and objectives:

Motivating scenario

The term digital library refers to a wide array of different organisations and collections that share the common trait of exposing digital content to a community of users. Digital libraries are applied in many different contexts ranging from academic institutions to public libraries, archives, museums and industries.

The content varies depending on the organisation, it can either be reproduction of physical objects or content which is 'born digital'. Usually a digital library has two main functions: to store and preserve content and to deliver it to end users, being it a human consumer or another software application.

When the project started digital libraries (DLs) only took limited advantage of the benefits that modern computing technologies offer.

To overcome this bottleneck, research and development for digital libraries include processing, dissemination, storage, search and analysis of all types of digital information. Traditional digital library applications are nowadays facing three new key challenges:

1. size and searching: due to the high rate of the data that digital libraries can store it has become very difficult for the users to find and retrieve relevant content;
2. interoperability: re-using, re-purposing, and re-mixing digital objects in heterogeneous environments is becoming a primary need of the users. Additionally, due to the fast increasing level of integration of IT infrastructures, digital libraries are now consumed and manipulated at the same time by human actors and by machines and other software applications;
3. linking: the value of digital libraries multiplies according to the relations that link their objects to other digital objects, also outside the boundaries of a single digital repository.

Semantic web technologies are a promising solution to meet the challenges that nowadays are faced by all digital libraries and digital content repositories, by:

1. helping users to manage large and heterogeneous data sets with semanticenhanced searching;
2. maximising interoperability of content by making semantics of the digital objects and their relations explicit and tractable by machines and software agents;
3. enriching the context around the objects by semantically annotating the objects themselves and their links with other content, within and outside the boundaries of a library (linked data).

Semantic technologies support more flexible information management than that offered by the classic digital libraries. Information about library resources can be composed from heterogeneous sources, including contributions from the communities of library users, and mash-ups of disconnected systems. Despite the use of semantic web technologies which have already proven valuable, Semantic Digital Library software applications have just begun to emerge on the market and reach the maturity to enter the industry. However, the path that will make traditional digital libraries evolve to Semantic Digital Libraries has undoubtedly become evident.

Objectives

A first objective of the project was that of making existing SME products more interoperable and compatible with the web of data. Benefits are the easier integration of new data coming from the LOD cloud, as well as the possibility to publish new structure data as linked data, thus providing visibility and the possibility of reusing such data. This has been done in the first months of the project by developing so called Linked Data connectors: specific modules that, basing on existing open-source software such as D2RQ, fill the gap between traditional databases and linked data / SPARQL technologies.

However, the main goal of the project has been to provide two novel and cutting-edge software solutions to support end-user semantic annotation of contents and semantic recommendation of related contents.

The main positive effects that SEMLIB results achieved with respect to the SME are:

1) to make it possible for digital libraries exploit user-generated content (and structure data) to enrich their collection descriptions;
2) to allow library managers and collection holders to enrich their document metadata in any moment, thus making content more searchable and easy to use;
3) to enrich the end-user experience by giving them access not only to the data that is available within the boundaries of a single library, but rather in all the libraries powered by one of the SME products and also the enormous and constantly growing amount of structured data publicly available on the web (linked data).

Finally, and objective was that of actually integrating the two software modules into SMEs development environments and into their web applications. This gave SMEs the opportunity to demonstrate the potentials of the results to their customers and business partners as well as to collect valuable feedback.

Motivated by the good impact that such novel features had on the public, SMEs are now in the process of planning future activities: further customisation of the tools and deployment of the RTD results into production-ready environments.

Project results:

In this project, the following main results have been achieved:

- The enhancement to the SMEs products to fully support linked data and expose machine readable data in their websites. Work done in work package (WP)2: Web of data connectors.
- The development of a semantic web based annotation system, which provides pluggable components to be integrated in SMEs web applications thus enabling users to create and manage their annotations. Developed by UNIVPM. Work done in WP3.
- The development of a hybrid recommendation system, which makes use of semantic web and linked data, providing a REST web service to query recommendations and an administrative interface to manage recommendation jobs. Developed by NUIG. Work done in WP4.
- The integration of the two systems into SMEs demonstrative web applications and the evaluation of resulting annotation and recommendation functionalities. Work done in WP5 and WP6.

Pundit

The Pundit web annotation system (Grassi et al., 2012, Nucci et al., 2012), developed by UNIVPM, provides annotation functionalities over web pages. Pundit is a novel semantic annotation system (Andrews et al., 2012) and is completely based on Semantic Web technologies. Annotations can be simple comments, tags, or more complex knowledge bits that include relations among different kind of entities (text excerpts, images, places, persons and concepts). Annotations produced with Pundit are represented in RDF and are attached to web content at a custom granularity level: annotations can be attached to entire web pages or to images contained in pages or, furthermore, to fractions of texts and images (e.g. a line of text or paragraph, or a polygonal region inside an image).

Thanks to the flexible and semantic-aware RDF data model, knowledge enclosed in users annotations can be consumed as a semantic graph: Pundit provides both ad hoc REST API and standard SPARQL endpoints.

The Pundit client is an annotation authoring tool that can be embedded into web pages. This component is decoupled from the sever side annotations store and communicates with it via a REST API only. Alternative clients, both for reading and for displaying annotations can be relatively easily developed by SMEs to address specific needs.

Online resources

A website dedicated to Pundit, containing extensive up-to-date documentation (included in this deliverable as well) has been put online at http://thepund.it/

Furthermore, in the online demo section (see http://thepund.it/demo.php online) a screencast and a slide presentation quickly introduce users to the main functionalities of the tool as well as a fully functioning instance of Pundit can be tested live.

Releases and current development builds are available at http://metasound.dibet.univpm.it/release_bot/ and will be updated as the development goes on.

The source code of both client and server modules is hosted into a Git repository provided by Net7 and accessible by all project members.

SLDR

SLDR is the result of research and development work conducted at DERI (Policarpio et al., 2012, Fossati et al., 2012) and is an innovative recommendation system based in semantic web technologies and combining different recommendation strategies into a configurable hybrid system (Burke, 2007; Cantador et al., 2011; Heitmann et al., 2010).

The SLDR system was designed to compute recommendations based on semantic web data (i.e. RDF). It does so by utilising the SPARQL query language in order to declaratively specify data which should be used to generate recommendations.

To compute these recommendations, SLDR uses a complex library of recommendation algorithms. Recommendations can be based on preferences and/or similarities between the entities selected in the initial SPARQL query. As a result of outputting them as RDF, recommendations retain their linkage to data that was originally inputted into the system. This cyclic relationship, shown in the diagrams above and below, allows for recommendations to co-exist and to be re-utilised with other linked data.

The complete SLDR system was developed in Java for interoperability across the various setups that the SMEs had already established. A number of different software technologies are currently also being used within SLDR, including for example:

- Apache Mahout (a distributed machine learning library)
- Apache Hadoop
- Apache Lucene
- OpenRDF Sesame (RDF triple store)
- Glassfish Jersey REST API.

More details about the recommender, its use and underlying technology can be found in D4.2 and D4.3.

Online resources

A website has been created to collect the information and links needed to test and install the software: http://sldr.deri.ie/

A screencast video introduces the functionalities and shows how to use the web interface from an end-user perspective.

Each SME customised and integrated the RTD results into their applications and business workflows.

As foregrounds of the projects we consider the software modules that result from such integration: even if they share a common base, their full property belongs to each SME.

In deliveralbe D5.1 each SME provide a description of its own foreground.

Potential impact:

The linked data web is emerging as a global source of rich and interconnected data, where applications can both write and read structured descriptions and browse conceptual graphs of things.

SEMLIB partners, supported by feedbacks and exchanges with both commercial partners and research projects, believe that there is strong opportunity in leveraging the web of data.

The results of the project are novel software components that provide two main advantages to the current SMEs products.

The semantic annotation system enables the manual or semi-automatic linking from published web content to the web of data. This allows knowledge created by system managers of by end-users to become part of a larger graph. Interestingly, such a graph is based on well-understood standards, thus allowing a variety of open-source tools to be applied to visualise, browse, analyse data.

One of the objectives of the project was that of making this component easily configurable for different use cases. Frequently annotation is a domain dependent activity that needs proper terminology and settings. The software addressed this with a flexible configuration. Such a configuration, loaded on the fly from annotated web pages, defines what vocabularies to include and what modules to activate, thus enabling to tailor the annotation functionalities to the specific target users.

The recommender system allows to gain immediate tangible value from such a graph, where users annotations coexist with external linked data. An intermediate result of the SEMLIB project (the web of data connectors, WP2) enabled SMEs web applications to directly expose data as linked data and, along with data created by user via semantic annotations, to feed the recommender system to obtain different kind of recommendations, e.g. based on user behaviour, based on semantic distance or on content similarity.

Both the systems have been successfully deployed and integrated by SMEs into demonstrative applications:

Net7 has integrated the annotation system and the recommender into its Muruca DL suite and is already adopting it in commercial and research projects. The clear needs and appreciation shown for such instruments in the humanities as well as in other domains, is a positive result of the project that is pushing Net7 in further developing the tools.

IN2 published a demo version of ON:meedi:a demonstrating annotation capabilities of images and text. The recommender system has been integrated into Followtheplace.com to improve social interaction among users and their pictures.

Knowledge hives is currently integrating the two software components in the digi.me platform, a collaborative smart bookmarks management system. In addition, the annotation tool was enhanced with support for Civet (a technology for entity extraction owned by knowledge hives).

Liberologico deployed the two components as additional plug-ins of their DSite CMS, used by a considerable number of public and private institutions in Italy, and is in the process of offering them as add-ons to customers.

The results of the project demonstrated to be applicable and useful in specific business projects and to bring clear advantages to the SMEs solutions. It is also clear that, while under a research viewpoint, the expectations was met, further work is needed to improve scalability and to perform detailed user interaction design based on specific applications. This is especially needed if SMEs wants to deliver their products as worldwide services.

List of websites:

SEMLIB project website: http://www.semlibproject.eu/
Annotation system website: http://thepund.it
Recommendation system website: http://sldr.deri.ie/

Final Report Summary - SEMLIB (Semantic tools for digital libraries)

Download Download the content of the page