Skip to main content

Identification and exploitation of user communities for efficiently handling large-scale folksonomies

Final Report Summary - COMMUNITY FOLKS (Identification and exploitation of user communities for efficiently handling large-scale folksonomies)

Social media applications, such as blogs, multimedia sharing sites, question and answering systems, wikis and online forums, are growing at an unprecedented rate and are estimated to generate a signi?cant amount of the content currently available on the Web. This has exponentially increased the amount of information that is available to users, from videos on sites like YouTube and MySpace, to pictures on Flickr, music on Last.fm blogs on Blogger, and so on. This content is no longer categorised according to pre-de?ned taxonomies (or ontologies). Rather, a new trend called social (or folksonomic) tagging has emerged, and quickly become the most popular way to describe content within Web 2.0 websites. Unlike taxonomies, which overimpose a hierarchical categorisation of content, folksonomies empower end users by enabling them to freely create and choose the tags that best describe a piece of information (a picture, a document, a blog entry, a video clip, etc.). However, although folksonomies are extremely easy to use and are able to overcome important weak points affecting classical taxonomy systems, they suffer from some disadvantages.

More speci?cally:

-Due to the language variability, folksonomies imply a lack of precision, as a function of different user behavior in labbeling activities. In fact, test searches on real folksonomies point out that spellos, new words, abbreviations, acronyms are widely used to label resources in a folksonomy. Owing to the language variability, when a user queries a folksonomy, he is able to receive a complete set of answers only if his vocabulary is rich enough to allow him to submit different version of queries including the whole set of tags which are used to label the resources she is searching for. This hypothesis is clearly unrealistic because users' vocabulary is often quite limited

-Folksonomies are ?at structures, then, it is hard to think of real people crafting complex structures of tags to describe their objects (posts, photos, etc) or objects from other sites.

As a consequence, while folksonomies play a relevant role to tag and, therefore, to create the metadata associated to content, they give absolutely no support to browse, a case in which hierarchies of pre-classi?ed concepts were good instead. Asking the end users to come up with complex structures/relationships of tags in unrealistic, since they would be required to own a very large vocabulary spanning many specialized domains.

In order to better face the drawbacks mentioned above, it would be extremely useful to de?ne and build a system capable of interactively supporting users in the folksonomy sur?ng.

This project is collocated just in this context and aims to investigate if it is possible to build a system for effectively searching and browsing user-generated content that has been freely described using tags.

In order to assist users in ?nding content of their own interest within a folksonomy, new approaches, inspired by traditional recommender systems, have been developed. These often exploit an underlying tag similarity measure; whenever a user labels a resource or searches for it by adopting a set of tags, they suggest new tags to be added to the resource label or to the user query, on the basis of their similarity to the original tags expressed by the user herself. They do so to increase the chances of ?nding content of relevance in these settings.

Various metrics have been used to compute tag similarity, including, for instance, cosine similarity, Jaccard coe?cient, and Pearson Correlation. Some of the approaches exploiting these metrics have proved to achieve excellent results; however, they do so only if the underlying folksonomy is already dense, and they operate by making it even denser. Nevertheless, we observe that this assumption does not hold true; rather, most real life folksonomies exhibit a power law distribution of tag usage, with few tags labelling most resources, and most tags labelling just a few resources instead. This means that, in practical cases, classical similarity measures fail to compute the similarity of those tags belonging to the long tail of low popularity tags.

To face this problem, we have studied and proposed a novel metric speci?cally developed to capture similarity in large-scale folksonomies that is based on a mutual reinforcement principle. We have offered an ef?cient realisation of this similarity metric, and we have proved that our similarity measure is capable of accurately assessing the extent to which two folksonomy tags are related each other even when such tags have been drawn from the long tail of low popularity tags. Furthermore we have quantitatively assessed its quality by comparing it against the classical similarity measures widely accepted in the literature on three real large scale folksonomies, namely Bibsonomy, MovieLens and CiteULike.

Furthermore we studied and proposed a new approach that transparently induces the creation of a dense folksonomy, thus supporting the e?ective retrieval of resources by construction. At the core of our approach lies our innovative tag similarity metric, used to recommend tags both when labelling resources and when querying the folksonomy: when a user labels a new resource, or when she is submitting a query to retrieve some resources, the above metric is used to automatically expand the user-selected tag set with those tags, already present within the folksonomy, that are: (i) most similar to those she initially submitted, and (ii) among those most widely used in the folksonomy.

We have conducted an extensive experimental evaluation on three large-scale datasets, namely BibSonomy , MovieLens , and CiteULike. The obtained results demonstrate that our technique metric operates e?ectively even in very sparse settings, where traditional metrics (including cosine similarity, SimRank and Latent Semantic Indexing) fail.

In particular, we succeed in creating denser folksonomies, within which searches can be answered both with better accuracy and coverage.

It is worth noting that, the proposed research will impact diverse research ?elds, from the study on the latest researches about:

(i) approaches aiming to help folksonomy users to browse a folksonomy with the support of external data sources (such as thesauri or pre-de?ned taxonomies);
(ii) approaches based on data mining and statistical tools conceived to discover hidden relationships among tags in a folksonomy; and
(iii) approaches exploiting social network theories on folksonomies.