European Commission logo
italiano italiano
CORDIS - Risultati della ricerca dell’UE
CORDIS

Data Mining Algorithms in Practice

Periodic Reporting for period 4 - DMAP (Data Mining Algorithms in Practice)

Periodo di rendicontazione: 2020-08-01 al 2021-01-31

Our investigation has improved the state of the art for several important problems (e.g. topic learning, RUM learning, online social network analysis, among others) that are at the heart of the online systems that drive the technological advancements of our age. The results of our project have been published in top-tier conferences and journals.

We have worked on several problems:
- we have improved the efficiency of topic learning in the natural LDA document model; these models make it possible to learn the topics of documents, in order to classify, or cluster, them in a completely unsupervised manner;
- we have studied discrete choice models, and we have given several new algorithms for learning random utility models that improve over the state of the art. These algorithms make it possible to guess which element a random user would prefer, given a generic set of options;
- we have given efficient algorithms for sampling users in social networks; these algorithms make it possible to efficiently compute the fraction of users that have a favorable opinion of a given movie or song, that is, they make it possible to efficiently learn properties of the users in a network;
- we have provided several algorithms, that improve significantly over the state of the art, for computing the frequency of motifs and graphlets of online social networks, that is, for assessing the frequency of the micro-structures in a large social network;
- we have also provided several algorithms for fair optimization --- e.g. algorithms that make it possibile to cluster people into groups so that no group contains too large, or too small, a fraction of people with a certain protected characteristic, in order for each group to be balanced and fair;
- moreover, we have also studied several other problems in the same area.

The importance of the above problems has is the machine learning revolution has evolved in the last few years.
"Our investigation has improved the state of the art for several important problems (e.g. topic learning, RUM learning, online social network analysis, among others) that are at the heart of the online systems that drive the technological advancements of our age. The results of our project have been published in top-tier conferences and journals.

The first set of results we would like to highlight here deals with LDA document models --- models that make it possible to reconstruct, in an unsupervised manner, the topics that make up a document corpus. In ""A Reduction for Efficient LDA Topic Reconstruction"", we show that one can improve (in running time, and in sample complexity) over existing algebraic algorithms for document topic modeling, through some simple combinatorial algorithms. The cornerstone of our result is an algebraic rule to transform mixed-topic document distributions into single-topic distributions --- with these simpler distributions, learning the topics becomes a simpler task, which can be carried out more efficiently.

The second set of results deals with learning patterns of user behavior. We have studied various classes of the so-called Random Utility Model, which has been used for decades for representing how rational users pick a choice in a set --- in this model, a user is a permutation over all the possible choices; the system picks some set of choices, and presents it to a random user: this user will pick the one choice in the set that she likes best (i.e. according to her permutation). This model is particularly significant for, e.g. advertisements on the web: given a random distribution over the users and, given a set of choices, we observe an empirical distribution of the element chosen in the set by a random user. Our goal is to learn the distribution for the generic set of choices, by querying some small number of sets.
In ""Discrete Choice, Permutations, and Reconstruction"", we have studied this general learning problem from a theoretical perspective, obtaining algorithms, and some strong lower bounds. In ""Learning a Mixture of Two Multinomial Logits"", we have studied a more restricted class of Random Utility Models (that is, a mixture of Multinomial Logits), obtaining some quite simple algorithms. In fact, our results can be seen as an explanation of why neural networks work so well in identifying mixtures of multinomial logits.

The third set of results so far revolves around the basic question of understanding the opinions of users in a social network: e.g. which fraction of people in this social network study computer science? Or, what is the approximate average star-rating of movie X in this social network? These questions can be answered by sampling nodes uniformly at random from the social network. For understandable privacy concerns, online social networks, however, severely restrict the type of access to their users/nodes. In ""On Sampling Nodes in a Network"", we have shown how one can sample nodes almost uniformly at random from a social network efficiently; the algorithms that we propose are extremely simple to implement. In ""On the Complexity of Sampling Vertices Uniformly from a Graph"", we have shown that the aforementioned simple algorithms are, in fact, optimal.

We have also studied many other Machine Learning, and data mining, problems and topics, including: Top-k lists learning, graphlets counting, sketching schemes, models for online user consumption, and optimization under fairness constraints."
The improvements induced by our results are of two main types: we have made solutions more efficient (time-, and energy-, wise) and we have reduced the amount of data and work required for learning several models.
We have given these improvements for several of the aforementioned problems, e.g. LDA topic reconstruction, Random Utility Models learning, social network sampling, top-k list learning, graphlet and motifs counting and optimization under fair constraints.
dmap.png