Skip to main content

Provably-Correct Efficient Algorithms for Clustering

Periodic Reporting for period 1 - PEAC (Provably-Correct Efficient Algorithms for Clustering)

Reporting period: 2017-03-01 to 2019-02-28

Machine learning and data analysis are taking an increasingly high role in the decisions made in our everyday life.
Yet, we still understand very little about some of the basic tools used to process and extract information the data that
lead to the decisions.
For example, a keystone problem in machine learning is to cluster a dataset into groups such that data elements in the same
group have common features. This is a fundamental problem as it allows to identify data elements that might not at first appear
very similar, and so it is used to detect communities in social network, classify genes according to their expression pattern or
divide a digital image into distinct regions. While there is a large body of experimental work on heuristics for solving clustering problems,
a lot less is known from a theoretical perspective but how can we trust machine learning or data analysis approaches if we do not understand the
behavior of some of the basic tools that are used to extract information from the data? Furthermore, how can we rely on the decision made by
machine learning algorithms if we do not understand which data they were based on.
In this project, we have made significant progress towards analyzing and providing performance guarantees on popular clustering heuristics
by focusing on specific inputs arising in machine learning and data analysis scenarios. We have analysed very simple heuristics such as
local search on these types of inputs. Moreover, we have designed new algorithms that are nearly as fast as widely-used heuristics
while outputting solution with provable properties. For example, we have shown how to speed-up local search techniques while preserving
the quality of the solution output.
We have also shown why some of the popular heuristics are much more efficient than others.
Finally, we have made progress on the understanding of the complexity of some important clustering problems.
"The PEAC project has allowed to develop the design of algorithms for data analysis and clustering problems at the University of
Copenhagen and allowed me to obtain a new expertise on data structure and develop further my expertise on
approximation algorithms.
For example, our collaboration on data structure has led to a publication in the top conference (FOCS), bringing together my
expertise on planar graphs and the expertise of the University of Copenhagen on data structure.
Our work on clustering and data analysis has been in three main steps.
First we have shown how to speed-up the classic local search heuristic while maintaining its approximation guarantees for low-dimensional
Euclidean inputs. These inputs are very frequent in image processing.
This has been done using new techniques and improving our understanding of the problem by proving new structural properties.
The second step consisted in analyzing the performance guarantees of classic heuristics for hierarchical clustering problems
and providing better algorithm for this problem. We have shown that one of the heuristics outperform the other and provided
an algorithm that outperforms this heuristic. We have also provided a new approach that outperforms all previous approaches
when asked to recover an underlying ""hidden"" hierarchical clustering.
In the last step, we have taken a more theoretical approach and improved upon the best known approximation algorithm for a
different hierarchical clustering problem.
While the best known algorithm achieved an O(log^3/2 n) approximation, we showed that it actually achieved an O(log^1/2 n) approximation.
In addition, we have provided (conditional) lower bounds for the complexity of classic clustering problems (such as k-median and k-means)
in low-dimensional Euclidean space. This was done in a close collaboration with young researchers at the University of Copenhagen.
All this work will be presented at a top conference (SODA)."
Most of the state-of-the-art approaches for clustering have been relying on empirical studies of
their performances. By proving structural properties on their behaviors and characterizing the types
of inputs for which these heuristics perform well, we have provided practitioners with a better
understanding on what they can expect from the output of these techniques and so a better understanding of the
information they are dealing with. In addition, we have provided new algorithms that are competitive
with state-of-the-art heuristics in terms of running time and in addition allow to recover natural hierarchical
clustering structures when they exist. A detailed explanation complemented with experimental results was published
at a top machine learning conference (NIPS) and so the dissemination to practitioners seems very adequate.
Finally, we have also made progress towards the understanding of the complexity of these problems by improving upon
the best known approximation algorithm for hierarchical clustering and providing lower bounds on the running time
required to compute exact solution for the classic k-means problem in very simple scenarios.