Periodic Reporting for period 1 - PEAC (Provably-Correct Efficient Algorithms for Clustering) Okres sprawozdawczy: 2017-03-01 do 2019-02-28 Podsumowanie kontekstu i ogólnych celów projektu Machine learning and data analysis are taking an increasingly high role in the decisions made in our everyday life.Yet, we still understand very little about some of the basic tools used to process and extract information the data thatlead to the decisions.For example, a keystone problem in machine learning is to cluster a dataset into groups such that data elements in the samegroup have common features. This is a fundamental problem as it allows to identify data elements that might not at first appearvery similar, and so it is used to detect communities in social network, classify genes according to their expression pattern ordivide a digital image into distinct regions. While there is a large body of experimental work on heuristics for solving clustering problems,a lot less is known from a theoretical perspective but how can we trust machine learning or data analysis approaches if we do not understand thebehavior of some of the basic tools that are used to extract information from the data? Furthermore, how can we rely on the decision made bymachine learning algorithms if we do not understand which data they were based on.In this project, we have made significant progress towards analyzing and providing performance guarantees on popular clustering heuristicsby focusing on specific inputs arising in machine learning and data analysis scenarios. We have analysed very simple heuristics such aslocal search on these types of inputs. Moreover, we have designed new algorithms that are nearly as fast as widely-used heuristicswhile outputting solution with provable properties. For example, we have shown how to speed-up local search techniques while preservingthe quality of the solution output.We have also shown why some of the popular heuristics are much more efficient than others.Finally, we have made progress on the understanding of the complexity of some important clustering problems. Prace wykonane od początku projektu do końca okresu sprawozdawczego oraz najważniejsze dotychczasowe rezultaty The PEAC project has allowed to develop the design of algorithms for data analysis and clustering problems at the University ofCopenhagen and allowed me to obtain a new expertise on data structure and develop further my expertise onapproximation algorithms.For example, our collaboration on data structure has led to a publication in the top conference (FOCS), bringing together myexpertise on planar graphs and the expertise of the University of Copenhagen on data structure.Our work on clustering and data analysis has been in three main steps.First we have shown how to speed-up the classic local search heuristic while maintaining its approximation guarantees for low-dimensionalEuclidean inputs. These inputs are very frequent in image processing.This has been done using new techniques and improving our understanding of the problem by proving new structural properties.The second step consisted in analyzing the performance guarantees of classic heuristics for hierarchical clustering problemsand providing better algorithm for this problem. We have shown that one of the heuristics outperform the other and providedan algorithm that outperforms this heuristic. We have also provided a new approach that outperforms all previous approacheswhen asked to recover an underlying "hidden" hierarchical clustering.In the last step, we have taken a more theoretical approach and improved upon the best known approximation algorithm for adifferent hierarchical clustering problem.While the best known algorithm achieved an O(log^3/2 n) approximation, we showed that it actually achieved an O(log^1/2 n) approximation.In addition, we have provided (conditional) lower bounds for the complexity of classic clustering problems (such as k-median and k-means)in low-dimensional Euclidean space. This was done in a close collaboration with young researchers at the University of Copenhagen.All this work will be presented at a top conference (SODA). Innowacyjność oraz oczekiwany potencjalny wpływ (w tym dotychczasowe znaczenie społeczno-gospodarcze i szersze implikacje społeczne projektu) Most of the state-of-the-art approaches for clustering have been relying on empirical studies oftheir performances. By proving structural properties on their behaviors and characterizing the typesof inputs for which these heuristics perform well, we have provided practitioners with a betterunderstanding on what they can expect from the output of these techniques and so a better understanding of theinformation they are dealing with. In addition, we have provided new algorithms that are competitivewith state-of-the-art heuristics in terms of running time and in addition allow to recover natural hierarchicalclustering structures when they exist. A detailed explanation complemented with experimental results was publishedat a top machine learning conference (NIPS) and so the dissemination to practitioners seems very adequate.Finally, we have also made progress towards the understanding of the complexity of these problems by improving uponthe best known approximation algorithm for hierarchical clustering and providing lower bounds on the running timerequired to compute exact solution for the classic k-means problem in very simple scenarios.