Skip to main content

Exact Mining from In-Exact Data

Final Report Summary - MININEXACT (Exact Mining from In-Exact Data)

User data, academic data and industrial data are increasingly stored in the 'cloud'. During this process the data may undergo a number of transformations: anonymization (e.g. medical data), right-protection (e.g. music), or compression (e.g. jpeg/mpeg). Therefore, the final data are only modified instances of the original data. While this is non-issue for multimedia or user-data, when dealing with sensitive scientific or medical data, it is desirable to be able to quantify the amount of data distortion, or, even more desirably, provide transformations that do not change the data utility. For example, after anonymizing medical data, the researchers would like to build the same classifications rules as on the original data, which will suggest with equal accuracy if a person has a particular disease or not. In this project, we study how to design such transformations, that modify data (anonymize, right-protect and compress), but the resulting data are equally useful as the original data.

We have studied the following problems:
1) Fast data compression that distorts to the last amount the outcome of machine learning and database operations.
2) Joint right-protection and anonymization of data, with provable guarantees on the data utility.
3) Leveraging ensembles of random projections to provide fast learning algorithms with probabilistic guarantees on the mining capacity of the distorted data.
4) Fast recommendation algorithms balance interpretability and accuracy of the prediction models on noisy data.