Robust statistical methodology and theory for large-scale data

Project Information

RobustStats

Grant agreement ID: 101019498

Project website

DOI

10.3030/101019498

EC signature date 27 May 2021

Start date 1 October 2021

End date 30 September 2027

Funded under

EXCELLENT SCIENCE - European Research Council (ERC)

Total cost

€ 2 050 068,00

EU contribution

€ 2 050 068,00

2 050 068,00

Coordinated by

THE CHANCELLOR MASTERS AND SCHOLARS OF THE UNIVERSITY OF CAMBRIDGE
United Kingdom

Periodic Reporting for period 2 - RobustStats (Robust statistical methodology and theory for large-scale data)

Reporting period: 2023-04-01 to 2024-09-30

Modern technology allows large-scale data to be collected in many new forms, and their underlying generating mechanisms can be extremely complex. In fact, an interesting (and perhaps initially surprising) feature of large-scale data is that it is often much harder to feel confident that one has identified a plausible statistical model. This is largely because there are so many forms of model violation and both visual and more formal statistical checks can become infeasible. It is therefore vital for trust in conclusions drawn from large studies that statisticians ensure that their methods are robust.

The RobustStats proposal introduces new statistical methodology and theory for a range of important contemporary Big Data challenges. In transfer learning, we wish to make inference about a target data population, but some (typically, most) of our training data come from a related but distinct source distribution. The central goal is to find appropriate ways to exploit the relationship between the source and target distributions.

Missing and corrupted data play an ever more prominent role in large-scale data sets because the proportion of cases with no missing attributes is typically small. We are addressing key challenges of testing the form of the missingness mechanism, and handling heterogeneous missingness and corruptions in classification labels.

The robustness of a statistical procedure is intimately linked to model misspecification. We will advocate for two approaches to studying model misspecification, one via the idea of regarding an estimator as a projection onto a model, and the other via oracle inequalities.

Finally, we will introduce new methods for robust inference with large-scale data based on the idea of data perturbation. Such approaches are attractive ways of exploring a space of distributions in a model-free way, and we will show that aggregation of the results of carefully-selected perturbations can be highly effective.

The work performed concerns several different areas:

1) In transfer learning, we seek robustness to the fact that much of our data does not come from the `ideal' target distribution. Such problems arise in many natural, practical settings: for instance, we may wish to understand the effectiveness of a treatment on a particular subgroup of the whole population, but still wish to exploit information about its efficacy on the wider population under study. In the context of a binary classification problem, we have developed a new method for transfer learning that is very flexible in terms of the types of relationship between the source and target populations permitted. We establish that this algorithm achieves the optimal rate of convergence, while being adaptive to the unknown parameters governing the transfer relationship, thereby providing another layer of robustness for this important contemporary data challenge.

Main publication:

Reeve, H. W. J., Cannings, T. I. and Samworth, R. J. (2021) Adaptive transfer learning. Ann. Statist., 49, 3618-3649.

2) In missing data, perhaps the most important issue is to establish the relationship between the data and the missingness mechanism. Independence of these random aspects, known as a Missing Completely At Random (MCAR) hypothesis, affords great simplification in subsequent analysis. Although there do exist tests for specialised parametric (e.g. Gaussian) models, we have studied the question of testing MCAR in a fundamental, nonparametric way. We elucidate precisely which alternatives it is possible to detect (i.e. have power greater than the test's nominal size). Moreover, we introduce a new measure of departure from the MCAR hypothesis that quantifies a test's power and prove optimality guarantees for our proposed procedure. Other projects have concerned the development of statistical procedures for high-dimensional principal component analysis and high-dimensional changepoint estimation that can be applied when some of our data are missing.

Main publications:

Berrett, T. B. and Samworth, R. J. (2023) Optimal nonparametric testing of Missing Completely At Random, and its connections to compatibility. Ann. Statist., 51, 2170-2193.

Zhu, Z., Wang, T. and Samworth, R. J. (2022) High-dimensional principal component analysis with heterogeneous missingness. J. Roy. Statist. Soc., Ser. B, 84, 2000-2031.

Follain, B., Wang, T. and Samworth, R. J. (2022) High-dimensional changepoint estimation with heterogeneous missingness. J. Roy. Statist. Soc., Ser. B, 84, 1023-1055.

3) Model misspecification: Robustness and model misspecification are intimately linked. While great historical focus has been placed on consistency and rates of convergence under correct model specification, there is increasing recognition of the need to understand the performance of statistical procedures under as broad a class of data generating mechanisms as possible. In the context of a shape-constrained estimation problem (specifically the estimation of an S-shaped function), we establish projection and oracle inequality theories for the S-shaped least squares estimator; this requires a delicate analysis because the class of S-shaped functions is not convex (since the inflection point is unknown). In another project, I have considered the problem of robustness to departures to Gaussianity in linear regression, and have developed a method that is typically significantly more efficient than ordinary least squares.

Main publication:

Feng, O. Y., Chen Y., Han, Q., Carroll, R. J. and Samworth, R. J. (2022) Nonparametric, tuning-free estimation of S-shaped functions. J. Roy. Statist. Soc., Ser. B, 84, 1324-1352.

Feng, O. Y., Kao, Y-.C. Xu, M. and Samworth, R. J. (2024) Optimal convex M-estimation via score matching. https://arxiv.org/abs/2403.16688.

4) Data perturbation is a very effective way of exploring a space of distributions, and can yield reliable conclusions under very weak conditions. We have developed a new method for high-dimensional semi-supervised learning based on the careful aggregation of results obtained from studying random projections of the data.

Main publication:

Wang, T., Dobriban, E., Gataric, M. and Samworth, R. J. (2024+) Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning. J. Amer. Statist. Assoc., to appear.

Some of my other projects cut across several themes from the grant. These include a recent monograph on Approximate Message Passing:

Feng, O. Y., Venkataramanan, R., Rush, C. and Samworth, R. J. (2022) A unifying tutorial on Approximate Message Passing. Foundations and Trends in Machine Learning, 15, 335-536.

I am also currently working on a major book project entitled 'Modern Statistical Methods and Theory'. This book is co-authored with Rajen Shah and will be published by Cambridge University Press. Finally, I mention two projects on a recent research interest of mine on subgroup selection, where in the context of a clinical trial, we may observe heterogeneity in performance of a treatment across a population, and wish to know whether it is safe to authorise the treatment for a particular subgroup, even though it was chosen after seeing the data:

Müller, M. M., Reeve, H. W. J., Cannings, T. I. and Samworth, R. J. (2023) Isotonic subgroup selection. https://arxiv.org/abs/2305.04852

Reeve, H. W. J., Cannings, T. I. and Samworth, R. J. (2023) Optimal subgroup selection. Ann. Statist., 51, 2342-2365.

As outlined above, I have made very substantial progress towards the goals set out in the proposal. Across all four of the work packages, I have obtained results that improve the state-of-the-art in significant ways.

I regard the book project ('Modern Statistical Methods and Theory') as the culmination of my thoughts and research over several years. This book is nearing completion, and I anticipate that it will have a significant impact on the field, both in terms of research and the way in which the subject is taught. Of course, I continue to work on the other topics listed in the proposal.

Periodic Reporting for period 2 - RobustStats (Robust statistical methodology and theory for large-scale data)

Share this page Share this page on social networks

Download Download the content of the page