Statistical Mechanics of Learning

Computers are now able to process language so efficiently that they can answer relatively complicated questions or generate poems. This progress is primarily due to the development of artificial “deep neural networks”. Nowadays, “deep learning” is revolutionizing our life, prompting an economic battle between technological giants and profoundly impacting society. As attractive and performant as it is, however, many agree that deep learning is largely an empirical field that lacks a theoretical understanding of its capacity and limitations. The algorithms used to "train" these neural networks explore a high-dimensional and non-convex energy landscape that eludes most of the present theoretical methodology in learning theory. The hope is that with a better theoretical understanding, we could build safer, more reliable and more performant systems.

In this project, we use advanced methods of statistical mechanics to develop a theoretical understanding of deep neural networks and their behaviour. We develop simplified models where learning performance can be analyzed and predicted mathematically. The overall goal is to make these models as realistic as possible and capture an extensive range of the behaviour observed empirically in deep learning. Analyzing how the performance depends on various tunable parameters brings a theoretical understanding of the principles behind the empirical success of deep neural networks. The synergy between the theoretical statistical physics approach and scientific questions from machine learning enables a leap forward in our understanding of learning from data.

The results of this project were published in 32 journal papers (including PNAS, Phys. Rev. X, Review of Modern Physics, or Advances in Mathematics), and 23 in major machine learning conference proceedings (such as NeurIPS, and ICML). Our work treats various aspects of the theory of learning with neural networks with approaches that most often stem from the statistical physics of disordered systems. We computed the performance of neural networks as a function of the number of training samples for a range of simple models of generating the data. We established regions where optimal performance is achievable with efficient algorithms and where it is algorithmically hard. We investigated for what range of parameters neural networks perform strictly better than kernel methods and clarified the role of overparametrization in learning. We proposed message-passing algorithms for training neural networks and analyzed gradient descent algorithms in challenging high-dimensional and non-convex cases.

We advanced the SOTA in a variety of directions. Among the most prominent is the analysis of neural networks with one small hidden layer where data are generated by a teacher network. We obtained a detailed understanding of the statistically achievable generalization error. We developed algorithms that are conjectured optimal among all polynomial ones and characterized when they reach the optimal performance. We also analyzed the performance of stochastic gradient descent in these networks with a particular focus on the overparametrized regime.

We advanced the SOTA significantly in terms of the rigorous establishment of the methods stemming from the physics of disordered systems. In particular, we proved that the replica method results for the optimal generalization error in the single-layer perceptron are exact and compared them thoroughly for a range of models to the best-known algorithmic performance.

We also advanced significantly in theoretical understanding of the performance of gradient descent algorithms in high-dimensional non-convex landscapes. We found a way to analyze their performance via dynamical mean-field theory and computed the exact signal-to-noise ratio needed for good performance. Interestingly we thus unveiled a region of parameters where spurious local minima exist and can trap the dynamics, but randomly initialized dynamics avoids them.

Looking at the interplay between the network architecture, the training algorithm and the structure of the data, we identified cases where overparametrization allows to reduce the number of samples that training algorithms need to achieve good performance. We showed how to generalize the analysis method to take into account the data structure, up to the extent that for simple neural networks, we can theoretically characterize the learning curves for realistic datasets.

Periodic Reporting for period 4 - SMILE (Statistical Mechanics of Learning)

Diese Seite teilen

Herunterladen