Periodic Reporting for period 4 - DLT (Deep Learning Theory: Geometric Analysis of Capacity, Optimization, and Generalization for Improving Learning in Deep Neural Networks)
Période du rapport: 2023-01-01 au 2023-12-31
(fig1.png) A mathematical theory of deep learning aims to quantify the relationships between three key elements in learning with neural networks: a) The representational power and the approximation errors of artificial neural networks as parametric sets of hypotheses, b) the properties and consequences of the training methods or optimization procedures, which are used to select a hypothesis based on training data, and c) the performance of the trained neural networks at test time on new data, i.e. their generalization performance.
(fig2.png) An artificial neural network is a composition of simple parametric functions (neurons), which together can map complex relationships. The top row illustrates how the input values (here pixel locations x colored with a picture C(x) of Max Planck) are mapped by one layer φ1 or two layers φ2 ∘ φ1 of neurons into output values (new pixel locations). The lower row illustrates how the input space is broken into regions in which the function is linear. Such geometric-combinatorial decompositions can be used to investigate Important properties of the networks (e.g. possible advantages of different architectures) and the trained functions (e.g. decision boundaries or smoothness).
We published 50+ articles, including 20 at the machine learning conferences ICML, ICLR, NeurIPS, 12 at conferences such as ISIT, Allerton, MSML, GSI, and 20 in journals such as SIAM SIAGA, JMLR, Information Geometry, FoCM, or in books such as Mathematical Aspects of Deep Learning. We presented this research in 100+ invited talks at workshops, conferences, and seminars, including 7 keynotes and plenary lectures, in addition to 50+ presentations at workshops, conferences, and general public outreach events. The achieved results have served as the basis for several subsequent research endeavours by us and others in theoretical deep learning, particularly such highlighting geometric and combinatorial aspects of learning with neural networks.
Within this project we created multiple platforms for research, training, and dissemination, including in particular the Deep Learning Theory meeting and the Mathematical Machine Learning Seminar, which has hosted 150+ talks in the reporting period. Among the organized events we may highlight the Deep Learning Theory kickoff workshop in early 2019 and the co-organized Mathematics of Machine Learning Conference at ZiF in 2021, in addition to further conference sessions and collaboration programs. The project has had a significant synergistic footprint particularly through the close interface we maintained with the Math Machine Learning group at UCLA, the co-creation of the Math of Data Initiative at MPI MiS, and the interface with other machine learning research stakeholders such as the DFG priority programme 2298 on Theoretical Foundations of Deep Learning and the School of Embedded Composite Artificial Intelligence in Leipzig and Dresden.
Optimization theory for neural networks. Training neural networks involves non-convex optimization problems and practical methodologies for which a theoretical footing has been elusive. In this project we obtained a series of results illuminating the interplay between training data, network architectures, parameter optimization, and capacity control in neural networks. These provide theoretical explanation for the success of some of these methodologies and capture nuanced characteristics of the optimization dynamics in training neural networks.
Regularization in neural networks. One of the puzzles in deep learning is why overparametrized networks may overfit to the training data and yet they perform well at test time. A possible explanation is that the training procedures are biased towards solutions with good properties. In this project we obtained results describing the biases of gradient descent training of neural networks depending on various key factors, including the training time and the parameter initialization. For a wide variety of network architectures we further obtained quantitative descriptions of spectral biases, or how a learning algorithm implicitly decomposes a learning problem into several components which are learned at different rates.