Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Dynamics-Aware Theory of Deep Learning

Periodic Reporting for period 1 - DYNASTY (Dynamics-Aware Theory of Deep Learning)

Reporting period: 2022-10-01 to 2025-03-31

The recent advances in deep learning (DL) technologies have had major impacts on industry and society, and have transformed many branches of science and engineering. However, the mathematical framework of these methods largely deviates from the classical settings of optimization and statistical learning theory, and more importantly the behavior of these methods in practical applications does not obey most of the wisdoms of existing machine learning theory. Hence, the vast majority of the current DL techniques mainly stand as poorly understood black-box algorithms, without formal theoretical guarantees.

This lack of theoretical understanding results in two crucial drawbacks: (i) Due to the lack of theoretical guidelines, designing neural networks for a given problem is essentially performed by time/energy consuming ‘trial and error’ approaches. Hence, developing a sound and relevant theoretical understanding of DL techniques that can identify both the strengths and drawbacks of the methods, is crucial for designing and developing improved algorithms, (ii) Given the fact that DL-based algorithms are increasingly deployed in daily-life applications (e.g. speech recognition, face recognition, autonomous driving, medicine), relying on the predictions of a technology whose behavior is not well-understood can be dangerous in many senses, engendering several risks, e.g. in terms of privacy aspects.

A sound and unified theory for deep learning would liberate the design process of DL-based algorithms from being an ad-hoc trial and error approach and enable principled development of better-understood and reliable algorithms. Hence, our main goal is to develop a mathematically sound and practically relevant theory for DL, which will ultimately provide usable theoretical guidelines and principled design routines for DL practitioners.
To achieve our goal, our main approach is to view the problem from a "dynamical systems" perspective. This perspective is natural, as the optimization algorithms that are used in deep learning systems are predominantly iterative, which constitute a dynamical system that evolves over time. From this perspective, we have so far analyzed learning algorithms from two points of view:

1) Heavy-tail properties: Heavy-tailed distributions likely produce observations that can be very large in magnitude and far from the mean; hence, they are often used for modeling phenomena that exhibit outliers. Despite their ‘daunting’ connotation, heavy tails are ubiquitous in virtually any domain: In the context of machine learning, recent studies (some of which have been co-authored by the PI) have shown that heavy tails naturally emerge in various ways, and, contrary to their perceived image, they can be in fact beneficial for the performance.
In this first direction, we rigorously analyzed the emergence and the impact of heavy tails in stochastic optimization. Our results have revealed several surprising phenomena:

• We investigated the emerging mechanism of heavy tails in stochastic optimization. We showed that certain popularly used heuristics in “online” stochastic optimization (e.g. cyclic step-sizes) have a direct impact on the tails of the parameters that are produced by the optimization algorithm. We then focused on the more realistic case “offline” case where the optimizer only has access to a finitely many training points. We showed that, in this case the iterates will have “approximate” heavy-tails, meaning that as the number of data points increases, the iterates will exhibit heavy tails. These two results shed more light on why and how heavy tails emerge in stochastic optimization.

• We proved for the first time that the relation between heavy tails and generalization error is not monotonic: depending on several factors (that we explicitly identified), heavy tails can either harm or improve generalization. This has clarified the general picture and brought new insights about algorithmic design.

2) In the second direction, we focused on the “geometry” of machine learning algorithms. We extended the state-of-the-art in terms of understanding how the topological properties of stochastic optimization algorithms affect the generalization error. In particular, we have proved novel error bounds for both continuous-time and discrete-time optimization algorithms. During this process, we discovered novel topological constructs for optimizers for the first time, and rigorously linked them to the generalization error.

These developments led to 16 publications so far, all of them published at top-tier venues such as NeurIPS, ICML, COLT, ALT, and TMLR.
All of our published papers have been accepted by highly-competitive top-tier venues as they advanced the state-of-the-art by certain margins. Among these we would like to highlight the following outcomes.

1. “Novel tools for computing the topological properties of stochastic optimizers as performance metrics.”

We introduced reliable topological complexity measures that provably bound the generalization error (the main quantity that we are interested in). These measures are computationally friendly and enabled us to propose simple yet effective algorithms for computing generalization indices. Moreover, we showed that our framework can be extended to different application domains, tasks, and architectures. Our experimental results demonstrated that our new complexity measures correlate highly with generalization error in industry-standard architectures such as transformers and deep graph networks. Our approach consistently outperformed existing topological bounds across a wide range of datasets, models, and optimizers. This is a significant improvement over the state-of-the-art theory, which (i) mostly apply on simplified network architectures with rather simple computer vision datasets (ii) are based certain quantities whose computation introduces an unrealistic computational burden.

Related reference:
R.Andreeva B. Dupuis, R. Sarkar, T. Birdal, U. Şimşekli, "Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms", NeurIPS 2024.

2. “A novel compression algorithm for neural networks by using heavy tails.”

By exploiting the links between compressibility and heavy tails, we developed a novel compression algorithm that have provable compression guarantees. The proposed approach required only a very minor modification to the original optimization algorithm, that is essentially injecting the “right” heavy-tailed noise to the optimization process. This rendered our approach very practical as its computational complexity is virtually the same as the original optimization process (this is a major gain when compared to regularization-based approaches which significantly increase the computational burden). Besides, the algorithm emerged purely from theory, which is a major difference when compared to a major body of compression algorithms, which are mostly based on heuristics and typically have no/little theoretical justification.

Related reference:
Y. Wan, M. Barsbey, A. Zaidi, U. Simsekli, "Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD", ICML 2024.
My booklet 0 0