Periodic Reporting for period 4 - TheoryDL (Practically Relevant Theory of Deep Learning)
Okres sprawozdawczy: 2020-08-01 do 2021-01-31
Translating text, searching and organizing images based on textual content, chat-bots, self-driving cars, are all examples of technologies which heavily rely on Deep Learning.
To the general audience, new technologies tend to look like a magic.
The unique situation in deep learning is that this technology looks like a magic even to data scientists.
The goal of the TheoryDL project is to demystify deep learning, by providing a deep (pun intended) theoretical understanding of this technology, and in particular, understanding its potential but also its limitations.
The significance of this goal is two folded. First, I believe that it is dangerous to rely on technology which we do not understand. Second, a better theoretical understanding should enable to improve existing algorithms. Of particular interest is to be able to come up with faster algorithms, which are not of brute-force nature. Current algorithms contain a lot of brute-force components, and therefore the ""power"" of using deep learning is focused around few industrial companies that have the data and computing resources. A better theoretical understanding may lead to a democratization of this technology.
"
At a high level, maybe the most important result is the coupling between deep learning and gradient-based algorithms, where we have shown that analyzing neural networks independently of the algorithm that is being used to train them is not the right approach.
On this theme, we have first performed a systematic study of failures of deep learning.
People tend to run and tell about success stories, but failures are even more interesting as laying down the boundaries of a technology enables to better understand why and when it works. We have identified cases in which gradient based training of deep learning fails miserably. Interestingly, the failures are neither due to overfitting/underfitting nor due to spurious local minima or a plethora of saddle points. They are rather due to more subtle issues such as insufficient information in the gradients or bad signal to noise ratios.
This direction led us to an important observation: that weight sharing is crucial for optimization of deep learning. We proved that without weight sharing, deep learning can only essentially learn low frequencies, but completely fails to learn mid and high frequencies. Weight sharing enables some sort of coarse-to-fine training.
From there, we were able to define generative hierarchical models for which provably efficient algorithms, that actually work in practice, exist.
We continued with a series of papers, where at the end we were able to establish the foundations of a general theory of deep learnability based on gradient-based algorithms using the language of statistical queries.
- We identified the connection between approximation, depth separation and learnability in neural networks (Malach, Yehudai, S., Shamir, 2021)
- We have proved the well known ""lottery ticket hypothesis"", showing the pruning is all what you need for building a deep network (Malach, Yehudai, S., Shamir, 2021).
- We have derived Computational Separation Between Convolutional and Fully-Connected Networks (Malach & S., 2020)
- We have derived a general novel theory of deep learning connecting both hardness of approximation and hardness of learning through a new concept of the ""Variance"" of hypothesis classes (Malach & S., 2020)"
 
           
        