Periodic Reporting for period 2 - NN-OVEROPT (Neural Network : An Overparametrization Perspective )
Período documentado: 2023-11-01 hasta 2024-10-31
Importance for society: Our work does not have any direct consequence on society, however, our works make progress towards providing a theoretical foundation for modern machine learning systems. A better theoretical understanding of modern machine-learning systems would eventually lead to better algorithm design across various applications of machine-learning systems. That will result in more interpretable systems which can be further modified to develop bais free algorithms.
Overall Objective: In the works, our main goal is to understand optimization and generalization while training a machine learning model with gradient descent, stochastic gradient descent, and noisy gradient descent.
SGD for the least square objective. In the later work, we show that our result can be extended to a class of non-convex problems as well. We also study the optimal control formulation of mirror descent and mirror Langevin to show that in the case of the convex optimization task, mirror descent and mirror Langevin solved certain optimal control formulations. In my current and future work, my goal is to show the control between vanilla SGD and various SDEs that we consider in our previous works. This will allow us to directly apply our result to analyzing the generalization bound for SGD on a class of non-convex problems.
(i) We show that the effects of various complicated explicit regularizations can be simply obtained by noise injection in the model. We show that noise injection can help in explicitly making the solution sparse in the case of over-parametrized linear regression, minimizing the nuclear norm of the solution in the case of overparametrized matrix factorization and driving the solution towards the wider minima in the case of neural network training. We also demonstrate this in our experiment that our method consistently beat the vanilla gradient descent and vanilla stochastic gradient descent in training deep neural networks.
(ii) In the next series of works, we provided the first algorithmic stability-based generalization bound for heavy-tailed SGD on convex and non-convex functions. We also prove that the interaction between the tail decay coefficient and generalization behaviour is non-monotonic in nature and for some choice of tail decay coefficient, we get the best generalization behavior. The analysis is based on the results from the applied probability theory in the Levy process SDEs. These results were not known earlier in the literature. In the extension, we propose a unified theory to derive algorithmic stability bound for discrete-time Markov chains. The analysis is not driven by the theory of SDEs instead directly works for the discrete-time Markov chain which covers vanilla SGD.
(iii) In another work, we show a connection between optimal control and mirror descent as well as mirror langevin. We show that running mirror descent or mirror langevin on a class of convex problems directly solves an optimal control problem whose cost is associated with the loss function and its fenchel dual.
In the last phase of the fellowship, we would like to show that the iterates of vanilla sgd converge to an SDE that we study in our other works. We would also like to obtain direct results on training a multi-layer perceptron training with SGD in a teacher student setting and obtain a recovery guarantee for that. We would also try to extend our work on optimal control to the class of non-convex problem.
Economic Impact: No direct impact
Societal Impact: No direct impact