Periodic Reporting for period 4 - BeyondBlackbox (Data-Driven Methods for Modelling and Optimizing the Empirical Performance of Deep Neural Networks)
Reporting period: 2021-07-01 to 2021-12-31
They therefore present a key technology for further economic growth. Yet, this key technology is still hard to use, e.g. for small and medium enterprises, due to its sensitivity on good hyperparameter settings and selections of neural architectures.
This ERC grant aims to change this by making deep learning much easier to use by means of automated machine learning (AutoML): automated methods for adjusting the learning method to avoid the need for manual tuning (and thus the need for expensive and often unavailable deep learning experts).
A key aspect in this work is efficiency: while previous *blackbox* hyperparameter optimization and neural architecture search methods are slow, this project goes *beyond the black box* to substantially speed up these processes.
Specifically, the project aims to develop efficient multi-fidelity Bayesian optimization approaches that reason across datasets, across training epochs of neural networks, and across subsets of large datasets, in order to enable much faster hyperparameter optimization and neural architecture search.
Method Goal 1: Exploiting data-driven priors in Bayesian optimization
We developed methods to effectively exploit large quantities of algorithm performance data on previous datasets [5, 9, 33], but also developed tooling to access such data in the first place [28, 18, 45]. Together with a methodology to use dataset subsets [1,13, 53], we used this methodology to win the 2nd AutoML competition [24]. Moreover, we were able to implement the equations given in WP 1.3 to successfully tune support vector machines and obtain up to 100-fold speedup [4,22]. We have also published the first work that allows the specification of free-form priors in Bayesian optimization [51] and the first work that allows learning across developer adjustments [31]. We also explored meta-learning in the context of neural architecture search [37].
Method Goal 2: Graybox Bayesian optimization
We worked on methods for modeling learning curves and exploiting them in a Bayesian optimization setting [2, 13, 21, 22]. However, concurrent work demonstrated excellent performance with a simpler, bandit-based method; we therefore shifted our focus on improving said bandit method and developed model-based bandit methods that improved the state of the art in hyperparameter optimization of deep neural networks. E.g. our BOHB approach has led to more than 50x speedups as compared to traditional black-box optimization methods [13]. We also developed efficient and reproducible benchmark problems to further evaluate gray-box methods [48]. As part of our work on extracting SGD state features, we found an issue with the popular Adam methodology, and the proposed fix is now my most highly cited paper [11].
Method Goal 3: Hyperparameter control
We worked on improving stochastic optimization with AdamW [11] in order to control this optimization with RL. To ease the application of reinforcement learning (RL) we worked on automated hyperparameter optimization for RL algorithms [26, 27], also adapting hyperparameters over time. We clearly demonstrated the benefit of hyperparameter optimization in the RL domain by finding hyperparameters so good that they allowed the agent to break the Mujoco simulator [12].
A part of our work that was not covered by the original work packages but requested by the reviewers of the original proposal, was the optimization of neural architectures. We therefore invested substantial effort into this new field of neural architecture search (NAS) and made substantial contributions to the field [7, 10, 20, 25, 37, 39, 40, 42, 43, 47].
Application Goal: Computationally inexpensive auto-tuned deep learning, even for large datasets.
To improve the applicability of deep learning we worked on the hyperparameter optimization of augmentation and regularization techniques [44, 50]. We achieved the application goal by releasing the open-source automated deep learning library Auto-PyTorch [49], which allows efficient hyperparameter optimization & training of deep networks even on large datasets. This democratizes deep learning by allowing non-ML-experts to achieve state-of-the-art ML results.
1. We published the first book on AutoML and established AutoML as a field.
2. We developed AdamW, a neural network optimizer very similar to Adam that uses a different formulation of weight decay and which has become the de-facto standard for training large transformer models.
3. We demonstrated that – when a broad hyperparameter space is considered and optimized automatically with AutoML – simple neural networks can excel on tabular data and even outperform traditional ML techniques, such as Gradient Boosting techniques. This new state-of-the-art performance opens up a garden of delights for deep learning on tabular data.
4. We introduced BOHB, an efficient combination of Hyperband and Bayesian optimization that has been widely adopted for general multi-fidelity hyperparameter optimization problems.
5. We devised the concept of tabular NAS benchmarks and wrote a series of papers on the topic, laying a strong foundation for the scientific study of NAS.