Periodic Reporting for period 4 - ScaleML (Elastic Coordination for Scalable Machine Learning)
Reporting period: 2023-09-01 to 2024-02-29
1) Conceptual Goal: Develop coordination abstractions and algorithms which scale by design, and show how they enable provably efficient and provably correct computation;
2) Applied Goal: Make these elastic coordination abstractions practical, by using them to scale machine learning tasks in multi-threaded and multi-machine environments.
The plan to achieve these goals was through two large targeted projects, each of which is divided into theoretical and practical components. These are summarized in the attached table, following the original proposal.
The goal is to understand the theoretical underpinnings of the interplay between iterative algorithms, in particular those used to train machine learning models, and the scheduling of their iterations, which is closely tied to their scalability. Then, we extend these theoretical findings to practical implementations.
In 2019, as one of our first results, we published the first full analysis of classic iterative algorithms (such as single-source shortest-paths) under relaxed schedulers.
Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers
Dan Alistarh, Nikita Koval, and Giorgi Nadiradze.
In ACM SPAA 2019.
Relaxed Scheduling for Scalable Belief Propagation
Vitaly Aksenov, Dan Alistarh, and Janne H. Korhonen.
In NeurIPS 2020.
An unexpectedly fruitful extension of this project has been trying to flip the script, by applying learning techniques to classical data structures. In particular, in a series of papers, we have started investigating concurrent data structures which try to adapt to (or “learn”) the access data distribution, therefore circumventing worst-case lower bounds. This has resulted in the following three publications, each of which has been highlighted via an award or journal “special issue” invitation:
Non-Blocking Concurrent Interpolation Search
Trevor Brown, Aleksandar Prokopec and Trevor Brown.
In PPOPP 2020. Best Paper Award.
In Search of the Fastest Concurrent Union-Find Algorithm
Dan Alistarh, Alexander Fedorov and Nikita Koval.
In OPODIS 2019. Best Paper Award.
The Splay-List: A Distribution-Adaptive Concurrent Skip-List
Vitaly Aksenov, Dan Alistarh, Alexandra Drozdova, Amirkeivan Mohtashami.
In DISC 2020. Invited to the Special Issue of “Distributed Computing” for DISC 2020.
Project 2: Scalable Machine Learning for Multi-Node Systems
The goal of this second subproject is to develop theory and implementations for distributed multi-node machine learning, based on the general idea of elastic coordination. Our first result was the first implementation of a framework to efficiently support communication-compression for training of deep neural networks. This was presented in Supercomputing 2019, the premier venue for high-performance computing:
SparCML: High-Performance Sparse Communication for Machine Learning
Cedric Renggli, Saleh Ashkboos, Mehdi Aghagholzadeh, Dan Alistarh and Torsten Hoefler
In Supercomputing (SC) 2019.
A parallel line of work was a paper studying asynchronous training of machine learning models, in particular addressing the practical problem of straggler (slow) processes in large-scale training of neural networks:
Taming Unbalanced Training Workloads in Deep Learning with Partial Collectives
Shigang Li, Tal Ben-Nun, Salvatore di Girolamo, Dan Alistarh, and Torsten Hoefler
in PPOPP 2020.
We then extended this collaboration to propose a new training method which allows nodes to make progress partially-asynchronously:
Breaking Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Shigang Li, Tal Ben-Nun, Dan Alistarh, Salvatore Di Girolamo, Nikoli Dryden, Torsten Hoefler:
Accepted to IEEE Transactions on Parallel and Distributed Systems (to appear).
On the lower bound side, I have investigated the inherent limits of learning in a distributed environment given nodes which may have slightly different local distributions, in the following work, published in ICML 2020:
On the Sample Complexity of Adversarial Multi-Source PAC Learning
Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert.
In ICML 2020.
Further in this area, in the second reporting period we have recently published six new results, in the proceedings of top machine learning conferences, such as ICLR, ICML, and NeurIPS 2021.
Foivos Alimisis, Peter Davies, Dan Alistarh:
Communication-Efficient Distributed Optimization with Quantized Preconditioners.
ICML 2021: 196-206
Peter Davies, Vijaykrishna Gurunanthan, Niusha Moshrefi, Saleh Ashkboos, Dan Alistarh:
New Bounds For Distributed Mean Estimation and Variance Reduction.
ICLR 2021
Zeyuan Allen-Zhu, Faeze Ebrahimianghazani, Jerry Li, Dan Alistarh:
Byzantine-Resilient Non-Convex Stochastic Gradient Descent.
ICLR 2021
Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, Dan Alistarh:
Asynchronous Decentralized SGD with Quantized and Local Updates.
NeurIPS 2021: 6829-6842
Foivos Alimisis, Peter Davies, Bart Vandereycken, Dan Alistarh:
Distributed Principal Component Analysis with Limited Communication.
NeurIPS 2021: 2823-2834
Janne H. Korhonen, Dan Alistarh:
Towards Tight Communication Lower Bounds for Distributed Optimisation.
NeurIPS 2021: 7254-7266
We have made excellent progress so far both on practical side, as our techniques now are among the fastest options to enable scalable, large-scale distributed training of neural networks. (Specifically, the CGX framework has been deployed by Genesis Cloud, a sustainable cloud provider.) On the theoretical side, as our techniques have established some of the first tight bounds for distributed optimization problems in the corresponding models.
Our general goal has been to unify the implementations which come with each of the existing papers into a cohesive communication-compression framework, which we have be released as open-source, supporting major machine learning frameworks such as TensorFlow and Pytorch. We have achieved this goal via the award-winning CGX framework, and are planning to follow up on this via an ERC Proof-of-Concept grant