Elastic Coordination for Scalable Machine Learning

Project Information

ScaleML

Grant agreement ID: 805223

Project website

DOI

10.3030/805223

Project closed

EC signature date 17 October 2018

Start date 1 March 2019

End date 29 February 2024

Funded under

EXCELLENT SCIENCE - European Research Council (ERC)

Total cost

€ 1 494 121,00

EU contribution

€ 1 494 121,00

1 494 121,00

Coordinated by

INSTITUTE OF SCIENCE AND TECHNOLOGY AUSTRIA
Austria

Periodic Reporting for period 4 - ScaleML (Elastic Coordination for Scalable Machine Learning)

Reporting period: 2023-09-01 to 2024-02-29

The overall goal of the ScaleML project is to solve the scalability problem for Machine Learning applications, enabling them to take full advantage of the computational power of current-day hardware. In turn, this should allow existing machine learning algorithms to process more data efficiently, thus improving their accuracy, but also potentially enable new applications. More precisely, please recall that the two main goals of the ScaleML project are:

1) Conceptual Goal: Develop coordination abstractions and algorithms which scale by design, and show how they enable provably efficient and provably correct computation;
2) Applied Goal: Make these elastic coordination abstractions practical, by using them to scale machine learning tasks in multi-threaded and multi-machine environments.

The plan to achieve these goals was through two large targeted projects, each of which is divided into theoretical and practical components. These are summarized in the attached table, following the original proposal.

Project 1: Elastic Synchronization for Iterative Algorithms

The goal is to understand the theoretical underpinnings of the interplay between iterative algorithms, in particular those used to train machine learning models, and the scheduling of their iterations, which is closely tied to their scalability. Then, we extend these theoretical findings to practical implementations.

In 2019, as one of our first results, we published the first full analysis of classic iterative algorithms (such as single-source shortest-paths) under relaxed schedulers.

Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers
Dan Alistarh, Nikita Koval, and Giorgi Nadiradze.
In ACM SPAA 2019.

Relaxed Scheduling for Scalable Belief Propagation
Vitaly Aksenov, Dan Alistarh, and Janne H. Korhonen.
In NeurIPS 2020.

An unexpectedly fruitful extension of this project has been trying to flip the script, by applying learning techniques to classical data structures. In particular, in a series of papers, we have started investigating concurrent data structures which try to adapt to (or “learn”) the access data distribution, therefore circumventing worst-case lower bounds. This has resulted in the following three publications, each of which has been highlighted via an award or journal “special issue” invitation:

Non-Blocking Concurrent Interpolation Search
Trevor Brown, Aleksandar Prokopec and Trevor Brown.
In PPOPP 2020. Best Paper Award.

In Search of the Fastest Concurrent Union-Find Algorithm
Dan Alistarh, Alexander Fedorov and Nikita Koval.
In OPODIS 2019. Best Paper Award.

The Splay-List: A Distribution-Adaptive Concurrent Skip-List
Vitaly Aksenov, Dan Alistarh, Alexandra Drozdova, Amirkeivan Mohtashami.
In DISC 2020. Invited to the Special Issue of “Distributed Computing” for DISC 2020.

Project 2: Scalable Machine Learning for Multi-Node Systems

The goal of this second subproject is to develop theory and implementations for distributed multi-node machine learning, based on the general idea of elastic coordination. Our first result was the first implementation of a framework to efficiently support communication-compression for training of deep neural networks. This was presented in Supercomputing 2019, the premier venue for high-performance computing:

SparCML: High-Performance Sparse Communication for Machine Learning
Cedric Renggli, Saleh Ashkboos, Mehdi Aghagholzadeh, Dan Alistarh and Torsten Hoefler
In Supercomputing (SC) 2019.

A parallel line of work was a paper studying asynchronous training of machine learning models, in particular addressing the practical problem of straggler (slow) processes in large-scale training of neural networks:

Taming Unbalanced Training Workloads in Deep Learning with Partial Collectives
Shigang Li, Tal Ben-Nun, Salvatore di Girolamo, Dan Alistarh, and Torsten Hoefler
in PPOPP 2020.

We then extended this collaboration to propose a new training method which allows nodes to make progress partially-asynchronously:

Breaking Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Shigang Li, Tal Ben-Nun, Dan Alistarh, Salvatore Di Girolamo, Nikoli Dryden, Torsten Hoefler:
Accepted to IEEE Transactions on Parallel and Distributed Systems (to appear).

On the lower bound side, I have investigated the inherent limits of learning in a distributed environment given nodes which may have slightly different local distributions, in the following work, published in ICML 2020:

On the Sample Complexity of Adversarial Multi-Source PAC Learning
Nikola Konstantinov, Elias Frantar, Dan Alistarh, Christoph H. Lampert.
In ICML 2020.

Further in this area, in the second reporting period we have recently published six new results, in the proceedings of top machine learning conferences, such as ICLR, ICML, and NeurIPS 2021.

Foivos Alimisis, Peter Davies, Dan Alistarh:
Communication-Efficient Distributed Optimization with Quantized Preconditioners.
ICML 2021: 196-206
Peter Davies, Vijaykrishna Gurunanthan, Niusha Moshrefi, Saleh Ashkboos, Dan Alistarh:
New Bounds For Distributed Mean Estimation and Variance Reduction.
ICLR 2021

Zeyuan Allen-Zhu, Faeze Ebrahimianghazani, Jerry Li, Dan Alistarh:
Byzantine-Resilient Non-Convex Stochastic Gradient Descent.
ICLR 2021

Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies, Shigang Li, Dan Alistarh:
Asynchronous Decentralized SGD with Quantized and Local Updates.
NeurIPS 2021: 6829-6842

Foivos Alimisis, Peter Davies, Bart Vandereycken, Dan Alistarh:
Distributed Principal Component Analysis with Limited Communication.
NeurIPS 2021: 2823-2834

Janne H. Korhonen, Dan Alistarh:
Towards Tight Communication Lower Bounds for Distributed Optimisation.
NeurIPS 2021: 7254-7266

Progress beyond the state of the art and results at the end of the project
We have made excellent progress so far both on practical side, as our techniques now are among the fastest options to enable scalable, large-scale distributed training of neural networks. (Specifically, the CGX framework has been deployed by Genesis Cloud, a sustainable cloud provider.) On the theoretical side, as our techniques have established some of the first tight bounds for distributed optimization problems in the corresponding models.

Our general goal has been to unify the implementations which come with each of the existing papers into a cohesive communication-compression framework, which we have be released as open-source, supporting major machine learning frameworks such as TensorFlow and Pytorch. We have achieved this goal via the award-winning CGX framework, and are planning to follow up on this via an ERC Proof-of-Concept grant

Table belonging to the Section "Summary of the context and overall objectives"

Periodic Reporting for period 4 - ScaleML (Elastic Coordination for Scalable Machine Learning)

Share this page Share this page on social networks

Download Download the content of the page