Skip to main content
Vai all'homepage della Commissione europea (si apre in una nuova finestra)
italiano it
CORDIS - Risultati della ricerca dell’UE
CORDIS

FastML: Efficient and Cost-Effective Distributed Machine Learning

Periodic Reporting for period 1 - FastML (FastML: Efficient and Cost-Effective Distributed Machine Learning)

Periodo di rendicontazione: 2024-05-01 al 2025-10-31

Training state-of-the-art deep neural networks (especially large language models) is now dominated by distributed computation across many GPUs, where frequent, high-volume communication of gradients and parameters creates network bottlenecks. These bottlenecks waste compute and energy, driving up the monetary and environmental cost of training; even 10–20% inefficiency becomes material at the scale of modern runs (e.g. GPT-class models). The prevailing industry response—hardware overprovisioning of interconnects—is expensive and limits accessibility. In short, there is a pressing need for software-level methods that cut distribution overheads without sacrificing model quality, thereby reducing costs and broadening access to cutting-edge training.
FastML aims to deliver a unified, high-performance software framework that makes communication-efficient training a “first-class citizen” in mainstream ML stacks (PyTorch first, with pathways to TensorFlow/JAX/MXNet). It targets both data-parallel and model-parallel regimes, integrating quantization/sparsity-based consistency relaxations that provably lower the amount of information exchanged while preserving convergence and accuracy.
By lowering training time and enabling cheaper hardware choices, FastML can reduce total training cost for enterprises and cloud users; this translates into direct operational cost savings on GPU hours and networking, and improved return-on-investment for AI initiatives.
The FastML Proof-of-Concept project developed LLMQ, an end-to-end CUDA/C++ framework enabling efficient large language model training on consumer-grade GPUs. The work focused on addressing the key bottlenecks of commodity hardware—limited device memory and slow inter-GPU communication—through a comprehensive set of optimizations.

Key technical work performed includes: implementation of selective activation checkpointing spanning from non-matrix-multiplication layers to full transformer blocks; development of efficient CPU offloading strategies for optimizer states, residuals, and gradients; creation of custom cudaMemcpy-based communication primitives that significantly outperform standard NCCL collectives on consumer hardware; and implementation of an FP8 training pipeline with dynamic tensor-level scaling compatible with both Ada (RTX 40xx) and Blackwell (RTX 50xx) architectures.
Main achievements include: training 7B parameter models on a single 16GB mid-range gaming card (RTX 5060Ti); training 32B parameter models on workstations equipped with 4× RTX 4090 GPUs; achieving 50-70% Model FLOPs Utilization (MFU), rivaling production-scale systems on datacenter GPUs; and establishing pilot partnerships with A1/Exoscale, Verda/DataCrunch, NVIDIA, and HP. The framework was released as open-source software at https://github.com/IST-DASLab/llmq(si apre in una nuova finestra).
LLMQ advances beyond existing approaches in several significant ways:
Memory efficiency on consumer hardware: While existing frameworks like LLaMA-Factory require full ZeRO-3 offloading with large batch sizes for acceptable performance on large models, LLMQ achieves superior throughput through partial offloading at moderate batch sizes, enabled by dramatically lower per-iteration overheads.
Communication optimization for peer-to-peer-limited systems: Consumer GPUs cannot communicate directly via PCIe and must route through the host. LLMQ introduces a novel cudaMemcpy-based reduce-scatter algorithm that separates arithmetic from data movement, enabling the copy engine to operate in parallel with compute kernels. This approach achieves nearly 2× speedup over NCCL-based communication on 4× RTX 4090 configurations (7,800 vs 4,300 tokens/second for 14B models).
Weight caching strategy: Contrary to traditional ZeRO recommendations, LLMQ demonstrates that on consumer hardware without direct GPU-to-GPU communication, enabling sharded model weights before sharded gradients reduces total communication by caching weights on the host memory after the first forward pass.
FP8 training without delayed scaling: LLMQ uses just-in-time tensor-level absmax-scaling for FP8 conversion, guaranteeing no value clipping even with rapidly changing tensor statistics, while achieving up to 55% speedup over BF16 training for sufficiently large models.
Efficiency exceeding professional hardware: On 32B model training, LLMQ achieves 51% MFU on 4× RTX 4090 compared to only 29% MFU on 4× professional L40S GPUs, demonstrating that optimized software can overcome hardware limitations.
Il mio fascicolo 0 0