This repository provides an extended implementation of DiLoCo (Distributed Low-Communication Training) and several communication-efficient optimizers for large-scale model training.
It is part of the broader exalsius stack, which enables scheduling and orchestrating distributed training workloads across geo-distributed GPU resources.
Traditional large-model training assumes high-bandwidth interconnects within data centers.
This work explores how to train effectively across heterogeneous, geographically distributed clusters by reducing synchronization frequency and communication volume between model replicas.
- Extends the original DiLoCo implementation with additional optimizers and momentum compression techniques
- Supports training across heterogeneous GPUs with varying compute capabilities and memory configurations
- Integrates seamlessly into the exalsius framework for cross-cluster and cross-cloud scheduling
- Reduces communication cost by combining infrequent synchronization with frequency-based momentum decomposition
- Supports transformer-based and CNN architectures for NLP and vision workloads
- To be published in the NeurIPs 2025 - DynaFront Workshop (Preprint)
This implementation supports a variety of architectures across domains:
| Domain | Model | Description |
|---|---|---|
| Vision | BigGAN | Generative Adversarial Network for high-fidelity image synthesis |
| ResNet | Convolutional neural network for image classification and feature extraction | |
| Language | GPT-Neo | Transformer-based autoregressive language model |
| GPT-NeoX | Large-scale, distributed GPT variant optimized for scalability | |
| Speech | Wav2Vec 2.0 | Self-supervised speech representation model |
Additional models can be integrated in diloco_training/models/.
The following optimizers are (or will be) supported in this repository:
| Optimizer | Status | Description |
|---|---|---|
| DiLoCo | ✅ | Distributed Low-Communication baseline optimizer |
| DCT-Momentum | ✅ | Momentum decomposition via Discrete Cosine Transform (DCT) |
| TBA | ⏳ | Additional optimizers under development (to be announced) |
This repository can be used standalone or as part of the exalsius distributed AI platform, which coordinates and scales training workloads across multiple geo-distributed GPU resources.
Within that context, training jobs can be:
-
Scheduled automatically across geographically distributed GPU nodes
-
Monitored through exalsius observability components
-
Executed efficiently on heterogeneous infrastructures with low-bandwidth interconnects
This enables scalable, communication-efficient training beyond the boundaries of traditional data centers.
For more details on the exalsius platform, visit the exalsius documentation.
Before getting started, ensure you have the following installed:
- Python 3.12 — Required for running the application and dependencies
- uv — Dependency management and packaging tool
To maintain code quality and enforce consistent style, we suggest using a pre-commit hook. Install and use it as follows:
# Install the pre-commit hook
pre-commit install
# (Optional) Run the hooks manually on all files
pre-commit run --all-filesThe Makefile includes several targets to streamline common development and deployment tasks:
| Target | Description | Command |
|---|---|---|
| format | Format codebase using black + isort |
make format |
| lint | Run ruff linter to check for code style issues |
make lint |
| test | Execute formatter, linter, and test suite | make test |
| build | Build the Docker image | make build |
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# Create and activate virtual environment
uv venv
source .venv/bin/activate
# Install dependencies (dev + test)
uv sync --dev --extra testTo execute the test suite after setting up uv:
uv run pytest --devAn initial version of this implementation was used for the following publication. If you use this code in your research, please cite:
@article{nedelkoski2025distributed,
title={Distributed Low-Communication Training with Decoupled Momentum Optimization},
author={Nedelkoski, Sasho and Acker, Alexander and Kao, Odej and Becker, Soeren and Scheinert, Dominik},
journal={NeurIPS 2025 - DynaFront 2025: Dynamics at the Frontiers of Optimization, Sampling, and Games Workshop},
year={2025}
}
