High Performance · Easy to Use · Built for Gaming GPUs
Documentation · 中文文档 · Benchmarks · Examples
RoundPipe is a large DNN training framework that lets you train huge models on consumer-grade GPUs. On a single 24 GB GPU, you can full fine-tune 32B-parameter models, LoRA fine-tune up to 235B, and handle 64K+ token sequences, with throughput approaching datacenter-class hardware.
- Train bigger than ever: Full fine-tune 32B models or LoRA fine-tune up to 235B on a single 24 GB GPU. Up to 7× longer sequence length than PyTorch FSDP.
- High performance: Push a 4090 close to A800 NVLINK-class throughput. Up to 6× faster than FSDP Offload in typical workloads.
- Linear multi-GPU scaling: Scale to multiple GPUs within a node without rewriting your training loop. Throughput grows linearly while max sequence length per GPU stays unchanged.
- Feels like PyTorch: Sequential programming interface with a low learning curve. Works well in Jupyter Notebook for rapid iteration.
- General by design: No constraints on layer structure, training flow, or parameter update strategy.
- Portable across accelerators: Pure PyTorch implementation. Runs on Nvidia, AMD, and Ascend platforms.
All benchmarks below are measured on a single node with 8 GPUs. "OOM" means the framework cannot fit the model under that configuration.
| Framework | Qwen3-1.7B | Llama3.1-8B | Qwen3-32B | Qwen3-235B (LoRA) |
|---|---|---|---|---|
| 4090 · FSDP Offload | 11 K | 11 K | OOM | OOM |
| 4090 · RoundPipe | 73 K | 49 K | 28 K | 31 K |
| A800 · FSDP | 39 K | 29 K | 11 K | OOM |
| A800 · RoundPipe | 288 K | 226 K | 126 K | 118 K |
| Framework | Qwen3-1.7B | Llama3.1-8B | Qwen3-32B | Qwen3-235B (LoRA) |
|---|---|---|---|---|
| 4090 · FSDP Offload | 35,074 | 4,071 | OOM | OOM |
| 4090 · RoundPipe | 65,417 | 24,275 | 5,516 | 1,820 |
| A800 · FSDP | 85,829 | 29,148 | 3,455 | OOM |
| A800 · RoundPipe | 84,692 | 28,427 | 6,301 | 1,796 |
| GPUs | Qwen3-1.7B | Llama3.1-8B | Qwen3-32B | Qwen3-235B (LoRA) |
|---|---|---|---|---|
| 1 | 8,881 | 3,142 | 740 | 480 |
| 2 | 17,026 | 6,259 | 1,476 | 808 |
| 4 | 33,178 | 12,278 | 2,897 | 1,281 |
| 8 | 65,417 | 24,275 | 5,516 | 1,820 |
Max sequence length per GPU stays constant across all GPU counts (73 K, 49 K, 28 K, and 31 K respectively).
| Device | Qwen3-1.7B | Llama3.1-8B | Qwen3-32B | Qwen3-235B (LoRA) |
|---|---|---|---|---|
| AMD W7800 | 17,852 | 5,915 | 1,450 | 665 |
| Ascend 910B | 50,599 | 23,253 | 5,028 | 459 |
| RTX 4090 | 65,417 | 24,275 | 5,516 | 1,820 |
pip install roundpipeRequirements: Python ≥ 3.8, PyTorch ≥ 2.4
See the example/ directory for interactive jupyter notebook.
Full documentation is available at itcarrot.github.io/RoundPipe.
中文文档请访问 itcarrot.github.io/RoundPipe/index.zh.html。
RoundPipe is licensed under the LGPL-3.0.