CPU adaptation of Karpathy's autoresearch — autonomous ML research on commodity hardware, no GPU required.
The original autoresearch requires an NVIDIA GPU (tested on H100). This fork removes that requirement entirely, letting you run autonomous ML research experiments on any machine with a CPU. Perfect for learning, experimentation, and tinkering without cloud GPU costs.
- Features
- What Changed vs Original
- Requirements
- Quick Start
- Benchmark
- How It Works
- Tuning for Your Hardware
- Key Findings
- Contributing
- License
- Zero GPU requirement — Runs on any x86_64 CPU
- Auto-detection — Uses CUDA if available, falls back to CPU seamlessly
- CPU-tuned defaults — Optimized batch size, sequence length, and time budget for CPU
- Backward compatible — All changes work on both CPU and CUDA
- Minimal dependencies —
torch+cpuonly, zero NVIDIA packages needed
| Area | Original | This fork |
|---|---|---|
| Attention | Flash Attention 3 (kernels package) |
PyTorch native scaled_dot_product_attention |
| Mixed precision | bf16 autocast (CUDA) | fp32 (bf16 autocast causes ~400x backward slowdown on CPU) |
| Compilation | torch.compile with CUDA backend |
Disabled on CPU (inductor backend crashes) |
| Device | Hardcoded cuda |
Auto-detected via get_device() — works on both CPU and CUDA |
| Defaults | DEPTH=8, SEQ_LEN=2048, BATCH=128 | DEPTH=4, SEQ_LEN=512, BATCH=8 (tuned for CPU) |
| Time budget | 5 min | 30 min (compensates for slower hardware) |
| Sliding window | SSSL pattern via FA3 | Full causal attention (no sliding window) |
| Dependencies | torch[cu128] + kernels |
torch+cpu (explicit CPU-only index, zero NVIDIA packages) |
All changes are backward-compatible — if a CUDA GPU is detected, the code uses CUDA paths automatically.
- CPU: Any x86_64 processor (tested on Intel Core Ultra 7 265, 20 cores)
- RAM: 8GB minimum, 16GB+ recommended
- Disk: ~2GB for dataset + dependencies
- Python: 3.10+
- No GPU needed
# 1. Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Install dependencies
uv sync
# 3. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py --num-shards 4
# 4. Run a training experiment (~40 min: 30 min training + eval)
uv run train.pyIntel Core Ultra 7 265, 20 cores, 128GB RAM:
| Metric | Value |
|---|---|
| val_bpb | 1.955 |
| Training time | 307s (5 min budget) |
| Step time | ~12s |
| Throughput | ~5,000 tok/sec |
| Peak memory | 2.2 GB |
| Parameters | 11.5M |
| Steps completed | 32 |
With the default 30 min TIME_BUDGET, expect ~140 steps and better val_bpb (~1.5-1.7).
Same as the original — three files:
prepare.py— data prep, tokenizer, dataloader, evaluation metric. Not modified by the agent.train.py— GPT model, Muon+AdamW optimizer, training loop. The agent edits this file.program.md— instructions for the AI agent.
Point your AI agent (Claude, GPT, Codex, etc.) at program.md and let it run experiments autonomously. Each experiment modifies train.py, trains for the time budget, checks val_bpb, and keeps or discards the change.
If training is too slow or you want faster iteration:
| Parameter | File | Default | Try |
|---|---|---|---|
DEPTH |
train.py | 4 | 2 (smaller model, faster steps) |
MAX_SEQ_LEN |
prepare.py | 512 | 256 (4x less attention compute) |
DEVICE_BATCH_SIZE |
train.py | 8 | 4 or 2 (less memory per step) |
TIME_BUDGET |
prepare.py | 1800 | 300 (quick 5 min test) |
TOTAL_BATCH_SIZE |
train.py | 2^16 | 2^14 (fewer grad accum steps) |
For much smaller hardware, consider using TinyStories dataset — lower entropy text produces better results with tiny models.
bf16 autocast on CPU is broken for training. torch.amp.autocast(device_type="cpu", dtype=torch.bfloat16) causes ~400x backward pass slowdown. Forward pass is fine, but autograd backward is catastrophically slow. This fork uses fp32 with contextlib.nullcontext() instead.
torch.compile inductor crashes on CPU. The inductor backend fails during C++ code generation for this model architecture. This fork skips compilation on CPU entirely. Training still works in eager mode.
- Tokenizer cache integrity: The tokenizer is stored as a Python pickle at
~/.cache/autoresearch/tokenizer/tokenizer.pkl. A SHA-256 hash is written alongside it at creation time and verified on every load. Do not import tokenizer caches from untrusted sources. - Data provenance: Dataset shards are downloaded from HuggingFace over HTTPS. No content-integrity checksums are verified beyond TLS.
- Autonomous execution: The
program.mdworkflow instructs AI agents to run indefinitely with local git operations. Always run on a dedicated branch in an isolated environment.
Contributions are welcome! Please open an issue first to discuss what you'd like to change.
- Fork the repository
- Create a feature branch (
git checkout -b feat/my-feature) - Test with
uv run train.pyto verify - Commit your changes
- Push to the branch and open a Pull Request
Apache-2.0 — same as the original.
- Karpathy's autoresearch — the original project
- nanochat — the training code this is based on
- miolini/autoresearch-macos (MacOS)
- trevin-creator/autoresearch-mlx (MacOS MLX)
- jsegov/autoresearch-win-rtx (Windows)
- andyluo7/autoresearch (AMD ROCm)