- RunPod account (runpod.io) with credits loaded ($25 minimum)
runpodctlCLI installed for file transfers
ssh-keygen -t ed25519Copy your public key and paste into RunPod dashboard > Settings > SSH Public Keys:
cat ~/.ssh/id_ed25519.pubbrew install runpod/runpodctl/runpodctl
runpodctl config --apiKey <YOUR_API_KEY>Via the RunPod dashboard (runpod.io):
- Click Deploy > GPU Pod
- Select GPU: H100 SXM 80GB (or A100 80GB)
- Template:
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04(torch + CUDA pre-installed) - Container disk: 70-80GB (default 20GB is too small — models + packages won't fit)
- Volume: 50GB persistent disk (mounts at
/workspace, used for final results only) - Click Deploy
Important: Use a PyTorch template so torch is pre-installed. Never pip install torch on the pod. The large container disk avoids needing the slow /workspace network storage for day-to-day operations.
Package locally (on your Mac):
tar czf /tmp/bitnet-code.tar.gz \
--exclude='.git' --exclude='models' --exclude='build' \
--exclude='distill/checkpoints' --exclude='__pycache__' \
-C ~/Documents/BitNet .Transfer via runpodctl:
runpodctl send /tmp/bitnet-code.tar.gzOn the pod, receive it:
runpodctl receive <CODE>Get the SSH command from the pod's page on the RunPod dashboard:
ssh <pod-id>@ssh.runpod.io -i ~/.ssh/id_ed25519Note: RunPod's SSH proxy is interactive only (no rsync, no remote commands).
Key principle: Everything on local container disk (/root). The /workspace volume is network-attached and painfully slow for pip installs, model downloads, and general I/O. Only use /workspace to copy final results for persistence.
# 1. Unpack code to local disk
mkdir -p /root/BitNet && cd /root/BitNet && tar xzf ~/bitnet-code.tar.gz
# 2. Create venv with system site packages (inherits pre-installed torch)
python3 -m venv --system-site-packages venv
source venv/bin/activate
# 3. Install remaining packages (torch already in template)
pip install transformers datasets safetensors
# 4. Verify
python3 -c "import torch; print(torch.cuda.is_available())"
nvidia-smi| Parameter | 0.5B (validation) | 3B Model | 7B Model | Why |
|---|---|---|---|---|
--batch_size |
8 | 4 | 2 | Largest that fits in VRAM |
--accumulation_steps |
4 | 8 | 16 | Effective batch 32 |
--max_length |
256 | 1024 | 512 | Longer = better quality (drop if OOM) |
--max_steps |
5000 | 10000 | 10000 | Ternary weights need many steps |
--lr |
5e-4 | 5e-4 | 5e-4 | Higher than default for ternary |
--tau |
2.0 | 2.0 | 2.0 | Sharper teacher signal |
--save_every |
500 | 500 | 500 | Checkpoints are ~12GB each |
Key insights:
tau=2.0>>tau=5.0— sharper distillation signal works much better through ternary bottlenecklr=5e-4>>lr=1e-4— ternary weights need stronger gradient signal- Always validate pipeline with 0.5B first (~45 min) before expensive 3B/7B runs
cd /root/BitNet && nohup python3 distill/distill.py \
--device cuda \
--teacher_model Qwen/Qwen2.5-0.5B \
--dataset alpaca \
--batch_size 8 \
--accumulation_steps 4 \
--max_length 256 \
--max_steps 5000 \
--lr 5e-4 \
--tau 2.0 \
--save_every 500 \
> distill_log.txt 2>&1 &Test checkpoints to verify quality before investing in longer runs:
python distill/test_checkpoint.py distill/checkpoints/step_1000.pt -n 50 -t 0.7 -r 1.3
python distill/inspect_logits.py distill/checkpoints/step_1000.ptcd /root/BitNet && nohup python3 distill/distill.py \
--device cuda \
--teacher_model Qwen/Qwen2.5-3B \
--student_model Qwen/Qwen2.5-3B \
--dataset slimorca \
--batch_size 4 \
--accumulation_steps 8 \
--max_length 1024 \
--max_steps 10000 \
--lr 5e-4 \
--tau 2.0 \
--save_every 500 \
> distill_log.txt 2>&1 &cd /root/BitNet && nohup python3 distill/distill.py \
--device cuda \
--teacher_model Qwen/Qwen2.5-7B \
--student_model Qwen/Qwen2.5-7B \
--dataset slimorca \
--batch_size 2 \
--accumulation_steps 16 \
--max_length 512 \
--epochs 2 \
--max_steps 5000 \
--lr 1e-4 \
--save_every 500 \
> distill_log.txt 2>&1 &The script auto-detects multiple GPUs via FSDP. Launch with torchrun:
cd /root/BitNet && nohup torchrun --nproc_per_node=<NUM_GPUS> distill/distill.py \
--device cuda \
--teacher_model Qwen/Qwen2.5-3B \
--student_model Qwen/Qwen2.5-3B \
--dataset slimorca \
--batch_size 8 \
--accumulation_steps 2 \
--max_length 512 \
--epochs 2 \
--max_steps 5000 \
--lr 1e-4 \
--save_every 500 \
> distill_log.txt 2>&1 &# Raw scrolling log
tail -f /root/BitNet/distill_log.txt
# Live color terminal dashboard (curses, press q to quit)
python distill/dashboard.py distill/training_log.jsonl
# Generate PNG plot of training curves
pip install matplotlib # first time only
python distill/plot.py distill/training_log.jsonl -o distill/training_plot.png
# GPU stats
nvidia-smiFrom the pod:
runpodctl send /root/BitNet/distill/checkpoints/final.pt
runpodctl send /root/BitNet/distill/training_log.jsonl # optional, for plottingOn your local machine:
runpodctl receive <CODE>
# One command: export → GGUF → quantize → chat
python distill/deploy.py distill/checkpoints/final.ptImportant: Stop the pod from the RunPod dashboard to avoid ongoing charges.
- Stop: Preserves
/workspacevolume, stops billing for GPU. - Terminate: Deletes everything except persistent volume.
| GPU | Price/hr | 3B→3B (5k steps) | 7B→7B (5k steps) |
|---|---|---|---|
| A100 80GB | ~$1.04 | ~$4-8 | ~$12-24 |
| H100 SXM 80GB | ~$2.49 | ~$5-10 | ~$10-20 |
Slow pip install / disk ops: You're installing to /workspace (network storage). Everything should be on /root (local container disk). Set container disk to 70-80GB when creating the pod.
"No module named 'torch'" / slow training / disk full: Never pip install torch — the template already has it. Use python3 -m venv --system-site-packages venv so the venv inherits the pre-installed CUDA-optimized torch. Installing torch manually wastes ~5GB disk and may install a CPU-only or mismatched CUDA version, causing 2x slower training.
OOM errors: Reduce --batch_size (try 4 or 2) or --max_length (try 256).
Pod disconnected: Training continues if launched with nohup ... &. Reconnect and tail -f /root/BitNet/distill_log.txt.
Not enough disk space: Container disk defaults to 20GB which is too small. Set to 70-80GB when creating the pod to fit models, packages, and checkpoints on fast local storage.