RunPod GPU Setup for BitNet Distillation

Prerequisites

RunPod account (runpod.io) with credits loaded ($25 minimum)
runpodctl CLI installed for file transfers

Generate SSH Key

ssh-keygen -t ed25519

Copy your public key and paste into RunPod dashboard > Settings > SSH Public Keys:

cat ~/.ssh/id_ed25519.pub

Install runpodctl

brew install runpod/runpodctl/runpodctl
runpodctl config --apiKey <YOUR_API_KEY>

1. Create a Pod

Via the RunPod dashboard (runpod.io):

Click Deploy > GPU Pod
Select GPU: H100 SXM 80GB (or A100 80GB)
Template: runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 (torch + CUDA pre-installed)
Container disk: 70-80GB (default 20GB is too small — models + packages won't fit)
Volume: 50GB persistent disk (mounts at /workspace, used for final results only)
Click Deploy

Important: Use a PyTorch template so torch is pre-installed. Never pip install torch on the pod. The large container disk avoids needing the slow /workspace network storage for day-to-day operations.

2. Transfer Code to Pod

Package locally (on your Mac):

tar czf /tmp/bitnet-code.tar.gz \
  --exclude='.git' --exclude='models' --exclude='build' \
  --exclude='distill/checkpoints' --exclude='__pycache__' \
  -C ~/Documents/BitNet .

Transfer via runpodctl:

runpodctl send /tmp/bitnet-code.tar.gz

On the pod, receive it:

runpodctl receive <CODE>

3. SSH into the Pod

Get the SSH command from the pod's page on the RunPod dashboard:

ssh <pod-id>@ssh.runpod.io -i ~/.ssh/id_ed25519

Note: RunPod's SSH proxy is interactive only (no rsync, no remote commands).

4. Set Up Environment

Key principle: Everything on local container disk (/root). The /workspace volume is network-attached and painfully slow for pip installs, model downloads, and general I/O. Only use /workspace to copy final results for persistence.

# 1. Unpack code to local disk
mkdir -p /root/BitNet && cd /root/BitNet && tar xzf ~/bitnet-code.tar.gz

# 2. Create venv with system site packages (inherits pre-installed torch)
python3 -m venv --system-site-packages venv
source venv/bin/activate

# 3. Install remaining packages (torch already in template)
pip install transformers datasets safetensors

# 4. Verify
python3 -c "import torch; print(torch.cuda.is_available())"
nvidia-smi

5. Start Distillation

Recommended Parameters (H100 SXM 80GB)

Parameter	0.5B (validation)	3B Model	7B Model	Why
`--batch_size`	8	4	2	Largest that fits in VRAM
`--accumulation_steps`	4	8	16	Effective batch 32
`--max_length`	256	1024	512	Longer = better quality (drop if OOM)
`--max_steps`	5000	10000	10000	Ternary weights need many steps
`--lr`	5e-4	5e-4	5e-4	Higher than default for ternary
`--tau`	2.0	2.0	2.0	Sharper teacher signal
`--save_every`	500	500	500	Checkpoints are ~12GB each

Key insights:

tau=2.0 >> tau=5.0 — sharper distillation signal works much better through ternary bottleneck
lr=5e-4 >> lr=1e-4 — ternary weights need stronger gradient signal
Always validate pipeline with 0.5B first (~45 min) before expensive 3B/7B runs

0.5B Validation Run (H100, ~45 min — do this first!)

cd /root/BitNet && nohup python3 distill/distill.py \
  --device cuda \
  --teacher_model Qwen/Qwen2.5-0.5B \
  --dataset alpaca \
  --batch_size 8 \
  --accumulation_steps 4 \
  --max_length 256 \
  --max_steps 5000 \
  --lr 5e-4 \
  --tau 2.0 \
  --save_every 500 \
  > distill_log.txt 2>&1 &

Test checkpoints to verify quality before investing in longer runs:

python distill/test_checkpoint.py distill/checkpoints/step_1000.pt -n 50 -t 0.7 -r 1.3
python distill/inspect_logits.py distill/checkpoints/step_1000.pt

3B → 3B Self-Distillation (H100 SXM 80GB)

cd /root/BitNet && nohup python3 distill/distill.py \
  --device cuda \
  --teacher_model Qwen/Qwen2.5-3B \
  --student_model Qwen/Qwen2.5-3B \
  --dataset slimorca \
  --batch_size 4 \
  --accumulation_steps 8 \
  --max_length 1024 \
  --max_steps 10000 \
  --lr 5e-4 \
  --tau 2.0 \
  --save_every 500 \
  > distill_log.txt 2>&1 &

7B → 7B Self-Distillation (H100 SXM 80GB, needs optimizer offload)

cd /root/BitNet && nohup python3 distill/distill.py \
  --device cuda \
  --teacher_model Qwen/Qwen2.5-7B \
  --student_model Qwen/Qwen2.5-7B \
  --dataset slimorca \
  --batch_size 2 \
  --accumulation_steps 16 \
  --max_length 512 \
  --epochs 2 \
  --max_steps 5000 \
  --lr 1e-4 \
  --save_every 500 \
  > distill_log.txt 2>&1 &

Multi-GPU (if pod has multiple GPUs)

The script auto-detects multiple GPUs via FSDP. Launch with torchrun:

cd /root/BitNet && nohup torchrun --nproc_per_node=<NUM_GPUS> distill/distill.py \
  --device cuda \
  --teacher_model Qwen/Qwen2.5-3B \
  --student_model Qwen/Qwen2.5-3B \
  --dataset slimorca \
  --batch_size 8 \
  --accumulation_steps 2 \
  --max_length 512 \
  --epochs 2 \
  --max_steps 5000 \
  --lr 1e-4 \
  --save_every 500 \
  > distill_log.txt 2>&1 &

Monitor

# Raw scrolling log
tail -f /root/BitNet/distill_log.txt

# Live color terminal dashboard (curses, press q to quit)
python distill/dashboard.py distill/training_log.jsonl

# Generate PNG plot of training curves
pip install matplotlib  # first time only
python distill/plot.py distill/training_log.jsonl -o distill/training_plot.png

# GPU stats
nvidia-smi

6. Download Results & Deploy

From the pod:

runpodctl send /root/BitNet/distill/checkpoints/final.pt
runpodctl send /root/BitNet/distill/training_log.jsonl  # optional, for plotting

On your local machine:

runpodctl receive <CODE>

# One command: export → GGUF → quantize → chat
python distill/deploy.py distill/checkpoints/final.pt

7. Stop the Pod

Important: Stop the pod from the RunPod dashboard to avoid ongoing charges.

Stop: Preserves /workspace volume, stops billing for GPU.
Terminate: Deletes everything except persistent volume.

Cost Reference

GPU	Price/hr	3B→3B (5k steps)	7B→7B (5k steps)
A100 80GB	~$1.04	~$4-8	~$12-24
H100 SXM 80GB	~$2.49	~$5-10	~$10-20

Troubleshooting

Slow pip install / disk ops: You're installing to /workspace (network storage). Everything should be on /root (local container disk). Set container disk to 70-80GB when creating the pod.

"No module named 'torch'" / slow training / disk full: Never pip install torch — the template already has it. Use python3 -m venv --system-site-packages venv so the venv inherits the pre-installed CUDA-optimized torch. Installing torch manually wastes ~5GB disk and may install a CPU-only or mismatched CUDA version, causing 2x slower training.

OOM errors: Reduce --batch_size (try 4 or 2) or --max_length (try 256).

Pod disconnected: Training continues if launched with nohup ... &. Reconnect and tail -f /root/BitNet/distill_log.txt.

Not enough disk space: Container disk defaults to 20GB which is too small. Set to 70-80GB when creating the pod to fit models, packages, and checkpoints on fast local storage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RunPod GPU Setup for BitNet Distillation

Prerequisites

Generate SSH Key

Install runpodctl

1. Create a Pod

2. Transfer Code to Pod

3. SSH into the Pod

4. Set Up Environment

5. Start Distillation

Recommended Parameters (H100 SXM 80GB)

0.5B Validation Run (H100, ~45 min — do this first!)

3B → 3B Self-Distillation (H100 SXM 80GB)

7B → 7B Self-Distillation (H100 SXM 80GB, needs optimizer offload)

Multi-GPU (if pod has multiple GPUs)

Monitor

6. Download Results & Deploy

7. Stop the Pod

Cost Reference

Troubleshooting

FilesExpand file tree

RUNPOD_GPU.md

Latest commit

History

RUNPOD_GPU.md

File metadata and controls

RunPod GPU Setup for BitNet Distillation

Prerequisites

Generate SSH Key

Install runpodctl

1. Create a Pod

2. Transfer Code to Pod

3. SSH into the Pod

4. Set Up Environment

5. Start Distillation

Recommended Parameters (H100 SXM 80GB)

0.5B Validation Run (H100, ~45 min — do this first!)

3B → 3B Self-Distillation (H100 SXM 80GB)

7B → 7B Self-Distillation (H100 SXM 80GB, needs optimizer offload)

Multi-GPU (if pod has multiple GPUs)

Monitor

6. Download Results & Deploy

7. Stop the Pod

Cost Reference

Troubleshooting