- AWS CLI installed and configured (
aws configure) - An AWS account with EC2 GPU instance quota (request g5 quota increase if needed)
aws ec2 create-key-pair --key-name bitnet-distill --query 'KeyMaterial' --output text > ~/.ssh/bitnet-distill.pem
chmod 400 ~/.ssh/bitnet-distill.pemFind your VPC ID:
aws ec2 describe-vpcs --query "Vpcs[0].VpcId" --output textCreate the security group (replace vpc-XXXXX with your VPC ID):
aws ec2 create-security-group \
--group-name bitnet-distill-sg \
--description "SSH access for BitNet distillation" \
--vpc-id vpc-XXXXX \
--query 'GroupId' --output textAllow SSH access (replace sg-XXXXX with the security group ID returned above):
aws ec2 authorize-security-group-ingress \
--group-id sg-XXXXX \
--protocol tcp --port 22 --cidr 0.0.0.0/0aws ec2 describe-images \
--owners amazon \
--filters "Name=name,Values=Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04)*" \
--query "Images | sort_by(@, &CreationDate) | [-1].{ID:ImageId,Name:Name}" \
--output tableaws ec2 describe-spot-price-history \
--instance-types g5.2xlarge \
--product-descriptions "Linux/UNIX" \
--max-items 3 \
--query "SpotPriceHistory[*].{AZ:AvailabilityZone,Price:SpotPrice}" \
--output tableFind the cheapest availability zone, then get its subnet ID:
aws ec2 describe-subnets \
--filters "Name=vpc-id,Values=vpc-XXXXX" "Name=availability-zone,Values=eu-west-1a" \
--query "Subnets[0].SubnetId" --output textReplace ami-XXXXX, sg-XXXXX, and subnet-XXXXX with your values:
aws ec2 run-instances \
--image-id ami-XXXXX \
--instance-type g5.2xlarge \
--key-name bitnet-distill \
--security-group-ids sg-XXXXX \
--subnet-id subnet-XXXXX \
--associate-public-ip-address \
--block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":100,"VolumeType":"gp3"}}]' \
--instance-market-options '{"MarketType":"spot","SpotOptions":{"SpotInstanceType":"persistent","InstanceInterruptionBehavior":"stop"}}' \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=bitnet-distill}]' \
--query "Instances[0].{InstanceId:InstanceId,State:State.Name}" \
--output tableGet the public IP once it's running:
aws ec2 describe-instances \
--instance-ids i-XXXXX \
--query "Reservations[0].Instances[0].PublicIpAddress" --output text# Upload the BitNet repo (excluding large files)
rsync -az --exclude='.git' --exclude='models' --exclude='build' \
--exclude='distill/checkpoints' --exclude='__pycache__' --exclude='*.pyc' \
-e "ssh -i ~/.ssh/bitnet-distill.pem" \
/path/to/BitNet/ ubuntu@<IP>:~/BitNet/
# Install dependencies
ssh -i ~/.ssh/bitnet-distill.pem ubuntu@<IP> \
"cd ~/BitNet && pip install -r distill/requirements.txt"Verify GPU is working:
ssh -i ~/.ssh/bitnet-distill.pem ubuntu@<IP> \
"nvidia-smi --query-gpu=name,memory.total --format=csv,noheader"| Parameter | A10G 24GB | A100 80GB | Why |
|---|---|---|---|
--batch_size |
2 | 4-8 | Largest that fits in VRAM |
--accumulation_steps |
16 | 4-8 | Target effective batch size 32 |
--max_length |
512 | 1024 | Longer context = better quality (drop if OOM) |
--save_every |
500 | 200 | Checkpoints are ~12GB each |
Key insight: Effective batch size = batch_size × accumulation_steps. Target 32 for stable training. Too small (e.g., 16) gives noisier gradients; too large wastes compute without quality gain.
ssh -i ~/.ssh/bitnet-distill.pem ubuntu@<IP> \
"cd ~/BitNet && nohup python3 distill/distill.py \
--device cuda \
--teacher_model Qwen/Qwen2.5-3B \
--dataset slimorca \
--batch_size 2 \
--accumulation_steps 16 \
--max_length 512 \
--epochs 2 \
--max_steps 5000 \
--lr 1e-4 \
--save_every 500 \
> distill_log.txt 2>&1 &"Uses PyTorch FSDP to shard the student model across all GPUs. Launch with torchrun:
ssh -i ~/.ssh/bitnet-distill.pem ubuntu@<IP> \
"cd ~/BitNet && nohup torchrun --nproc_per_node=8 distill/distill.py \
--device cuda \
--teacher_model Qwen/Qwen2.5-3B \
--student_model Qwen/Qwen2.5-3B \
--dataset slimorca \
--batch_size 4 \
--accumulation_steps 4 \
--max_length 1024 \
--epochs 2 \
--max_steps 5000 \
--lr 1e-4 \
--save_every 200 \
> distill_log.txt 2>&1 &"Note: With 8 GPUs, effective batch = batch_size × accumulation_steps × 8 = 128. Adjust --accumulation_steps accordingly. Only rank 0 prints logs and saves checkpoints.
Monitor progress:
# Raw scrolling log
ssh -i ~/.ssh/bitnet-distill.pem ubuntu@<IP> "tail -f ~/BitNet/distill_log.txt"
# Live color terminal dashboard (curses, press q to quit)
ssh -i ~/.ssh/bitnet-distill.pem ubuntu@<IP> \
"cd ~/BitNet && python distill/dashboard.py distill/training_log.jsonl"
# Generate PNG plot of training curves
ssh -i ~/.ssh/bitnet-distill.pem ubuntu@<IP> \
"cd ~/BitNet && pip install matplotlib && python distill/plot.py"
# GPU stats
ssh -i ~/.ssh/bitnet-distill.pem ubuntu@<IP> "nvidia-smi"scp -i ~/.ssh/bitnet-distill.pem \
ubuntu@<IP>:~/BitNet/distill/checkpoints/final.pt \
distill/checkpoints/final_3b.pt
# One command: export → GGUF → quantize → chat
python distill/deploy.py distill/checkpoints/final_3b.ptImportant: stop the instance to avoid ongoing charges.
aws ec2 stop-instances --instance-ids i-XXXXXTo restart later:
aws ec2 start-instances --instance-ids i-XXXXX
# Get the new public IP (changes on restart)
aws ec2 describe-instances \
--instance-ids i-XXXXX \
--query "Reservations[0].Instances[0].PublicIpAddress" --output textTo permanently delete (terminates instance and releases storage):
aws ec2 terminate-instances --instance-ids i-XXXXXAfter downloading the checkpoint, deploy with one command:
python distill/deploy.py distill/checkpoints/final_3b.pt -o models/distilled-bitnet-3bThis auto-detects the tokenizer, exports to HuggingFace format, converts to GGUF, quantizes, and launches interactive chat.
To produce a high-quality 7B-class BitNet model, we self-distill (7B → 7B BitNet). The ternary weight constraint is the bottleneck, not architecture size, so the student should match ~90-95% of the teacher's quality.
| Component | VRAM |
|---|---|
| Teacher (7B, fp16, frozen) | ~14 GB |
| Student (7B, fp32, trainable) | ~28 GB |
| Gradients (fp32) | ~22 GB |
| AdamW optimizer states (2x fp32) | ~44 GB |
| Activations (grad checkpointing) | ~5-10 GB |
| Total | ~113-118 GB |
Does not fit on a single A100 80GB. Two approaches:
Hardware: AWS p4d.24xlarge — 8x A100 80GB (uses 2-3 GPUs)
python distill/distill.py \
--device cuda \
--teacher_model Qwen/Qwen2.5-7B \
--student_model Qwen/Qwen2.5-7B \
--dataset slimorca \
--batch_size 2 \
--accumulation_steps 16 \
--max_steps 10000 \
--save_every 500Requires: Script changes for multi-GPU (DeepSpeed ZeRO-2 or model placement across GPUs).
| Estimate | |
|---|---|
| Time | 24-48 hours |
| Spot price | ~$12/hr |
| Total cost | $290-580 |
Load teacher in 8-bit quantization (bitsandbytes) and offload optimizer states to CPU RAM:
| Component | VRAM |
|---|---|
| Teacher (7B, 8-bit quantized) | ~7 GB |
| Student (7B, fp32) | ~28 GB |
| Gradients | ~22 GB |
| Optimizer states → CPU offload | ~0 GB GPU |
| Activations | ~5-10 GB |
| Total | ~62-67 GB → fits on A100 80GB |
python distill/distill.py \
--device cuda \
--teacher_model Qwen/Qwen2.5-7B \
--student_model Qwen/Qwen2.5-7B \
--teacher_quantize 8bit \
--optimizer_offload cpu \
--dataset slimorca \
--batch_size 1 \
--accumulation_steps 32 \
--max_steps 10000 \
--save_every 500Requires: Script changes for 8-bit teacher loading (bitsandbytes) and optimizer CPU offload.
| Estimate | |
|---|---|
| Time | 48-72 hours (slower from CPU offload) |
| Spot price | ~$1-2/hr (single A100 80GB) |
| Total cost | $100-200 |
Same as Option A/B but with --teacher_model Qwen/Qwen2.5-14B. Slightly better quality than self-distillation, but requires more VRAM (Option A hardware).
Start with Option B for cost efficiency. If quality isn't sufficient, move to Option A with a 14B teacher.
| Instance | GPU | VRAM | Spot Price* | Best For |
|---|---|---|---|---|
| g5.xlarge | A10G | 24 GB | ~$0.40/hr | 1.5B-3B teacher (batch_size=2) |
| g5.2xlarge | A10G | 24 GB | ~$0.63/hr | 3B teacher (more CPU/RAM) |
| p4d.24xlarge | 8x A100 | 8x 80 GB | ~$12/hr | 7B-14B teacher |
*Spot prices vary by region and time. Check current prices with Step 4 above.
CUDA out of memory: Reduce --batch_size (try 1) or --max_length (try 256).
Spot instance interrupted: Use --resume_from to continue from the last checkpoint:
ssh -i ~/.ssh/bitnet-distill.pem ubuntu@<IP> \
"cd ~/BitNet && nohup python3 distill/distill.py \
--device cuda \
--teacher_model Qwen/Qwen2.5-3B \
--dataset slimorca \
--resume_from distill/checkpoints/step_1000.pt \
--batch_size 2 --accumulation_steps 16 --max_steps 5000 \
> distill_log.txt 2>&1 &"SSH connection refused: Wait 30-60 seconds after starting the instance for SSH to come up.
No GPU quota: Request a quota increase for "Running On-Demand G and VT instances" in the AWS Service Quotas console for your region.