Skip to content

Training infra#1

Merged
es617 merged 12 commits intomainfrom
training-infra
Apr 19, 2026
Merged

Training infra#1
es617 merged 12 commits intomainfrom
training-infra

Conversation

@es617
Copy link
Copy Markdown
Owner

@es617 es617 commented Apr 19, 2026

No description provided.

es617 added 12 commits April 10, 2026 17:17
- prepare_data.py: converts bank to Apple FM training JSONL (19k train / 3k eval)
- train_adapter.ipynb: Colab notebook with Drive integration
- train_cloud.sh: CLI script for SSH-based cloud training
- README documenting LoRA background, setup, training options, QLoRA future work
Adds --adapter flag to hunch CLI, QLoRA/LoRA benchmark approaches
in run.py, source filtering in prepare_data.py, and training
notebooks for LoRA, fp16 LoRA, and QLoRA experiments.
Works around TGOnDeviceInferenceProviderService disk leak where each
process invocation caches ~160MB of the adapter. Batch mode loads the
adapter once and runs all prompts in a single process. 4 runs of 100
prompts = 1 cached copy instead of 400.
Reviewed all non-exact results across 5 approaches x 4 runs.
Added accepted alternates for placeholder variations, flag reordering,
and equivalent commands.
…sults

- QLoRA training on Mac via native Metal kernels (bitsandbytes PR #1875)
  ~34 min for 20 epochs on M3, 3.4GB GPU, ~7x slower than T4
- MPS GradScaler fix for fp16 gradient overflow
- Flat checkpoint format for export compatibility
- Benchmark review criteria documented in REVIEW_CRITERIA.md
- MPS adapter benchmark approaches added to run.py
- Updated alternates.json with manual review of 28 runs
- TRAINING.md rewritten: Mac + Colab paths, memory breakdowns, accuracy table
- Removed failed eval cells from notebooks
Label masking was the main accuracy issue: the training loop computed
loss over prompt tokens, wasting adapter capacity. Now only assistant
response tokens contribute to the loss. This closed the MPS vs T4 gap
entirely — Mac-trained adapters now match T4 quality (~86% with retrieval).

Also: flat checkpoint format, conditional compress_statistics for MPS,
batch_size default 8, better logging granularity.
- bench_mps.py: structured benchmark for Metal vs CPU fallback comparison
- TRAINING.md: ~5GB GPU peak (not 3.4GB), LoRA T4 OOM is system RAM not GPU,
  accuracy table updated with latest results
- train_qlora_full.py: log every 20 steps instead of 100 for shorter runs
- README: link to TRAINING.md instead of gitignored README.md
- TRAINING.md: inline disk leak workaround instead of referencing
  uncommitted file, update file listing, fix GPU number
- main.swift: fix batch loop indentation
- .gitignore: exclude bench_mps_results.jsonl
@es617 es617 merged commit 7b34975 into main Apr 19, 2026
1 check passed
@es617 es617 deleted the training-infra branch April 19, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant