Training infra by es617 · Pull Request #1 · es617/hunch

es617 · 2026-04-19T18:37:19Z

No description provided.

- prepare_data.py: converts bank to Apple FM training JSONL (19k train / 3k eval) - train_adapter.ipynb: Colab notebook with Drive integration - train_cloud.sh: CLI script for SSH-based cloud training - README documenting LoRA background, setup, training options, QLoRA future work

Adds --adapter flag to hunch CLI, QLoRA/LoRA benchmark approaches in run.py, source filtering in prepare_data.py, and training notebooks for LoRA, fp16 LoRA, and QLoRA experiments.

Works around TGOnDeviceInferenceProviderService disk leak where each process invocation caches ~160MB of the adapter. Batch mode loads the adapter once and runs all prompts in a single process. 4 runs of 100 prompts = 1 cached copy instead of 400.

Reviewed all non-exact results across 5 approaches x 4 runs. Added accepted alternates for placeholder variations, flag reordering, and equivalent commands.

…sults - QLoRA training on Mac via native Metal kernels (bitsandbytes PR #1875) ~34 min for 20 epochs on M3, 3.4GB GPU, ~7x slower than T4 - MPS GradScaler fix for fp16 gradient overflow - Flat checkpoint format for export compatibility - Benchmark review criteria documented in REVIEW_CRITERIA.md - MPS adapter benchmark approaches added to run.py - Updated alternates.json with manual review of 28 runs - TRAINING.md rewritten: Mac + Colab paths, memory breakdowns, accuracy table - Removed failed eval cells from notebooks

… caveat

Label masking was the main accuracy issue: the training loop computed loss over prompt tokens, wasting adapter capacity. Now only assistant response tokens contribute to the loss. This closed the MPS vs T4 gap entirely — Mac-trained adapters now match T4 quality (~86% with retrieval). Also: flat checkpoint format, conditional compress_statistics for MPS, batch_size default 8, better logging granularity.

- bench_mps.py: structured benchmark for Metal vs CPU fallback comparison - TRAINING.md: ~5GB GPU peak (not 3.4GB), LoRA T4 OOM is system RAM not GPU, accuracy table updated with latest results - train_qlora_full.py: log every 20 steps instead of 100 for shorter runs

- README: link to TRAINING.md instead of gitignored README.md - TRAINING.md: inline disk leak workaround instead of referencing uncommitted file, update file listing, fix GPU number - main.swift: fix batch loop indentation - .gitignore: exclude bench_mps_results.jsonl

es617 added 12 commits April 10, 2026 17:17

Remove training README from git, keep locally

aee7d26

Merge branch 'main' into training-infra

f3e3d82

Add QLoRA training notebooks and adapter benchmark approaches

29d0779

Adds --adapter flag to hunch CLI, QLoRA/LoRA benchmark approaches in run.py, source filtering in prepare_data.py, and training notebooks for LoRA, fp16 LoRA, and QLoRA experiments.

Update accepted answers from 20-run adapter benchmark review

2a1b8be

Reviewed all non-exact results across 5 approaches x 4 runs. Added accepted alternates for placeholder variations, flag reordering, and equivalent commands.

Clarify TRAINING.md: export env explanation, adapter size, Mac timing…

1870e00

… caveat

Update readme

c6cfa96

es617 merged commit 7b34975 into main Apr 19, 2026
1 check passed

es617 deleted the training-infra branch April 19, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training infra#1

Training infra#1
es617 merged 12 commits intomainfrom
training-infra

es617 commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

es617 commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant