Merged
Conversation
- prepare_data.py: converts bank to Apple FM training JSONL (19k train / 3k eval) - train_adapter.ipynb: Colab notebook with Drive integration - train_cloud.sh: CLI script for SSH-based cloud training - README documenting LoRA background, setup, training options, QLoRA future work
Adds --adapter flag to hunch CLI, QLoRA/LoRA benchmark approaches in run.py, source filtering in prepare_data.py, and training notebooks for LoRA, fp16 LoRA, and QLoRA experiments.
Works around TGOnDeviceInferenceProviderService disk leak where each process invocation caches ~160MB of the adapter. Batch mode loads the adapter once and runs all prompts in a single process. 4 runs of 100 prompts = 1 cached copy instead of 400.
Reviewed all non-exact results across 5 approaches x 4 runs. Added accepted alternates for placeholder variations, flag reordering, and equivalent commands.
…sults - QLoRA training on Mac via native Metal kernels (bitsandbytes PR #1875) ~34 min for 20 epochs on M3, 3.4GB GPU, ~7x slower than T4 - MPS GradScaler fix for fp16 gradient overflow - Flat checkpoint format for export compatibility - Benchmark review criteria documented in REVIEW_CRITERIA.md - MPS adapter benchmark approaches added to run.py - Updated alternates.json with manual review of 28 runs - TRAINING.md rewritten: Mac + Colab paths, memory breakdowns, accuracy table - Removed failed eval cells from notebooks
Label masking was the main accuracy issue: the training loop computed loss over prompt tokens, wasting adapter capacity. Now only assistant response tokens contribute to the loss. This closed the MPS vs T4 gap entirely — Mac-trained adapters now match T4 quality (~86% with retrieval). Also: flat checkpoint format, conditional compress_statistics for MPS, batch_size default 8, better logging granularity.
- bench_mps.py: structured benchmark for Metal vs CPU fallback comparison - TRAINING.md: ~5GB GPU peak (not 3.4GB), LoRA T4 OOM is system RAM not GPU, accuracy table updated with latest results - train_qlora_full.py: log every 20 steps instead of 100 for shorter runs
- README: link to TRAINING.md instead of gitignored README.md - TRAINING.md: inline disk leak workaround instead of referencing uncommitted file, update file listing, fix GPU number - main.swift: fix batch loop indentation - .gitignore: exclude bench_mps_results.jsonl
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.