Skip to content

Add optimized training: 14% faster (107→92 ms/step on M3 Max)#21

Closed
tomdif wants to merge 2 commits intomaderix:mainfrom
tomdif:ane-training-optimizations
Closed

Add optimized training: 14% faster (107→92 ms/step on M3 Max)#21
tomdif wants to merge 2 commits intomaderix:mainfrom
tomdif:ane-training-optimizations

Conversation

@tomdif
Copy link
Copy Markdown

@tomdif tomdif commented Mar 3, 2026

Summary

Adds an optimized train_opt variant alongside the existing train_large, achieving significant speedups on M3 Max with stories110M. No changes to the original training code — this is purely additive.

Benchmarks (M3 Max, stories110M, steady-state):

Metric train_large train_opt Δ
ms/step (macOS 15.4) 82.9 74.0 -10.7%
ms/step (macOS 15.3) 107.2 92.1 -14.1%
ANE util (15.4) 7.1% 7.9% +11%
IO ms/step (15.4) 3.8 1.7 -55%
TFLOPS (15.4) 1.12 1.25 +12%

Note: macOS 15.4 Sequoia update improved ANE scheduling significantly (~23% faster baseline vs 15.3).

What's changed

New files:

  • training/stories_cpu_ops_opt.h — NEON-vectorized Adam optimizer + vectorized embedding ops
  • training/train_opt.m — Optimized training loop with all improvements below

Modified files:

  • training/stories_io.h — Added io_read_raw_fp16() helper for raw fp16 memcpy from IOSurface
  • training/Makefile — Added train_opt build target

Optimizations

1. NEON-vectorized Adam optimizer (~3x faster Adam)

4-wide NEON intrinsics with vrsqrteq_f32 + one Newton-Raphson iteration for fast reciprocal sqrt. Scalar tail for non-aligned remainder. Significant win over 110M params.

2. fp16 activation & gradient caching (~3.4ms saved per step)

  • Forward activations (xnorm, attn_out, x2norm, silu_out): stored as _Float16* via raw memcpy from IOSurface, skipping fp16→fp32 NEON conversion on the main thread. Conversion deferred to dW dispatch blocks.
  • Backward gradients (dh1, dh3, dq, dk, dv): same pattern — raw fp16 read on main thread, convert in concurrent dispatch blocks.

3. Pre-allocated per-step buffers (~132 malloc/free eliminated per step)

LayerCaptures struct pre-allocates all 11 fp32 + 5 fp16 dW capture buffers per layer at startup. Dispatch blocks just memcpy into pre-allocated memory instead of malloc+memcpy+free.

4. Concurrent dW dispatch queue

Changed DISPATCH_QUEUE_SERIALDISPATCH_QUEUE_CONCURRENT for weight gradient computation. Individual sgemm calls dispatched independently. Added setenv("VECLIB_MAXIMUM_THREADS", "2", 1) to prevent cblas thread oversubscription.

5. Dead read elimination (~1.1ms saved)

Removed unnecessary h1/h3 reads from fwdFFN IOSurface output during forward pass — backward already copies directly via io_copy.

6. Vectorized embedding ops

embed_lookup_opt: memcpy rows + single vDSP_mtrans (vs scalar scatter).
embed_backward_opt: vDSP_mtrans + vDSP_vadd per token row.

7. Optional Metal GPU for dW (off by default)

MPSMatrixMultiplication for weight gradients on GPU. Disabled by default because it causes memory bandwidth contention with ANE on M3 Max (~28ms regression). Available via --metal flag for testing on other hardware.

Build & test

make train_opt
./train_opt stories110M.bin             # default: no Metal
./train_opt stories110M.bin --metal     # opt-in Metal GPU
./train_opt stories110M.bin --steps 100 # custom step count

Drops in alongside train_large — existing code untouched.

Test plan

  • Verified build on M3 Max (macOS Sequoia 15.3 + 15.4)
  • Benchmarked against train_large baseline (10.7% faster on 15.4, 14.1% on 15.3)
  • Confirmed no data races in concurrent dispatch (x2norm/xnorm converted on main thread)
  • Verified Metal path with --metal flag (functional but slower due to bandwidth contention)
  • Fixed model path handling — now accepts argv[1] like train_large
  • Test on M1/M2/M4 variants

🤖 Generated with Claude Code

New train_opt target with NEON-vectorized Adam, fp16 activation/gradient
caching, concurrent dW dispatch, pre-allocated buffers, and optional
Metal GPU support. Tested on M3 Max with stories110M.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tomdif tomdif force-pushed the ane-training-optimizations branch from 3b188f5 to 09e9c99 Compare March 3, 2026 13:08
train_opt had a hardcoded MODEL_PATH that didn't match the working
directory, causing fallback to random init. Now accepts positional
model path argument (e.g., ./train_opt stories110M.bin).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tomdif
Copy link
Copy Markdown
Author

tomdif commented Mar 3, 2026

Closing this — after benchmarking train_large_ane on the same machine (M3 Max, macOS 15.4), it already achieves 72.6 ms/step and 1.45 TFLOPS via ANE offloads (classifier/softmax/rmsnorm_bwd). Our CPU-side optimizations (NEON Adam, fp16 cache, etc.) landed at 74.5 ms/step and 1.25 TFLOPS — same ballpark but a less impactful approach since the real gains come from moving work onto the ANE rather than optimizing CPU paths.

No point adding complexity that doesn't improve on what you already have. Nice work on the ANE offload approach.

@tomdif tomdif closed this Mar 3, 2026
dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 3, 2026
…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).
dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 3, 2026
…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant