Add optimized training: 14% faster (107→92 ms/step on M3 Max) by tomdif · Pull Request #21 · maderix/ANE

tomdif · 2026-03-03T12:34:15Z

Summary

Adds an optimized train_opt variant alongside the existing train_large, achieving significant speedups on M3 Max with stories110M. No changes to the original training code — this is purely additive.

Benchmarks (M3 Max, stories110M, steady-state):

Metric	`train_large`	`train_opt`	Δ
ms/step (macOS 15.4)	82.9	74.0	-10.7%
ms/step (macOS 15.3)	107.2	92.1	-14.1%
ANE util (15.4)	7.1%	7.9%	+11%
IO ms/step (15.4)	3.8	1.7	-55%
TFLOPS (15.4)	1.12	1.25	+12%

Note: macOS 15.4 Sequoia update improved ANE scheduling significantly (~23% faster baseline vs 15.3).

What's changed

New files:

training/stories_cpu_ops_opt.h — NEON-vectorized Adam optimizer + vectorized embedding ops
training/train_opt.m — Optimized training loop with all improvements below

Modified files:

training/stories_io.h — Added io_read_raw_fp16() helper for raw fp16 memcpy from IOSurface
training/Makefile — Added train_opt build target

Optimizations

1. NEON-vectorized Adam optimizer (~3x faster Adam)

4-wide NEON intrinsics with vrsqrteq_f32 + one Newton-Raphson iteration for fast reciprocal sqrt. Scalar tail for non-aligned remainder. Significant win over 110M params.

2. fp16 activation & gradient caching (~3.4ms saved per step)

Forward activations (xnorm, attn_out, x2norm, silu_out): stored as _Float16* via raw memcpy from IOSurface, skipping fp16→fp32 NEON conversion on the main thread. Conversion deferred to dW dispatch blocks.
Backward gradients (dh1, dh3, dq, dk, dv): same pattern — raw fp16 read on main thread, convert in concurrent dispatch blocks.

3. Pre-allocated per-step buffers (~132 malloc/free eliminated per step)

LayerCaptures struct pre-allocates all 11 fp32 + 5 fp16 dW capture buffers per layer at startup. Dispatch blocks just memcpy into pre-allocated memory instead of malloc+memcpy+free.

4. Concurrent dW dispatch queue

Changed DISPATCH_QUEUE_SERIAL → DISPATCH_QUEUE_CONCURRENT for weight gradient computation. Individual sgemm calls dispatched independently. Added setenv("VECLIB_MAXIMUM_THREADS", "2", 1) to prevent cblas thread oversubscription.

5. Dead read elimination (~1.1ms saved)

Removed unnecessary h1/h3 reads from fwdFFN IOSurface output during forward pass — backward already copies directly via io_copy.

6. Vectorized embedding ops

embed_lookup_opt: memcpy rows + single vDSP_mtrans (vs scalar scatter).
embed_backward_opt: vDSP_mtrans + vDSP_vadd per token row.

7. Optional Metal GPU for dW (off by default)

MPSMatrixMultiplication for weight gradients on GPU. Disabled by default because it causes memory bandwidth contention with ANE on M3 Max (~28ms regression). Available via --metal flag for testing on other hardware.

Build & test

make train_opt
./train_opt stories110M.bin             # default: no Metal
./train_opt stories110M.bin --metal     # opt-in Metal GPU
./train_opt stories110M.bin --steps 100 # custom step count

Drops in alongside train_large — existing code untouched.

Test plan

Verified build on M3 Max (macOS Sequoia 15.3 + 15.4)
Benchmarked against train_large baseline (10.7% faster on 15.4, 14.1% on 15.3)
Confirmed no data races in concurrent dispatch (x2norm/xnorm converted on main thread)
Verified Metal path with --metal flag (functional but slower due to bandwidth contention)
Fixed model path handling — now accepts argv[1] like train_large
Test on M1/M2/M4 variants

🤖 Generated with Claude Code

New train_opt target with NEON-vectorized Adam, fp16 activation/gradient caching, concurrent dW dispatch, pre-allocated buffers, and optional Metal GPU support. Tested on M3 Max with stories110M. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

train_opt had a hardcoded MODEL_PATH that didn't match the working directory, causing fallback to random init. Now accepts positional model path argument (e.g., ./train_opt stories110M.bin). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tomdif · 2026-03-03T14:35:45Z

Closing this — after benchmarking train_large_ane on the same machine (M3 Max, macOS 15.4), it already achieves 72.6 ms/step and 1.45 TFLOPS via ANE offloads (classifier/softmax/rmsnorm_bwd). Our CPU-side optimizations (NEON Adam, fp16 cache, etc.) landed at 74.5 ms/step and 1.25 TFLOPS — same ballpark but a less impactful approach since the real gains come from moving work onto the ANE rather than optimizing CPU paths.

No point adding complexity that doesn't improve on what you already have. Nice work on the ANE offload approach.

…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).

tomdif force-pushed the ane-training-optimizations branch from 3b188f5 to 09e9c99 Compare March 3, 2026 13:08

tomdif closed this Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized training: 14% faster (107→92 ms/step on M3 Max)#21

Add optimized training: 14% faster (107→92 ms/step on M3 Max)#21
tomdif wants to merge 2 commits intomaderix:mainfrom
tomdif:ane-training-optimizations

tomdif commented Mar 3, 2026 •

edited

Loading

Uh oh!

tomdif commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomdif commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's changed

Optimizations

1. NEON-vectorized Adam optimizer (~3x faster Adam)

2. fp16 activation & gradient caching (~3.4ms saved per step)

3. Pre-allocated per-step buffers (~132 malloc/free eliminated per step)

4. Concurrent dW dispatch queue

5. Dead read elimination (~1.1ms saved)

6. Vectorized embedding ops

7. Optional Metal GPU for dW (off by default)

Build & test

Test plan

Uh oh!

tomdif commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tomdif commented Mar 3, 2026 •

edited

Loading