bridge: function parameter IOSurfaces — 30% faster than spatial packing (76.9ms vs 110ms/step)#22
Closed
fspecii wants to merge 1 commit intomaderix:mainfrom
Closed
Conversation
Adds a second dynamic weight approach to the bridge alongside the existing
BLOBFILE compile path. Instead of packing weights into the spatial dimension
of a single large input tensor and slicing them inside MIL (the training_dynamic/
approach), weights are declared as native MIL function parameters backed by
persistent IOSurfaces:
// training_dynamic/ approach: spatial packing
func main<ios18>(tensor<fp32, [1, DIM, 1, SEQ + 4*DIM]> x) {
Wq = slice_by_size(x=x, begin=..., size=...); // overhead
...
// this PR: native function parameters
func main<ios18>(tensor<fp16,[1,K,1,M]> x, tensor<fp16,[1,N,K]> W) { ... }
New API:
ane_bridge_compile_dyn() — compile with n_weights IOSurface parameters
ane_bridge_write_weight() — write fp16 to weight IOSurface (~0.001ms)
ane_bridge_write_weight_f32() — write fp32 with NEON conversion
ane_bridge_copy_io() — direct output→input copy, no CPU round-trip
ane_bridge_begin/end_realtime() — 90.6% p99 jitter reduction
Compile cache fix: ANE only writes net.plist for parameter-based models (no
data file). try_cache_restore now checks net.plist only; data is saved/restored
conditionally for BLOBFILE models that do produce it.
Also removes the pre-built libane_bridge.dylib binary from version control.
Performance vs spatial packing (Stories110M, 12 layers, M-series):
training_dynamic/ (slice approach): 110ms/step
function parameter approach: 76.9ms/step (-30%)
The slice/reshape/transpose overhead per weight matrix explains the gap.
Both compile once at startup; weight updates are IOSurface writes in both cases.
Tested: test_bridge.m — 15/15 assertions across all new API functions.
dev-erik
added a commit
to dev-erik/ANE
that referenced
this pull request
Mar 3, 2026
…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).
dev-erik
added a commit
to dev-erik/ANE
that referenced
this pull request
Mar 3, 2026
…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
An alternative dynamic weight approach where weights are declared as native MIL function parameters backed by persistent IOSurfaces, rather than packed into the spatial dimension of a single large input tensor.
Two approaches compared
Current
training_dynamic/— spatial dimension packingThis PR — native function parameters
The spatial packing approach adds
slice_by_size + reshape + transposeoverhead for every weight matrix per kernel call. With 4 weights per attention kernel that's 12 extra ops before the actual matmul.Performance
Same hardware (M-series), same model (Stories110M, 12 layers, DIM=768, SEQ=256):
training_dynamic/(spatial packing)30% faster per step. Both compile once at startup, both update weights via IOSurface writes (~0.001ms).
New bridge API
Cache fix
try_cache_restorepreviously required adatafile that ANE never creates for parameter-based models (onlynet.plistis written to tmpDir). Fixed to require onlynet.plist;datais saved/restored conditionally for BLOBFILE models.Tested
test_bridge.m— 15/15 assertions: