bridge: function parameter IOSurfaces — 30% faster than spatial packing (76.9ms vs 110ms/step) by fspecii · Pull Request #22 · maderix/ANE

fspecii · 2026-03-03T13:01:18Z

What

An alternative dynamic weight approach where weights are declared as native MIL function parameters backed by persistent IOSurfaces, rather than packed into the spatial dimension of a single large input tensor.

Two approaches compared

Current `training_dynamic/` — spatial dimension packing

// Input: [1, DIM, 1, SEQ + 4*DIM] — activations at sp[0:SEQ], weights at sp[SEQ:]
func main<ios18>(tensor<fp32, [1, DIM, 1, SEQ+4*DIM]> x) {
    Wq = slice_by_size(x=x, begin=[0,0,0,SEQ], size=[1,DIM,1,DIM]);  // overhead
    Wq2 = reshape(shape=[1,1,DIM,DIM], x=Wq);                         // overhead
    xnt = transpose(perm=[0,1,3,2], x=xn2);                           // overhead
    qm = matmul(x=xnt, y=Wq2);
    ...
}

This PR — native function parameters

// Each weight is its own IOSurface argument, no unpacking needed
func main<ios18>(tensor<fp16,[1,K,1,M]> x, tensor<fp16,[1,N,K]> W) {
    out = matmul(x=W, y=x);
} -> (out);

The spatial packing approach adds slice_by_size + reshape + transpose overhead for every weight matrix per kernel call. With 4 weights per attention kernel that's 12 extra ops before the actual matmul.

Performance

Same hardware (M-series), same model (Stories110M, 12 layers, DIM=768, SEQ=256):

Approach	ms/step	Kernels
`training_dynamic/` (spatial packing)	110ms	9
Function parameters (this PR)	76.9ms	74

30% faster per step. Both compile once at startup, both update weights via IOSurface writes (~0.001ms).

New bridge API

// Compile once — weights declared as MIL function parameters
ANEKernelHandle *ane_bridge_compile_dyn(
    const char *mil_text, size_t mil_len,
    int n_inputs, const size_t *input_sizes,
    int n_weights, const size_t *weight_sizes,
    size_t output_size);

// Update weights — direct IOSurface write, ~0.001ms
void ane_bridge_write_weight(ANEKernelHandle *k, int idx,
                              const void *fp16_data, size_t bytes);
void ane_bridge_write_weight_f32(ANEKernelHandle *k, int idx,
                                  const float *fp32_data, size_t count);

// Chain kernels without CPU round-trip
void ane_bridge_copy_io(ANEKernelHandle *src, int src_out_idx,
                         ANEKernelHandle *dst, int dst_in_idx);

// 90.6% p99 jitter reduction (plain p99=35ms → with RT task p99=3.3ms)
void ane_bridge_begin_realtime(void);
void ane_bridge_end_realtime(void);

Cache fix

try_cache_restore previously required a data file that ANE never creates for parameter-based models (only net.plist is written to tmpDir). Fixed to require only net.plist; data is saved/restored conditionally for BLOBFILE models.

Tested

test_bridge.m — 15/15 assertions:

compile_dyn, write_weight (ratio=2.0 confirmed), write_weight_f32, copy_io (1→2→4 chain), begin/end_realtime, compile cache hit (compile count does not increment on second call with same MIL), free

Adds a second dynamic weight approach to the bridge alongside the existing BLOBFILE compile path. Instead of packing weights into the spatial dimension of a single large input tensor and slicing them inside MIL (the training_dynamic/ approach), weights are declared as native MIL function parameters backed by persistent IOSurfaces: // training_dynamic/ approach: spatial packing func main<ios18>(tensor<fp32, [1, DIM, 1, SEQ + 4*DIM]> x) { Wq = slice_by_size(x=x, begin=..., size=...); // overhead ... // this PR: native function parameters func main<ios18>(tensor<fp16,[1,K,1,M]> x, tensor<fp16,[1,N,K]> W) { ... } New API: ane_bridge_compile_dyn() — compile with n_weights IOSurface parameters ane_bridge_write_weight() — write fp16 to weight IOSurface (~0.001ms) ane_bridge_write_weight_f32() — write fp32 with NEON conversion ane_bridge_copy_io() — direct output→input copy, no CPU round-trip ane_bridge_begin/end_realtime() — 90.6% p99 jitter reduction Compile cache fix: ANE only writes net.plist for parameter-based models (no data file). try_cache_restore now checks net.plist only; data is saved/restored conditionally for BLOBFILE models that do produce it. Also removes the pre-built libane_bridge.dylib binary from version control. Performance vs spatial packing (Stories110M, 12 layers, M-series): training_dynamic/ (slice approach): 110ms/step function parameter approach: 76.9ms/step (-30%) The slice/reshape/transpose overhead per weight matrix explains the gap. Both compile once at startup; weight updates are IOSurface writes in both cases. Tested: test_bridge.m — 15/15 assertions across all new API functions.

…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).

fspecii closed this by deleting the head repository Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bridge: function parameter IOSurfaces — 30% faster than spatial packing (76.9ms vs 110ms/step)#22

bridge: function parameter IOSurfaces — 30% faster than spatial packing (76.9ms vs 110ms/step)#22
fspecii wants to merge 1 commit intomaderix:mainfrom
fspecii:param-iosurface-dynamic-weights

fspecii commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fspecii commented Mar 3, 2026

What

Two approaches compared

Current training_dynamic/ — spatial dimension packing

This PR — native function parameters

Performance

New bridge API

Cache fix

Tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Current `training_dynamic/` — spatial dimension packing