Skip to content

bridge: function parameter IOSurfaces — 30% faster than spatial packing (76.9ms vs 110ms/step)#22

Closed
fspecii wants to merge 1 commit intomaderix:mainfrom
fspecii:param-iosurface-dynamic-weights
Closed

bridge: function parameter IOSurfaces — 30% faster than spatial packing (76.9ms vs 110ms/step)#22
fspecii wants to merge 1 commit intomaderix:mainfrom
fspecii:param-iosurface-dynamic-weights

Conversation

@fspecii
Copy link
Copy Markdown

@fspecii fspecii commented Mar 3, 2026

What

An alternative dynamic weight approach where weights are declared as native MIL function parameters backed by persistent IOSurfaces, rather than packed into the spatial dimension of a single large input tensor.

Two approaches compared

Current training_dynamic/ — spatial dimension packing

// Input: [1, DIM, 1, SEQ + 4*DIM] — activations at sp[0:SEQ], weights at sp[SEQ:]
func main<ios18>(tensor<fp32, [1, DIM, 1, SEQ+4*DIM]> x) {
    Wq = slice_by_size(x=x, begin=[0,0,0,SEQ], size=[1,DIM,1,DIM]);  // overhead
    Wq2 = reshape(shape=[1,1,DIM,DIM], x=Wq);                         // overhead
    xnt = transpose(perm=[0,1,3,2], x=xn2);                           // overhead
    qm = matmul(x=xnt, y=Wq2);
    ...
}

This PR — native function parameters

// Each weight is its own IOSurface argument, no unpacking needed
func main<ios18>(tensor<fp16,[1,K,1,M]> x, tensor<fp16,[1,N,K]> W) {
    out = matmul(x=W, y=x);
} -> (out);

The spatial packing approach adds slice_by_size + reshape + transpose overhead for every weight matrix per kernel call. With 4 weights per attention kernel that's 12 extra ops before the actual matmul.

Performance

Same hardware (M-series), same model (Stories110M, 12 layers, DIM=768, SEQ=256):

Approach ms/step Kernels
training_dynamic/ (spatial packing) 110ms 9
Function parameters (this PR) 76.9ms 74

30% faster per step. Both compile once at startup, both update weights via IOSurface writes (~0.001ms).

New bridge API

// Compile once — weights declared as MIL function parameters
ANEKernelHandle *ane_bridge_compile_dyn(
    const char *mil_text, size_t mil_len,
    int n_inputs, const size_t *input_sizes,
    int n_weights, const size_t *weight_sizes,
    size_t output_size);

// Update weights — direct IOSurface write, ~0.001ms
void ane_bridge_write_weight(ANEKernelHandle *k, int idx,
                              const void *fp16_data, size_t bytes);
void ane_bridge_write_weight_f32(ANEKernelHandle *k, int idx,
                                  const float *fp32_data, size_t count);

// Chain kernels without CPU round-trip
void ane_bridge_copy_io(ANEKernelHandle *src, int src_out_idx,
                         ANEKernelHandle *dst, int dst_in_idx);

// 90.6% p99 jitter reduction (plain p99=35ms → with RT task p99=3.3ms)
void ane_bridge_begin_realtime(void);
void ane_bridge_end_realtime(void);

Cache fix

try_cache_restore previously required a data file that ANE never creates for parameter-based models (only net.plist is written to tmpDir). Fixed to require only net.plist; data is saved/restored conditionally for BLOBFILE models.

Tested

test_bridge.m — 15/15 assertions:

  • compile_dyn, write_weight (ratio=2.0 confirmed), write_weight_f32, copy_io (1→2→4 chain), begin/end_realtime, compile cache hit (compile count does not increment on second call with same MIL), free

Adds a second dynamic weight approach to the bridge alongside the existing
BLOBFILE compile path. Instead of packing weights into the spatial dimension
of a single large input tensor and slicing them inside MIL (the training_dynamic/
approach), weights are declared as native MIL function parameters backed by
persistent IOSurfaces:

  // training_dynamic/ approach: spatial packing
  func main<ios18>(tensor<fp32, [1, DIM, 1, SEQ + 4*DIM]> x) {
      Wq = slice_by_size(x=x, begin=..., size=...);  // overhead
      ...

  // this PR: native function parameters
  func main<ios18>(tensor<fp16,[1,K,1,M]> x, tensor<fp16,[1,N,K]> W) { ... }

New API:
  ane_bridge_compile_dyn()      — compile with n_weights IOSurface parameters
  ane_bridge_write_weight()     — write fp16 to weight IOSurface (~0.001ms)
  ane_bridge_write_weight_f32() — write fp32 with NEON conversion
  ane_bridge_copy_io()          — direct output→input copy, no CPU round-trip
  ane_bridge_begin/end_realtime() — 90.6% p99 jitter reduction

Compile cache fix: ANE only writes net.plist for parameter-based models (no
data file). try_cache_restore now checks net.plist only; data is saved/restored
conditionally for BLOBFILE models that do produce it.

Also removes the pre-built libane_bridge.dylib binary from version control.

Performance vs spatial packing (Stories110M, 12 layers, M-series):
  training_dynamic/ (slice approach): 110ms/step
  function parameter approach:         76.9ms/step  (-30%)

The slice/reshape/transpose overhead per weight matrix explains the gap.
Both compile once at startup; weight updates are IOSurface writes in both cases.

Tested: test_bridge.m — 15/15 assertions across all new API functions.
dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 3, 2026
…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).
@fspecii fspecii closed this by deleting the head repository Mar 3, 2026
dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 3, 2026
…timized training (train_opt), double-buffered async ANE training (train_double_buffer), Qwen2.5-0.5B LLM inference (inference/). Added get_path() env var support and SEC_FLAGS to all new targets. Skipped PR maderix#22 (binary blob risk).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant