Skip to content

hexagon: optimization for HMX mat_mul#21554

Merged
max-krasnyansky merged 9 commits intoggml-org:masterfrom
aizip:feat/hmx-optimization
Apr 14, 2026
Merged

hexagon: optimization for HMX mat_mul#21554
max-krasnyansky merged 9 commits intoggml-org:masterfrom
aizip:feat/hmx-optimization

Conversation

@njsyw1997
Copy link
Copy Markdown
Contributor

Overview

This PR introduces two additional optimizations for the Hexagon HMX backend:

  1. Enable asynchronous HMX execution
    HMX computations are now executed asynchronously, allowing them to overlap with HVX dequantization and DMA stages within the pipeline. Previously, synchronous HMX calls blocked the main thread and limited parallelism.

  2. Automatic shape search for mat_mul_qk_0_d16a32_out_stationary()
    The auto-tuning logic is extended to the out-stationary pipeline path. This functionality was previously only available for non out-stationary paths.

Additional Information

  • Improved auto-tuning strategy
    The previous strategy maximized mc * nc, effectively reducing the number of DMA calls. While this works well for FP16 matmul, it does not accurately model the cost of quantized matmul.

    In quantized matmul:

    • Weight tensors require both dequantization and shuffling
    • Activation tensors require only shuffling

    Profiling on 8 Elite Gen 5 indicates that loading quantized weights is approximately 1.5× more expensive than loading activations. Although this is a rough estimate, it's produce good enough results.

Benchmark on 8 Elite Gen 5

Master

model size params backend ngl threads cpu_mask cpu_strict poll n_batch dev mmap test t/s
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 pp512 123.54 ± 0.28
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 tg128 14.70 ± 0.04

Commit a521c91 (HMX Async)

model size params backend ngl threads cpu_mask cpu_strict poll n_batch dev mmap test t/s
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 pp512 130.68 ± 0.12
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 tg128 14.67 ± 0.06

Commit ef501f8 (HMX async and auto-tuning)

model size params backend ngl threads cpu_mask cpu_strict poll n_batch dev mmap test t/s
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 pp512 138.56 ± 0.75
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 HTP0 0 tg128 14.92 ± 0.07

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes. Used for adding tests, logs and creating scripts to filter logs.

@njsyw1997 njsyw1997 requested a review from a team as a code owner April 7, 2026 10:36
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Apr 7, 2026
@max-krasnyansky
Copy link
Copy Markdown
Member

@njsyw1997 thanks for the updates. I'm going follow up (review/etc) asap.
Wanted to merge this #21705 and another HMX specific update on top first.

@njsyw1997 njsyw1997 force-pushed the feat/hmx-optimization branch 2 times, most recently from 0aedac2 to a51d65b Compare April 11, 2026 23:43
@njsyw1997
Copy link
Copy Markdown
Contributor Author

njsyw1997 commented Apr 12, 2026

@max-krasnyansky
Rebase finished. Big improvement.

model size params backend ngl threads cpu_mask cpu_strict poll n_batch fa dev mmap test t/s
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 1 HTP0 0 pp512 176.92 ± 0.16
qwen3 4B Q4_0 2.21 GiB 4.02 B HTP 99 6 0xfc 1 1000 512 1 HTP0 0 tg128 20.77 ± 0.89

@max-krasnyansky
Copy link
Copy Markdown
Member

max-krasnyansky commented Apr 12, 2026

@njsyw1997
Yep. I've been playing with it in parallel (with you) as well :)
Here is my branch with slightly cleaner rebase and some more optimizations from one of my colleagues.
https://github.com/qualcomm/llama.cpp/tree/hexagon-async-hmx

That branch also switches everything to use HMX intrinsics
(see HEXAGON_Tools/19.0.04/Tools/target/hexagon/include/hmx_hexagon_protos.h).
We're going to respin the toolchain soon with the next version of Hexagon SDK that will include HMX docs.

I wanted to iterate on the hmx-worker thing a bit more. I'm thinking we can either make it a lot simpler or maybe use the workpool with slight updates (ie just need to have a mode where thread-0 does not run jobs), but we can do that as a follow up.
I tested this branch on X-Elite, S24U, S25+, S26+ and I see very nice gains everywhere.

We also have a version that does weight transpose differently (using shuffles) but it needs a bit more work and ideally should move to model load time during the repack.

Please reset to that branch and push, it includes all your commits, and we can merge.
(btw you might want to cleanup your commit id, it's got a funky email).

@njsyw1997 njsyw1997 force-pushed the feat/hmx-optimization branch from 0895350 to 0d79977 Compare April 12, 2026 01:58
@njsyw1997
Copy link
Copy Markdown
Contributor Author

@max-krasnyansky
Oh that's a proxy provided by github. I will use my real email as the ID instead in the future.
Yes. I agree we should iterate on it. It is not an elegant solution. Do you have any roadmap or any rough plan for the hexagon backend?

@max-krasnyansky
Copy link
Copy Markdown
Member

@max-krasnyansky Do you have any roadmap or any rough plan for the hexagon backend?

Yep, mostly the usual suspects. Three main areas:

  1. Op and Data Type coverage improvements : missing ops for Qwen3.5 (GDN, etc), ... ; Q4_1, ... ;
  2. Performance improvements : better VTCM alloc, better HMX utilization, FlashAtten optimizations, Op fusion, ...
  3. Platform support improvements : Linux with IQ-8/9/10, Arduino (Ventuno-Q), ...

@max-krasnyansky
Copy link
Copy Markdown
Member

max-krasnyansky commented Apr 13, 2026

@njsyw1997
I got a new working version of hmx-worker replacement with hmx-queue that mimics dma-queue interface.
Removed extra thread wakeup round-trips etc. Looks really nice and simpler.
I'll test it some more and start a new PR tomorrow.

njsyw1997 and others added 8 commits April 14, 2026 10:42
Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.
Store the boolean to local variable avoid atomic load twice
…rface

Simplifies the overall implemantion, reduces thread wakeup roundtrips.
@njsyw1997
Copy link
Copy Markdown
Contributor Author

Can we have another approval for merging?

@max-krasnyansky
Copy link
Copy Markdown
Member

Can we have another approval for merging?

Yes, I was about to send an update.
Can you please merge this and try on your setup https://github.com/qualcomm/llama.cpp/tree/hexagon-async-hmx
(well, not merge but reset to it, I just did a rebase with latest master).
I ended up iterating on the hmx-queue thing and it looks really nice and reduces latencies as I mentioned.

If that works well for you we'll ping @lhez and merge.

@max-krasnyansky
Copy link
Copy Markdown
Member

https://github.com/qualcomm/llama.cpp/tree/hexagon-async-hmx

btw I'm going to re-write the workpool into a workqueue that follows the same pattern.
We can benefit from submitting multiple jobs (say quant + hvx_mm) without having to go through multiple wakeup/sleep/wakeup thread cycles.

@njsyw1997 njsyw1997 force-pushed the feat/hmx-optimization branch from 0d79977 to c2b48b8 Compare April 14, 2026 19:12
@njsyw1997
Copy link
Copy Markdown
Contributor Author

Done.
I am also preparing a new PR for flash attention and supporting more quantization formats on HMX side. After merging I will rebase everything on these changes.

@max-krasnyansky
Copy link
Copy Markdown
Member

@lhez or @ggml-org/maintainers need the second approval here please

@max-krasnyansky
Copy link
Copy Markdown
Member

Done. I am also preparing a new PR for flash attention and supporting more quantization formats on HMX side. After merging I will rebase everything on these changes.

I was just doing a quick model sweep on S26+ with this PR.
Nice bumps across the board. I see similar improvements on X-Elite, S24U, S25+ as well.

qwen3-0.6b-Q4_0
  master
  prompt eval time =     277.78 ms /   204 tokens (    1.36 ms per token,   734.40 tokens per second)
         eval time =     744.84 ms /    63 runs   (   11.82 ms per token,    84.58 tokens per second)

  hmx-async
  prompt eval time =     274.12 ms /   204 tokens (    1.34 ms per token,   744.20 tokens per second)
         eval time =     741.37 ms /    63 runs   (   11.77 ms per token,    84.98 tokens per second)

qwen3-4b-Q4_0
  master
  prompt eval time =    1489.46 ms /   204 tokens (    7.30 ms per token,   136.96 tokens per second)
         eval time =    2915.77 ms /    63 runs   (   46.28 ms per token,    21.61 tokens per second)

  hmx-async
  prompt eval time =    1358.25 ms /   204 tokens (    6.66 ms per token,   150.19 tokens per second)
         eval time =    2921.44 ms /    63 runs   (   46.37 ms per token,    21.56 tokens per second)

  prompt eval time =    5174.48 ms /   732 tokens (    7.07 ms per token,   141.46 tokens per second) << faster on longer prompt
         eval time =    3235.20 ms /    63 runs   (   51.35 ms per token,    19.47 tokens per second)

qwen3.5-08b-Q4_0
  master
  prompt eval time =     795.61 ms /   206 tokens (    3.86 ms per token,   258.92 tokens per second)
         eval time =    4161.33 ms /    63 runs   (   66.05 ms per token,    15.14 tokens per second)

  hmx-async
  prompt eval time =    2539.91 ms /   732 tokens (    3.47 ms per token,   288.20 tokens per second) << faster on longer prompt
         eval time =    4485.94 ms /    63 runs   (   71.21 ms per token,    14.04 tokens per second)

gemma-4-e2b-Q4_0
  master
  prompt eval time =     983.40 ms /   202 tokens (    4.87 ms per token,   205.41 tokens per second)
         eval time =    2267.28 ms /    63 runs   (   35.99 ms per token,    27.79 tokens per second)

  prompt eval time =     972.69 ms /   202 tokens (    4.84 ms per token,   206.64 tokens per second)
         eval time =    2286.67 ms /    63 runs   (   36.30 ms per token,    27.55 tokens per second)

LFM2.5-1.2B-Q4_0
  hmx-async
  prompt eval time =    1065.26 ms /   767 tokens (    1.39 ms per token,   720.01 tokens per second)
         eval time =     924.45 ms /    63 runs   (   14.67 ms per token,    68.15 tokens per second)

OLMoE-7B-Q4_0
  master
  prompt eval time =    1353.83 ms /   212 tokens (    6.39 ms per token,   156.59 tokens per second)
         eval time =    1244.10 ms /    63 runs   (   19.75 ms per token,    50.64 tokens per second)

  hmx-async
  prompt eval time =    1333.71 ms /   212 tokens (    6.32 ms per token,   158.21 tokens per second)
         eval time =    1293.38 ms /    63 runs   (   20.53 ms per token,    48.71 tokens per second)

  prompt eval time =    4952.02 ms /   758 tokens (    6.53 ms per token,   153.07 tokens per second)
         eval time =    1415.80 ms /    63 runs   (   22.47 ms per token,    44.50 tokens per second)

Comment thread ggml/src/ggml-hexagon/htp/hmx-queue.h Outdated
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
@max-krasnyansky max-krasnyansky merged commit 5d14e5d into ggml-org:master Apr 14, 2026
47 of 50 checks passed
mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026
* hexagon: add async HMX worker

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.

* hexagon: cost-based VTCM chunk search for out-stationary matmul

* hexagon: fix futex race in hmx_worker_drain
Store the boolean to local variable avoid atomic load twice

* hex-mm: hmx optimize scatter/transpose and use HMX intrinsics

* hex-vmem: drop vmem limit a touch under 3GB on v73

* hexagon: add fwd declaration of htp_context

* hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface

Simplifies the overall implemantion, reduces thread wakeup roundtrips.

* hex-mm: add debug log to hmx work func called from hmx-queue

* Update hmx-queue.h

Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
* hexagon: add async HMX worker

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.

* hexagon: cost-based VTCM chunk search for out-stationary matmul

* hexagon: fix futex race in hmx_worker_drain
Store the boolean to local variable avoid atomic load twice

* hex-mm: hmx optimize scatter/transpose and use HMX intrinsics

* hex-vmem: drop vmem limit a touch under 3GB on v73

* hexagon: add fwd declaration of htp_context

* hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface

Simplifies the overall implemantion, reduces thread wakeup roundtrips.

* hex-mm: add debug log to hmx work func called from hmx-queue

* Update hmx-queue.h

Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
* hexagon: add async HMX worker

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.

* hexagon: cost-based VTCM chunk search for out-stationary matmul

* hexagon: fix futex race in hmx_worker_drain
Store the boolean to local variable avoid atomic load twice

* hex-mm: hmx optimize scatter/transpose and use HMX intrinsics

* hex-vmem: drop vmem limit a touch under 3GB on v73

* hexagon: add fwd declaration of htp_context

* hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface

Simplifies the overall implemantion, reduces thread wakeup roundtrips.

* hex-mm: add debug log to hmx work func called from hmx-queue

* Update hmx-queue.h

Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants