hexagon: optimization for HMX mat_mul by njsyw1997 · Pull Request #21554 · ggml-org/llama.cpp

njsyw1997 · 2026-04-07T10:36:43Z

Overview

This PR introduces two additional optimizations for the Hexagon HMX backend:

Enable asynchronous HMX execution
HMX computations are now executed asynchronously, allowing them to overlap with HVX dequantization and DMA stages within the pipeline. Previously, synchronous HMX calls blocked the main thread and limited parallelism.
Automatic shape search for mat_mul_qk_0_d16a32_out_stationary()
The auto-tuning logic is extended to the out-stationary pipeline path. This functionality was previously only available for non out-stationary paths.

Additional Information

Improved auto-tuning strategy
The previous strategy maximized mc * nc, effectively reducing the number of DMA calls. While this works well for FP16 matmul, it does not accurately model the cost of quantized matmul.

In quantized matmul:
- Weight tensors require both dequantization and shuffling
- Activation tensors require only shuffling
Profiling on 8 Elite Gen 5 indicates that loading quantized weights is approximately 1.5× more expensive than loading activations. Although this is a rough estimate, it's produce good enough results.

Benchmark on 8 Elite Gen 5

Master

model	size	params	backend	ngl	threads	cpu_mask	cpu_strict	poll	n_batch	dev	mmap	test	t/s
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	pp512	123.54 ± 0.28
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	tg128	14.70 ± 0.04

Commit a521c91 (HMX Async)

model	size	params	backend	ngl	threads	cpu_mask	cpu_strict	poll	n_batch	dev	mmap	test	t/s
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	pp512	130.68 ± 0.12
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	tg128	14.67 ± 0.06

Commit ef501f8 (HMX async and auto-tuning)

model	size	params	backend	ngl	threads	cpu_mask	cpu_strict	poll	n_batch	dev	mmap	test	t/s
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	pp512	138.56 ± 0.75
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	HTP0	0	tg128	14.92 ± 0.07

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes. Used for adding tests, logs and creating scripts to filter logs.

max-krasnyansky · 2026-04-10T18:50:58Z

@njsyw1997 thanks for the updates. I'm going follow up (review/etc) asap.
Wanted to merge this #21705 and another HMX specific update on top first.

njsyw1997 · 2026-04-12T00:48:53Z

@max-krasnyansky
Rebase finished. Big improvement.

model	size	params	backend	ngl	threads	cpu_mask	cpu_strict	poll	n_batch	fa	dev	mmap	test	t/s
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	1	HTP0	0	pp512	176.92 ± 0.16
qwen3 4B Q4_0	2.21 GiB	4.02 B	HTP	99	6	0xfc	1	1000	512	1	HTP0	0	tg128	20.77 ± 0.89

max-krasnyansky · 2026-04-12T01:32:43Z

@njsyw1997
Yep. I've been playing with it in parallel (with you) as well :)
Here is my branch with slightly cleaner rebase and some more optimizations from one of my colleagues.
https://github.com/qualcomm/llama.cpp/tree/hexagon-async-hmx

That branch also switches everything to use HMX intrinsics
(see HEXAGON_Tools/19.0.04/Tools/target/hexagon/include/hmx_hexagon_protos.h).
We're going to respin the toolchain soon with the next version of Hexagon SDK that will include HMX docs.

I wanted to iterate on the hmx-worker thing a bit more. I'm thinking we can either make it a lot simpler or maybe use the workpool with slight updates (ie just need to have a mode where thread-0 does not run jobs), but we can do that as a follow up.
I tested this branch on X-Elite, S24U, S25+, S26+ and I see very nice gains everywhere.

We also have a version that does weight transpose differently (using shuffles) but it needs a bit more work and ideally should move to model load time during the repack.

Please reset to that branch and push, it includes all your commits, and we can merge.
(btw you might want to cleanup your commit id, it's got a funky email).

njsyw1997 · 2026-04-12T02:17:37Z

@max-krasnyansky
Oh that's a proxy provided by github. I will use my real email as the ID instead in the future.
Yes. I agree we should iterate on it. It is not an elegant solution. Do you have any roadmap or any rough plan for the hexagon backend?

max-krasnyansky · 2026-04-12T20:51:55Z

@max-krasnyansky Do you have any roadmap or any rough plan for the hexagon backend?

Yep, mostly the usual suspects. Three main areas:

Op and Data Type coverage improvements : missing ops for Qwen3.5 (GDN, etc), ... ; Q4_1, ... ;
Performance improvements : better VTCM alloc, better HMX utilization, FlashAtten optimizations, Op fusion, ...
Platform support improvements : Linux with IQ-8/9/10, Arduino (Ventuno-Q), ...

max-krasnyansky · 2026-04-13T04:46:47Z

@njsyw1997
I got a new working version of hmx-worker replacement with hmx-queue that mimics dma-queue interface.
Removed extra thread wakeup round-trips etc. Looks really nice and simpler.
I'll test it some more and start a new PR tomorrow.

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread.

Store the boolean to local variable avoid atomic load twice

…rface Simplifies the overall implemantion, reduces thread wakeup roundtrips.

njsyw1997 · 2026-04-14T18:07:07Z

Can we have another approval for merging?

max-krasnyansky · 2026-04-14T18:17:22Z

Can we have another approval for merging?

Yes, I was about to send an update.
Can you please merge this and try on your setup https://github.com/qualcomm/llama.cpp/tree/hexagon-async-hmx
(well, not merge but reset to it, I just did a rebase with latest master).
I ended up iterating on the hmx-queue thing and it looks really nice and reduces latencies as I mentioned.

If that works well for you we'll ping @lhez and merge.

max-krasnyansky · 2026-04-14T18:23:31Z

https://github.com/qualcomm/llama.cpp/tree/hexagon-async-hmx

btw I'm going to re-write the workpool into a workqueue that follows the same pattern.
We can benefit from submitting multiple jobs (say quant + hvx_mm) without having to go through multiple wakeup/sleep/wakeup thread cycles.

njsyw1997 · 2026-04-14T19:15:51Z

Done.
I am also preparing a new PR for flash attention and supporting more quantization formats on HMX side. After merging I will rebase everything on these changes.

max-krasnyansky · 2026-04-14T19:43:46Z

@lhez or @ggml-org/maintainers need the second approval here please

max-krasnyansky · 2026-04-14T19:47:06Z

Done. I am also preparing a new PR for flash attention and supporting more quantization formats on HMX side. After merging I will rebase everything on these changes.

I was just doing a quick model sweep on S26+ with this PR.
Nice bumps across the board. I see similar improvements on X-Elite, S24U, S25+ as well.

qwen3-0.6b-Q4_0
  master
  prompt eval time =     277.78 ms /   204 tokens (    1.36 ms per token,   734.40 tokens per second)
         eval time =     744.84 ms /    63 runs   (   11.82 ms per token,    84.58 tokens per second)

  hmx-async
  prompt eval time =     274.12 ms /   204 tokens (    1.34 ms per token,   744.20 tokens per second)
         eval time =     741.37 ms /    63 runs   (   11.77 ms per token,    84.98 tokens per second)

qwen3-4b-Q4_0
  master
  prompt eval time =    1489.46 ms /   204 tokens (    7.30 ms per token,   136.96 tokens per second)
         eval time =    2915.77 ms /    63 runs   (   46.28 ms per token,    21.61 tokens per second)

  hmx-async
  prompt eval time =    1358.25 ms /   204 tokens (    6.66 ms per token,   150.19 tokens per second)
         eval time =    2921.44 ms /    63 runs   (   46.37 ms per token,    21.56 tokens per second)

  prompt eval time =    5174.48 ms /   732 tokens (    7.07 ms per token,   141.46 tokens per second) << faster on longer prompt
         eval time =    3235.20 ms /    63 runs   (   51.35 ms per token,    19.47 tokens per second)

qwen3.5-08b-Q4_0
  master
  prompt eval time =     795.61 ms /   206 tokens (    3.86 ms per token,   258.92 tokens per second)
         eval time =    4161.33 ms /    63 runs   (   66.05 ms per token,    15.14 tokens per second)

  hmx-async
  prompt eval time =    2539.91 ms /   732 tokens (    3.47 ms per token,   288.20 tokens per second) << faster on longer prompt
         eval time =    4485.94 ms /    63 runs   (   71.21 ms per token,    14.04 tokens per second)

gemma-4-e2b-Q4_0
  master
  prompt eval time =     983.40 ms /   202 tokens (    4.87 ms per token,   205.41 tokens per second)
         eval time =    2267.28 ms /    63 runs   (   35.99 ms per token,    27.79 tokens per second)

  prompt eval time =     972.69 ms /   202 tokens (    4.84 ms per token,   206.64 tokens per second)
         eval time =    2286.67 ms /    63 runs   (   36.30 ms per token,    27.55 tokens per second)

LFM2.5-1.2B-Q4_0
  hmx-async
  prompt eval time =    1065.26 ms /   767 tokens (    1.39 ms per token,   720.01 tokens per second)
         eval time =     924.45 ms /    63 runs   (   14.67 ms per token,    68.15 tokens per second)

OLMoE-7B-Q4_0
  master
  prompt eval time =    1353.83 ms /   212 tokens (    6.39 ms per token,   156.59 tokens per second)
         eval time =    1244.10 ms /    63 runs   (   19.75 ms per token,    50.64 tokens per second)

  hmx-async
  prompt eval time =    1333.71 ms /   212 tokens (    6.32 ms per token,   158.21 tokens per second)
         eval time =    1293.38 ms /    63 runs   (   20.53 ms per token,    48.71 tokens per second)

  prompt eval time =    4952.02 ms /   758 tokens (    6.53 ms per token,   153.07 tokens per second)
         eval time =    1415.80 ms /    63 runs   (   22.47 ms per token,    44.50 tokens per second)

Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>

* hexagon: add async HMX worker Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread. * hexagon: cost-based VTCM chunk search for out-stationary matmul * hexagon: fix futex race in hmx_worker_drain Store the boolean to local variable avoid atomic load twice * hex-mm: hmx optimize scatter/transpose and use HMX intrinsics * hex-vmem: drop vmem limit a touch under 3GB on v73 * hexagon: add fwd declaration of htp_context * hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface Simplifies the overall implemantion, reduces thread wakeup roundtrips. * hex-mm: add debug log to hmx work func called from hmx-queue * Update hmx-queue.h Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>

njsyw1997 requested a review from a team as a code owner April 7, 2026 10:36

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Apr 7, 2026

njsyw1997 force-pushed the feat/hmx-optimization branch 2 times, most recently from 0aedac2 to a51d65b Compare April 11, 2026 23:43

njsyw1997 force-pushed the feat/hmx-optimization branch from 0895350 to 0d79977 Compare April 12, 2026 01:58

max-krasnyansky approved these changes Apr 12, 2026

View reviewed changes

njsyw1997 and others added 8 commits April 14, 2026 10:42

hexagon: add async HMX worker

de15255

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread.

hexagon: cost-based VTCM chunk search for out-stationary matmul

b2ec80a

hexagon: fix futex race in hmx_worker_drain

f14e9c6

Store the boolean to local variable avoid atomic load twice

hex-mm: hmx optimize scatter/transpose and use HMX intrinsics

f873760

hex-vmem: drop vmem limit a touch under 3GB on v73

189a2d2

hexagon: add fwd declaration of htp_context

6d82f7f

hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue inte…

2af9a1c

…rface Simplifies the overall implemantion, reduces thread wakeup roundtrips.

hex-mm: add debug log to hmx work func called from hmx-queue

c2b48b8

njsyw1997 force-pushed the feat/hmx-optimization branch from 0d79977 to c2b48b8 Compare April 14, 2026 19:12

max-krasnyansky approved these changes Apr 14, 2026

View reviewed changes

max-krasnyansky reviewed Apr 14, 2026

View reviewed changes

Comment thread ggml/src/ggml-hexagon/htp/hmx-queue.h Outdated

Update hmx-queue.h

b122c95

Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>

CISC approved these changes Apr 14, 2026

View reviewed changes

max-krasnyansky merged commit 5d14e5d into ggml-org:master Apr 14, 2026
47 of 50 checks passed

Conversation

njsyw1997 commented Apr 7, 2026

Overview

Additional Information

Requirements

Uh oh!

max-krasnyansky commented Apr 10, 2026

Uh oh!

njsyw1997 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njsyw1997 commented Apr 12, 2026

Uh oh!

max-krasnyansky commented Apr 12, 2026

Uh oh!

max-krasnyansky commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njsyw1997 commented Apr 14, 2026

Uh oh!

max-krasnyansky commented Apr 14, 2026

Uh oh!

max-krasnyansky commented Apr 14, 2026

Uh oh!

njsyw1997 commented Apr 14, 2026

Uh oh!

max-krasnyansky commented Apr 14, 2026

Uh oh!

max-krasnyansky commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

njsyw1997 commented Apr 12, 2026 •

edited

Loading

max-krasnyansky commented Apr 12, 2026 •

edited

Loading

max-krasnyansky commented Apr 13, 2026 •

edited

Loading