hexagon: optimize HMX matmul operations by chraac · Pull Request #21071 · ggml-org/llama.cpp

chraac · 2026-03-27T15:05:35Z

Summary

Type Safety and Code Robustness:

Replaced int with size_t for variables representing sizes, indices, and tile counts throughout the codebase to prevent potential integer overflows and improve correctness (e.g., n_col_tiles, n_row_tiles, loop indices). [1] [2] [3] [4] [5] [6] [7]
Refactored tile and row/column stride calculations to use size_t and clarified index calculations in matrix operations, which improves code clarity and reduces the risk of subtle bugs. [1] [2]

Resource Management and Thread Safety:

Moved HAP_compute_res_hmx_lock and HAP_compute_res_hmx_unlock calls to ensure locks are held for the correct duration, improving thread safety and resource management. [1] [2] [3]

Architecture Compatibility and Memory Mapping:

Updated memory mapping and unmapping logic in main.c to use HAP_mmap2/HAP_munmap2 for HVX architectures greater than 73, and fallback to HAP_mmap/HAP_munmap otherwise. Added a check to prevent mapping more than 2GB with HAP_mmap (which is unsupported), improving compatibility and error handling. [1] [2] [3]

Documentation and Comments:

Clarified comments related to scale initialization, specifying both scale and bias in FP16 initialization for improved code documentation. [1] [2] [3] [4]

These changes collectively improve the reliability, maintainability, and portability of the Hexagon HTP matrix multiplication code.

Additional information

Tested with Qwen3.5-2b-q4, works well

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, for commit log and PR descriptions

…front

…e readability

…r tile counts

…ation

…ales initialization

chraac · 2026-03-27T15:10:27Z

    TIMER_START(total);

    HAP_compute_res_hmx_lock(ctx->vtcm_rctx);
+    hmx_set_output_scales(vtcm_scales);


Key change: move scale and bias initialization out of the loop instead of reinitializing the same scale each iteration, to reduces some HMX register setup.

Potential follow-up: reuse this scale for quant-block scaling to avoid the dequantization vmpy, at the cost of extra VTCM cache for scale storage...

Potential follow-up: reuse this scale for quant-block scaling to avoid the dequantization vmpy, at the cost of extra VTCM cache for scale storage...

From testing, the scales are column-based, so we can postpone scale multiplication until after accumulation.
For each column in the 32x32 output tile, apply the scale once per column, which should remove the multiply currently done during dequantization.

This likely saves dequant vmpy work, at the cost of extra VTCM usage for storing scales.

…ing and locking

…output

# Conflicts: # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c

max-krasnyansky · 2026-04-01T15:50:41Z

@chraac
Sorry for the delay on this. We have a couple of decent size changes coming in this area (op request batching/buffer management and HMX optimizations).
We'll probably merge that stuff first (hopefully by end of this week), then I'm going to work with you to rebase/update/merge things on top.

chraac · 2026-04-02T06:11:45Z

We'll probably merge that stuff first (hopefully by end of this week), then I'm going to work with you to rebase/update/merge things on top.

Np, lets wait for your PR merged first, then redo this one based on that.

… initialization

max-krasnyansky · 2026-04-10T18:49:07Z

@chraac here is that PR that I mentioned #21705
There are more HMX specific updates coming on top shortly.

# Conflicts: # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c

This reverts commit cde679e.

max-krasnyansky · 2026-04-15T17:01:14Z

@chraac Do you still see an improvement after rebasing with recent master (ie after we added hmx-queue and dyn. chunk sizing)?

chraac · 2026-04-16T09:48:11Z

@chraac Do you still see an improvement after rebasing with recent master (ie after we added hmx-queue and dyn. chunk sizing)?

Small improvement on qwen3.5-2b-q4 prefill on my 8 Gen 3 devices.

before (`5d14e5d`)	after (`086ccf5`)
67.14 tk/s	71.64 tk/s

Not sure yet whether this is a real gain or just run-to-run fluctuation, since the difference is fairly small.

…ctions

chraac · 2026-04-16T13:23:57Z

+            }
+
+            void *va = HAP_mmap(NULL, size, HAP_PROT_READ | HAP_PROT_WRITE, 0, fd, 0);
+#endif


Fallback to HAP_mmap for older arch, that fix the crash on my 8gen2 devices

max-krasnyansky · 2026-04-16T17:49:34Z

@chraac Do you still see an improvement after rebasing with recent master (ie after we added hmx-queue and dyn. chunk sizing)?

Small improvement on qwen3.5-2b-q4 prefill on my 8 Gen 3 devices.

before (5d14e5d) after (086ccf5)
67.14 tk/s 71.64 tk/s
Not sure yet whether this is a real gain or just run-to-run fluctuation, since the difference is fairly small.

Sweet!
Trying it out on my setups...
Other fixes sound good too.

max-krasnyansky

Yep. Looks good here. Seems to improve prompt by ~1-2 TPS.
@lhez can we please get the second approval.

* optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679e. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip

chraac added 12 commits March 26, 2026 21:28

optimize hmx_mat_mul functions by calculating row and column tiles up…

de56c35

…front

refactor core_dot_chunk_fp16 to use size_t for tile counts and improv…

b2b21a3

…e readability

wip

5e18f4e

set scale outside of loop

a262832

wip

ee95d92

refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t fo…

33d9431

…r tile counts

wip

3a97015

wip

6e291d8

refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions

f43d68c

refactor core_dot_chunk_fp16 to use size_t for tile row stride calcul…

42bd08c

…ation

wip

ee95146

refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column sc…

91d88a3

…ales initialization

chraac requested a review from a team as a code owner March 27, 2026 15:05

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Mar 27, 2026

chraac commented Mar 27, 2026

View reviewed changes

chraac added 3 commits March 27, 2026 23:32

refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale sett…

55d7258

…ing and locking

refactor core_dot_chunk_fp16 to improve tile stride calculations for …

362c62c

…output

Merge branch 'master' into dev-hmx-opt

e31e30a

# Conflicts: # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c

chraac added 2 commits April 4, 2026 21:33

refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales…

7c1a5a3

… initialization

Merge branch 'master' into dev-hmx-opt

3cd8041

chraac added 6 commits April 15, 2026 09:09

Merge branch 'master' into dev-hmx-opt

1a71699

# Conflicts: # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c

fix compiling error

2f37db7

wip

87ab485

optimize row and column tile indexing in core_mma_chunk_fp16 function

2a12467

wip

cde679e

Revert "wip"

086ccf5

This reverts commit cde679e.

Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap fun…

36b9bb1

…ctions

chraac commented Apr 16, 2026

View reviewed changes

wip

37a2797

max-krasnyansky approved these changes Apr 16, 2026

View reviewed changes

lhez approved these changes Apr 16, 2026

View reviewed changes

max-krasnyansky merged commit 85dde8d into ggml-org:master Apr 16, 2026
48 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hexagon: optimize HMX matmul operations#21071

hexagon: optimize HMX matmul operations#21071
max-krasnyansky merged 25 commits intoggml-org:masterfrom
chraac:dev-hmx-opt

chraac commented Mar 27, 2026 •

edited

Loading

Uh oh!

chraac Mar 27, 2026

Uh oh!

chraac Mar 27, 2026

Uh oh!

chraac Apr 16, 2026

Uh oh!

max-krasnyansky commented Apr 1, 2026

Uh oh!

chraac commented Apr 2, 2026

Uh oh!

max-krasnyansky commented Apr 10, 2026

Uh oh!

max-krasnyansky commented Apr 15, 2026

Uh oh!

chraac commented Apr 16, 2026

Uh oh!

chraac Apr 16, 2026 •

edited

Loading

Uh oh!

max-krasnyansky commented Apr 16, 2026

Uh oh!

max-krasnyansky left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chraac commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Additional information

Requirements

Uh oh!

chraac Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

chraac Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

chraac Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Apr 1, 2026

Uh oh!

chraac commented Apr 2, 2026

Uh oh!

max-krasnyansky commented Apr 10, 2026

Uh oh!

max-krasnyansky commented Apr 15, 2026

Uh oh!

chraac commented Apr 16, 2026

Uh oh!

chraac Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Apr 16, 2026

Uh oh!

max-krasnyansky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chraac commented Mar 27, 2026 •

edited

Loading

chraac Apr 16, 2026 •

edited

Loading