Skip to content

hexagon: optimize HMX matmul operations#21071

Merged
max-krasnyansky merged 25 commits intoggml-org:masterfrom
chraac:dev-hmx-opt
Apr 16, 2026
Merged

hexagon: optimize HMX matmul operations#21071
max-krasnyansky merged 25 commits intoggml-org:masterfrom
chraac:dev-hmx-opt

Conversation

@chraac
Copy link
Copy Markdown
Contributor

@chraac chraac commented Mar 27, 2026

Summary

Type Safety and Code Robustness:

  • Replaced int with size_t for variables representing sizes, indices, and tile counts throughout the codebase to prevent potential integer overflows and improve correctness (e.g., n_col_tiles, n_row_tiles, loop indices). [1] [2] [3] [4] [5] [6] [7]
  • Refactored tile and row/column stride calculations to use size_t and clarified index calculations in matrix operations, which improves code clarity and reduces the risk of subtle bugs. [1] [2]

Resource Management and Thread Safety:

  • Moved HAP_compute_res_hmx_lock and HAP_compute_res_hmx_unlock calls to ensure locks are held for the correct duration, improving thread safety and resource management. [1] [2] [3]

Architecture Compatibility and Memory Mapping:

  • Updated memory mapping and unmapping logic in main.c to use HAP_mmap2/HAP_munmap2 for HVX architectures greater than 73, and fallback to HAP_mmap/HAP_munmap otherwise. Added a check to prevent mapping more than 2GB with HAP_mmap (which is unsupported), improving compatibility and error handling. [1] [2] [3]

Documentation and Comments:

  • Clarified comments related to scale initialization, specifying both scale and bias in FP16 initialization for improved code documentation. [1] [2] [3] [4]

These changes collectively improve the reliability, maintainability, and portability of the Hexagon HTP matrix multiplication code.

Additional information

Tested with Qwen3.5-2b-q4, works well

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for commit log and PR descriptions

@chraac chraac requested a review from a team as a code owner March 27, 2026 15:05
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Mar 27, 2026
TIMER_START(total);

HAP_compute_res_hmx_lock(ctx->vtcm_rctx);
hmx_set_output_scales(vtcm_scales);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key change: move scale and bias initialization out of the loop instead of reinitializing the same scale each iteration, to reduces some HMX register setup.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential follow-up: reuse this scale for quant-block scaling to avoid the dequantization vmpy, at the cost of extra VTCM cache for scale storage...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential follow-up: reuse this scale for quant-block scaling to avoid the dequantization vmpy, at the cost of extra VTCM cache for scale storage...

From testing, the scales are column-based, so we can postpone scale multiplication until after accumulation.
For each column in the 32x32 output tile, apply the scale once per column, which should remove the multiply currently done during dequantization.

This likely saves dequant vmpy work, at the cost of extra VTCM usage for storing scales.

@max-krasnyansky
Copy link
Copy Markdown
Member

@chraac
Sorry for the delay on this. We have a couple of decent size changes coming in this area (op request batching/buffer management and HMX optimizations).
We'll probably merge that stuff first (hopefully by end of this week), then I'm going to work with you to rebase/update/merge things on top.

@chraac
Copy link
Copy Markdown
Contributor Author

chraac commented Apr 2, 2026

We'll probably merge that stuff first (hopefully by end of this week), then I'm going to work with you to rebase/update/merge things on top.

Np, lets wait for your PR merged first, then redo this one based on that.

@max-krasnyansky
Copy link
Copy Markdown
Member

@chraac here is that PR that I mentioned #21705
There are more HMX specific updates coming on top shortly.

@max-krasnyansky
Copy link
Copy Markdown
Member

@chraac Do you still see an improvement after rebasing with recent master (ie after we added hmx-queue and dyn. chunk sizing)?

@chraac
Copy link
Copy Markdown
Contributor Author

chraac commented Apr 16, 2026

@chraac Do you still see an improvement after rebasing with recent master (ie after we added hmx-queue and dyn. chunk sizing)?

Small improvement on qwen3.5-2b-q4 prefill on my 8 Gen 3 devices.

before (5d14e5d) after (086ccf5)
67.14 tk/s 71.64 tk/s

Not sure yet whether this is a real gain or just run-to-run fluctuation, since the difference is fairly small.

}

void *va = HAP_mmap(NULL, size, HAP_PROT_READ | HAP_PROT_WRITE, 0, fd, 0);
#endif
Copy link
Copy Markdown
Contributor Author

@chraac chraac Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback to HAP_mmap for older arch, that fix the crash on my 8gen2 devices

@max-krasnyansky
Copy link
Copy Markdown
Member

@chraac Do you still see an improvement after rebasing with recent master (ie after we added hmx-queue and dyn. chunk sizing)?

Small improvement on qwen3.5-2b-q4 prefill on my 8 Gen 3 devices.

before (5d14e5d) after (086ccf5)
67.14 tk/s 71.64 tk/s
Not sure yet whether this is a real gain or just run-to-run fluctuation, since the difference is fairly small.

Sweet!
Trying it out on my setups...
Other fixes sound good too.

Copy link
Copy Markdown
Member

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Looks good here. Seems to improve prompt by ~1-2 TPS.
@lhez can we please get the second approval.

@max-krasnyansky max-krasnyansky merged commit 85dde8d into ggml-org:master Apr 16, 2026
48 of 50 checks passed
cnsiva pushed a commit to saas-home/llama.cpp that referenced this pull request Apr 17, 2026
* optimize hmx_mat_mul functions by calculating row and column tiles upfront

* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability

* wip

* set scale outside of loop

* wip

* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts

* wip

* wip

* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions

* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation

* wip

* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization

* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking

* refactor core_dot_chunk_fp16 to improve tile stride calculations for output

* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization

* fix compiling error

* wip

* optimize row and column tile indexing in core_mma_chunk_fp16 function

* wip

* Revert "wip"

This reverts commit cde679e.

* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions

* wip
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request Apr 19, 2026
* optimize hmx_mat_mul functions by calculating row and column tiles upfront

* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability

* wip

* set scale outside of loop

* wip

* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts

* wip

* wip

* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions

* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation

* wip

* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization

* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking

* refactor core_dot_chunk_fp16 to improve tile stride calculations for output

* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization

* fix compiling error

* wip

* optimize row and column tile indexing in core_mma_chunk_fp16 function

* wip

* Revert "wip"

This reverts commit cde679e.

* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions

* wip
mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026
* optimize hmx_mat_mul functions by calculating row and column tiles upfront

* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability

* wip

* set scale outside of loop

* wip

* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts

* wip

* wip

* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions

* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation

* wip

* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization

* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking

* refactor core_dot_chunk_fp16 to improve tile stride calculations for output

* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization

* fix compiling error

* wip

* optimize row and column tile indexing in core_mma_chunk_fp16 function

* wip

* Revert "wip"

This reverts commit cde679e.

* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions

* wip
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
* optimize hmx_mat_mul functions by calculating row and column tiles upfront

* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability

* wip

* set scale outside of loop

* wip

* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts

* wip

* wip

* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions

* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation

* wip

* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization

* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking

* refactor core_dot_chunk_fp16 to improve tile stride calculations for output

* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization

* fix compiling error

* wip

* optimize row and column tile indexing in core_mma_chunk_fp16 function

* wip

* Revert "wip"

This reverts commit cde679e.

* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions

* wip
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
* optimize hmx_mat_mul functions by calculating row and column tiles upfront

* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability

* wip

* set scale outside of loop

* wip

* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts

* wip

* wip

* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions

* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation

* wip

* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization

* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking

* refactor core_dot_chunk_fp16 to improve tile stride calculations for output

* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization

* fix compiling error

* wip

* optimize row and column tile indexing in core_mma_chunk_fp16 function

* wip

* Revert "wip"

This reverts commit cde679e.

* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions

* wip
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants