hexagon: optimize HMX matmul operations#21071
hexagon: optimize HMX matmul operations#21071max-krasnyansky merged 25 commits intoggml-org:masterfrom
Conversation
…ales initialization
| TIMER_START(total); | ||
|
|
||
| HAP_compute_res_hmx_lock(ctx->vtcm_rctx); | ||
| hmx_set_output_scales(vtcm_scales); |
There was a problem hiding this comment.
Key change: move scale and bias initialization out of the loop instead of reinitializing the same scale each iteration, to reduces some HMX register setup.
There was a problem hiding this comment.
Potential follow-up: reuse this scale for quant-block scaling to avoid the dequantization vmpy, at the cost of extra VTCM cache for scale storage...
There was a problem hiding this comment.
Potential follow-up: reuse this scale for quant-block scaling to avoid the dequantization vmpy, at the cost of extra VTCM cache for scale storage...
From testing, the scales are column-based, so we can postpone scale multiplication until after accumulation.
For each column in the 32x32 output tile, apply the scale once per column, which should remove the multiply currently done during dequantization.
This likely saves dequant vmpy work, at the cost of extra VTCM usage for storing scales.
# Conflicts: # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
|
@chraac |
Np, lets wait for your PR merged first, then redo this one based on that. |
# Conflicts: # ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
This reverts commit cde679e.
|
@chraac Do you still see an improvement after rebasing with recent master (ie after we added hmx-queue and dyn. chunk sizing)? |
Small improvement on qwen3.5-2b-q4 prefill on my 8 Gen 3 devices.
Not sure yet whether this is a real gain or just run-to-run fluctuation, since the difference is fairly small. |
| } | ||
|
|
||
| void *va = HAP_mmap(NULL, size, HAP_PROT_READ | HAP_PROT_WRITE, 0, fd, 0); | ||
| #endif |
There was a problem hiding this comment.
Fallback to HAP_mmap for older arch, that fix the crash on my 8gen2 devices
Sweet! |
max-krasnyansky
left a comment
There was a problem hiding this comment.
Yep. Looks good here. Seems to improve prompt by ~1-2 TPS.
@lhez can we please get the second approval.
* optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679e. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip
* optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679e. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip
* optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679e. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip
* optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679e. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip
* optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679e. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip
Summary
Type Safety and Code Robustness:
intwithsize_tfor variables representing sizes, indices, and tile counts throughout the codebase to prevent potential integer overflows and improve correctness (e.g.,n_col_tiles,n_row_tiles, loop indices). [1] [2] [3] [4] [5] [6] [7]size_tand clarified index calculations in matrix operations, which improves code clarity and reduces the risk of subtle bugs. [1] [2]Resource Management and Thread Safety:
HAP_compute_res_hmx_lockandHAP_compute_res_hmx_unlockcalls to ensure locks are held for the correct duration, improving thread safety and resource management. [1] [2] [3]Architecture Compatibility and Memory Mapping:
main.cto useHAP_mmap2/HAP_munmap2for HVX architectures greater than 73, and fallback toHAP_mmap/HAP_munmapotherwise. Added a check to prevent mapping more than 2GB withHAP_mmap(which is unsupported), improving compatibility and error handling. [1] [2] [3]Documentation and Comments:
These changes collectively improve the reliability, maintainability, and portability of the Hexagon HTP matrix multiplication code.
Additional information
Tested with Qwen3.5-2b-q4, works well
Requirements