hexagon: optimization for HMX mat_mul#21554
Conversation
|
@njsyw1997 thanks for the updates. I'm going follow up (review/etc) asap. |
0aedac2 to
a51d65b
Compare
|
@max-krasnyansky
|
|
@njsyw1997 That branch also switches everything to use HMX intrinsics I wanted to iterate on the We also have a version that does weight transpose differently (using shuffles) but it needs a bit more work and ideally should move to model load time during the repack. Please reset to that branch and push, it includes all your commits, and we can merge. |
0895350 to
0d79977
Compare
|
@max-krasnyansky |
Yep, mostly the usual suspects. Three main areas:
|
|
@njsyw1997 |
Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread.
Store the boolean to local variable avoid atomic load twice
…rface Simplifies the overall implemantion, reduces thread wakeup roundtrips.
|
Can we have another approval for merging? |
Yes, I was about to send an update. If that works well for you we'll ping @lhez and merge. |
btw I'm going to re-write the workpool into a workqueue that follows the same pattern. |
0d79977 to
c2b48b8
Compare
|
Done. |
|
@lhez or @ggml-org/maintainers need the second approval here please |
I was just doing a quick model sweep on S26+ with this PR. |
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
* hexagon: add async HMX worker Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread. * hexagon: cost-based VTCM chunk search for out-stationary matmul * hexagon: fix futex race in hmx_worker_drain Store the boolean to local variable avoid atomic load twice * hex-mm: hmx optimize scatter/transpose and use HMX intrinsics * hex-vmem: drop vmem limit a touch under 3GB on v73 * hexagon: add fwd declaration of htp_context * hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface Simplifies the overall implemantion, reduces thread wakeup roundtrips. * hex-mm: add debug log to hmx work func called from hmx-queue * Update hmx-queue.h Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
* hexagon: add async HMX worker Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread. * hexagon: cost-based VTCM chunk search for out-stationary matmul * hexagon: fix futex race in hmx_worker_drain Store the boolean to local variable avoid atomic load twice * hex-mm: hmx optimize scatter/transpose and use HMX intrinsics * hex-vmem: drop vmem limit a touch under 3GB on v73 * hexagon: add fwd declaration of htp_context * hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface Simplifies the overall implemantion, reduces thread wakeup roundtrips. * hex-mm: add debug log to hmx work func called from hmx-queue * Update hmx-queue.h Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
* hexagon: add async HMX worker Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread. * hexagon: cost-based VTCM chunk search for out-stationary matmul * hexagon: fix futex race in hmx_worker_drain Store the boolean to local variable avoid atomic load twice * hex-mm: hmx optimize scatter/transpose and use HMX intrinsics * hex-vmem: drop vmem limit a touch under 3GB on v73 * hexagon: add fwd declaration of htp_context * hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface Simplifies the overall implemantion, reduces thread wakeup roundtrips. * hex-mm: add debug log to hmx work func called from hmx-queue * Update hmx-queue.h Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Overview
This PR introduces two additional optimizations for the Hexagon HMX backend:
Enable asynchronous HMX execution
HMX computations are now executed asynchronously, allowing them to overlap with HVX dequantization and DMA stages within the pipeline. Previously, synchronous HMX calls blocked the main thread and limited parallelism.
Automatic shape search for
mat_mul_qk_0_d16a32_out_stationary()The auto-tuning logic is extended to the out-stationary pipeline path. This functionality was previously only available for non out-stationary paths.
Additional Information
Improved auto-tuning strategy
The previous strategy maximized
mc * nc, effectively reducing the number of DMA calls. While this works well for FP16 matmul, it does not accurately model the cost of quantized matmul.In quantized matmul:
Profiling on 8 Elite Gen 5 indicates that loading quantized weights is approximately 1.5× more expensive than loading activations. Although this is a rough estimate, it's produce good enough results.
Benchmark on 8 Elite Gen 5
Master
Commit a521c91 (HMX Async)
Commit ef501f8 (HMX async and auto-tuning)
Requirements