hexagon: add f32 ssm_conv op by tboinovski1 · Pull Request #20122 · ggml-org/llama.cpp

tboinovski1 · 2026-03-05T01:55:57Z

Make sure to read the contributing guidelines before submitting a PR

max-krasnyansky

Nice!
We can improve DMA pipelining and precompute more per-thread state in a local context. Ok to do that in a followup though.

test-backend-ops is passing on S25 and Gen5, but failing on S24 (gen3, hex-arch v75).

[SSM_CONV] ERR = 0.045690326 > 0.000000100   SSM_CONV(type=f32,ne_a=[3,1024,1,1],ne_b=[3,1024,1,1]): FAIL
[SSM_CONV] ERR = 0.027886839 > 0.000000100   SSM_CONV(type=f32,ne_a=[6,1024,1,1],ne_b=[3,1024,1,1]): FAIL
[SSM_CONV] ERR = 0.069031246 > 0.000000100   SSM_CONV(type=f32,ne_a=[3,1024,4,1],ne_b=[3,1024,1,1]): FAIL
[SSM_CONV] ERR = 0.011857102 > 0.000000100   SSM_CONV(type=f32,ne_a=[3,1536,1,1],ne_b=[3,1536,1,1]): FAIL
[SSM_CONV] ERR = 0.021944322 > 0.000000100   SSM_CONV(type=f32,ne_a=[6,1536,1,1],ne_b=[3,1536,1,1]): FAIL
[SSM_CONV] ERR = 0.046434721 > 0.000000100   SSM_CONV(type=f32,ne_a=[3,1536,4,1],ne_b=[3,1536,1,1]): FAIL
[SSM_CONV] ERR = 0.009548537 > 0.000000100   SSM_CONV(type=f32,ne_a=[3,2048,1,1],ne_b=[3,2048,1,1]): FAIL
[SSM_CONV] ERR = 0.023424253 > 0.000000100   SSM_CONV(type=f32,ne_a=[6,2048,1,1],ne_b=[3,2048,1,1]): FAIL
[SSM_CONV] ERR = 0.022690631 > 0.000000100   SSM_CONV(type=f32,ne_a=[3,2048,4,1],ne_b=[3,2048,1,1]): FAIL
[SSM_CONV] ERR = 0.020098070 > 0.000000100   SSM_CONV(type=f32,ne_a=[4,1024,1,1],ne_b=[4,1024,1,1]): FAIL
[SSM_CONV] ERR = 0.033635444 > 0.000000100   SSM_CONV(type=f32,ne_a=[8,1024,1,1],ne_b=[4,1024,1,1]): FAIL
[SSM_CONV] ERR = 0.053863172 > 0.000000100   SSM_CONV(type=f32,ne_a=[4,1024,4,1],ne_b=[4,1024,1,1]): FAIL
[SSM_CONV] ERR = 0.046260286 > 0.000000100   SSM_CONV(type=f32,ne_a=[4,1536,1,1],ne_b=[4,1536,1,1]): FAIL
[SSM_CONV] ERR = 0.019073288 > 0.000000100   SSM_CONV(type=f32,ne_a=[8,1536,1,1],ne_b=[4,1536,1,1]): FAIL
[SSM_CONV] ERR = 0.019367290 > 0.000000100   SSM_CONV(type=f32,ne_a=[4,1536,4,1],ne_b=[4,1536,1,1]): FAIL
[SSM_CONV] ERR = 0.003745381 > 0.000000100   SSM_CONV(type=f32,ne_a=[4,2048,1,1],ne_b=[4,2048,1,1]): FAIL
[SSM_CONV] ERR = 0.017238832 > 0.000000100   SSM_CONV(type=f32,ne_a=[8,2048,1,1],ne_b=[4,2048,1,1]): FAIL
[SSM_CONV] ERR = 0.015438665 > 0.000000100   SSM_CONV(type=f32,ne_a=[4,2048,4,1],ne_b=[4,2048,1,1]): FAIL
[SSM_CONV] ERR = 0.026768994 > 0.000000100   SSM_CONV(type=f32,ne_a=[9,1024,1,1],ne_b=[9,1024,1,1]): FAIL
[SSM_CONV] ERR = 0.013978035 > 0.000000100   SSM_CONV(type=f32,ne_a=[18,1024,1,1],ne_b=[9,1024,1,1]): FAIL
[SSM_CONV] ERR = 0.020436464 > 0.000000100   SSM_CONV(type=f32,ne_a=[9,1024,4,1],ne_b=[9,1024,1,1]): FAIL
[SSM_CONV] ERR = 0.003860245 > 0.000000100   SSM_CONV(type=f32,ne_a=[9,1536,1,1],ne_b=[9,1536,1,1]): FAIL
[SSM_CONV] ERR = 0.006388827 > 0.000000100   SSM_CONV(type=f32,ne_a=[18,1536,1,1],ne_b=[9,1536,1,1]): FAIL
[SSM_CONV] ERR = 0.021140220 > 0.000000100   SSM_CONV(type=f32,ne_a=[9,1536,4,1],ne_b=[9,1536,1,1]): FAIL
[SSM_CONV] ERR = 0.005109369 > 0.000000100   SSM_CONV(type=f32,ne_a=[9,2048,1,1],ne_b=[9,2048,1,1]): FAIL
[SSM_CONV] ERR = 0.006309449 > 0.000000100   SSM_CONV(type=f32,ne_a=[18,2048,1,1],ne_b=[9,2048,1,1]): FAIL
[SSM_CONV] ERR = 0.011532749 > 0.000000100   SSM_CONV(type=f32,ne_a=[9,2048,4,1],ne_b=[9,2048,1,1]): FAIL

I'll dig some more tomorrow to see what's up with that.
The error is quite large. LFM2 output seems OK but it would be good to fix those errors.

…, etc)

max-krasnyansky · 2026-03-06T17:58:55Z

Latest updates fixed all test-backend-ops failures on Snapdragon Gen3,4,5 and X-Elites.

* hexagon: add ssm_conv op * hexagon: hvx kernel is functional * hexagon: improvements to ssm-conv hvx kernel * hexagon: added dma to ssm-conv hvx kernel * hexagon: ssm-conv dynamically compute gather scratchpad * hex-ssm-conv: add local context and fix various issues (spad indexing, etc) --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

tboinovski1 requested review from lhez and max-krasnyansky as code owners March 5, 2026 01:55

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 5, 2026

loci-dev mentioned this pull request Mar 5, 2026

UPSTREAM PR #20122: hexagon: add f32 ssm_conv op auroralabs-loci/llama.cpp#1226

Open

max-krasnyansky reviewed Mar 5, 2026

View reviewed changes

tboinovski1 and others added 6 commits March 5, 2026 20:04

hexagon: add ssm_conv op

a5a22b3

hexagon: hvx kernel is functional

e7b715e

hexagon: improvements to ssm-conv hvx kernel

9aba15b

hexagon: added dma to ssm-conv hvx kernel

e0e10d7

hexagon: ssm-conv dynamically compute gather scratchpad

1d6cc3b

hex-ssm-conv: add local context and fix various issues (spad indexing…

daec768

…, etc)

max-krasnyansky force-pushed the tb/htp-ssm-conv branch from fab471c to daec768 Compare March 6, 2026 04:12

max-krasnyansky approved these changes Mar 6, 2026

View reviewed changes

max-krasnyansky merged commit 34df42f into ggml-org:master Mar 6, 2026
78 checks passed

max-krasnyansky deleted the tb/htp-ssm-conv branch March 6, 2026 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hexagon: add f32 ssm_conv op#20122

hexagon: add f32 ssm_conv op#20122
max-krasnyansky merged 6 commits intoggml-org:masterfrom
qualcomm:tb/htp-ssm-conv

tboinovski1 commented Mar 5, 2026

Uh oh!

max-krasnyansky left a comment

Uh oh!

max-krasnyansky commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tboinovski1 commented Mar 5, 2026

Uh oh!

max-krasnyansky left a comment

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants