Skip to content

cpu: skip redudant ROPE cache updates#20149

Merged
max-krasnyansky merged 1 commit intoggml-org:masterfrom
qualcomm:cpu-skip-reduntant-rope-cache-updates
Mar 6, 2026
Merged

cpu: skip redudant ROPE cache updates#20149
max-krasnyansky merged 1 commit intoggml-org:masterfrom
qualcomm:cpu-skip-reduntant-rope-cache-updates

Conversation

@max-krasnyansky
Copy link
Copy Markdown
Member

I was updating ROPE for Hexagon while using the CPU implementation as the reference and noticed that some threads may end up doing redundant ROPE cache updates for the rows that are mapped to other threads.
Added a bit of instrumentation and sure enough I see something like this:

Details
thread-0: compute rope cache: ir 18 i3 0 i2 1
thread-0: compute rope cache: ir 19 i3 0 i2 2
thread-0: compute rope cache: ir 20 i3 0 i2 3
thread-0: compute rope cache: ir 21 i3 0 i2 4
thread-0: compute rope cache: ir 22 i3 0 i2 5
thread-0: compute rope cache: ir 23 i3 0 i2 6
thread-0: compute rope cache: ir 24 i3 0 i2 7
thread-0: compute rope cache: ir 25 i3 0 i2 8
thread-0: compute rope cache: ir 26 i3 0 i2 9
thread-0: compute rope cache: ir 27 i3 0 i2 10
thread-0: compute rope cache: ir 0 i3 0 i2 0
thread-0: use rope cache: ir 1 i3 0 i2 0 i1 0
thread-0: use rope cache: ir 2 i3 0 i2 0 i1 1
thread-0: use rope cache: ir 3 i3 0 i2 0 i1 2
thread-0: use rope cache: ir 4 i3 0 i2 0 i1 3
thread-0: use rope cache: ir 5 i3 0 i2 0 i1 4
thread-0: use rope cache: ir 6 i3 0 i2 0 i1 5
thread-0: compute rope cache: ir 7 i3 0 i2 1
thread-0: compute rope cache: ir 8 i3 0 i2 2
thread-0: compute rope cache: ir 9 i3 0 i2 3
thread-0: compute rope cache: ir 10 i3 0 i2 4
thread-0: compute rope cache: ir 11 i3 0 i2 5
thread-0: compute rope cache: ir 12 i3 0 i2 6
thread-0: compute rope cache: ir 13 i3 0 i2 7
thread-0: compute rope cache: ir 14 i3 0 i2 8
thread-0: compute rope cache: ir 15 i3 0 i2 9
thread-0: compute rope cache: ir 16 i3 0 i2 10
thread-0: compute rope cache: ir 0 i3 0 i2 0
thread-0: use rope cache: ir 1 i3 0 i2 0 i1 0
thread-0: use rope cache: ir 2 i3 0 i2 0 i1 1
thread-0: use rope cache: ir 3 i3 0 i2 0 i1 2
thread-0: use rope cache: ir 4 i3 0 i2 0 i1 3
thread-0: use rope cache: ir 5 i3 0 i2 0 i1 4
thread-0: use rope cache: ir 6 i3 0 i2 0 i1 5
thread-0: use rope cache: ir 7 i3 0 i2 0 i1 6
thread-0: use rope cache: ir 8 i3 0 i2 0 i1 7
thread-0: use rope cache: ir 9 i3 0 i2 0 i1 8
...
thread-1: compute rope cache: ir 8 i3 0 i2 1
thread-1: use rope cache: ir 9 i3 0 i2 1 i1 0
thread-1: use rope cache: ir 10 i3 0 i2 1 i1 1
thread-1: use rope cache: ir 11 i3 0 i2 1 i1 2
thread-1: use rope cache: ir 12 i3 0 i2 1 i1 3
thread-1: compute rope cache: ir 13 i3 0 i2 2
thread-1: compute rope cache: ir 14 i3 0 i2 3
thread-1: compute rope cache: ir 15 i3 0 i2 4
thread-1: compute rope cache: ir 16 i3 0 i2 5
thread-1: compute rope cache: ir 17 i3 0 i2 6
thread-1: compute rope cache: ir 18 i3 0 i2 7
thread-1: compute rope cache: ir 19 i3 0 i2 8
thread-1: compute rope cache: ir 20 i3 0 i2 9
thread-1: compute rope cache: ir 21 i3 0 i2 10
thread-1: compute rope cache: ir 0 i3 0 i2 0
thread-1: use rope cache: ir 18 i3 0 i2 0 i1 17
thread-1: use rope cache: ir 19 i3 0 i2 0 i1 18
thread-1: use rope cache: ir 20 i3 0 i2 0 i1 19
thread-1: use rope cache: ir 21 i3 0 i2 0 i1 20
thread-1: use rope cache: ir 22 i3 0 i2 0 i1 21
thread-1: use rope cache: ir 23 i3 0 i2 0 i1 22
thread-1: use rope cache: ir 24 i3 0 i2 0 i1 23

ROPE cache is updated for each i2 (seq-len) iteration but some of the iterations are mapped to other threads.
This PR adds a bit of state to compute ROPE cache only for the iterations that are actually used.

Here are some before/after numbers with llama3.2-3B-Q4_0.

## Ryzen AI Max+ 395
Before:
  common_perf_print: prompt eval time =     353.17 ms /   214 tokens (    1.65 ms per token,   605.93 tokens per second)
  common_perf_print:        eval time =    1631.31 ms /    83 runs   (   19.65 ms per token,    50.88 tokens per second)
After:
  common_perf_print: prompt eval time =     350.50 ms /   214 tokens (    1.64 ms per token,   610.55 tokens per second)
  common_perf_print:        eval time =    1640.84 ms /    83 runs   (   19.77 ms per token,    50.58 tokens per second)

## Snapdragon-Gen5
Before:
  common_perf_print: prompt eval time =    1091.27 ms /   205 tokens (    5.32 ms per token,   187.85 tokens per second)
  common_perf_print:        eval time =    2056.67 ms /    63 runs   (   32.65 ms per token,    30.63 tokens per second)
After:
  common_perf_print: prompt eval time =    1033.26 ms /   205 tokens (    5.04 ms per token,   198.40 tokens per second)
  common_perf_print:        eval time =    2078.37 ms /    63 runs   (   32.99 ms per token,    30.31 tokens per second)

ROPE cache compute is kind of expensive (lots of divs/muls/sin/cos/...) it makes sense to skip redundant updates even if the token rate bump is not huge.

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 6, 2026
@max-krasnyansky max-krasnyansky merged commit ba2fd11 into ggml-org:master Mar 6, 2026
78 checks passed
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants