cpu: skip redudant ROPE cache updates by max-krasnyansky · Pull Request #20149 · ggml-org/llama.cpp

max-krasnyansky · 2026-03-05T23:32:35Z

I was updating ROPE for Hexagon while using the CPU implementation as the reference and noticed that some threads may end up doing redundant ROPE cache updates for the rows that are mapped to other threads.
Added a bit of instrumentation and sure enough I see something like this:

Details

thread-0: compute rope cache: ir 18 i3 0 i2 1
thread-0: compute rope cache: ir 19 i3 0 i2 2
thread-0: compute rope cache: ir 20 i3 0 i2 3
thread-0: compute rope cache: ir 21 i3 0 i2 4
thread-0: compute rope cache: ir 22 i3 0 i2 5
thread-0: compute rope cache: ir 23 i3 0 i2 6
thread-0: compute rope cache: ir 24 i3 0 i2 7
thread-0: compute rope cache: ir 25 i3 0 i2 8
thread-0: compute rope cache: ir 26 i3 0 i2 9
thread-0: compute rope cache: ir 27 i3 0 i2 10
thread-0: compute rope cache: ir 0 i3 0 i2 0
thread-0: use rope cache: ir 1 i3 0 i2 0 i1 0
thread-0: use rope cache: ir 2 i3 0 i2 0 i1 1
thread-0: use rope cache: ir 3 i3 0 i2 0 i1 2
thread-0: use rope cache: ir 4 i3 0 i2 0 i1 3
thread-0: use rope cache: ir 5 i3 0 i2 0 i1 4
thread-0: use rope cache: ir 6 i3 0 i2 0 i1 5
thread-0: compute rope cache: ir 7 i3 0 i2 1
thread-0: compute rope cache: ir 8 i3 0 i2 2
thread-0: compute rope cache: ir 9 i3 0 i2 3
thread-0: compute rope cache: ir 10 i3 0 i2 4
thread-0: compute rope cache: ir 11 i3 0 i2 5
thread-0: compute rope cache: ir 12 i3 0 i2 6
thread-0: compute rope cache: ir 13 i3 0 i2 7
thread-0: compute rope cache: ir 14 i3 0 i2 8
thread-0: compute rope cache: ir 15 i3 0 i2 9
thread-0: compute rope cache: ir 16 i3 0 i2 10
thread-0: compute rope cache: ir 0 i3 0 i2 0
thread-0: use rope cache: ir 1 i3 0 i2 0 i1 0
thread-0: use rope cache: ir 2 i3 0 i2 0 i1 1
thread-0: use rope cache: ir 3 i3 0 i2 0 i1 2
thread-0: use rope cache: ir 4 i3 0 i2 0 i1 3
thread-0: use rope cache: ir 5 i3 0 i2 0 i1 4
thread-0: use rope cache: ir 6 i3 0 i2 0 i1 5
thread-0: use rope cache: ir 7 i3 0 i2 0 i1 6
thread-0: use rope cache: ir 8 i3 0 i2 0 i1 7
thread-0: use rope cache: ir 9 i3 0 i2 0 i1 8
...
thread-1: compute rope cache: ir 8 i3 0 i2 1
thread-1: use rope cache: ir 9 i3 0 i2 1 i1 0
thread-1: use rope cache: ir 10 i3 0 i2 1 i1 1
thread-1: use rope cache: ir 11 i3 0 i2 1 i1 2
thread-1: use rope cache: ir 12 i3 0 i2 1 i1 3
thread-1: compute rope cache: ir 13 i3 0 i2 2
thread-1: compute rope cache: ir 14 i3 0 i2 3
thread-1: compute rope cache: ir 15 i3 0 i2 4
thread-1: compute rope cache: ir 16 i3 0 i2 5
thread-1: compute rope cache: ir 17 i3 0 i2 6
thread-1: compute rope cache: ir 18 i3 0 i2 7
thread-1: compute rope cache: ir 19 i3 0 i2 8
thread-1: compute rope cache: ir 20 i3 0 i2 9
thread-1: compute rope cache: ir 21 i3 0 i2 10
thread-1: compute rope cache: ir 0 i3 0 i2 0
thread-1: use rope cache: ir 18 i3 0 i2 0 i1 17
thread-1: use rope cache: ir 19 i3 0 i2 0 i1 18
thread-1: use rope cache: ir 20 i3 0 i2 0 i1 19
thread-1: use rope cache: ir 21 i3 0 i2 0 i1 20
thread-1: use rope cache: ir 22 i3 0 i2 0 i1 21
thread-1: use rope cache: ir 23 i3 0 i2 0 i1 22
thread-1: use rope cache: ir 24 i3 0 i2 0 i1 23

ROPE cache is updated for each i2 (seq-len) iteration but some of the iterations are mapped to other threads.
This PR adds a bit of state to compute ROPE cache only for the iterations that are actually used.

Here are some before/after numbers with llama3.2-3B-Q4_0.

## Ryzen AI Max+ 395
Before:
  common_perf_print: prompt eval time =     353.17 ms /   214 tokens (    1.65 ms per token,   605.93 tokens per second)
  common_perf_print:        eval time =    1631.31 ms /    83 runs   (   19.65 ms per token,    50.88 tokens per second)
After:
  common_perf_print: prompt eval time =     350.50 ms /   214 tokens (    1.64 ms per token,   610.55 tokens per second)
  common_perf_print:        eval time =    1640.84 ms /    83 runs   (   19.77 ms per token,    50.58 tokens per second)

## Snapdragon-Gen5
Before:
  common_perf_print: prompt eval time =    1091.27 ms /   205 tokens (    5.32 ms per token,   187.85 tokens per second)
  common_perf_print:        eval time =    2056.67 ms /    63 runs   (   32.65 ms per token,    30.63 tokens per second)
After:
  common_perf_print: prompt eval time =    1033.26 ms /   205 tokens (    5.04 ms per token,   198.40 tokens per second)
  common_perf_print:        eval time =    2078.37 ms /    63 runs   (   32.99 ms per token,    30.31 tokens per second)

ROPE cache compute is kind of expensive (lots of divs/muls/sin/cos/...) it makes sense to skip redundant updates even if the token rate bump is not huge.

cpu: skip redudant ROPE cache updates

d7b31fe

max-krasnyansky requested a review from ggerganov as a code owner March 5, 2026 23:32

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 6, 2026

loci-dev mentioned this pull request Mar 6, 2026

UPSTREAM PR #20149: cpu: skip redudant ROPE cache updates auroralabs-loci/llama.cpp#1228

Open

ggerganov approved these changes Mar 6, 2026

View reviewed changes

max-krasnyansky merged commit ba2fd11 into ggml-org:master Mar 6, 2026
78 checks passed

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026

cpu: skip redudant ROPE cache updates (ggml-org#20149)

d4187fa

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026

cpu: skip redudant ROPE cache updates (ggml-org#20149)

78216e2

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

cpu: skip redudant ROPE cache updates (ggml-org#20149)

fa520b0

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

cpu: skip redudant ROPE cache updates (ggml-org#20149)

c9d099f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: skip redudant ROPE cache updates#20149

cpu: skip redudant ROPE cache updates#20149
max-krasnyansky merged 1 commit intoggml-org:masterfrom
qualcomm:cpu-skip-reduntant-rope-cache-updates

max-krasnyansky commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

max-krasnyansky commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants