Skip to content

Fix prompt cache saving and chat-persistent rollover#1678

Merged
ejones merged 2 commits intoggml-org:masterfrom
ejones:fix-persistent
Jun 3, 2023
Merged

Fix prompt cache saving and chat-persistent rollover#1678
ejones merged 2 commits intoggml-org:masterfrom
ejones:fix-persistent

Conversation

@ejones
Copy link
Copy Markdown
Collaborator

@ejones ejones commented Jun 3, 2023

Fixes #1670, by reworking the original fix for #1585 from #1609.

The original fix examined embd to determine if the prompt had been evaluated, but embd is limited to the batch size. In addition, that fix left session_tokens in its original state (i.e., the longer, cached prompt), while normal session evaluation truncates it at the first eval. This combination meant that any prompts with a cache hit on just the first batch (512 by default) would begin eval-ing ~from the second batch, and all of that eval would get appended to the end of the full, original cached prompt. This had the downstream effect of diverging the cache from the prompt and overrunning the context size in the cache, as seen in #1670.

For the fix, I opted to move the re-eval logic to main's initialization rather than at the eval stage. Here, it transforms session_tokens such that it will only match (prompt - 1) tokens.

Testing:

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Comment thread examples/main/main.cpp Outdated
Copy link
Copy Markdown
Contributor

@DannyDaemonic DannyDaemonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever fix. Feel free to merge after the suggested size() to !empty() fix.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@ejones ejones merged commit 136476e into ggml-org:master Jun 3, 2023
@ejones
Copy link
Copy Markdown
Collaborator Author

ejones commented Jun 3, 2023

Thanks!

wbruna added a commit to wbruna/llama.cpp that referenced this pull request Aug 20, 2025
…gml-org#1678)

* Add separate flash attention config for image generation

* Add config option for Conv2D Direct
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* Fix prompt cache saving and chat-persistent rollover (fixes ggml-org#1670)

* clang-tidy

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* Fix prompt cache saving and chat-persistent rollover (fixes ggml-org#1670)

* clang-tidy

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* Fix prompt cache saving and chat-persistent rollover (fixes ggml-org#1670)

* clang-tidy

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chat-persistent.sh not rotating cache files correctly

2 participants