Fix prompt cache saving and chat-persistent rollover by ejones · Pull Request #1678 · ggml-org/llama.cpp

ejones · 2023-06-03T05:13:36Z

Fixes #1670, by reworking the original fix for #1585 from #1609.

The original fix examined embd to determine if the prompt had been evaluated, but embd is limited to the batch size. In addition, that fix left session_tokens in its original state (i.e., the longer, cached prompt), while normal session evaluation truncates it at the first eval. This combination meant that any prompts with a cache hit on just the first batch (512 by default) would begin eval-ing ~from the second batch, and all of that eval would get appended to the end of the full, original cached prompt. This had the downstream effect of diverging the cache from the prompt and overrunning the context size in the cache, as seen in #1670.

For the fix, I opted to move the re-eval logic to main's initialization rather than at the eval stage. Here, it transforms session_tokens such that it will only match (prompt - 1) tokens.

Testing:

for chat-persistent.sh not rotating cache files correctly #1670, conducted a long chat with 30B, past the context rotation
for Bug when prompt stored in --prompt-cache is longer than the new one #1585, applied the Z/joke test and got a joke that did not start with "Z"

…1670)

github-actions

clang-tidy made some suggestions

DannyDaemonic

This is a clever fix. Feel free to merge after the suggested size() to !empty() fix.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

ejones · 2023-06-03T11:28:59Z

Thanks!

…gml-org#1678) * Add separate flash attention config for image generation * Add config option for Conv2D Direct

* Fix prompt cache saving and chat-persistent rollover (fixes ggml-org#1670) * clang-tidy Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Fix prompt cache saving and chat-persistent rollover (fixes ggml-org#…

c812ff2

…1670)

ejones requested a review from DannyDaemonic June 3, 2023 05:13

ejones mentioned this pull request Jun 3, 2023

chat-persistent.sh not rotating cache files correctly #1670

Closed

github-actions Bot reviewed Jun 3, 2023

View reviewed changes

Comment thread examples/main/main.cpp Outdated

DannyDaemonic approved these changes Jun 3, 2023

View reviewed changes

clang-tidy

fb14faf

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

ejones merged commit 136476e into ggml-org:master Jun 3, 2023

wbruna added a commit to wbruna/llama.cpp that referenced this pull request Aug 20, 2025

Add flash attention and conv2d direct controls for image generation (g…

6003e90

…gml-org#1678) * Add separate flash attention config for image generation * Add config option for Conv2D Direct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prompt cache saving and chat-persistent rollover#1678

Fix prompt cache saving and chat-persistent rollover#1678
ejones merged 2 commits intoggml-org:masterfrom
ejones:fix-persistent

ejones commented Jun 3, 2023

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

DannyDaemonic left a comment

Uh oh!

ejones commented Jun 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ejones commented Jun 3, 2023

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DannyDaemonic left a comment

Choose a reason for hiding this comment

Uh oh!

ejones commented Jun 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants