server : add arg for disabling prompt caching#18776
Merged
rgerganov merged 3 commits intoggml-org:masterfrom Jan 12, 2026
Merged
server : add arg for disabling prompt caching#18776rgerganov merged 3 commits intoggml-org:masterfrom
rgerganov merged 3 commits intoggml-org:masterfrom
Conversation
Disabling prompt caching is useful for clients who are restricted to sending only OpenAI-compat requests and want deterministic responses.
ngxson
approved these changes
Jan 12, 2026
Contributor
There was a problem hiding this comment.
I think it's ok to add this, I thought we already had this arg, but turns out we haven't had it.
Nits: it's better to move this arg right before --cache-reuse, while also update the help message of --cache-reuse saying that it depends on --cache-prompt
Member
Author
|
@ngxson thanks for the review, let me know if any other docs need update |
ngxson
reviewed
Jan 12, 2026
ngxson
approved these changes
Jan 12, 2026
gary149
pushed a commit
to gary149/llama-agent
that referenced
this pull request
Jan 13, 2026
* server : add arg for disabling prompt caching Disabling prompt caching is useful for clients who are restricted to sending only OpenAI-compat requests and want deterministic responses. * address review comments * address review comments
dillon-blake
pushed a commit
to Boxed-Logic/llama.cpp
that referenced
this pull request
Jan 15, 2026
* server : add arg for disabling prompt caching Disabling prompt caching is useful for clients who are restricted to sending only OpenAI-compat requests and want deterministic responses. * address review comments * address review comments
Seunghhon
pushed a commit
to Seunghhon/llama.cpp
that referenced
this pull request
Apr 26, 2026
* server : add arg for disabling prompt caching Disabling prompt caching is useful for clients who are restricted to sending only OpenAI-compat requests and want deterministic responses. * address review comments * address review comments
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We have a use case where we run end to end tests which include
llama-serverand we expect all responses to be strictly deterministic. Of course we are setting a fixedseedin the HTTP request but it turns out this is not enough if prompt caching is enabled. Unfortunately, we cannot use thecache_promptrequest option because it is not OpenAI compatible.This patch adds another command line arg for disabling prompt caching but I'd be happy to discard it if there is some other way to accomplish this. Even if there is no such way, I don't insist merging this if maintainers decide this adds more confusion or our use case is not a valid one.