Skip to content

server : add arg for disabling prompt caching#18776

Merged
rgerganov merged 3 commits intoggml-org:masterfrom
rgerganov:arg-cache-prompt
Jan 12, 2026
Merged

server : add arg for disabling prompt caching#18776
rgerganov merged 3 commits intoggml-org:masterfrom
rgerganov:arg-cache-prompt

Conversation

@rgerganov
Copy link
Copy Markdown
Member

We have a use case where we run end to end tests which include llama-server and we expect all responses to be strictly deterministic. Of course we are setting a fixed seed in the HTTP request but it turns out this is not enough if prompt caching is enabled. Unfortunately, we cannot use the cache_prompt request option because it is not OpenAI compatible.

This patch adds another command line arg for disabling prompt caching but I'd be happy to discard it if there is some other way to accomplish this. Even if there is no such way, I don't insist merging this if maintainers decide this adds more confusion or our use case is not a valid one.

Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.
Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to add this, I thought we already had this arg, but turns out we haven't had it.

Nits: it's better to move this arg right before --cache-reuse, while also update the help message of --cache-reuse saying that it depends on --cache-prompt

@rgerganov rgerganov marked this pull request as ready for review January 12, 2026 12:06
@rgerganov rgerganov requested a review from ggerganov as a code owner January 12, 2026 12:06
@rgerganov
Copy link
Copy Markdown
Member Author

@ngxson thanks for the review, let me know if any other docs need update

Comment thread common/arg.cpp Outdated
Comment thread common/arg.cpp Outdated
@rgerganov rgerganov merged commit bcf7546 into ggml-org:master Jan 12, 2026
75 of 76 checks passed
gary149 pushed a commit to gary149/llama-agent that referenced this pull request Jan 13, 2026
* server : add arg for disabling prompt caching

Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.

* address review comments

* address review comments
dillon-blake pushed a commit to Boxed-Logic/llama.cpp that referenced this pull request Jan 15, 2026
* server : add arg for disabling prompt caching

Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.

* address review comments

* address review comments
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* server : add arg for disabling prompt caching

Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.

* address review comments

* address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants