server : add special handling for /health in httplib#20799
server : add special handling for /health in httplib#20799rgerganov wants to merge 1 commit intoggml-org:masterfrom
Conversation
When the number of parallel requests to llama-server exceed the number of http threads, llama-server stop responding to /health which is very disruptive in k8s deployments, causing restarts of properly working inference endpoints. Unfortunately, there is no way to fix this outside of httplib and this patch adds a rather ugly hack for handling GET /health requests before dispatching them to the thread pool. No changes are made in the HTTPS implementation. closes: ggml-org#20684
|
Since we regularly pull upstream httplib source code, this patch will be overwritten the next time we do so. Would it work if you specify http threads to a very large number? Maybe 64 threads? |
Unfortunately no because we are deploying Out of curiosity, how do you solve this issue for HF endpoints of |
|
I think there is a solution without changing Here, we basically execute the handlers on the HTTP threads: llama.cpp/tools/server/server-http.cpp Lines 389 to 419 in 58c81f7 Instead, we need to free the http threads ASAP. To do that, they need to push the incoming requests into a work queue. The The handlers for certain endpoints, such as This way, the |
This may not work because llama.cpp/vendor/cpp-httplib/httplib.cpp Lines 7685 to 7690 in 58c81f7 |
Ah bummer. |
|
@rgerganov on HF endpoints, we simply set the number of http threads to Another idea is to allow spawn dynamic amount of threads, which is a newly-added feature in httplib (and websocket is also built-in now, quite nice!) yhirose/cpp-httplib#2368 |
|
closing this in favor of #20817 |
When the number of parallel requests to llama-server exceed the number of http threads, llama-server stop responding to /health which is very disruptive in k8s deployments, causing restarts of properly working inference endpoints.
Unfortunately, there is no way to fix this outside of httplib and this patch adds a rather ugly hack for handling GET /health requests before dispatching them to the thread pool. No changes are made in the HTTPS implementation.
Disclaimer: the implementation was AI assisted
closes: #20684