Skip to content

server : add special handling for /health in httplib#20799

Closed
rgerganov wants to merge 1 commit intoggml-org:masterfrom
rgerganov:fix-health
Closed

server : add special handling for /health in httplib#20799
rgerganov wants to merge 1 commit intoggml-org:masterfrom
rgerganov:fix-health

Conversation

@rgerganov
Copy link
Copy Markdown
Member

When the number of parallel requests to llama-server exceed the number of http threads, llama-server stop responding to /health which is very disruptive in k8s deployments, causing restarts of properly working inference endpoints.

Unfortunately, there is no way to fix this outside of httplib and this patch adds a rather ugly hack for handling GET /health requests before dispatching them to the thread pool. No changes are made in the HTTPS implementation.

Disclaimer: the implementation was AI assisted

closes: #20684

When the number of parallel requests to llama-server exceed the number
of http threads, llama-server stop responding to /health which is very
disruptive in k8s deployments, causing restarts of properly working
inference endpoints.

Unfortunately, there is no way to fix this outside of httplib and this
patch adds a rather ugly hack for handling GET /health requests before
dispatching them to the thread pool.

No changes are made in the HTTPS implementation.

closes: ggml-org#20684
@rgerganov rgerganov requested a review from ggerganov as a code owner March 20, 2026 14:15
@rgerganov rgerganov requested a review from ngxson March 20, 2026 14:16
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Mar 20, 2026

Since we regularly pull upstream httplib source code, this patch will be overwritten the next time we do so.

Would it work if you specify http threads to a very large number? Maybe 64 threads?

@rgerganov
Copy link
Copy Markdown
Member Author

Since we regularly pull upstream httplib source code, this patch will be overwritten the next time we do so.

Would it work if you specify http threads to a very large number? Maybe 64 threads?

Unfortunately no because we are deploying llama.cpp endpoints in k8s environments where there is no upper bound for the number of parallel requests and we observe constant restarts.

Out of curiosity, how do you solve this issue for HF endpoints of llama.cpp?

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented Mar 20, 2026

I think there is a solution without changing httplib.

Here, we basically execute the handlers on the HTTP threads:

void server_http_context::get(const std::string & path, const server_http_context::handler_t & handler) const {
pimpl->srv->Get(path_prefix + path, [handler](const httplib::Request & req, httplib::Response & res) {
server_http_req_ptr request = std::make_unique<server_http_req>(server_http_req{
get_params(req),
get_headers(req),
req.path,
build_query_string(req),
req.body,
req.is_connection_closed
});
server_http_res_ptr response = handler(*request);
process_handler_response(std::move(request), response, res);
});
}
void server_http_context::post(const std::string & path, const server_http_context::handler_t & handler) const {
pimpl->srv->Post(path_prefix + path, [handler](const httplib::Request & req, httplib::Response & res) {
server_http_req_ptr request = std::make_unique<server_http_req>(server_http_req{
get_params(req),
get_headers(req),
req.path,
build_query_string(req),
req.body,
req.is_connection_closed
});
server_http_res_ptr response = handler(*request);
process_handler_response(std::move(request), response, res);
});
}

Instead, we need to free the http threads ASAP. To do that, they need to push the incoming requests into a work queue. The server_http_context will then have a separate "work" thread pool that processes this work queue.

The handlers for certain endpoints, such as /health will not forward the requests to the work queue. They will immediately respond.

This way, the --threads-http argument would now become the size of the "work" thread pool. And the actual HTTP thread pool of the httplib instance is no longer necessary to be large. Even just 1 to 4 threads would be enough to serve it efficiently in all cases.

@rgerganov
Copy link
Copy Markdown
Member Author

Instead, we need to free the http threads ASAP. To do that, they need to push the incoming requests into a work queue. The server_http_context will then have a separate "work" thread pool that processes this work queue.

This may not work because httplib is closing the socket after executing the callback:

if (!task_queue->enqueue(
[this, sock]() { process_and_close_socket(sock); })) {
output_error_log(Error::ResourceExhaustion, nullptr);
detail::shutdown_socket(sock);
detail::close_socket(sock);
}

@ggerganov
Copy link
Copy Markdown
Member

Instead, we need to free the http threads ASAP. To do that, they need to push the incoming requests into a work queue. The server_http_context will then have a separate "work" thread pool that processes this work queue.

This may not work because httplib is closing the socket after executing the callback:

if (!task_queue->enqueue(
[this, sock]() { process_and_close_socket(sock); })) {
output_error_log(Error::ResourceExhaustion, nullptr);
detail::shutdown_socket(sock);
detail::close_socket(sock);
}

Ah bummer.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Mar 20, 2026

@rgerganov on HF endpoints, we simply set the number of http threads to max(16, n_parallel * 2), so we ensure that there are minimum 16 threads

Another idea is to allow spawn dynamic amount of threads, which is a newly-added feature in httplib (and websocket is also built-in now, quite nice!) yhirose/cpp-httplib#2368

@rgerganov
Copy link
Copy Markdown
Member Author

closing this in favor of #20817

@rgerganov rgerganov closed this Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: llama-server should have special handling for /health

3 participants