server : add special handling for /health in httplib by rgerganov · Pull Request #20799 · ggml-org/llama.cpp

rgerganov · 2026-03-20T14:15:12Z

When the number of parallel requests to llama-server exceed the number of http threads, llama-server stop responding to /health which is very disruptive in k8s deployments, causing restarts of properly working inference endpoints.

Unfortunately, there is no way to fix this outside of httplib and this patch adds a rather ugly hack for handling GET /health requests before dispatching them to the thread pool. No changes are made in the HTTPS implementation.

Disclaimer: the implementation was AI assisted

closes: #20684

When the number of parallel requests to llama-server exceed the number of http threads, llama-server stop responding to /health which is very disruptive in k8s deployments, causing restarts of properly working inference endpoints. Unfortunately, there is no way to fix this outside of httplib and this patch adds a rather ugly hack for handling GET /health requests before dispatching them to the thread pool. No changes are made in the HTTPS implementation. closes: ggml-org#20684

ngxson · 2026-03-20T14:29:37Z

Since we regularly pull upstream httplib source code, this patch will be overwritten the next time we do so.

Would it work if you specify http threads to a very large number? Maybe 64 threads?

rgerganov · 2026-03-20T14:35:46Z

Since we regularly pull upstream httplib source code, this patch will be overwritten the next time we do so.

Would it work if you specify http threads to a very large number? Maybe 64 threads?

Unfortunately no because we are deploying llama.cpp endpoints in k8s environments where there is no upper bound for the number of parallel requests and we observe constant restarts.

Out of curiosity, how do you solve this issue for HF endpoints of llama.cpp?

ggerganov · 2026-03-20T14:39:00Z

I think there is a solution without changing httplib.

Here, we basically execute the handlers on the HTTP threads:

llama.cpp/tools/server/server-http.cpp

Lines 389 to 419 in 58c81f7

    
           void server_http_context::get(const std::string & path, const server_http_context::handler_t & handler) const { 
        
               pimpl->srv->Get(path_prefix + path, [handler](const httplib::Request & req, httplib::Response & res) { 
        
                   server_http_req_ptr request = std::make_unique<server_http_req>(server_http_req{ 
        
                       get_params(req), 
        
                       get_headers(req), 
        
                       req.path, 
        
                       build_query_string(req), 
        
                       req.body, 
        
                       req.is_connection_closed 
        
                   }); 
        
                   server_http_res_ptr response = handler(*request); 
        
                   process_handler_response(std::move(request), response, res); 
        
               }); 
        
           } 
        
           void server_http_context::post(const std::string & path, const server_http_context::handler_t & handler) const { 
        
               pimpl->srv->Post(path_prefix + path, [handler](const httplib::Request & req, httplib::Response & res) { 
        
                   server_http_req_ptr request = std::make_unique<server_http_req>(server_http_req{ 
        
                       get_params(req), 
        
                       get_headers(req), 
        
                       req.path, 
        
                       build_query_string(req), 
        
                       req.body, 
        
                       req.is_connection_closed 
        
                   }); 
        
                   server_http_res_ptr response = handler(*request); 
        
                   process_handler_response(std::move(request), response, res); 
        
               }); 
        
           }

Instead, we need to free the http threads ASAP. To do that, they need to push the incoming requests into a work queue. The server_http_context will then have a separate "work" thread pool that processes this work queue.

The handlers for certain endpoints, such as /health will not forward the requests to the work queue. They will immediately respond.

This way, the --threads-http argument would now become the size of the "work" thread pool. And the actual HTTP thread pool of the httplib instance is no longer necessary to be large. Even just 1 to 4 threads would be enough to serve it efficiently in all cases.

rgerganov · 2026-03-20T14:58:54Z

Instead, we need to free the http threads ASAP. To do that, they need to push the incoming requests into a work queue. The server_http_context will then have a separate "work" thread pool that processes this work queue.

This may not work because httplib is closing the socket after executing the callback:

llama.cpp/vendor/cpp-httplib/httplib.cpp

Lines 7685 to 7690 in 58c81f7

    
           if (!task_queue->enqueue( 
        
                   [this, sock]() { process_and_close_socket(sock); })) { 
        
             output_error_log(Error::ResourceExhaustion, nullptr); 
        
             detail::shutdown_socket(sock); 
        
             detail::close_socket(sock); 
        
           }

ggerganov · 2026-03-20T17:54:21Z

Instead, we need to free the http threads ASAP. To do that, they need to push the incoming requests into a work queue. The server_http_context will then have a separate "work" thread pool that processes this work queue.

This may not work because httplib is closing the socket after executing the callback:

llama.cpp/vendor/cpp-httplib/httplib.cpp

Lines 7685 to 7690 in 58c81f7

if (!task_queue->enqueue(

[this, sock]() { process_and_close_socket(sock); })) {

output_error_log(Error::ResourceExhaustion, nullptr);

detail::shutdown_socket(sock);

detail::close_socket(sock);

}

Ah bummer.

ngxson · 2026-03-20T19:16:48Z

@rgerganov on HF endpoints, we simply set the number of http threads to max(16, n_parallel * 2), so we ensure that there are minimum 16 threads

Another idea is to allow spawn dynamic amount of threads, which is a newly-added feature in httplib (and websocket is also built-in now, quite nice!) yhirose/cpp-httplib#2368

rgerganov · 2026-03-23T08:46:59Z

closing this in favor of #20817

rgerganov requested a review from ggerganov as a code owner March 20, 2026 14:15

rgerganov requested a review from ngxson March 20, 2026 14:16

ngxson mentioned this pull request Mar 20, 2026

server: use httplib dynamic threads #20817

Merged

rgerganov closed this Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : add special handling for /health in httplib#20799

server : add special handling for /health in httplib#20799
rgerganov wants to merge 1 commit intoggml-org:masterfrom
rgerganov:fix-health

rgerganov commented Mar 20, 2026

Uh oh!

ngxson commented Mar 20, 2026

Uh oh!

rgerganov commented Mar 20, 2026

Uh oh!

ggerganov commented Mar 20, 2026 •

edited

Loading

Uh oh!

rgerganov commented Mar 20, 2026

Uh oh!

ggerganov commented Mar 20, 2026

Uh oh!

ngxson commented Mar 20, 2026 •

edited

Loading

Uh oh!

rgerganov commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rgerganov commented Mar 20, 2026

Uh oh!

ngxson commented Mar 20, 2026

Uh oh!

rgerganov commented Mar 20, 2026

Uh oh!

ggerganov commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Mar 20, 2026

Uh oh!

ggerganov commented Mar 20, 2026

Uh oh!

ngxson commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Mar 20, 2026 •

edited

Loading

ngxson commented Mar 20, 2026 •

edited

Loading