Skip to content

server: use httplib dynamic threads#20817

Merged
ngxson merged 2 commits intoggml-org:masterfrom
ngxson:xsn/server_dynamic_threads
Mar 23, 2026
Merged

server: use httplib dynamic threads#20817
ngxson merged 2 commits intoggml-org:masterfrom
ngxson:xsn/server_dynamic_threads

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Mar 20, 2026

Alternative to #20799

Fix #20684

Server can now create up to maximum of 1024 threads on demand. Dynamic threads will be terminated once they finish their task.

@ngxson ngxson requested a review from a team as a code owner March 20, 2026 19:22
@ngxson ngxson requested a review from rgerganov March 20, 2026 19:22
Copy link
Copy Markdown
Member

@rgerganov rgerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work well and while it's not the ideal solution, I think it is good enough for all practical purposes. Thanks!

Comment thread tools/server/server-http.cpp Outdated
// spawn n_threads_http fixed thread (always alive), while allow up to 1024 max possible number of threads
// when n_threads_http is used, server will create new "dynamic" threads that will be destroyed after processing each request
// ref: https://github.com/yhirose/cpp-httplib/pull/2368
return new httplib::ThreadPool(n_threads_http, 1024);
Copy link
Copy Markdown
Member

@rgerganov rgerganov Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case, let's use std::max(1024, n_threads_http) to support more than 1024:

return new httplib::ThreadPool(n_threads_http, std::max(1024, n_threads_http));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I changed the logic to n_threads_http + 1024 instead. if n_threads_http is high because of n_parallel = 2000 for example, then we always have 1024 for overhead connections

@ngxson ngxson requested a review from rgerganov March 23, 2026 09:39
@ngxson ngxson merged commit 31a5cf4 into ggml-org:master Mar 23, 2026
48 checks passed
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* server: use httplib dynamic threads

* change to n_threads_http + 1024
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* server: use httplib dynamic threads

* change to n_threads_http + 1024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: llama-server should have special handling for /health

2 participants