Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| self.model_device = model_device | ||
| self.model_dtype = model_dtype | ||
| self.scheduler = scheduler | ||
| self._deliver_output = deliver_output |
There was a problem hiding this comment.
a bit ugly, maybe there a cleaner way to do this
There was a problem hiding this comment.
Maybe we can create a dedicated object to handle delivering the output, and passing it to the processor at creation time? It would have its own lock, method, and just a reference to the output queue. It will also clear up the manager class.
|
@bot /style |
|
Style fix fix runs successfully without any file modified. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Nice! We could add tests in tests/cli/test_serve.py ?
| if self.log_prob_generation: | ||
| raise NotImplementedError("log_prob_generation is not supported yet") | ||
|
|
||
| def _register_handler(self, request_id: str, callback: callable, loop: asyncio.AbstractEventLoop) -> None: |
There was a problem hiding this comment.
Seems like this function and _unregister_handler could be removed: they are 2-lines called only once, might as well inline them
| for request in requests_in_batch: | ||
| state = request.state |
There was a problem hiding this comment.
my test was failing without this fix, feels correct to me
There was a problem hiding this comment.
yes, sorry this is fixed in a un-merged PR, good fix
| """ | ||
|
|
||
| def __init__(self) -> None: | ||
| self.output_queue = queue.Queue() |
There was a problem hiding this comment.
moved the output queue here
We already have plenty of these tests in serve as this is basically the default path there when the PR over there will be merge. I will still add a few tests here. |
* merge * update * fix * style * simpler * style * review ! * style * batch output * style * type
* merge * update * fix * style * simpler * style * review ! * style * batch output * style * type
What does this PR do?
This PR adds some features that makes serving more efficient. It shouldn't impact
generate_batchat all:Per-request result delivery via callbacks (replaces shared queue contention). Added
_request_callbacksdict andregister_result_handler(request_id, callback)— a unified API for async result delivery. The generation thread delivers results directly to registered callbacks instead of everything going through the sharedoutput_queue. This eliminates the O(n²) requeue contention thatget_resultwithrequest_idfiltering had at high concurrency.The generation loop waits on an Event instead of busy-spinning when there are no requests. add_request signals it via .set() to wake the loop immediately. Zero CPU when idle, instant wakeup on new request. In our server, the issue was that yhe busy-spin was holding the GIL when idle, which slowed down tokenization on the event loop thread.