Skip to content

CB improvements for serving #45063

Merged
SunMarc merged 14 commits intomainfrom
output-callback-cb
Mar 30, 2026
Merged

CB improvements for serving #45063
SunMarc merged 14 commits intomainfrom
output-callback-cb

Conversation

@SunMarc
Copy link
Copy Markdown
Member

@SunMarc SunMarc commented Mar 27, 2026

What does this PR do?

This PR adds some features that makes serving more efficient. It shouldn't impact generate_batch at all:

  • Per-request result delivery via callbacks (replaces shared queue contention). Added _request_callbacks dict and register_result_handler(request_id, callback) — a unified API for async result delivery. The generation thread delivers results directly to registered callbacks instead of everything going through the shared output_queue. This eliminates the O(n²) requeue contention that get_result with request_id filtering had at high concurrency.

  • The generation loop waits on an Event instead of busy-spinning when there are no requests. add_request signals it via .set() to wake the loop immediately. Zero CPU when idle, instant wakeup on new request. In our server, the issue was that yhe busy-spin was holding the GIL when idle, which slowed down tokenization on the event loop thread.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

self.model_device = model_device
self.model_dtype = model_dtype
self.scheduler = scheduler
self._deliver_output = deliver_output
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit ugly, maybe there a cleaner way to do this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can create a dedicated object to handle delivering the output, and passing it to the processor at creation time? It would have its own lock, method, and just a reference to the output queue. It will also clear up the manager class.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea !

@SunMarc SunMarc requested a review from remi-or March 27, 2026 17:37
@SunMarc
Copy link
Copy Markdown
Member Author

SunMarc commented Mar 27, 2026

@bot /style

@SunMarc SunMarc requested a review from ArthurZucker March 27, 2026 17:38
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 27, 2026

Style fix fix runs successfully without any file modified.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! We could add tests in tests/cli/test_serve.py ?

Comment thread src/transformers/generation/continuous_batching/continuous_api.py Outdated
if self.log_prob_generation:
raise NotImplementedError("log_prob_generation is not supported yet")

def _register_handler(self, request_id: str, callback: callable, loop: asyncio.AbstractEventLoop) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this function and _unregister_handler could be removed: they are 2-lines called only once, might as well inline them

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done !

@SunMarc SunMarc requested a review from remi-or March 30, 2026 12:14
Comment on lines +300 to +301
for request in requests_in_batch:
state = request.state
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my test was failing without this fix, feels correct to me

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry this is fixed in a un-merged PR, good fix

"""

def __init__(self) -> None:
self.output_queue = queue.Queue()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the output queue here

@SunMarc
Copy link
Copy Markdown
Member Author

SunMarc commented Mar 30, 2026

Nice! We could add tests in tests/cli/test_serve.py ?

We already have plenty of these tests in serve as this is basically the default path there when the PR over there will be merge. I will still add a few tests here.

Copy link
Copy Markdown
Collaborator

@remi-or remi-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@SunMarc SunMarc added this pull request to the merge queue Mar 30, 2026
Merged via the queue into main with commit 8213e0d Mar 30, 2026
30 checks passed
@SunMarc SunMarc deleted the output-callback-cb branch March 30, 2026 18:48
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Mar 31, 2026
* merge

* update

* fix

* style

* simpler

* style

* review !

* style

* batch output

* style

* type
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Apr 4, 2026
* merge

* update

* fix

* style

* simpler

* style

* review !

* style

* batch output

* style

* type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants