Skip to content

[refactor] Serving into proper modules#44796

Merged
SunMarc merged 76 commits intomainfrom
refactor-serving
Apr 1, 2026
Merged

[refactor] Serving into proper modules#44796
SunMarc merged 76 commits intomainfrom
refactor-serving

Conversation

@SunMarc
Copy link
Copy Markdown
Member

@SunMarc SunMarc commented Mar 17, 2026

What does this PR do?

This PR refactors transformers serve so that it is not in a single file. We split it into multiple files with clear responsabilities. There were 2,293 lines initially in the serve.py file.

  ┌────────────────────────────┬───────┬──────────────────────────────────────────────┐
  │            File            │ Lines │                   Purpose                    │
  ├────────────────────────────┼───────┼──────────────────────────────────────────────┤
  │ serving/chat_completion.py │ 411   │ /v1/chat/completions handler                 │
  ├────────────────────────────┼───────┼──────────────────────────────────────────────┤
  │ serving/response.py        │ 565   │ /v1/responses handler (with queue draining)  │
  ├────────────────────────────┼───────┼──────────────────────────────────────────────┤
  │ serving/utils.py           │ 941   │ Shared types, streamers, generation managers │
  ├────────────────────────────┼───────┼──────────────────────────────────────────────┤
  │ serving/model_manager.py   │ 449   │ Model loading/caching                        │
  ├────────────────────────────┼───────┼──────────────────────────────────────────────┤
  │ serving/server.py          │ 127   │ FastAPI app + routes                         │
  ├────────────────────────────┼───────┼──────────────────────────────────────────────┤
  │ serving/transcription.py   │ 181   │ Audio transcription handler                  │
  ├────────────────────────────┼───────┼──────────────────────────────────────────────┤
  │ serving/__init__.py        │ 17    │ Package init                                 │
  ├────────────────────────────┼───────┼──────────────────────────────────────────────┤
  │ Total new                  │ 2,691 │                                              │
  └────────────────────────────┴───────┴──────────────────────────────────────────────┘

I've added and fix some features that there missing or broken from the orignal serve. It was easier to fix those in this refactor as it helped me find the right design.

  • transcriptions fixed
  • CB streaming + non streaming tested with concurrency (up to 20k request at the same time - non streaming works nicely, streaming have a 10-20% gap compared to it)
  • generate and CB parity
  • response and chat completion parity
  • generate + compile
  • typing + docstring as much as possible cc @tarekziade
  • adding lots of tests, especially integration tests (1600 + 700 -)

For CB to work correctly, we need to first merge this PR: #45063 -> merged !

I'll add benchmarks in a follow-up as this PR is already big enough !

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc SunMarc marked this pull request as ready for review March 24, 2026 09:22
Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a first pass!

Comment thread src/transformers/cli/serving/chat_completion.py Outdated

messages = body["messages"]

# HACK: tiny-agents sends requests ending with assistant message — skip
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth elaborating this comment a bit more in the future

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i kept it as it was there in the orignal serve.py but i didn't do any testing. Not sure if it is worth keeping

Comment thread src/transformers/cli/serving/chat_completion.py Outdated
Comment thread src/transformers/cli/serving/chat_completion.py
Comment thread src/transformers/cli/serving/response.py Outdated
Comment thread src/transformers/cli/serving/utils.py Outdated
Comment thread src/transformers/cli/serving/utils.py
Comment thread src/transformers/cli/serving/utils.py
Comment thread src/transformers/cli/serving/utils.py Outdated
Comment thread tests/cli/test_serve_refactored.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need an additional file for this btw? IMO we should focus on the existing test file, with the addition of other tests there. I would still make sure all existing tests in the existing module pass.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as serve.py. I kept the original file to make sure that all the tests are covered as I iterate through the different features. I will update the name at the end of the refactor !

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN a4a429cd workflow commit (merge commit)
PR ff02cd79 branch commit (from PR)
main bc576731 base commit (on main)

Model CI Report

3 new failed tests from this PR 😭

  • cli:
    tests/cli/test_serve.py::TestTranscription::test_transcription_openai_client (✅ ⟹ ❌)
    tests/cli/test_serve.py::TestTranscription::test_transcription_returns_text (✅ ⟹ ❌)
    tests/cli/test_serve.py::TestTranscription::test_transcription_streaming (✅ ⟹ ❌)

@SunMarc
Copy link
Copy Markdown
Member Author

SunMarc commented Mar 31, 2026

run-slow: cli

@SunMarc SunMarc requested a review from NathanHB March 31, 2026 17:01
@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["cli"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 80567983 workflow commit (merge commit)
PR ffe4c64e branch commit (from PR)
main bc576731 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@SunMarc
Copy link
Copy Markdown
Member Author

SunMarc commented Mar 31, 2026

run-slow: cli

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["cli"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 90ca80e2 workflow commit (merge commit)
PR 06a78815 branch commit (from PR)
main bc576731 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@SunMarc SunMarc force-pushed the refactor-serving branch from 24fb171 to 052cbc7 Compare April 1, 2026 10:26
@SunMarc SunMarc added this pull request to the merge queue Apr 1, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 1, 2026
@SunMarc SunMarc added this pull request to the merge queue Apr 1, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 1, 2026
@SunMarc SunMarc enabled auto-merge April 1, 2026 15:32
@SunMarc SunMarc added this pull request to the merge queue Apr 1, 2026
Merged via the queue into main with commit 38593c2 Apr 1, 2026
30 checks passed
@SunMarc SunMarc deleted the refactor-serving branch April 1, 2026 16:03
marvinzh pushed a commit to marvinzh/transformers that referenced this pull request Apr 3, 2026
* new serve file

* app

* model_manager done

* update serve

* style

* poc done

* renaming

* fix

* new tests

* update metrics and processor

* hardcode n_batch for now

* add response api + compile

* more tests

* add it for now but we will move it

* remove cache impl

* add back load_model

* fix naming

* add transcription

* tool calls better !

* vlm support for both response and chat endpoints

* update bench

* fix vl test

* first iteration of cb

* cb tests

* typing + review

* update test

* better benchmark

* better stream

* update bench

* fix

* serve refactored

* merge

* update

* fix

* style

* simpler

* style

* update warmup

* remove llamacpp integration for now

* styke

* styke

* style again

* remove annoattion

* review !

* style

* much cleaner

* renamed

* remove bench for now

* batch output

* style

* type

* better tests

* update test

* queue draining

* some logs

* readd nathan feature + some minor fixes

* fix

* guard transcription

* better now

* fix

* adding lock to see if this helps

* remove locks

* lock again

* update bench and remove lock for now
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Apr 4, 2026
* new serve file

* app

* model_manager done

* update serve

* style

* poc done

* renaming

* fix

* new tests

* update metrics and processor

* hardcode n_batch for now

* add response api + compile

* more tests

* add it for now but we will move it

* remove cache impl

* add back load_model

* fix naming

* add transcription

* tool calls better !

* vlm support for both response and chat endpoints

* update bench

* fix vl test

* first iteration of cb

* cb tests

* typing + review

* update test

* better benchmark

* better stream

* update bench

* fix

* serve refactored

* merge

* update

* fix

* style

* simpler

* style

* update warmup

* remove llamacpp integration for now

* styke

* styke

* style again

* remove annoattion

* review !

* style

* much cleaner

* renamed

* remove bench for now

* batch output

* style

* type

* better tests

* update test

* queue draining

* some logs

* readd nathan feature + some minor fixes

* fix

* guard transcription

* better now

* fix

* adding lock to see if this helps

* remove locks

* lock again

* update bench and remove lock for now
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
* new serve file

* app

* model_manager done

* update serve

* style

* poc done

* renaming

* fix

* new tests

* update metrics and processor

* hardcode n_batch for now

* add response api + compile

* more tests

* add it for now but we will move it

* remove cache impl

* add back load_model

* fix naming

* add transcription

* tool calls better !

* vlm support for both response and chat endpoints

* update bench

* fix vl test

* first iteration of cb

* cb tests

* typing + review

* update test

* better benchmark

* better stream

* update bench

* fix

* serve refactored

* merge

* update

* fix

* style

* simpler

* style

* update warmup

* remove llamacpp integration for now

* styke

* styke

* style again

* remove annoattion

* review !

* style

* much cleaner

* renamed

* remove bench for now

* batch output

* style

* type

* better tests

* update test

* queue draining

* some logs

* readd nathan feature + some minor fixes

* fix

* guard transcription

* better now

* fix

* adding lock to see if this helps

* remove locks

* lock again

* update bench and remove lock for now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants