transformers serve + llamacpp by SunMarc · Pull Request #44682 · huggingface/transformers

SunMarc · 2026-03-13T18:52:41Z

What does this PR do?

Llama cpp integration in transformers serve.

Minor changes to add llama.cpp integration
Mostly changes on serve to fix latency for streaming and non streaming

github-actions · 2026-03-13T18:53:14Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44682&sha=fdf28b

HuggingFaceDocBuilderDev · 2026-03-13T19:02:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gambletan

Review: transformers serve + llama.cpp integration

This is a substantial PR that adds GGUF/llama.cpp support to transformers serve and rewrites the streaming path. I will focus on the most impactful concerns.

1. Streaming rewrite — async SSE path

_DirectStreamer reuses _token_texts list unsafely across threads: The put() method appends to self._token_texts and _flush() sends the same list via call_soon_threadsafe(self._queue.put_nowait, self._token_texts) then resets it to []. However, between the put_nowait and the consumer reading the list, the generate thread could start appending to a new list (which is fine) — but the reference semantics are subtle. Consider sending list(self._token_texts) or self._token_texts.copy() in _flush() to make the handoff explicitly safe. Currently it works because self._token_texts = [] creates a new list, but this is fragile.

Error handling gap in _run_generate: The except Exception catch sends a _StreamError to the queue, but direct_streamer.end() is never called on the error path. This means the async_sse_gen() consumer could hang on await text_queue.get() if the error sentinel is consumed but the None terminator is never sent. Add a finally block that calls direct_streamer.end(), or restructure so error always terminates the queue.

self._event_loop set in lifespan: self._event_loop = asyncio.get_event_loop() is set in the lifespan context manager. But asyncio.get_event_loop() is deprecated in Python 3.12+ for getting the running loop — use asyncio.get_running_loop() instead. Also, if the server hasn't started the event loop yet at lifespan entry, this could return a different loop.

2. KV cache bypass for stateful models

if not getattr(model, "_is_stateful", False):

This check relies on a private convention (_is_stateful). If a third-party model doesn't set this flag, it will get the default PyTorch path, which is correct as a fallback. However, this should be documented — where is _is_stateful expected to be set? Is it part of the llama-cpp-transformers interface contract?

3. Fast SSE path — manual JSON construction

The fast path builds SSE JSON strings manually:

f'data: {{"id":"{_request_id}","object":"chat.completion.chunk",'
f'"model":"{model_id_and_revision}",'
f'"choices":[{{"delta":{{"content":{json.dumps(result)}}},"index":0}}]}}\n\n'

json.dumps(result) handles escaping for the content field, which is good. However, _request_id and model_id_and_revision are interpolated directly without escaping. If either contains quotes or backslashes, this produces invalid JSON. While unlikely for model IDs, using json.dumps() for these fields too would be more robust. Also, this duplicates the same f-string template in two places (the stream_chat_completion function and async_sse_gen) — extract a helper.

4. Non-streaming path rewrite

The non-streaming else branch was completely rewritten but the diff is cut off. From what's visible, it calls model.generate() directly without a streamer, which is correct and simpler than the previous approach of consuming the streaming generator synchronously.

5. Missing `json` import check

The code uses json.dumps() in the fast SSE path. Verify json is imported at the top of serve.py.

Minor observations:

DecodeStream is used but I don't see its import in the diff — verify it's imported
The _StreamError sentinel class is clean but could benefit from inheriting from a common base or being an Exception subclass to leverage isinstance checks more naturally
The batch_size parameter on _DirectStreamer defaults to 4 in the class but self.stream_batch_size (default 1) is passed from the CLI — the CLI default of 1 negates the batching optimization. Consider documenting the tradeoff or choosing a higher default.

Overall this is a meaningful performance improvement for streaming. The main concerns are the error path hanging risk in the async streamer and thread-safety of the list handoff.

first poc

fdf28b7

gambletan reviewed Mar 14, 2026

View reviewed changes

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformers serve + llamacpp#44682

transformers serve + llamacpp#44682
SunMarc wants to merge 1 commit intomainfrom
llama-cpp-serving

SunMarc commented Mar 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 13, 2026

Uh oh!

gambletan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SunMarc commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 13, 2026

Uh oh!

gambletan left a comment

Choose a reason for hiding this comment

Review: transformers serve + llama.cpp integration

1. Streaming rewrite — async SSE path

2. KV cache bypass for stateful models

3. Fast SSE path — manual JSON construction

4. Non-streaming path rewrite

5. Missing json import check

Minor observations:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SunMarc commented Mar 13, 2026 •

edited

Loading

5. Missing `json` import check