Skip to content

transformers serve + llamacpp#44682

Open
SunMarc wants to merge 1 commit intomainfrom
llama-cpp-serving
Open

transformers serve + llamacpp#44682
SunMarc wants to merge 1 commit intomainfrom
llama-cpp-serving

Conversation

@SunMarc
Copy link
Copy Markdown
Member

@SunMarc SunMarc commented Mar 13, 2026

What does this PR do?

Llama cpp integration in transformers serve.

Minor changes to add llama.cpp integration
Mostly changes on serve to fix latency for streaming and non streaming

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44682&sha=fdf28b

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown

@gambletan gambletan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: transformers serve + llama.cpp integration

This is a substantial PR that adds GGUF/llama.cpp support to transformers serve and rewrites the streaming path. I will focus on the most impactful concerns.

1. Streaming rewrite — async SSE path

_DirectStreamer reuses _token_texts list unsafely across threads: The put() method appends to self._token_texts and _flush() sends the same list via call_soon_threadsafe(self._queue.put_nowait, self._token_texts) then resets it to []. However, between the put_nowait and the consumer reading the list, the generate thread could start appending to a new list (which is fine) — but the reference semantics are subtle. Consider sending list(self._token_texts) or self._token_texts.copy() in _flush() to make the handoff explicitly safe. Currently it works because self._token_texts = [] creates a new list, but this is fragile.

Error handling gap in _run_generate: The except Exception catch sends a _StreamError to the queue, but direct_streamer.end() is never called on the error path. This means the async_sse_gen() consumer could hang on await text_queue.get() if the error sentinel is consumed but the None terminator is never sent. Add a finally block that calls direct_streamer.end(), or restructure so error always terminates the queue.

self._event_loop set in lifespan: self._event_loop = asyncio.get_event_loop() is set in the lifespan context manager. But asyncio.get_event_loop() is deprecated in Python 3.12+ for getting the running loop — use asyncio.get_running_loop() instead. Also, if the server hasn't started the event loop yet at lifespan entry, this could return a different loop.

2. KV cache bypass for stateful models

if not getattr(model, "_is_stateful", False):

This check relies on a private convention (_is_stateful). If a third-party model doesn't set this flag, it will get the default PyTorch path, which is correct as a fallback. However, this should be documented — where is _is_stateful expected to be set? Is it part of the llama-cpp-transformers interface contract?

3. Fast SSE path — manual JSON construction

The fast path builds SSE JSON strings manually:

f'data: {{"id":"{_request_id}","object":"chat.completion.chunk",'
f'"model":"{model_id_and_revision}",'
f'"choices":[{{"delta":{{"content":{json.dumps(result)}}},"index":0}}]}}\n\n'

json.dumps(result) handles escaping for the content field, which is good. However, _request_id and model_id_and_revision are interpolated directly without escaping. If either contains quotes or backslashes, this produces invalid JSON. While unlikely for model IDs, using json.dumps() for these fields too would be more robust. Also, this duplicates the same f-string template in two places (the stream_chat_completion function and async_sse_gen) — extract a helper.

4. Non-streaming path rewrite

The non-streaming else branch was completely rewritten but the diff is cut off. From what's visible, it calls model.generate() directly without a streamer, which is correct and simpler than the previous approach of consuming the streaming generator synchronously.

5. Missing json import check

The code uses json.dumps() in the fast SSE path. Verify json is imported at the top of serve.py.

Minor observations:

  • DecodeStream is used but I don't see its import in the diff — verify it's imported
  • The _StreamError sentinel class is clean but could benefit from inheriting from a common base or being an Exception subclass to leverage isinstance checks more naturally
  • The batch_size parameter on _DirectStreamer defaults to 4 in the class but self.stream_batch_size (default 1) is passed from the CLI — the CLI default of 1 negates the batching optimization. Consider documenting the tradeoff or choosing a higher default.

Overall this is a meaningful performance improvement for streaming. The main concerns are the error path hanging risk in the async streamer and thread-safety of the list handoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants