fix(continuous-batching): apply logits processors in packed batches by floor-licker · Pull Request #43457 · huggingface/transformers

floor-licker · 2026-01-23T23:14:41Z

Summary

It seems like this was a known issue, but the continuous batching implementation packs multiple sequences into a single token stream even though most generation-time logits processors assume [batch, seq_len] / [batch, vocab] shapes. In practice it looks like processors are either broken or disabled in transformers serve and the tests. This change would apply logits processors per-request on the correct next token logits positions and removes the workarounds.

Logic:

Apply logits processors only at “next token” positions indicated by
logits_indices, and only for requests that are in decode
For each decoding request, rebuild the per-request token history
(state.initial_tokens + state.generated_tokens) and apply the processor to
logits[0, position] (next-token scores) in-place

Tests run

python -m pytest tests/generation/test_continuous_batching.py -q
python -m pytest tests/generation/test_logits_process.py -k
repetition_penalty_continuous_batching -q
python -m pytest tests/cli/test_serve.py -q
RUN_SLOW=1 python -m pytest tests/cli/test_serve.py -k
ServeCompletionsContinuousBatchingIntegrationTest -q

github-actions · 2026-01-23T23:40:59Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43457&sha=af1fed

floor-licker · 2026-01-24T01:15:57Z

Some of these CI failures seem a bit odd relative to the code changes, may take me a minute to track this down.

floor-licker · 2026-01-25T01:53:20Z

Hey @LysandreJik , just wondering if I can get some feedback on this. As far as I can tell the failing tests are just flaky. Can only intermittently reproduce the failures locally.

Rocketknight1 · 2026-01-26T13:27:17Z

cc @McPatate @remi-or for continuous batching

remi-or

Thanks for the contribution! There are some large points to address here, as this is not a trivial part of the continuous batching code. It would be good if we can support this in the most optimized way possible. Let me know if my comments are clear!

remi-or · 2026-01-26T15:37:58Z

    @traced(span_name="logit_processing")
    def _process_logit(self, batch_data: dict, logits: torch.Tensor, logit_processor: LogitsProcessor) -> torch.Tensor:
-        # Pass continuous batching context to logits processor if it supports it.
+        if isinstance(logit_processor, list) and len(logit_processor) == 0:


This should always be a list: you can check _get_logits_processor

remi-or · 2026-01-26T15:46:25Z

        )
+        if isinstance(self.logit_processor, list) and len(self.logit_processor) > 0:
+            # Processors need eager.
+            if self.use_cuda_graph:


IMO this is not the optimal solution: the process_logits phase and what happens afterwards should be outside of the cuda graph / compile if there are processors. Do you think you could rewrite it that way?

remi-or · 2026-01-26T15:47:04Z

+        # System prompt applied.
+        # Expect "sports" mention.
+        self.assertTrue(full_text.strip())
+        self.assertIn("sports", full_text.lower())


Not sure why we should change this test!

remi-or · 2026-01-26T15:47:48Z

+        logits = torch.ones((1, 4, vocab_size), device=torch_device, dtype=torch.float)
+
+        # Req0 next-token @2.
+        logits[0, 2, 1] = -2.0


Seems like you can hardcode this whole part as one tensor declaration

remi-or · 2026-01-26T15:48:12Z

+        self.assertTrue(torch.equal(rep_penalty.logits_indices, batch_data["logits_indices"]))
+
+        # Req0 token penalties.
+        self.assertAlmostEqual(processed_logits[0, 2, 1].item(), -4.0)


Why not do bactched checks?

remi-or · 2026-01-26T15:49:07Z

        architecture = getattr(transformers, config.architectures[0])
        model = architecture.from_pretrained(model_id, **model_kwargs)

+        # Default `max_length` is 20.


These comments are weird: did you use some AI assistant for this?

floor-licker · 2026-01-27T02:41:17Z

@remi-or Thanks a lot for your feedback, its my first contribution to transformers so I'll take a minute to review your comments and get back to you.

JLenzy · 2026-03-04T16:59:08Z

Hello! I just discovered this PR as I'm looking to add per-request logit processors in a production continuous batching setting. @remi-or do you know if this is currently supported?

remi-or · 2026-03-05T09:11:32Z

Hello! I just discovered this PR as I'm looking to add per-request logit processors in a production continuous batching setting. @remi-or do you know if this is currently supported?

I am not sure logits processors are supported in the first place, and I am sure per-request is not supported. Not sure it will ever be in production settings, where you need CUDA graphs, which makes per-request treatment hard in (most) cases. What logits processors do you have in mind? I can take a look.

JLenzy · 2026-03-09T18:45:22Z

Hello! I just discovered this PR as I'm looking to add per-request logit processors in a production continuous batching setting. @remi-or do you know if this is currently supported?

I am not sure logits processors are supported in the first place, and I am sure per-request is not supported. Not sure it will ever be in production settings, where you need CUDA graphs, which makes per-request treatment hard in (most) cases. What logits processors do you have in mind? I can take a look.

Thanks for your attention to this!
My team is preparing to use continuous batching to deploy a model for widespread use. During internal testing, we have received a lot of requests for 'weirder' outputs - we work in generative music. In research environments (single batch generations), we have been able to expose Top-p/top-k/temperature inference settings, which has allowed for a lot of creative exploration. But my current understanding is that we need to instantiate a single GenerationConfig and then run that across the board for all generations?

remi-or · 2026-03-13T11:24:46Z

But my current understanding is that we need to instantiate a single GenerationConfig and then run that across the board for all generations?

Yes, that's the case currently. Maybe we can complexify this for some logits processors, if pytorch operators permit it. That would only be those three parameters?

JLenzy · 2026-03-13T11:37:39Z

In an 'ideal world' we could modify per generation:

top_p
min_p
top_k
temperature
repetition_penalty
no_repeat_ngram_size

But top-p/k/temp as a starting point would already be immensely helpful. I would be happy to help contribute to this effort, bearing in mind that it'd be my first contribution to the HF ecosystem :)

remi-or · 2026-03-26T17:01:22Z

Hey @JLenzy , we are adding this feature in #45026
The draft will probably be removed tomorrow. Dont hesitate to let us know if you have early feedback.

remi-or · 2026-04-07T08:30:18Z

Closed thanks to #45026

fix(continuous-batching): support logits processors in packed batches

af1fed6

floor-licker force-pushed the fix/cb-logits-processors branch from 203d9d7 to af1fed6 Compare January 23, 2026 23:32

Merge branch 'huggingface:main' into fix/cb-logits-processors

f49672f

Merge branch 'main' into fix/cb-logits-processors

ef920f4

Merge branch 'main' into fix/cb-logits-processors

a280ee3

remi-or reviewed Jan 26, 2026

View reviewed changes

floor-licker added 3 commits January 29, 2026 19:50

remove redundant branch statement

4f88c0a

avoid unsqueeze in logits slicing

2f5003a

refactor: simplify logits processor emptiness check

069a17b

remi-or closed this Apr 7, 2026

Conversation

floor-licker commented Jan 23, 2026

Summary

Tests run

Uh oh!

github-actions Bot commented Jan 23, 2026

Uh oh!

floor-licker commented Jan 24, 2026

Uh oh!

floor-licker commented Jan 25, 2026

Uh oh!

Rocketknight1 commented Jan 26, 2026

Uh oh!

remi-or left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

remi-or Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

remi-or Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

remi-or Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

remi-or Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

remi-or Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

remi-or Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

floor-licker commented Jan 27, 2026

Uh oh!

JLenzy commented Mar 4, 2026

Uh oh!

remi-or commented Mar 5, 2026

Uh oh!

JLenzy commented Mar 9, 2026

Uh oh!

remi-or commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JLenzy commented Mar 13, 2026

Uh oh!

remi-or commented Mar 26, 2026

Uh oh!

remi-or commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

remi-or left a comment •

edited

Loading

remi-or commented Mar 13, 2026 •

edited

Loading