[CB] Support the `num_return_sequences` argument by remi-or · Pull Request #42921 · huggingface/transformers

remi-or · 2025-12-17T11:36:36Z

Summary

This PR adds the options to fork requests during continuous batching, which duplicates the request and uses as much as possible the existing cache. This is then leveraged to make the num_return_sequences argument available in CB.
This PR enables parallel decoding, which will be useful for RL workflows.

Performance

Samples	Attention	Add Prefix	Source	Duration (s)	Generated tokens	Throughput (tok/s)
100	flash_attention_2	False	With PR	6.82	17480	2563.24
100	flash_attention_2	False	On main	6.91	17698	2562.81
100	sdpa	False	With PR	21.77	17637	810.33
100	sdpa	False	On main	21.81	17234	790.03
500	flash_attention_3	True	With PR	16.4	113054	6895.14
500	flash_attention_3	True	On main	16.73	112333	6715.63

--compile is always on, thus the number of tokens varies between runs because the compiled kernels do

Tests

Tests pass, including the one added to test the feature.

Sanity check

Looks good

HuggingFaceDocBuilderDev · 2025-12-17T11:45:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2025-12-17T13:31:12Z

                    prompt_ids=(state.initial_tokens + state.generated_tokens),
                )

+    def copy_cache(self, source_blocks: list[int], forked_blocks: list[int]) -> None:


very interesting, I would assume we want to delay requests that are getting forked to the next batch to do this async (I might be wrong).

On the opposite, to maximize prefix sharing, you want to schedule those request asap. But there might be something to the idea that we can do much of the cpu-side of forking in the async. The issue is that there will always be a copy of the cache, hence GPU intervenes, but maybe it can be done in a side stream.
I think the best compromise is to add the feature now and later, when we get to CPU asynchronous-ness, we can add the FORKING status to let the scheduler know we need those requests to not be scheduled -- until the cache has been copied.

yep makes sense!

ArthurZucker · 2025-12-17T13:31:32Z

+        source_blocks = torch.tensor(source_blocks, device=self.device, dtype=torch.int32)
+        forked_blocks = torch.tensor(forked_blocks, device=self.device, dtype=torch.int32)


this is allocating memeory + might need a sync (tensor of list -> sync cpu GPU) we wanna avoid that

I tried playing around with this and I was surprised this is the fastest alternative, which makes no sense to me. Will leave a TODO to deep dive later

Got it thanks for checking!

ArthurZucker · 2025-12-17T13:32:39Z

+            key_cache[forked_blocks] = key_cache[source_blocks]
+            value_cache[forked_blocks] = value_cache[source_blocks]
+            # FIXME: should be one copy for al CMs with only the changing blocks
+            # FIXME: even once per fork batch


copy should be async as well (async=True)

ArthurZucker · 2025-12-17T13:33:15Z

+            source_blocks, forked_blocks = cm.fork_blocks(state.request_id, new_state.request_id, self._block_manager)
+            self.copy_cache(source_blocks, forked_blocks)


same here we should "schedule" the copy. Remember we are in Python and the gil is killing us

Since the copy is device to device, I think the best we can do for now is one copy as is the case right now. Plus, I think pytorch works in async in that case, ie. CPU operations continue after the copy is launched

ArthurZucker · 2025-12-17T13:33:42Z

+        """Fork a given list of (source_blocks) into a new list of forked_blocks. If the blocks are (shareable), we
+        reference the existing blocks when they are complete. Otherwise, we allocate new blocks if possible. The
+        (group_id) of the layer group the blocks belong to is also needed."""


need do that shows in / out

Not sure what you means by this sorry!

sorry I mean doc should help us more, showing example with input output!

Added an ascii table

ArthurZucker · 2025-12-17T13:42:33Z

+        while self.scheduler._requests_to_fork:
+            state = self.scheduler._requests_to_fork.pop()
+            num_children = state.num_children
+            state.num_children = 0
+            for i in range(num_children):
+                # FIXME: if fork cant be done, create a new pending request without forking
+                new_request = self.cache.fork_request(state, f"{state.request_id}__child#{i}")
+                self.scheduler.active_requests[new_request.request_id] = new_request


same here, we should make that async IMO (new status "FORKING" -> wait until forked? IDK but we need to bench a tad bit

this has changed to be done in batch, which without asynchronous mode is the best we can do for CPU side.

ArthurZucker · 2025-12-17T13:47:34Z

+        manager_cm = self.continuous_batching_context_manager(
+            generation_config=generation_config,
+            num_q_cuda_graphs=num_q_padding_intervals,
+            num_kv_cuda_graphs=num_kv_padding_intervals,
+            allow_block_sharing=allow_block_sharing,
+            block=True,
+            timeout=5,
+        )
+        logging_cm = logging_redirect_tqdm([logger])
+        pbar_cm = tqdm(
+            total=num_requests,
+            disable=(not progress_bar),
+            desc=f"Solving {num_requests} requests",
+            unit="request",
+        )


you can create a get cm func?

not sure I understand this -- what would it return? It seems self.continuous_batching_context_manager(...) is the get cm

Yeah sorry I don't remember what I wanted here LGTM

ArthurZucker

LGTM! def fork(self, new_request_id: str) -> "RequestState": should be optimized as much as possible,

ArthurZucker · 2026-01-05T13:20:35Z

                    prompt_ids=(state.initial_tokens + state.generated_tokens),
                )

+    def copy_cache(self, source_blocks: list[int], forked_blocks: list[int]) -> None:


yep makes sense!

ArthurZucker · 2026-01-05T13:20:59Z

+        source_blocks = torch.tensor(source_blocks, device=self.device, dtype=torch.int32)
+        forked_blocks = torch.tensor(forked_blocks, device=self.device, dtype=torch.int32)


Got it thanks for checking!

ArthurZucker · 2026-01-05T13:25:31Z

+        self, parent_request_id: str, children_request_ids: list[str], block_manager: BlockManager
+    ) -> tuple[list[int], list[int]]:
+        """Forks the cache blocks of a (parent_request_id) to a list of (children_request_ids). To manage the blocks,
+        the (block_manager) is used. When forking, the child's block are either shared with the parent, or they need to


Suggested change

the (block_manager) is used. When forking, the child's block are either shared with the parent, or they need to

the block_manager is used. When forking, the child's block are either shared with the parent, or they need to

I use parenthesis to denote arguments of the function, would be weird to change convention midway

ah ok did not know!

ArthurZucker · 2026-01-05T13:26:38Z

+        for children_request_id, forked_blocks in zip(children_request_ids, list_forked_blocks):
+            self.block_table[children_request_id] = forked_blocks


it feels like we should not need to iterate twice on the list_forked_blocks if block_manager.fork_blocks updates the block table on the fly. But emcapsulation might be better this way?

You are right, moving the check to inside the loop. Thanks!

Oh my bad, i mistook your point. We iterate twice on that list indeed, because block_manager.fork_blocks does not update the block table directly. I tried making it handle that part and it led to a messy function that was not all that readable. imo it's best to leave it clear for now and re-visit it if it's a hotspotfor optimization, which it might be? then again it's just an additional iteration on a small list.

yeah yeah, I am just trying to avoid any extra for looop when we can, from reading the code it appeared to be removable, but if its not no worries!

ArthurZucker · 2026-01-05T13:32:50Z

+        manager_cm = self.continuous_batching_context_manager(
+            generation_config=generation_config,
+            num_q_cuda_graphs=num_q_padding_intervals,
+            num_kv_cuda_graphs=num_kv_padding_intervals,
+            allow_block_sharing=allow_block_sharing,
+            block=True,
+            timeout=5,
+        )
+        logging_cm = logging_redirect_tqdm([logger])
+        pbar_cm = tqdm(
+            total=num_requests,
+            disable=(not progress_bar),
+            desc=f"Solving {num_requests} requests",
+            unit="request",
+        )


Yeah sorry I don't remember what I wanted here LGTM

* Reformat to make the code pretty * Allow for multiple decoding sequences in CB * Style * Fix a generation config bug * Add seed to example * Batch forking * Cahnge the fixme (for later PR) * Copy source is optional * Added a benchmark script for PR * Added a test and fixed a bug * Deepcopy and style * Review compliance * Style

remi-or self-assigned this Dec 17, 2025

ArthurZucker reviewed Dec 17, 2025

View reviewed changes

Base automatically changed from cb-block-sharing to main December 18, 2025 11:28

remi-or force-pushed the cb-fork branch from a2a74aa to a7cc167 Compare December 18, 2025 17:54

remi-or marked this pull request as ready for review December 24, 2025 13:58

remi-or added 11 commits January 5, 2026 09:42

Reformat to make the code pretty

94822dd

Allow for multiple decoding sequences in CB

5165d9e

Style

eb8152c

Fix a generation config bug

3bef652

Add seed to example

9f2596e

Batch forking

ea36c6a

Cahnge the fixme (for later PR)

703b48d

Copy source is optional

87fe8fd

Added a benchmark script for PR

8e4d0c3

Added a test and fixed a bug

2e772d5

Deepcopy and style

4e6415d

remi-or force-pushed the cb-fork branch from 59b05e5 to 4e6415d Compare January 5, 2026 09:42

remi-or requested a review from ArthurZucker January 5, 2026 12:06

ArthurZucker approved these changes Jan 5, 2026

View reviewed changes

remi-or and others added 3 commits January 6, 2026 11:27

Merge branch 'main' into cb-fork

299a299

Review compliance

c53e3b2

Style

fb733b2

remi-or merged commit accb698 into main Jan 6, 2026
26 checks passed

remi-or deleted the cb-fork branch January 6, 2026 12:40

		source_blocks = torch.tensor(source_blocks, device=self.device, dtype=torch.int32)
		forked_blocks = torch.tensor(forked_blocks, device=self.device, dtype=torch.int32)

		source_blocks, forked_blocks = cm.fork_blocks(state.request_id, new_state.request_id, self._block_manager)
		self.copy_cache(source_blocks, forked_blocks)

	the (block_manager) is used. When forking, the child's block are either shared with the parent, or they need to
	the block_manager is used. When forking, the child's block are either shared with the parent, or they need to

		for children_request_id, forked_blocks in zip(children_request_ids, list_forked_blocks):
		self.block_table[children_request_id] = forked_blocks

Conversation

remi-or commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

Tests

Sanity check

Uh oh!

HuggingFaceDocBuilderDev commented Dec 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

remi-or Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

remi-or commented Dec 17, 2025 •

edited

Loading

remi-or Dec 18, 2025 •

edited

Loading