[CB] Changes for long generation by remi-or · Pull Request #45530 · huggingface/transformers

remi-or · 2026-04-20T10:33:54Z

Summary

This PR fixes some issues related to memory to make long generation (16K+) easier.

Fix KV dedup for decode batches (scheduler.py): Decode-only batches don't consume the read_indices cache budget, so don't reject them on that basis. Also gate the decode fast path on max_blocks_per_request > 0 instead of unconditionally enabling it.
Fix memory estimation (requests.py): Use torch.cuda.mem_get_info instead of device_properties().total_memory, which ignored CUDA context/driver overhead (~0.5 GiB) and caused overcommit/OOM.
Raise max_memory_percent default 0.8 → 0.9 (configuration_utils.py): Now safe with the corrected memory accounting above.
Write-only fast path (cache.py, input_outputs.py, scheduler.py): When a batch has no past-cache reads (pure prefills), skip the index_select read-back, avoid allocating/transferring read_index, and return the input KV states directly. Also adjusts the CUDA-graph key to depend on the block-table path rather than max_kv_read > 0.
Two-peak memory model (cache.py): Replace the single peak_activation_per_token with two activation peaks — LM head (hidden + logits, N-independent) and attention (hidden + Q + new K/V + cache K/V reads, grows with N). Solve the memory polynomial for each peak independently and take the most restrictive (num_blocks, max_batch_tokens). Bumps _upper_bound_max_batch_tokens 256 → 1024.

Performances

Pretty good, lot of workloads benefit from 80% to 90% raise in cache space

Arguments	Main (tok/s)	Current (tok/s)	Diff (%)
--samples 10	869.0	890.37	+2.5%
--samples 20 --num-blocks 20	517.95	520.16	+0.4%
--samples 50	3629.88	3638.6	+0.2%
--samples 100	5375.41	5522.83	+2.7%
--samples 100 --attn flash_attention_2	3666.82	3743.47	+2.1%
--samples 100 --attn sdpa	1030.21	1053.57	+2.3%
--samples 500 --no-use-async	6621.78	8020.64	+21.1%
--samples 500 --use-async	7963.66	9332.71	+17.2%
--samples 32 --max-new-tokens 2048 --use-async	2033.87	2064.29	+1.5%
--samples 32 --max-new-tokens 2048 --use-async --block-table 32	2716.64	2734.81	+0.7%
--samples 500 --add-prefix --compile	7649.48	8882.62	+16.1%
--samples 50 --num-return-sequences 8 --do-sample	869.94	980.13	+12.7%
--samples 100 --num-return-sequences 4 --do-sample	1708.88	1925.25	+12.7%

Tests

make style and make_typing pass
RUN_SLOW=1 pytest tests/generation/test_continuous_batching.py
RUN_SLOW=1 pytest tests/cli/test_serve.py
RUN_SLOW=1 pytest tests/generation/test_paged_attention.py

HuggingFaceDocBuilderDev · 2026-04-20T10:44:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

some of the names are a tad bit unfamiliar to me, but LGTM!

ArthurZucker · 2026-04-23T07:55:26Z

-        total_memory = torch.cuda.get_device_properties(device).total_memory
+        # Use mem_get_info to get actual free memory: device_properties().total_memory returns the physical device
+        # total which ignores CUDA context and driver overhead (~0.5 GiB), leading to overcommit.
+        free_memory, total_memory = torch.cuda.mem_get_info(device)


* Fix KV dedup for decode batches * Fix memory estimation * Change default * Added write-only fast path * Take both peaks into account * Revert unused config field * Review 1 * Fix p1s * Fix p2s and p3s that needed it * Added a TODO * Fix test, lower max cached graph, add TODO * Fix fragmentation with big warmup * Add more space for logits processors * Fix

remi-or added 6 commits April 20, 2026 10:12

Fix KV dedup for decode batches

1c076c6

Fix memory estimation

7eec987

Change default

b4b74ff

Added write-only fast path

88287d1

Take both peaks into account

dd00e9b

Revert unused config field

1599e24

remi-or and others added 8 commits April 21, 2026 01:05

Review 1

7cd08da

Fix p1s

8640626

Fix p2s and p3s that needed it

b853c85

Added a TODO

f39b68f

Fix test, lower max cached graph, add TODO

4f814b2

Fix fragmentation with big warmup

28ca9ed

Add more space for logits processors

878f469

Merge branch 'main' into cb-very-long-gen

5854ad1

remi-or marked this pull request as ready for review April 21, 2026 15:40

remi-or requested a review from ArthurZucker April 21, 2026 15:40

ArthurZucker approved these changes Apr 23, 2026

View reviewed changes

remi-or and others added 3 commits April 23, 2026 10:05

Merge branch 'main' into cb-very-long-gen

7bd12b3

Fix

1695d37

Merge branch 'main' into cb-very-long-gen

6b45f17

ArthurZucker merged commit 07e3831 into main Apr 23, 2026
27 of 29 checks passed

ArthurZucker deleted the cb-very-long-gen branch April 23, 2026 09:34

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CB] Changes for long generation#45530

[CB] Changes for long generation#45530
ArthurZucker merged 17 commits intomainfrom
cb-very-long-gen

remi-or commented Apr 20, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

remi-or commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performances

Tests

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

remi-or commented Apr 20, 2026 •

edited

Loading