Skip to content

[CB] Tweaks to update and minor fixes#45179

Merged
remi-or merged 13 commits intomainfrom
cb-minor-fixes
Apr 3, 2026
Merged

[CB] Tweaks to update and minor fixes#45179
remi-or merged 13 commits intomainfrom
cb-minor-fixes

Conversation

@remi-or
Copy link
Copy Markdown
Collaborator

@remi-or remi-or commented Apr 2, 2026

Summary

This PR ads minor changes to cache.update, updates the memory handler with all new features and refactors a few parts of the code to make it more readable.

Cache indexing:

  • Replace fancy indexing (cache[idx, :, :]) with explicit torch.index_select / index_copy_, which have cleaner behavior under torch.compile and require non-negative indices.
  • Switch index storage tensors from int32 to int64 to match index_select/index_copy_ requirements, removing hidden .long() casts in the hot path.
  • Introduce sentinel_index and trash_index, which are dedicated positions in the cache padding zone that were already used but now have a name. It also avoids passing negative values (like -1) to indexing functions.

Memory handler (cache.py)

  • Collapse the three separate solving methods (compute_num_blocks_and_max_batch_tokens, compute_max_batch_tokens, compute_num_blocks) and verbose compute_memory_footprint into a single polynomial coefficient model. Each term maps to a tensor in _setup_static_tensors, making the memory model auditable and impossible to drift between solvers.
  • Account for previously unmodeled tensors: block_table, logprobs output rows, and async double-buffering (when use_async_batching is on).

Benchmark (continuous_batching_overall.py)

  • Store results in a timestamped directory instead of a single file, enabling comparison against any previous baseline.

Tests

  • Add TestMemoryHandlerPrediction: allocates tensors matching the handler's polynomial model and validates predicted vs actual GPU memory across 5 configurations.
  • Fix test_paged_attention: move cache params to ContinuousBatchingConfig, handle list-type eos_token_id, accept attention-impl-dependent output variants.

Performance

Those changes add between 1 and 3% of performance (when not using the block table) depending on the workload. No regressions.

Testing

All tests pass, expect tests/generation/test_continuous_batching.py::ContinuousBatchingWithAcceleratorTest::test_prefix_sharing which is fixed in #45026

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Nice unbloating!

Comment on lines +548 to +550
def _equation_coefficients(self, cache_dtype: torch.dtype) -> tuple[int, int, int, int]:
"""Returns (coeff_n, coeff_m, coeff_nm, coeff_mm) for the memory polynomial. Each addend is annotated with
the tensor it corresponds to in `ContinuousBatchingIOs._setup_static_tensors`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

"""Largest positive root of a·x² + b·x + c = 0. Falls back to linear when a == 0."""
if a == 0:
return -c / b
discriminant = b**2 - 4 * a * c
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high school memories

@remi-or remi-or added this pull request to the merge queue Apr 2, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 2, 2026
@remi-or remi-or enabled auto-merge April 2, 2026 11:14
@remi-or remi-or added this pull request to the merge queue Apr 3, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 3, 2026
@remi-or remi-or added this pull request to the merge queue Apr 3, 2026
Merged via the queue into main with commit 138f757 Apr 3, 2026
30 checks passed
@remi-or remi-or deleted the cb-minor-fixes branch April 3, 2026 09:11
marvinzh pushed a commit to marvinzh/transformers that referenced this pull request Apr 3, 2026
* Bette cache update

* alternative cache uodate

* Fix paged tests

* Update cache computation

* Add test

* Memory for CB overall

* int64 for tensors

* Review compliance

* Review compliance 2/2

* Style

* Fix test
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Apr 4, 2026
* Bette cache update

* alternative cache uodate

* Fix paged tests

* Update cache computation

* Add test

* Memory for CB overall

* int64 for tensors

* Review compliance

* Review compliance 2/2

* Style

* Fix test
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
* Bette cache update

* alternative cache uodate

* Fix paged tests

* Update cache computation

* Add test

* Memory for CB overall

* int64 for tensors

* Review compliance

* Review compliance 2/2

* Style

* Fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants