Skip to content

[gemma4] Dissociate kv states sharing from the Cache#45312

Merged
Cyrilvallez merged 13 commits intomainfrom
no-cache-gemma4
Apr 9, 2026
Merged

[gemma4] Dissociate kv states sharing from the Cache#45312
Cyrilvallez merged 13 commits intomainfrom
no-cache-gemma4

Conversation

@Cyrilvallez
Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez commented Apr 8, 2026

What does this PR do?

As per the title. It was confirmed that the weight matrices of shared layers are NEVER used, and that kv states should ALWAYS be shared, even during training or inference without Cache.
I will fully remove them on another PR, as they consume memory for no reason.

@Cyrilvallez Cyrilvallez changed the title Fix Gemma4 cache sharing with no_cache Fix Gemma4 layer cache sharing with no_cache Apr 8, 2026
@Cyrilvallez Cyrilvallez changed the title Fix Gemma4 layer cache sharing with no_cache Fix Gemma4 layer cache sharing with use_cache=False Apr 8, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, ig we have to disable GC for model and the same fix is needed for Gemma3n

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 8, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: gemma4

@Cyrilvallez Cyrilvallez changed the title Fix Gemma4 layer cache sharing with use_cache=False [gemma4] Dissociate kv states sharing from the Cache Apr 8, 2026
@Cyrilvallez Cyrilvallez merged commit d42f8ba into main Apr 9, 2026
21 of 23 checks passed
@Cyrilvallez Cyrilvallez deleted the no-cache-gemma4 branch April 9, 2026 08:08
Cyrilvallez added a commit that referenced this pull request Apr 9, 2026
* force at least cache sharing always

* better

* adjust tests gradients

* fix mask

* fix

* fix

* simplify a lot

* add slow test

* fix

* doc

* improve type hints
CalebisGross added a commit to AppSprout-dev/mnemonic that referenced this pull request Apr 13, 2026
…ments

- transformers 5.5.3 includes the Gemma 4 KV sharing fix
  (huggingface/transformers#45312) that caused all our training failures
- Also updated: datasets 4.8.4, sentence-transformers 5.4.0, wandb 0.25.1
- Removed unused: outlines, flash-linear-attention, causal-conv1d and deps
- Updated comments to reference the upstream fix while keeping our
  TrainingCache workaround as a safety net

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
* force at least cache sharing always

* better

* adjust tests gradients

* fix mask

* fix

* fix

* simplify a lot

* add slow test

* fix

* doc

* improve type hints
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Apr 19, 2026
- Minimize DAPO Gemma4 E2B-it recipe to satisfy configs-minimize-check hook.
- Guard `_needs_kv_cache_for_shared_layers` against non-int
  `num_kv_shared_layers` (fixes `TypeError` when tests pass a bare
  MagicMock as the model). Noted in a TODO that this workaround can be
  removed once transformers>=5.5.2 (huggingface/transformers#45312) lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants