[gemma4] Dissociate kv states sharing from the Cache#45312
Merged
Cyrilvallez merged 13 commits intomainfrom Apr 9, 2026
Merged
[gemma4] Dissociate kv states sharing from the Cache#45312Cyrilvallez merged 13 commits intomainfrom
Cyrilvallez merged 13 commits intomainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
zucchini-nlp
approved these changes
Apr 8, 2026
Member
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks, ig we have to disable GC for model and the same fix is needed for Gemma3n
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: gemma4 |
This was referenced Apr 9, 2026
Cyrilvallez
added a commit
that referenced
this pull request
Apr 9, 2026
* force at least cache sharing always * better * adjust tests gradients * fix mask * fix * fix * simplify a lot * add slow test * fix * doc * improve type hints
CalebisGross
added a commit
to AppSprout-dev/mnemonic
that referenced
this pull request
Apr 13, 2026
…ments - transformers 5.5.3 includes the Gemma 4 KV sharing fix (huggingface/transformers#45312) that caused all our training failures - Also updated: datasets 4.8.4, sentence-transformers 5.4.0, wandb 0.25.1 - Removed unused: outlines, flash-linear-attention, causal-conv1d and deps - Updated comments to reference the upstream fix while keeping our TrainingCache workaround as a safety net Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 13, 2026
sirzechs66
pushed a commit
to sirzechs66/transformers
that referenced
this pull request
Apr 18, 2026
* force at least cache sharing always * better * adjust tests gradients * fix mask * fix * fix * simplify a lot * add slow test * fix * doc * improve type hints
sharonyu-115
added a commit
to sharonyu-115/RL
that referenced
this pull request
Apr 19, 2026
- Minimize DAPO Gemma4 E2B-it recipe to satisfy configs-minimize-check hook. - Guard `_needs_kv_cache_for_shared_layers` against non-int `num_kv_shared_layers` (fixes `TypeError` when tests pass a bare MagicMock as the model). Noted in a TODO that this workaround can be removed once transformers>=5.5.2 (huggingface/transformers#45312) lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
As per the title. It was confirmed that the weight matrices of shared layers are NEVER used, and that kv states should ALWAYS be shared, even during training or inference without Cache.
I will fully remove them on another PR, as they consume memory for no reason.