[GGUF] Reduce peak RAM usage by casting dequantized tensors early during load#45386
Merged
SunMarc merged 5 commits intohuggingface:mainfrom Apr 20, 2026
Merged
Conversation
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Member
|
cc @SunMarc |
SunMarc
reviewed
Apr 13, 2026
Signed-off-by: Usama Kenway <usamakenway@gmail.com>
Signed-off-by: Usama Kenway <usamakenway@gmail.com>
UsamaKenway
commented
Apr 15, 2026
Contributor
Author
|
@SunMarc Changes pushed. refactored the logic as suggested . Let me know if you spot anything else or if this is good to go. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
lvliang-intel
pushed a commit
to lvliang-intel/transformers
that referenced
this pull request
Apr 21, 2026
…ing load (huggingface#45386) * [GGUF] cast dequantized tensors to target dtype during load Signed-off-by: UsamaKenway <usamakenway@gmail.com> * [GGUF] refac dtype, quantization casting Signed-off-by: Usama Kenway <usamakenway@gmail.com> * [GGUF] refac dtype Signed-off-by: Usama Kenway <usamakenway@gmail.com> --------- Signed-off-by: UsamaKenway <usamakenway@gmail.com> Signed-off-by: Usama Kenway <usamakenway@gmail.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
artem-spector
pushed a commit
to artem-spector/transformers
that referenced
this pull request
Apr 21, 2026
…ing load (huggingface#45386) * [GGUF] cast dequantized tensors to target dtype during load Signed-off-by: UsamaKenway <usamakenway@gmail.com> * [GGUF] refac dtype, quantization casting Signed-off-by: Usama Kenway <usamakenway@gmail.com> * [GGUF] refac dtype Signed-off-by: Usama Kenway <usamakenway@gmail.com> --------- Signed-off-by: UsamaKenway <usamakenway@gmail.com> Signed-off-by: Usama Kenway <usamakenway@gmail.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimizes memory usage when loading GGUF models by performing dtype casting immediately after dequantization.
While I was adding the support for Gemma4 in this PR #45296, i noticed this issue that the GGUF tensors are dequantized to
float32by default during the loading process, even if the user intends to load the model infloat16orbfloat16. For large models, this creates a significant RAM spike that can lead to Out Of Memory.By passing the target
torch_dtypedirectly into the loading utility, we can cast the tensors immediately after dequantization, effectively halving the peak RAM required for the state dict.Benchmark Results (Gemma 4 26B IT q4_k_m)
I tested the peak RAM (Global Peak RSS) with and without this change using a separate branch for tracking:
Tests
With the changes
Without the changes
Code Agent Policy
Before submitting
Pull Request section?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.