[GGUF] Reduce peak RAM usage by casting dequantized tensors early during load by UsamaKenway · Pull Request #45386 · huggingface/transformers

UsamaKenway · 2026-04-12T13:17:17Z

Optimizes memory usage when loading GGUF models by performing dtype casting immediately after dequantization.

While I was adding the support for Gemma4 in this PR #45296, i noticed this issue that the GGUF tensors are dequantized to float32 by default during the loading process, even if the user intends to load the model in float16 or bfloat16. For large models, this creates a significant RAM spike that can lead to Out Of Memory.

By passing the target torch_dtype directly into the loading utility, we can cast the tensors immediately after dequantization, effectively halving the peak RAM required for the state dict.

Benchmark Results (Gemma 4 26B IT q4_k_m)

I tested the peak RAM (Global Peak RSS) with and without this change using a separate branch for tracking:

- Without this PR (Float32 spike): ~118.7 GB Peak RSS
- With this PR (Early casting):    ~59.4  GB Peak RSS
------------------------------------------------------
Saving:                            ~59.3  GB (50% reduction)

Tests

With the changes

(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q4_k_m -s                                                                                                                                                                                                                                                                                                                                                   

[RAM DEBUG] Global Peak RSS (High Water Mark): 2185.59 MB

[RAM DEBUG] Global Peak RSS (High Water Mark): 2197.88 MB

[RAM DEBUG] Global Peak RSS (High Water Mark): 2391.64 MB
Converting and de-quantizing GGUF tensors...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 658/658 [03:03<00:00,  3.59it/s]

[RAM DEBUG] Global Peak RSS (High Water Mark): 59428.81 MB
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 657/657 [00:00<00:00, 26841.26it/s]
PASSEDtests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q4_k_m [PASSED] 287.34s

Without the changes

(py312venv) usamakenway@Megatron:  RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q4_k_m -s
                                                                                                                                       
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q4_k_m 
[RAM DEBUG] Global Peak RSS (High Water Mark): 2259.14 MB

[RAM DEBUG] Global Peak RSS (High Water Mark): 2270.75 MB

[RAM DEBUG] Global Peak RSS (High Water Mark): 2464.60 MB
Converting and de-quantizing GGUF tensors...: 100%|███████████████████████████████████████████████████████████████████████████████| 658/658 [05:46<00:00,  1.90it/s]

[RAM DEBUG] Global Peak RSS (High Water Mark): 118747.33 MB
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 657/657 [00:03<00:00, 195.35it/s]
PASSEDtests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q4_k_m [PASSED] 499.38s

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

Rocketknight1 · 2026-04-13T13:29:50Z

cc @SunMarc

SunMarc

Thanks, a couple of comments

Signed-off-by: Usama Kenway <usamakenway@gmail.com>

UsamaKenway · 2026-04-15T08:31:44Z

@SunMarc Changes pushed. refactored the logic as suggested . Let me know if you spot anything else or if this is good to go.
Thank you.

SunMarc

Much better thanks a lot !

HuggingFaceDocBuilderDev · 2026-04-20T15:46:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ing load (huggingface#45386) * [GGUF] cast dequantized tensors to target dtype during load Signed-off-by: UsamaKenway <usamakenway@gmail.com> * [GGUF] refac dtype, quantization casting Signed-off-by: Usama Kenway <usamakenway@gmail.com> * [GGUF] refac dtype Signed-off-by: Usama Kenway <usamakenway@gmail.com> --------- Signed-off-by: UsamaKenway <usamakenway@gmail.com> Signed-off-by: Usama Kenway <usamakenway@gmail.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

UsamaKenway and others added 2 commits April 12, 2026 13:59

[GGUF] cast dequantized tensors to target dtype during load

f51fd9d

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

Merge branch 'huggingface:main' into gguf-early-dtype-casting

b45024e

SunMarc reviewed Apr 13, 2026

View reviewed changes

Comment thread src/transformers/modeling_utils.py Outdated

Comment thread src/transformers/modeling_gguf_pytorch_utils.py Outdated

UsamaKenway added 2 commits April 14, 2026 00:33

[GGUF] refac dtype, quantization casting

1dc1aab

Signed-off-by: Usama Kenway <usamakenway@gmail.com>

[GGUF] refac dtype

04c27ac

Signed-off-by: Usama Kenway <usamakenway@gmail.com>

UsamaKenway commented Apr 15, 2026

View reviewed changes

Comment thread src/transformers/modeling_utils.py

UsamaKenway requested a review from SunMarc April 15, 2026 08:31

SunMarc approved these changes Apr 15, 2026

View reviewed changes

Merge branch 'main' into gguf-early-dtype-casting

7bee92f

SunMarc enabled auto-merge April 20, 2026 15:35

SunMarc added this pull request to the merge queue Apr 20, 2026

Merged via the queue into huggingface:main with commit 7a0d582 Apr 20, 2026
28 checks passed

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GGUF] Reduce peak RAM usage by casting dequantized tensors early during load#45386

[GGUF] Reduce peak RAM usage by casting dequantized tensors early during load#45386
SunMarc merged 5 commits intohuggingface:mainfrom
UsamaKenway:gguf-early-dtype-casting

UsamaKenway commented Apr 12, 2026 •

edited

Loading

Uh oh!

Rocketknight1 commented Apr 13, 2026

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

UsamaKenway commented Apr 15, 2026

Uh oh!

SunMarc left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

UsamaKenway commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results (Gemma 4 26B IT q4_k_m)

Code Agent Policy

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Apr 13, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

UsamaKenway commented Apr 15, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UsamaKenway commented Apr 12, 2026 •

edited

Loading