Add GGUF support to Gemma4 (31B & 26B-A4B) text by UsamaKenway · Pull Request #45296 · huggingface/transformers

UsamaKenway · 2026-04-07T18:39:33Z

What does this PR do?

Adds support for Gemma4 GGUF models : 26B Moe and 31B dense model.
This also helps me load GGUF model in vLLM that I'm also working on.
ViT part is not included

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. Related to Community contribution: Adding GGUF support for more architectures #33260.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Tests:

# 1. Gemma4 31B (q4_k_m)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_q4_k_m -s 
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_q4_k_m . [100%]
================================= 1 passed in 498.43s (0:08:18) =================================

# 2. Gemma4 31B IT (q8_0)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_it_q8_0 -s
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_it_q8_0 . [100%]
================================= 1 passed in 506.82s (0:08:26) =================================

# 3. Gemma4 31B IT (q4_k_m)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_it_q4_k_m -s
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_it_q4_k_m . [100%]
================================= 1 passed in 488.88s (0:08:08) =================================

# 4. Gemma4 26B IT (q8_0)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q8_0 -s
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q8_0 . [100%]
================================= 1 passed in 420.12s (0:07:00) =================================

# 5. Gemma4 26B IT (q4_k_m)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q4_k_m -s
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q4_k_m . [100%]
================================= 1 passed in 473.56s (0:07:53) =================================

I saw repeated hello for 31B in unittest so i thought to run this output to confirm if chat is working fine:

# === Test 1: Bare text (unit test style) ===
text = tokenizer(self.example_text, return_tensors="pt")["input_ids"]
out = model.generate(text, max_new_tokens=10)
# Output: "HelloKelloKelloKelloKelloKello"

# === Test 2: Chat completion ===
messages = [{"role": "user", "content": "Hi how are you"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_dict=True, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30, do_sample=False)
# Output: "I'm doing well, thank you for asking! How are you doing today? Is there anything I can help you with?"
an help you with?<turn|>
<|channel>thought

# === Test 3: Sentence continuation ===
                                                                                                                                                                                                                                                                                                                        
text = tokenizer("The capital of France is", return_tensors="pt")["input_ids"].to(model.device)                                                                                                                                                                                                                                                                            
print(f"input_ids: {text}")                                                                                                                                                                                                                                                                                                                                                
out = model.generate(text, max_new_tokens=20, do_sample=False)                                                                                                                                                                                                                                                                                                             
print(f"Output: {tokenizer.decode(out[0], skip_special_tokens=True)}")        

input_ids: tensor([[   2,  818, 5279,  529, 7001,  563]], device='cuda:0')
Output: The capital of France is Paris.<turn|>
<|channel>thought
<channel|>The capital of France is Paris.<turn|>}<turn|>

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

Rocketknight1 · 2026-04-08T14:40:22Z

cc @SunMarc

SunMarc

Thanks, left a comment

- Add base model - rename instruct models Signed-off-by: UsamaKenway <usamakenway@gmail.com>

UsamaKenway · 2026-04-12T10:45:43Z

Addressed the feedback regarding expected values and updated the tests, adding base model.

- ruff reformat Signed-off-by: UsamaKenway <usamakenway@gmail.com>

github-actions · 2026-04-12T10:53:54Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ggml

SunMarc

Thanks, just a nit

SunMarc · 2026-04-20T15:39:55Z

+
+    def test_gemma4_26b_it_q8_0(self):
+        tokenizer = AutoTokenizer.from_pretrained(
+            self.gemma4_26b_it_model_id, gguf_file=self.q8_0_gemma4_26b_it_model_id


our CI won't have enough space to run these models. So let's just skip those for now

UsamaKenway and others added 8 commits April 7, 2026 00:18

Add GGUF loading support for Gemma4 text backbone

447ec30

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

[GEMMA4:GGUF] 26B-A4B Test Q4_K_M and Q8_0 quantized

08855a0

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

[GEMMA4:GGUF] 31B Test Q4_K_M and Q8_0 quantized

c1dd1b3

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

[GEMMA4:GGUF] fix tokenizer bos

14ae49a

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

[GEMMA4:GGUF] fix tokenizer BOS/EOS

f00b0ef

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

[GEMMA4:GGUF] rename test names and EXPECTED_TEXT

0f0c93b

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

Merge branch 'main' into gemma4-gguf

e27d4b3

ruff format check

8c38c9c

Signed-off-by: UsamaKenway <usamakenway@gmail.com>

SunMarc reviewed Apr 9, 2026

View reviewed changes

Comment thread tests/quantization/ggml/test_ggml.py

UsamaKenway and others added 2 commits April 11, 2026 17:06

Merge branch 'huggingface:main' into gemma4-gguf

15c8206

[GEMMA4:GGUF] update test_ggml

b5cb18e

- Add base model - rename instruct models Signed-off-by: UsamaKenway <usamakenway@gmail.com>

UsamaKenway requested a review from SunMarc April 12, 2026 10:46

[GEMMA4:GGUF] update test_ggml

d7632bd

- ruff reformat Signed-off-by: UsamaKenway <usamakenway@gmail.com>

UsamaKenway mentioned this pull request Apr 12, 2026

[GGUF] Reduce peak RAM usage by casting dequantized tensors early during load #45386

Merged

6 tasks

rnh0 mentioned this pull request Apr 19, 2026

[BUG] Gemma 4 token repetition collapse during long generation — affects both 31b Dense and 26b MoE google-deepmind/gemma#622

Open

SunMarc approved these changes Apr 20, 2026

View reviewed changes

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GGUF support to Gemma4 (31B & 26B-A4B) text #45296

Add GGUF support to Gemma4 (31B & 26B-A4B) text #45296
UsamaKenway wants to merge 11 commits intohuggingface:mainfrom
UsamaKenway:gemma4-gguf

UsamaKenway commented Apr 7, 2026 •

edited

Loading

Uh oh!

Rocketknight1 commented Apr 8, 2026

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

UsamaKenway commented Apr 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

UsamaKenway commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Apr 8, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

UsamaKenway commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UsamaKenway commented Apr 7, 2026 •

edited

Loading

UsamaKenway commented Apr 12, 2026 •

edited

Loading