Skip to content

Add GGUF support to Gemma4 (31B & 26B-A4B) text #45296

Open
UsamaKenway wants to merge 11 commits intohuggingface:mainfrom
UsamaKenway:gemma4-gguf
Open

Add GGUF support to Gemma4 (31B & 26B-A4B) text #45296
UsamaKenway wants to merge 11 commits intohuggingface:mainfrom
UsamaKenway:gemma4-gguf

Conversation

@UsamaKenway
Copy link
Copy Markdown
Contributor

@UsamaKenway UsamaKenway commented Apr 7, 2026

What does this PR do?

Adds support for Gemma4 GGUF models : 26B Moe and 31B dense model.
This also helps me load GGUF model in vLLM that I'm also working on.
ViT part is not included

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

Tests:

# 1. Gemma4 31B (q4_k_m)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_q4_k_m -s 
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_q4_k_m . [100%]
================================= 1 passed in 498.43s (0:08:18) =================================

# 2. Gemma4 31B IT (q8_0)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_it_q8_0 -s
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_it_q8_0 . [100%]
================================= 1 passed in 506.82s (0:08:26) =================================

# 3. Gemma4 31B IT (q4_k_m)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_it_q4_k_m -s
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_31b_it_q4_k_m . [100%]
================================= 1 passed in 488.88s (0:08:08) =================================

# 4. Gemma4 26B IT (q8_0)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q8_0 -s
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q8_0 . [100%]
================================= 1 passed in 420.12s (0:07:00) =================================

# 5. Gemma4 26B IT (q4_k_m)
(py312venv) usamakenway@Megatron: RUN_SLOW=1 pytest tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q4_k_m -s
tests/quantization/ggml/test_ggml.py::GgufModelTests::test_gemma4_26b_it_q4_k_m . [100%]
================================= 1 passed in 473.56s (0:07:53) =================================

I saw repeated hello for 31B in unittest so i thought to run this output to confirm if chat is working fine:

# === Test 1: Bare text (unit test style) ===
text = tokenizer(self.example_text, return_tensors="pt")["input_ids"]
out = model.generate(text, max_new_tokens=10)
# Output: "HelloKelloKelloKelloKelloKello"

# === Test 2: Chat completion ===
messages = [{"role": "user", "content": "Hi how are you"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_dict=True, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30, do_sample=False)
# Output: "I'm doing well, thank you for asking! How are you doing today? Is there anything I can help you with?"
an help you with?<turn|>
<|channel>thought

# === Test 3: Sentence continuation ===
                                                                                                                                                                                                                                                                                                                        
text = tokenizer("The capital of France is", return_tensors="pt")["input_ids"].to(model.device)                                                                                                                                                                                                                                                                            
print(f"input_ids: {text}")                                                                                                                                                                                                                                                                                                                                                
out = model.generate(text, max_new_tokens=20, do_sample=False)                                                                                                                                                                                                                                                                                                             
print(f"Output: {tokenizer.decode(out[0], skip_special_tokens=True)}")        

input_ids: tensor([[   2,  818, 5279,  529, 7001,  563]], device='cuda:0')
Output: The capital of France is Paris.<turn|>
<|channel>thought
<channel|>The capital of France is Paris.<turn|>}<turn|>

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

UsamaKenway and others added 8 commits April 7, 2026 00:18
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
Signed-off-by: UsamaKenway <usamakenway@gmail.com>
@Rocketknight1
Copy link
Copy Markdown
Member

cc @SunMarc

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left a comment

Comment thread tests/quantization/ggml/test_ggml.py
UsamaKenway and others added 2 commits April 11, 2026 17:06
- Add base model
- rename instruct models

Signed-off-by: UsamaKenway <usamakenway@gmail.com>
@UsamaKenway
Copy link
Copy Markdown
Contributor Author

UsamaKenway commented Apr 12, 2026

Addressed the feedback regarding expected values and updated the tests, adding base model.

@UsamaKenway UsamaKenway requested a review from SunMarc April 12, 2026 10:46
- ruff reformat

Signed-off-by: UsamaKenway <usamakenway@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: ggml

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a nit


def test_gemma4_26b_it_q8_0(self):
tokenizer = AutoTokenizer.from_pretrained(
self.gemma4_26b_it_model_id, gguf_file=self.q8_0_gemma4_26b_it_model_id
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our CI won't have enough space to run these models. So let's just skip those for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants