huggingface · Cyrilvallez · Apr 2, 2026 · Apr 2, 2026 · Apr 2, 2026 · Apr 2, 2026
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -1182,6 +1182,8 @@
         title: Gemma3
       - local: model_doc/gemma3n
         title: Gemma3n
+      - local: model_doc/gemma4
+        title: Gemma4
       - local: model_doc/git
         title: GIT
       - local: model_doc/glm46v

diff --git a/docs/source/en/model_doc/gemma4.md b/docs/source/en/model_doc/gemma4.md
@@ -0,0 +1,242 @@
+<!--Copyright 2026 the HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
+
+-->
+*This model was released on {release_date} and added to Hugging Face Transformers on 2026-04-01.*
+
+
+# Gemma4
+
+## Overview
+
+[Gemma 4](INSET_PAPER_LINK) is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.
+
+You can find all the original Gemma 4 checkpoints under the [Gemma 4](https://huggingface.co/collections/google/gemma-4-release-67c6c6f89c4f76621268bb6d) release.
+
+### Gemma4 Vision Model
+
+The key difference from previous Gemma releases is the new design to process **images of different sizes** using a **fixed-budget number of tokens**. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There a a couple constraints to follow:
+- The total number of pixels must fit within a patch budget
+- Both height and width must be divisible by **48** (= patch size 16 × pooling kernel 3)
+
+> [!IMPORTANT]
+> Gemma 4 does **not** apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).
+
+The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is **280 soft tokens** per image.
+
+
+| Soft Tokens | Patches (before pooling) | Approx. Image Area |
+|:-----------:|:------------------------:|:-------------------:|
+| 70          | 630                      | ~161K pixels        |
+| 140         | 1,260                    | ~323K pixels        |
+| **280**     | **2,520**                | **~645K pixels**    |
+| 560         | 5,040                    | ~1.3M pixels        |
+| 1,120       | 10,080                   | ~2.6M pixels        |
+
+
+To encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."
+
+
+
+## Usage examples
+
+The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
+
+<hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```py
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(
+    task="image-text-to-text",
+    model="google/gemma-4-2b-pt",
+    device=0,
+    dtype=torch.bfloat16
+)
+pipeline(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
+    text="<start_of_image> What is shown in this image?"
+)
+```
+
+</hfoption>
+<hfoption id="AutoModel">
+
+```py
+import torch
+from transformers import AutoProcessor, AutoModelForImageTextToText
+
+model = AutoModelForImageTextToText.from_pretrained(
+    "google/gemma-4-2b-it",
+    dtype=torch.bfloat16,
+    device_map="auto",
+    attn_implementation="sdpa"
+)
+processor = AutoProcessor.from_pretrained(
+    "google/gemma-4-2b-it",
+    padding_side="left"
+)
+
+messages = [
+    {
+        "role": "user", "content": [
+            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+            {"type": "text", "text": "What is shown in this image?"},
+        ]
+    },
+]
+inputs = processor.apply_chat_template(
+    messages,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt",
+    add_generation_prompt=True,
+).to(model.device)
+
+output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
+print(processor.decode(output[0], skip_special_tokens=True))
+```
+
+### Function callin
+
+TODO: add decent examples, I am no good with tools and agents
+
+### Quantization
+
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+
+The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
+
+```py
+# pip install torchao
+import torch
+from transformers import TorchAoConfig, Gemma4ForConditionalGeneration, AutoProcessor
+
+quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
+model = Gemma4ForConditionalGeneration.from_pretrained(
+    "google/gemma-4-2b-it",
+    dtype=torch.bfloat16,
+    device_map="auto",
+    quantization_config=quantization_config
+)
+processor = AutoProcessor.from_pretrained(
+    "google/gemma-2-2b-it",
+    padding_side="left"
+)
+
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {"type": "text", "text": "You are a helpful assistant."}
+        ]
+    },
+    {
+        "role": "user", "content": [
+            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
+            {"type": "text", "text": "What is shown in this image?"},
+        ]
+    },
+]
+inputs = processor.apply_chat_template(
+    messages,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt",
+    add_generation_prompt=True,
+).to(model.device)
+
+output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
+print(processor.decode(output[0], skip_special_tokens=True))
+```
+
+## Gemma4AudioConfig
+
+[[autodoc]] Gemma4AudioConfig
+
+## Gemma4VisionConfig
+
+[[autodoc]] Gemma4VisionConfig
+
+## Gemma4TextConfig
+
+[[autodoc]] Gemma4TextConfig
+
+## Gemma4Config
+
+[[autodoc]] Gemma4Config
+
+## Gemma4AudioFeatureExtractor
+
+[[autodoc]] Gemma4AudioFeatureExtractor
+    - __call__
+
+## Gemma4ImageProcessorPil
+
+[[autodoc]] Gemma4ImageProcessorPil
+    - preprocess
+
+## Gemma4ImageProcessor
+
+[[autodoc]] Gemma4ImageProcessor
+    - preprocess
+
+## Gemma4VideoProcessor
+
+[[autodoc]] Gemma4VideoProcessor
+    - preprocess
+
+## Gemma4Processor
+
+[[autodoc]] Gemma4Processor
+    - __call__
+
+## Gemma4PreTrainedModel
+
+[[autodoc]] Gemma4PreTrainedModel
+    - forward
+
+## Gemma4AudioModel
+
+[[autodoc]] Gemma4AudioModel
+    - forward
+
+## Gemma4VisionModel
+
+[[autodoc]] Gemma4VisionModel
+    - forward
+
+## Gemma4TextModel
+
+[[autodoc]] Gemma4TextModel
+    - forward
+
+## Gemma4ForCausalLM
+
+[[autodoc]] Gemma4ForCausalLM
+
+## Gemma4Model
+
+[[autodoc]] Gemma4Model
+    - forward
+
+## Gemma4ForConditionalGeneration
+
+[[autodoc]] Gemma4ForConditionalGeneration
+    - forward
diff --git a/src/transformers/generation/candidate_generator.py b/src/transformers/generation/candidate_generator.py
@@ -1285,15 +1285,16 @@ def _prepare_position_ids(model_kwargs: dict[str, Any], new_length: int, is_enco
 
 def _prepare_token_type_ids(model_kwargs: dict[str, Any], new_length: int) -> dict[str, Any]:
     """Expands or crops the model's token_type_ids for decoding purposes, to the defined length"""
-    if "token_type_ids" not in model_kwargs or model_kwargs["token_type_ids"] is None:
+    if model_kwargs.get("token_type_ids") is None:
         return model_kwargs
 
+    # Multimodal models call this arg `mm_token_type_ids`
     token_type_ids = model_kwargs["token_type_ids"]
     final_token_type = token_type_ids[:, -1].unsqueeze(-1)
     type_length_diff = new_length - token_type_ids.shape[1]
 
     if type_length_diff < 0:
-        token_type_ids = token_type_ids[:, :type_length_diff]
+        model_kwargs["token_type_ids"] = token_type_ids[:, :type_length_diff]
     elif type_length_diff > 0:
         token_type_copies = final_token_type.repeat(1, type_length_diff)
         model_kwargs["token_type_ids"] = torch.cat([model_kwargs["token_type_ids"], token_type_copies], dim=-1)

diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
@@ -538,7 +538,7 @@ def prepare_inputs_for_generation(
             model_inputs["token_type_ids"] = token_type_ids
 
         # 3. Slice model inputs if it's an input that should have the same length as `input_ids`
-        for model_input_name in [position_ids_key, "token_type_ids"]:
+        for model_input_name in [position_ids_key, "token_type_ids", "mm_token_type_ids"]:
             model_input = model_inputs.get(model_input_name)
             if model_input is not None and model_input.shape[-1] != sequence_length:
                 # Input can be 2D or 3D, and we always slice on `seq-length` (last dim)
@@ -567,7 +567,9 @@ def prepare_inputs_for_generation(
                 attention_mask=attention_mask,
                 past_key_values=model_inputs.get("past_key_values"),
                 position_ids=model_inputs.get(position_ids_key),
+                # The following kwargs are not used in the main function - only on a few models with overloaded `create_masks_for_generate`
                 token_type_ids=model_inputs.get("token_type_ids"),
+                mm_token_type_ids=model_inputs.get("mm_token_type_ids"),
                 is_first_iteration=is_first_iteration,
             )
 
@@ -919,6 +921,12 @@ def _update_model_kwargs_for_generation(
         if (token_type_ids := model_kwargs.get("token_type_ids")) is not None:
             model_kwargs["token_type_ids"] = torch.cat([token_type_ids, token_type_ids[:, -num_new_tokens:]], dim=-1)
 
+        # update mm_token_type_ids with zeros (only-text)
+        if (mm_token_type_ids := model_kwargs.get("mm_token_type_ids")) is not None:
+            model_kwargs["mm_token_type_ids"] = torch.cat(
+                [mm_token_type_ids, mm_token_type_ids.new_zeros((mm_token_type_ids.shape[0], num_new_tokens))], dim=-1
+            )
+
         # Position ids (2D or 3D sometimes)
         position_ids_key = "position_ids" if not is_encoder_decoder else "decoder_position_ids"
         if (position_ids := model_kwargs.get(position_ids_key)) is not None:

diff --git a/src/transformers/integrations/finegrained_fp8.py b/src/transformers/integrations/finegrained_fp8.py
@@ -596,10 +596,9 @@ def __init__(
         self.block_size = block_size
         self.hidden_dim = config.hidden_size
         self.activation_scheme = activation_scheme
-        self.num_experts = config.num_local_experts if hasattr(config, "num_local_experts") else config.num_experts
-        self.intermediate_dim = (
-            config.moe_intermediate_size if hasattr(config, "moe_intermediate_size") else config.intermediate_size
-        )
+        self.num_experts = getattr(config, "num_local_experts", config.num_experts)
+        self.intermediate_dim = getattr(config, "moe_intermediate_size", config.intermediate_size)
+        self.act_fn = ACT2FN[getattr(config, "hidden_activation", config.hidden_act)]
 
         if self.has_gate:
             gu_proj_out, gu_proj_in = 2 * self.intermediate_dim, self.hidden_dim
@@ -633,8 +632,6 @@ def __init__(
             self.gate_up_proj_activation_scale = nn.Parameter(torch.ones(self.num_experts, dtype=torch.float32))
             self.down_proj_activation_scale = nn.Parameter(torch.ones(self.num_experts, dtype=torch.float32))
 
-        self.act_fn = ACT2FN[config.hidden_act]
-
     def _apply_gate(self, gate_up: torch.Tensor) -> torch.Tensor:
         gate, up = gate_up.chunk(2, dim=-1)
         return self.act_fn(gate) * up