From 0c0e6e29dccb7ead051ed6448887ab47fea1700f Mon Sep 17 00:00:00 2001
From: Saswat Meher <saswatmeher@gmail.com>
Date: Sat, 19 Apr 2025 13:30:57 +0900
Subject: [PATCH 01/12] update siglip2 model card

---
 docs/source/en/model_doc/siglip2.md | 291 +++++++++++-----------------
 1 file changed, 110 insertions(+), 181 deletions(-)
diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index 0d49d9382361..14dbcd6f2077 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -14,225 +14,154 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# SigLIP2
-
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+            <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+            <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+            <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
 </div>
 
+# SigLIP2
+
 ## Overview
 
-The SigLIP2 model was proposed in [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://huggingface.co/papers/2502.14786) by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin,
-Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen,
-Andreas Steiner and Xiaohua Zhai.
+[SigLIP2](https://huggingface.co/papers/2502.14786) is a family of multilingual vision-language encoders, it is build upon the success of the original [SigLIP](siglip). The original image-text training objective is extended by several independently developed techniques. These include decoder-based pretraining, self-supervised losses such as self-distillation and masked prediction, as well as online data curation.
 
 The model comes in two variants
+- FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
+- NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in `transformers`)
 
- 1) FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
- 2) NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in `transformers`)
-
-The abstract from the paper is the following:
-
-*We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success
-of the original SigLIP. In this second iteration, we extend the original image-text training objective with
-several prior, independently developed techniques into a unified recipe—this includes decoder-based
-pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With
-these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, 
-including zero-shot classification (best SigLIP 2 ViT-g/16 achieves 85.0% ImageNet zero-shot
-accuracy), image-text retrieval, and transfer performance when extracting visual representations for
-Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements 
-on localization and dense prediction tasks. We also train variants which support multiple resolutions 
-and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that
-includes de-biasing techniques, leading to much better multilingual understanding and improved fair-
-ness. To provide users with the ability to trade-off inference cost with performance, we release model
-checkpoints at four sizes (ViT-B/86M, L/303M, So400m/400M, and g/1B).*
-
-## Usage tips
-
-- Usage of SigLIP2 is similar to [SigLIP](siglip) and [CLIP](clip). The main difference from CLIP is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
-- Training is supported but does not use `torch.distributed` utilities which may limit the scalability of batch size. However, DDP and FDSP works on single-node multi-gpu setup.
-- When using the standalone [`GemmaTokenizerFast`] make sure to pass `padding="max_length"` and `max_length=64` as that's how the model was trained.
-- Model was trained with *lowercased* text, make sure you make the same preprocessing for your text labels.
-- To get the same results as the pipeline, a prompt template of "this is a photo of {label}" should be used.
-- The NaFlex variant supports processing images at higher resolutions by adjusting the `max_num_patches` parameter in the `Processor`. The default value is `max_num_patches=256`. Increasing `max_num_patches` to 1024 (4x) will approximately double processed image height and width, while preserving the aspect ratio.
-
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip2_metrics_table.png"
-alt="drawing" width="600"/>
+You can find all the original SigLIP2 checkpoints under the [SigLIP2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) collection.
 
-This model was contributed by [qubvel](https://huggingface.co/qubvel-hf).
-The original code can be found [here](https://github.com/google-research/big_vision/tree/main).
+> [!TIP]
+> Click on the SigLIP2 models in the right sidebar for more examples of how to apply SigLIP2 to different image and text tasks.
 
-## Usage example
+The example below demonstrates how to generate similarity scores between texts and image(s) with [`Pipeline`] or the [`AutoModel`] class.
 
-There are 2 main ways to use SigLIP2: either using the pipeline API, which abstracts away all the complexity for you, or by using the `Siglip2Model` class yourself.
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
-### FixRes variant
+```py
+import torch
+from transformers import pipeline
 
-**Pipeline API**
+image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
 
-The pipeline allows to use the model in a few lines of code:
-
-```python
->>> from transformers import pipeline
->>> from PIL import Image
->>> import requests
-
->>> # load pipe
->>> image_classifier = pipeline(
-...     task="zero-shot-image-classification",
-...     model="google/siglip2-base-patch16-224",
-... )
-
->>> # load image
->>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
->>> image = Image.open(requests.get(url, stream=True).raw)
-
->>> # inference
->>> candidate_labels = ["2 cats", "a plane", "a remote"]
->>> outputs = image_classifier(image, candidate_labels=candidate_labels)
->>> outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
->>> print(outputs)
-[{'score': 0.1499, 'label': '2 cats'}, {'score': 0.0008, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}]
+pipeline = pipeline(task="zero-shot-image-classification", model="google/siglip2-base-patch16-224", device=0, torch_dtype=torch.bfloat16)
+print(pipeline(image, candidate_labels=candidate_labels))
 ```
 
-**Using the model yourself**
+</hfoption>
+<hfoption id="AutoModel">
 
-If you want to do the pre- and postprocessing yourself, here's how to do that:
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoProcessor, AutoModel
 
-```python
->>> from PIL import Image
->>> import requests
->>> from transformers import AutoProcessor, AutoModel
->>> import torch
+model = AutoModel.from_pretrained("google/siglip2-base-patch16-224", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
+processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
 
->>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
->>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
 
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
-
->>> candidate_labels = ["2 cats", "2 dogs"]
 # follows the pipeline prompt template to get same results
->>> texts = [f"This is a photo of {label}." for label in candidate_labels]
+texts = [f'This is a photo of {label}.' for label in candidate_labels]
 
 # IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
->>> inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt")
+inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to("cuda")
 
->>> with torch.no_grad():
-...     outputs = model(**inputs)
+with torch.no_grad():
+    outputs = model(**inputs)
 
->>> logits_per_image = outputs.logits_per_image
->>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
->>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
-15.0% that image 0 is '2 cats'
+logits_per_image = outputs.logits_per_image
+probs = torch.sigmoid(logits_per_image)
+print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
 ```
 
-### NaFlex variant
-
-NaFlex combines ideas from FlexiViT, i.e. supporting multiple, predefined sequence lengths 
-with a single ViT model, and NaViT, namely processing images at their native aspect ratio.
-This enables processing different types of images at appropriate resolution, e.g. using a
-larger resolution to process document images, while at the same time minimizing the impact 
-of aspect ratio distortion on certain inference tasks, e.g. on OCR.
-
-Given a patch size and target sequence length, NaFlex preprocesses the data by first resizing 
-the input image such that the height and width after resizing are multiples of the patch size,
-while 
-    
-    1. keeping the aspect ratio distortion as small as possible
-    2. producing a sequence length of at most the desired target sequence length (`max_num_patches`)
-    
-The resulting distortion in width and height is at most `(patch_size - 1) / width` and
-`(patch_size - 1) / height`, respectively, which tends to be small for common resolutions and aspect ratios. 
-After resizing, the image is split into a sequence of patches, and a mask with padding information is added.
-
-```python
->>> from PIL import Image
->>> import requests
->>> from transformers import AutoProcessor, AutoModel
->>> import torch
-
->>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex")
->>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")
-
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
-
->>> candidate_labels = ["2 cats", "2 dogs"]
-# follows the pipeline prompt template to get same results
->>> texts = [f"This is a photo of {label}." for label in candidate_labels]
+</hfoption>
+</hfoptions>
 
-# default value for `max_num_patches` is 256, but you can increase resulted image resolution providing
-# higher values e.g. `max_num_patches=512`
->>> inputs = processor(text=texts, images=image, max_num_patches=256, return_tensors="pt")
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
 
->>> with torch.no_grad():
-...     outputs = model(**inputs)
+The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to int4.
 
->>> logits_per_image = outputs.logits_per_image
->>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
->>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
-21.1% that image 0 is '2 cats'
-```
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoProcessor, AutoModel, BitsAndBytesConfig
 
-## Resources
+bnb_config = BitsAndBytesConfig(load_in_4bit=True)
+model = AutoModel.from_pretrained("google/siglip2-base-patch16-224", quantization_config=bnb_config, device_map="auto", attn_implementation="sdpa")
+processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
 
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SigLIP2.
-
-- [Zero-shot image classification task guide](../tasks/zero_shot_image_classification)
-- Demo notebook for SigLIP2 can be found [here](https://github.com/qubvel/transformers-notebooks/tree/master/notebooks/SigLIP2_inference.ipynb). 🌎
-
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
 
+# follows the pipeline prompt template to get same results
+texts = [f'This is a photo of {label}.' for label in candidate_labels]
 
-## Combining SigLIP2 and Flash Attention 2
+# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
+inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to("cuda")
 
-First, make sure to install the latest version of Flash Attention 2.
+with torch.no_grad():
+    outputs = model(**inputs)
 
-```bash
-pip install -U flash-attn --no-build-isolation
+logits_per_image = outputs.logits_per_image
+probs = torch.sigmoid(logits_per_image)
+print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
 ```
 
-Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16``)
-
-To load and run a model using Flash Attention 2, refer to the snippet below:
-
-```python
->>> import torch
->>> import requests
->>> from PIL import Image
->>> from transformers import AutoProcessor, AutoModel
->>> device = "cuda" # the device to load the model onto
-
->>> model = AutoModel.from_pretrained(
-...     "google/siglip2-so400m-patch14-384",
-...     attn_implementation="flash_attention_2",
-...     torch_dtype=torch.float16,
-...     device_map=device,
-... )
->>> processor = AutoProcessor.from_pretrained("google/siglip2-so400m-patch14-384")
-
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
-
->>> candidate_labels = ["2 cats", "2 dogs"]
-# follows the pipeline prompt template to get same results
->>> texts = [f'This is a photo of {label}.' for label in candidate_labels]
-# important: we pass `padding=max_length` since the model was trained with this
->>> inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt").to(device)
-
->>> with torch.no_grad():
-...     with torch.autocast(device):
-...         outputs = model(**inputs)
-
->>> logits_per_image = outputs.logits_per_image
->>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
->>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
-19.8% that image 0 is '2 cats'
-```
+## Notes
 
+- Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size.
+- When using the standalone [`GemmaTokenizerFast`] make sure to pass `padding="max_length"` and `max_length=64` as that's how the model was trained.
+- Model was trained with *lowercased* text, make sure you make the same preprocessing for your text labels.
+- To get the same results as the [`Pipeline`], a prompt template of `"This is a photo of {label}."` should be passed to the processor.
+- The NaFlex variant enables processing different types of images at appropriate resolution, e.g. using a larger resolution to process document images, while at the same time minimizing the impact of aspect ratio distortion on certain inference tasks, e.g. on OCR.
+    ```py
+    import torch
+    import requests
+    from PIL import Image
+    from transformers import AutoProcessor, AutoModel
+
+    model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
+    processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")
+
+    url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+    image = Image.open(requests.get(url, stream=True).raw)
+    candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
+    texts = [f'This is a photo of {label}.' for label in candidate_labels]
+
+    # default value for `max_num_patches` is 256, but you can increase resulted image resolution providing higher values e.g. `max_num_patches=512`
+    inputs = processor(text=texts, images=image, padding="max_length", max_num_patches=256, return_tensors="pt").to("cuda")
+
+    with torch.no_grad():
+        outputs = model(**inputs)
+
+    logits_per_image = outputs.logits_per_image
+    probs = torch.sigmoid(logits_per_image)
+    print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
+    ```
+- Toggle the `attn_implementation` parameter to either `"sdpa"` or `"flash_attention_2"` to use a more memory-efficient attention.
+    ```py
+    # pip install -U flash-attn --no-build-isolation
+
+    from transformers import SiglipModel
+
+    model = SiglipModel.from_pretrained(
+        "google/siglip2-so400m-patch14-384",
+        attn_implementation="flash_attention_2",
+        torch_dtype=torch.float16,
+        device_map=device,
+    )
+    ```
 ## Siglip2Config
 
 [[autodoc]] Siglip2Config

From b16f6cb8285143da10344ccb379c042f316c8a80 Mon Sep 17 00:00:00 2001
From: saswatmeher <35535056+saswatmeher@users.noreply.github.com>
Date: Thu, 24 Apr 2025 21:54:23 +0900
Subject: [PATCH 02/12] Update docs/source/en/model_doc/siglip2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/siglip2.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index 14dbcd6f2077..aa14761e1af7 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -124,7 +124,9 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
 - When using the standalone [`GemmaTokenizerFast`] make sure to pass `padding="max_length"` and `max_length=64` as that's how the model was trained.
 - Model was trained with *lowercased* text, make sure you make the same preprocessing for your text labels.
 - To get the same results as the [`Pipeline`], a prompt template of `"This is a photo of {label}."` should be passed to the processor.
-- The NaFlex variant enables processing different types of images at appropriate resolution, e.g. using a larger resolution to process document images, while at the same time minimizing the impact of aspect ratio distortion on certain inference tasks, e.g. on OCR.
+- The NaFlex variant processes different types of images at the appropriate resolution (using a larger resolution to process document images for example), while also minimizing the impact of aspect ratio distortion for certain inference tasks like OCR.
+
+   NaFlex resizes the input image so the height and width are multiples of the patch size after resizing. It keeps the aspect ratio distortion as low as possible and produces a sequence length of at most the desired target sequence length (`max_num_patches`). After resizing, the image is split into a sequence of patches and a mask with padding information is added.
     ```py
     import torch
     import requests

From 1e82f8406cebe95cdd902ec0f377aeb7a41f8a21 Mon Sep 17 00:00:00 2001
From: saswatmeher <35535056+saswatmeher@users.noreply.github.com>
Date: Thu, 24 Apr 2025 21:54:39 +0900
Subject: [PATCH 03/12] Update docs/source/en/model_doc/siglip2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/siglip2.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index aa14761e1af7..bc026bf42ed6 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -122,7 +122,7 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
 
 - Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) utilities which may limit the scalability of batch size.
 - When using the standalone [`GemmaTokenizerFast`] make sure to pass `padding="max_length"` and `max_length=64` as that's how the model was trained.
-- Model was trained with *lowercased* text, make sure you make the same preprocessing for your text labels.
+- Model was trained with *lowercased* text, so make sure your text labels are preprocessed the same way.
 - To get the same results as the [`Pipeline`], a prompt template of `"This is a photo of {label}."` should be passed to the processor.
 - The NaFlex variant processes different types of images at the appropriate resolution (using a larger resolution to process document images for example), while also minimizing the impact of aspect ratio distortion for certain inference tasks like OCR.
 

From 5ef0169bbcf49c71551daa1aecca8821d3bf4af1 Mon Sep 17 00:00:00 2001
From: saswatmeher <35535056+saswatmeher@users.noreply.github.com>
Date: Thu, 24 Apr 2025 21:55:23 +0900
Subject: [PATCH 04/12] Update docs/source/en/model_doc/siglip2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/siglip2.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index bc026bf42ed6..2c90c6fea834 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -29,8 +29,6 @@ rendered properly in your Markdown viewer.
 [SigLIP2](https://huggingface.co/papers/2502.14786) is a family of multilingual vision-language encoders, it is build upon the success of the original [SigLIP](siglip). The original image-text training objective is extended by several independently developed techniques. These include decoder-based pretraining, self-supervised losses such as self-distillation and masked prediction, as well as online data curation.
 
 The model comes in two variants
-- FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
-- NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in `transformers`)
 
 You can find all the original SigLIP2 checkpoints under the [SigLIP2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) collection.
 

From 891e8bf7debd2e6729afc255064c34ea8acd1647 Mon Sep 17 00:00:00 2001
From: saswatmeher <35535056+saswatmeher@users.noreply.github.com>
Date: Thu, 24 Apr 2025 21:55:47 +0900
Subject: [PATCH 05/12] Update docs/source/en/model_doc/siglip2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/siglip2.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index 2c90c6fea834..5f71ef04d57b 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -26,7 +26,10 @@ rendered properly in your Markdown viewer.
 
 ## Overview
 
-[SigLIP2](https://huggingface.co/papers/2502.14786) is a family of multilingual vision-language encoders, it is build upon the success of the original [SigLIP](siglip). The original image-text training objective is extended by several independently developed techniques. These include decoder-based pretraining, self-supervised losses such as self-distillation and masked prediction, as well as online data curation.
+[SigLIP2](https://huggingface.co/papers/2502.14786) is a family of multilingual vision-language encoders that builds on the [SigLIP](./siglip) training recipe. It includes decoder-based pretraining, self-distillation, and masked prediction to improve dense prediction tasks (segmentation, depth estimation, etc.). This model is available in two variants:
+
+- NaFlex supports different resolutions and maintains the native image aspect ratio
+- FixRes supports fixed resolutions and is backwards compatible with [SigLIP](./siglip)
 
 The model comes in two variants
 

From e92653de59530c42967ca9d4bbfef6e83b1b14e8 Mon Sep 17 00:00:00 2001
From: saswatmeher <35535056+saswatmeher@users.noreply.github.com>
Date: Thu, 24 Apr 2025 21:56:10 +0900
Subject: [PATCH 06/12] Update docs/source/en/model_doc/siglip2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/siglip2.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index 5f71ef04d57b..58f68e88f7a7 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -38,7 +38,7 @@ You can find all the original SigLIP2 checkpoints under the [SigLIP2](https://hu
 > [!TIP]
 > Click on the SigLIP2 models in the right sidebar for more examples of how to apply SigLIP2 to different image and text tasks.
 
-The example below demonstrates how to generate similarity scores between texts and image(s) with [`Pipeline`] or the [`AutoModel`] class.
+The example below demonstrates zero-shot classification with [`Pipeline`] or the [`AutoModel`] class.
 
 <hfoptions id="usage">
 <hfoption id="Pipeline">

From bc6a5d0d3969ffdfa3c360e2580600c1a47a0c1c Mon Sep 17 00:00:00 2001
From: saswatmeher <35535056+saswatmeher@users.noreply.github.com>
Date: Thu, 24 Apr 2025 21:56:21 +0900
Subject: [PATCH 07/12] Update docs/source/en/model_doc/siglip2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/siglip2.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index 58f68e88f7a7..3fd3a754ddc7 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -51,7 +51,7 @@ image = "https://huggingface.co/datasets/huggingface/documentation-images/resolv
 candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
 
 pipeline = pipeline(task="zero-shot-image-classification", model="google/siglip2-base-patch16-224", device=0, torch_dtype=torch.bfloat16)
-print(pipeline(image, candidate_labels=candidate_labels))
+pipeline(image, candidate_labels=candidate_labels)
 ```
 
 </hfoption>

From 1b31a930b0cbb252fadb0a7b2886db97ede088e3 Mon Sep 17 00:00:00 2001
From: Saswat Meher <saswatmeher@gmail.com>
Date: Thu, 24 Apr 2025 22:06:20 +0900
Subject: [PATCH 08/12] address comments

---
 docs/source/en/model_doc/siglip2.md | 55 ++++++++++++++++-------------
 1 file changed, 30 insertions(+), 25 deletions(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index 3fd3a754ddc7..9600e8a6418c 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -57,6 +57,35 @@ pipeline(image, candidate_labels=candidate_labels)
 </hfoption>
 <hfoption id="AutoModel">
 
+The example below demonstrates a zero-shot classification with NaFlex variant of the model.
+
+```py
+import torch
+import requests
+from PIL import Image
+from transformers import AutoProcessor, AutoModel
+
+model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
+processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")
+
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
+texts = [f'This is a photo of {label}.' for label in candidate_labels]
+
+# default value for `max_num_patches` is 256, but you can increase resulted image resolution providing higher values e.g. `max_num_patches=512`
+inputs = processor(text=texts, images=image, padding="max_length", max_num_patches=256, return_tensors="pt").to("cuda")
+
+with torch.no_grad():
+    outputs = model(**inputs)
+
+logits_per_image = outputs.logits_per_image
+probs = torch.sigmoid(logits_per_image)
+print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
+```
+
+The example below demonstrates a zero-shot classification with FixRes variant of the model.
+
 ```py
 import torch
 import requests
@@ -98,7 +127,7 @@ from PIL import Image
 from transformers import AutoProcessor, AutoModel, BitsAndBytesConfig
 
 bnb_config = BitsAndBytesConfig(load_in_4bit=True)
-model = AutoModel.from_pretrained("google/siglip2-base-patch16-224", quantization_config=bnb_config, device_map="auto", attn_implementation="sdpa")
+model = AutoModel.from_pretrained("google/siglip2-large-patch16-512", quantization_config=bnb_config, device_map="auto", attn_implementation="sdpa")
 processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
 
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
@@ -128,30 +157,6 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
 - The NaFlex variant processes different types of images at the appropriate resolution (using a larger resolution to process document images for example), while also minimizing the impact of aspect ratio distortion for certain inference tasks like OCR.
 
    NaFlex resizes the input image so the height and width are multiples of the patch size after resizing. It keeps the aspect ratio distortion as low as possible and produces a sequence length of at most the desired target sequence length (`max_num_patches`). After resizing, the image is split into a sequence of patches and a mask with padding information is added.
-    ```py
-    import torch
-    import requests
-    from PIL import Image
-    from transformers import AutoProcessor, AutoModel
-
-    model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
-    processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")
-
-    url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
-    image = Image.open(requests.get(url, stream=True).raw)
-    candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
-    texts = [f'This is a photo of {label}.' for label in candidate_labels]
-
-    # default value for `max_num_patches` is 256, but you can increase resulted image resolution providing higher values e.g. `max_num_patches=512`
-    inputs = processor(text=texts, images=image, padding="max_length", max_num_patches=256, return_tensors="pt").to("cuda")
-
-    with torch.no_grad():
-        outputs = model(**inputs)
-
-    logits_per_image = outputs.logits_per_image
-    probs = torch.sigmoid(logits_per_image)
-    print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
-    ```
 - Toggle the `attn_implementation` parameter to either `"sdpa"` or `"flash_attention_2"` to use a more memory-efficient attention.
     ```py
     # pip install -U flash-attn --no-build-isolation

From dac97324252ac28a8c6ac6cc95a1f71437496ed2 Mon Sep 17 00:00:00 2001
From: Saswat Meher <saswatmeher@gmail.com>
Date: Thu, 24 Apr 2025 22:25:16 +0900
Subject: [PATCH 09/12] separate naflex and fixres variant

---
 docs/source/en/model_doc/siglip2.md | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index 9600e8a6418c..fa5ae8318d98 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -55,9 +55,7 @@ pipeline(image, candidate_labels=candidate_labels)
 ```
 
 </hfoption>
-<hfoption id="AutoModel">
-
-The example below demonstrates a zero-shot classification with NaFlex variant of the model.
+<hfoption id="AutoModel - FixRes">
 
 ```py
 import torch
@@ -65,16 +63,18 @@ import requests
 from PIL import Image
 from transformers import AutoProcessor, AutoModel
 
-model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
-processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")
+model = AutoModel.from_pretrained("google/siglip2-base-patch16-224", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
+processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
 
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
 candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
+
+# follows the pipeline prompt template to get same results
 texts = [f'This is a photo of {label}.' for label in candidate_labels]
 
-# default value for `max_num_patches` is 256, but you can increase resulted image resolution providing higher values e.g. `max_num_patches=512`
-inputs = processor(text=texts, images=image, padding="max_length", max_num_patches=256, return_tensors="pt").to("cuda")
+# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
+inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to("cuda")
 
 with torch.no_grad():
     outputs = model(**inputs)
@@ -84,7 +84,8 @@ probs = torch.sigmoid(logits_per_image)
 print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
 ```
 
-The example below demonstrates a zero-shot classification with FixRes variant of the model.
+</hfoption>
+<hfoption id="AutoModel - NaFlex">
 
 ```py
 import torch
@@ -92,18 +93,16 @@ import requests
 from PIL import Image
 from transformers import AutoProcessor, AutoModel
 
-model = AutoModel.from_pretrained("google/siglip2-base-patch16-224", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
-processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
+model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
+processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")
 
 url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
 image = Image.open(requests.get(url, stream=True).raw)
 candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
-
-# follows the pipeline prompt template to get same results
 texts = [f'This is a photo of {label}.' for label in candidate_labels]
 
-# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
-inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt").to("cuda")
+# default value for `max_num_patches` is 256, but you can increase resulted image resolution providing higher values e.g. `max_num_patches=512`
+inputs = processor(text=texts, images=image, padding="max_length", max_num_patches=256, return_tensors="pt").to("cuda")
 
 with torch.no_grad():
     outputs = model(**inputs)

From 7db72508da6fcbe5e45609575747072437c0ed9c Mon Sep 17 00:00:00 2001
From: saswatmeher <35535056+saswatmeher@users.noreply.github.com>
Date: Fri, 25 Apr 2025 21:23:25 +0900
Subject: [PATCH 10/12] Update docs/source/en/model_doc/siglip2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/siglip2.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index fa5ae8318d98..aa23a280565f 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -31,7 +31,6 @@ rendered properly in your Markdown viewer.
 - NaFlex supports different resolutions and maintains the native image aspect ratio
 - FixRes supports fixed resolutions and is backwards compatible with [SigLIP](./siglip)
 
-The model comes in two variants
 
 You can find all the original SigLIP2 checkpoints under the [SigLIP2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) collection.
 

From 4a5368382f657c3d2c81cd2d57c6472e4930f22c Mon Sep 17 00:00:00 2001
From: saswatmeher <35535056+saswatmeher@users.noreply.github.com>
Date: Fri, 25 Apr 2025 21:23:40 +0900
Subject: [PATCH 11/12] Update docs/source/en/model_doc/siglip2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/siglip2.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index aa23a280565f..e5fba00ded68 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -54,7 +54,7 @@ pipeline(image, candidate_labels=candidate_labels)
 ```
 
 </hfoption>
-<hfoption id="AutoModel - FixRes">
+<hfoption id="AutoModel (FixRes)">
 
 ```py
 import torch

From f728b8c0f279973e4012161d8bd03f8cb652bdc8 Mon Sep 17 00:00:00 2001
From: saswatmeher <35535056+saswatmeher@users.noreply.github.com>
Date: Fri, 25 Apr 2025 21:23:53 +0900
Subject: [PATCH 12/12] Update docs/source/en/model_doc/siglip2.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/siglip2.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/siglip2.md b/docs/source/en/model_doc/siglip2.md
index e5fba00ded68..830258f2fc5c 100644
--- a/docs/source/en/model_doc/siglip2.md
+++ b/docs/source/en/model_doc/siglip2.md
@@ -84,7 +84,7 @@ print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
 ```
 
 </hfoption>
-<hfoption id="AutoModel - NaFlex">
+<hfoption id="AutoModel (NaFlex)">
 
 ```py
 import torch