huggingface · vasqu · Apr 13, 2026 · Feb 26, 2026 · Feb 26, 2026 · Feb 26, 2026
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -1351,6 +1351,8 @@
         title: SAM3
       - local: model_doc/sam3_video
         title: SAM3 Video
+      - local: model_doc/sam3_lite_text
+        title: SAM3-LiteText
       - local: model_doc/shieldgemma2
         title: ShieldGemma2
       - local: model_doc/siglip

diff --git a/docs/source/en/model_doc/nomic_bert.md b/docs/source/en/model_doc/nomic_bert.md
@@ -23,7 +23,7 @@ limitations under the License.
 
 ## Overview
 
-NomicBERT was proposed in [Nomic Embed: Training a Reproducible Long Context Text Embedder](https://arxiv.org/abs/2402.01613) by 
+NomicBERT was proposed in [Nomic Embed: Training a Reproducible Long Context Text Embedder](https://huggingface.co/papers/2402.01613) by 
 Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. It is BERT-inspired with the most notable extension applying 
 [Rotary Position Embeddings](https://huggingface.co/papers/2104.09864.pdf) to an encoder model. 
 

diff --git a/docs/source/en/model_doc/sam3_lite_text.md b/docs/source/en/model_doc/sam3_lite_text.md
@@ -0,0 +1,118 @@
+<!--Copyright 2026 the HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
+
+-->
+*This model was released on 2026-02-12 and added to Hugging Face Transformers on 2026-04-12.*
+
+# SAM3-LiteText
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>
+
+## Overview
+
+SAM3-LiteText was proposed in [SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation](https://huggingface.co/papers/2602.12173) by Chengxi Zeng, Yuxuan Jiang, Ge Gao, Shuai Wang, Duolikun Danier, Bin Zhu, Stevan Rudinac, David Bull, and Fan Zhang.
+
+SAM3-LiteText is a lightweight variant of [SAM3](sam3) that replaces the heavy SAM3 text encoder (353M parameters) with a compact MobileCLIP-based text encoder optimized through knowledge distillation. The SAM3 ViT-H image encoder is kept intact. This reduces text encoder parameters by up to 88% while maintaining segmentation performance comparable to the original model.
+
+The abstract from the paper is the following:
+
+*Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model.*
+
+The text encoder architecture is based on [MobileCLIP](https://huggingface.co/papers/2311.17049) and comes in three variants:
+
+| Variant | Text Encoder | Text Params | Reduction |
+|---|---|---|---|
+| SAM3-LiteText-S0-16 | MobileCLIP-S0 | 42.54M | ~88% |
+| SAM3-LiteText-S1-16 | MobileCLIP-S1 | 63.53M | ~82% |
+| SAM3-LiteText-L-16 | MobileCLIP2-L | 123.80M | ~65% |
+
+This model was contributed by [nielsr](https://huggingface.co/nielsr) and [yonigozlan](https://huggingface.co/yonigozlan).
+The original code can be found [here](https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext).
+
+## Usage
+
+SAM3-LiteText is a drop-in replacement for SAM3 with a lightweight text encoder. It uses the same processor ([`Sam3Processor`]) and supports the same prompting interface. Refer to the [SAM3 documentation](sam3) for detailed usage examples including text prompts, box prompts, batched inference, and more.
+
+```python
+from io import BytesIO
+
+import httpx
+from transformers import AutoModel, AutoProcessor
+from PIL import Image
+
+model = AutoModel.from_pretrained("yonigozlan/sam3-litetext-s0", device_map="auto")
+processor = AutoProcessor.from_pretrained("yonigozlan/sam3-litetext-s0")
+
+image_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
+image = Image.open(BytesIO(httpx.get(image_url).content)).convert("RGB")
+
+inputs = processor(images=image, text="ear", return_tensors="pt").to(model.device)
+
+outputs = model(**inputs)
+
+results = processor.post_process_instance_segmentation(
+    outputs,
+    threshold=0.5,
+    mask_threshold=0.5,
+    target_sizes=inputs.get("original_sizes").tolist(),
+)[0]
+
+print(f"Found {len(results['masks'])} objects")
+```
+
+## Sam3LiteTextConfig
+
+[[autodoc]] Sam3LiteTextConfig
+
+## Sam3LiteTextTextConfig
+
+[[autodoc]] Sam3LiteTextTextConfig
+
+## Sam3LiteTextGeometryEncoderConfig
+
+[[autodoc]] Sam3LiteTextGeometryEncoderConfig
+
+## Sam3LiteTextDETREncoderConfig
+
+[[autodoc]] Sam3LiteTextDETREncoderConfig
+
+## Sam3LiteTextDETRDecoderConfig
+
+[[autodoc]] Sam3LiteTextDETRDecoderConfig
+
+## Sam3LiteTextMaskDecoderConfig
+
+[[autodoc]] Sam3LiteTextMaskDecoderConfig
+
+## Sam3LiteTextTextModel
+
+[[autodoc]] Sam3LiteTextTextModel
+    - forward
+
+## Sam3LiteTextModel
+
+[[autodoc]] Sam3LiteTextModel
+    - forward
+
+## Sam3LiteTextPreTrainedModel
+
+[[autodoc]] Sam3LiteTextPreTrainedModel
+    - forward
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
@@ -367,6 +367,7 @@
     from .sam2 import *
     from .sam2_video import *
     from .sam3 import *
+    from .sam3_lite_text import *
     from .sam3_tracker import *
     from .sam3_tracker_video import *
     from .sam3_video import *

diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
@@ -421,6 +421,8 @@
         ("sam2_video", "Sam2VideoConfig"),
         ("sam2_vision_model", "Sam2VisionConfig"),
         ("sam3", "Sam3Config"),
+        ("sam3_lite_text", "Sam3LiteTextConfig"),
+        ("sam3_lite_text_text_model", "Sam3LiteTextTextConfig"),
         ("sam3_tracker", "Sam3TrackerConfig"),
         ("sam3_tracker_video", "Sam3TrackerVideoConfig"),
         ("sam3_video", "Sam3VideoConfig"),
@@ -954,6 +956,8 @@
         ("sam2_video", "Sam2VideoModel"),
         ("sam2_vision_model", "Sam2VisionModel"),
         ("sam3", "SAM3"),
+        ("sam3_lite_text", "SAM3-LiteText"),
+        ("sam3_lite_text_text_model", "SAM3-LiteText Text Model"),
         ("sam3_tracker", "Sam3Tracker"),
         ("sam3_tracker_video", "Sam3TrackerVideo"),
         ("sam3_video", "Sam3VideoModel"),
@@ -1142,6 +1146,7 @@
         ("sam_vision_model", "sam"),
         ("sam2_vision_model", "sam2"),
         ("sam2_hiera_det_model", "sam2"),
+        ("sam3_lite_text_text_model", "sam3_lite_text"),
         ("sam3_vit_model", "sam3"),
         ("sam3_vision_model", "sam3"),
         ("edgetam_vision_model", "edgetam"),

diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py
@@ -226,6 +226,7 @@
             ("sam2", {"torchvision": "Sam2ImageProcessor"}),
             ("sam2_video", {"torchvision": "Sam2ImageProcessor"}),
             ("sam3", {"torchvision": "Sam3ImageProcessor"}),
+            ("sam3_lite_text", {"torchvision": "Sam3ImageProcessor"}),
             ("sam3_tracker", {"torchvision": "Sam3ImageProcessor"}),
             ("sam3_tracker_video", {"torchvision": "Sam3ImageProcessor"}),
             ("sam3_video", {"torchvision": "Sam3ImageProcessor"}),

diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
@@ -399,6 +399,8 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
         ("sam2_video", "Sam2VideoModel"),
         ("sam2_vision_model", "Sam2VisionModel"),
         ("sam3", "Sam3Model"),
+        ("sam3_lite_text", "Sam3LiteTextModel"),
+        ("sam3_lite_text_text_model", "Sam3LiteTextTextModel"),
         ("sam3_tracker", "Sam3TrackerModel"),
         ("sam3_tracker", "Sam3TrackerModel"),
         ("sam3_tracker_video", "Sam3TrackerVideoModel"),

diff --git a/src/transformers/models/auto/processing_auto.py b/src/transformers/models/auto/processing_auto.py
@@ -152,6 +152,7 @@
             ("sam", "SamProcessor"),
             ("sam2", "Sam2Processor"),
             ("sam3", "Sam3Processor"),
+            ("sam3_lite_text", "Sam3Processor"),
             ("sam_hq", "SamHQProcessor"),
             ("seamless_m4t", "SeamlessM4TProcessor"),
             ("sew", "Wav2Vec2Processor"),

diff --git a/src/transformers/models/sam3/convert_sam3_to_hf.py b/src/transformers/models/sam3/convert_sam3_to_hf.py
@@ -25,7 +25,7 @@
 import regex as re
 import torch
 
-from transformers import CLIPTokenizerFast, Sam3Config, Sam3ImageProcessorFast, Sam3Model, Sam3Processor
+from transformers import CLIPTokenizerFast, Sam3Config, Sam3ImageProcessor, Sam3Model, Sam3Processor
 from transformers.utils import logging
 
 
@@ -383,7 +383,7 @@ def convert_sam3_checkpoint(
 
     # Save processor
     print("Creating and saving processor...")
-    image_processor = Sam3ImageProcessorFast()
+    image_processor = Sam3ImageProcessor()
     tokenizer = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32", max_length=32, model_max_length=32)
     processor = Sam3Processor(image_processor=image_processor, tokenizer=tokenizer)
     processor.save_pretrained(output_path)

diff --git a/src/transformers/models/sam3/modeling_sam3.py b/src/transformers/models/sam3/modeling_sam3.py
@@ -279,7 +279,7 @@ def box_cxcywh_to_xyxy(x):
 
 
 class Sam3MLP(nn.Module):
-    def __init__(self, config: Sam3ViTConfig):
+    def __init__(self, config):
         super().__init__()
         self.config = config
         self.activation_fn = ACT2FN[config.hidden_act]

diff --git a/src/transformers/models/sam3_lite_text/__init__.py b/src/transformers/models/sam3_lite_text/__init__.py
@@ -0,0 +1,28 @@
+# Copyright 2026 the HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from ...utils import _LazyModule
+from ...utils.import_utils import define_import_structure
+
+
+if TYPE_CHECKING:
+    from .configuration_sam3_lite_text import *
+    from .modeling_sam3_lite_text import *
+else:
+    import sys
+
+    _file = globals()["__file__"]
+    sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)