Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
4bae67b
Fix
NielsRogge Feb 26, 2026
2c96661
First draft
NielsRogge Feb 26, 2026
490ff1f
Add push-to-hub options for SAM3-LiteText conversion
NielsRogge Feb 26, 2026
00777a5
Merge pull request #69 from NielsRogge/codex/add-sam3-litetext-model-…
NielsRogge Feb 26, 2026
4dd3735
Fix SAM3-LiteText model tests and text encoder init stability
NielsRogge Feb 26, 2026
0d96394
Add LiteText ViT auto mappings and use LiteText config
NielsRogge Feb 27, 2026
06fbf45
Merge branch 'add_sam_3_lite_text' into codex/add-sam3-litetext-model…
NielsRogge Feb 27, 2026
53f7dd4
Merge pull request #70 from NielsRogge/codex/add-sam3-litetext-model-…
NielsRogge Feb 27, 2026
4d8008a
Improve conversion script
NielsRogge Feb 27, 2026
a5ce4ca
Do not require triton
NielsRogge Feb 27, 2026
db44153
Improve modeling
NielsRogge Feb 27, 2026
98cea30
Fix repo
NielsRogge Feb 27, 2026
5fdc242
Merge remote-tracking branch 'upstream/main' into add_sam_3_lite_text
NielsRogge Feb 27, 2026
dcaceff
Merge remote-tracking branch 'upstream/main' into add_sam_3_lite_text
NielsRogge Feb 27, 2026
c85571e
Fix repo
NielsRogge Feb 27, 2026
8ba6455
Merge branch 'main' into add_sam_3_lite_text
NielsRogge Mar 1, 2026
5ab59a6
Add vision model to auto mapping
NielsRogge Mar 1, 2026
d5728ae
Add missing entries to auto mapping
NielsRogge Mar 1, 2026
813dd0b
Merge remote-tracking branch 'upstream/main' into add_sam_3_lite_text
yonigozlan Mar 24, 2026
583df21
reverse serve.py
yonigozlan Mar 24, 2026
8f35675
simplify implementation
yonigozlan Mar 30, 2026
37c3fcd
Merge remote-tracking branch 'upstream/main' into add_sam_3_lite_text
yonigozlan Mar 30, 2026
a402d1a
fix modular
yonigozlan Mar 30, 2026
0ae1024
Merge branch 'main' into add_sam_3_lite_text
NielsRogge Mar 30, 2026
672424a
Address review comments
yonigozlan Apr 1, 2026
0b477d9
fix repo
yonigozlan Apr 1, 2026
ac3370a
fix after review 2
yonigozlan Apr 6, 2026
7204251
Merge remote-tracking branch 'upstream/main' into add_sam_3_lite_text
yonigozlan Apr 6, 2026
9baf14d
fix tests + repo
yonigozlan Apr 6, 2026
5aaad48
Merge branch 'main' into add_sam_3_lite_text
yonigozlan Apr 7, 2026
c5deac2
Merge remote-tracking branch 'upstream/main' into add_sam_3_lite_text
NielsRogge Apr 11, 2026
fe153ad
Address comments
NielsRogge Apr 12, 2026
a1a9c1e
Address comments
NielsRogge Apr 13, 2026
6ffbc8e
Make fix-repo
NielsRogge Apr 13, 2026
9570b50
Merge branch 'main' into add_sam_3_lite_text
NielsRogge Apr 13, 2026
ccb4902
add to hub cache + fixup base sam3 as well
vasqu Apr 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1351,6 +1351,8 @@
title: SAM3
- local: model_doc/sam3_video
title: SAM3 Video
- local: model_doc/sam3_lite_text
title: SAM3-LiteText
- local: model_doc/shieldgemma2
title: ShieldGemma2
- local: model_doc/siglip
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/model_doc/nomic_bert.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ limitations under the License.

## Overview

NomicBERT was proposed in [Nomic Embed: Training a Reproducible Long Context Text Embedder](https://arxiv.org/abs/2402.01613) by
NomicBERT was proposed in [Nomic Embed: Training a Reproducible Long Context Text Embedder](https://huggingface.co/papers/2402.01613) by
Comment thread
NielsRogge marked this conversation as resolved.
Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. It is BERT-inspired with the most notable extension applying
[Rotary Position Embeddings](https://huggingface.co/papers/2104.09864.pdf) to an encoder model.

Expand Down
118 changes: 118 additions & 0 deletions docs/source/en/model_doc/sam3_lite_text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
<!--Copyright 2026 the HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.

-->
*This model was released on 2026-02-12 and added to Hugging Face Transformers on 2026-04-12.*

# SAM3-LiteText

<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>

## Overview

SAM3-LiteText was proposed in [SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation](https://huggingface.co/papers/2602.12173) by Chengxi Zeng, Yuxuan Jiang, Ge Gao, Shuai Wang, Duolikun Danier, Bin Zhu, Stevan Rudinac, David Bull, and Fan Zhang.

SAM3-LiteText is a lightweight variant of [SAM3](sam3) that replaces the heavy SAM3 text encoder (353M parameters) with a compact MobileCLIP-based text encoder optimized through knowledge distillation. The SAM3 ViT-H image encoder is kept intact. This reduces text encoder parameters by up to 88% while maintaining segmentation performance comparable to the original model.

The abstract from the paper is the following:

*Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model.*

The text encoder architecture is based on [MobileCLIP](https://huggingface.co/papers/2311.17049) and comes in three variants:

| Variant | Text Encoder | Text Params | Reduction |
|---|---|---|---|
| SAM3-LiteText-S0-16 | MobileCLIP-S0 | 42.54M | ~88% |
| SAM3-LiteText-S1-16 | MobileCLIP-S1 | 63.53M | ~82% |
| SAM3-LiteText-L-16 | MobileCLIP2-L | 123.80M | ~65% |

This model was contributed by [nielsr](https://huggingface.co/nielsr) and [yonigozlan](https://huggingface.co/yonigozlan).
The original code can be found [here](https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext).

## Usage

SAM3-LiteText is a drop-in replacement for SAM3 with a lightweight text encoder. It uses the same processor ([`Sam3Processor`]) and supports the same prompting interface. Refer to the [SAM3 documentation](sam3) for detailed usage examples including text prompts, box prompts, batched inference, and more.

```python
from io import BytesIO

import httpx
from transformers import AutoModel, AutoProcessor
from PIL import Image

model = AutoModel.from_pretrained("yonigozlan/sam3-litetext-s0", device_map="auto")
processor = AutoProcessor.from_pretrained("yonigozlan/sam3-litetext-s0")
Comment on lines +61 to +62
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any plans to move these to another repo?


image_url = "http://images.cocodataset.org/val2017/000000077595.jpg"
image = Image.open(BytesIO(httpx.get(image_url).content)).convert("RGB")

inputs = processor(images=image, text="ear", return_tensors="pt").to(model.device)

outputs = model(**inputs)

results = processor.post_process_instance_segmentation(
outputs,
threshold=0.5,
mask_threshold=0.5,
target_sizes=inputs.get("original_sizes").tolist(),
)[0]

print(f"Found {len(results['masks'])} objects")
```

## Sam3LiteTextConfig

[[autodoc]] Sam3LiteTextConfig

## Sam3LiteTextTextConfig

[[autodoc]] Sam3LiteTextTextConfig

Comment thread
vasqu marked this conversation as resolved.
## Sam3LiteTextGeometryEncoderConfig

[[autodoc]] Sam3LiteTextGeometryEncoderConfig

## Sam3LiteTextDETREncoderConfig

[[autodoc]] Sam3LiteTextDETREncoderConfig

## Sam3LiteTextDETRDecoderConfig

[[autodoc]] Sam3LiteTextDETRDecoderConfig

## Sam3LiteTextMaskDecoderConfig

[[autodoc]] Sam3LiteTextMaskDecoderConfig

## Sam3LiteTextTextModel

[[autodoc]] Sam3LiteTextTextModel
- forward

## Sam3LiteTextModel

[[autodoc]] Sam3LiteTextModel
- forward

## Sam3LiteTextPreTrainedModel

[[autodoc]] Sam3LiteTextPreTrainedModel
- forward
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -367,6 +367,7 @@
from .sam2 import *
from .sam2_video import *
from .sam3 import *
from .sam3_lite_text import *
from .sam3_tracker import *
from .sam3_tracker_video import *
from .sam3_video import *
Expand Down
5 changes: 5 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,8 @@
("sam2_video", "Sam2VideoConfig"),
("sam2_vision_model", "Sam2VisionConfig"),
("sam3", "Sam3Config"),
("sam3_lite_text", "Sam3LiteTextConfig"),
("sam3_lite_text_text_model", "Sam3LiteTextTextConfig"),
("sam3_tracker", "Sam3TrackerConfig"),
("sam3_tracker_video", "Sam3TrackerVideoConfig"),
("sam3_video", "Sam3VideoConfig"),
Expand Down Expand Up @@ -954,6 +956,8 @@
("sam2_video", "Sam2VideoModel"),
("sam2_vision_model", "Sam2VisionModel"),
("sam3", "SAM3"),
("sam3_lite_text", "SAM3-LiteText"),
("sam3_lite_text_text_model", "SAM3-LiteText Text Model"),
("sam3_tracker", "Sam3Tracker"),
("sam3_tracker_video", "Sam3TrackerVideo"),
("sam3_video", "Sam3VideoModel"),
Expand Down Expand Up @@ -1142,6 +1146,7 @@
("sam_vision_model", "sam"),
("sam2_vision_model", "sam2"),
("sam2_hiera_det_model", "sam2"),
("sam3_lite_text_text_model", "sam3_lite_text"),
("sam3_vit_model", "sam3"),
("sam3_vision_model", "sam3"),
("edgetam_vision_model", "edgetam"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,7 @@
("sam2", {"torchvision": "Sam2ImageProcessor"}),
("sam2_video", {"torchvision": "Sam2ImageProcessor"}),
("sam3", {"torchvision": "Sam3ImageProcessor"}),
("sam3_lite_text", {"torchvision": "Sam3ImageProcessor"}),
("sam3_tracker", {"torchvision": "Sam3ImageProcessor"}),
("sam3_tracker_video", {"torchvision": "Sam3ImageProcessor"}),
("sam3_video", {"torchvision": "Sam3ImageProcessor"}),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Comment thread
yonigozlan marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,8 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("sam2_video", "Sam2VideoModel"),
("sam2_vision_model", "Sam2VisionModel"),
("sam3", "Sam3Model"),
("sam3_lite_text", "Sam3LiteTextModel"),
("sam3_lite_text_text_model", "Sam3LiteTextTextModel"),
("sam3_tracker", "Sam3TrackerModel"),
("sam3_tracker", "Sam3TrackerModel"),
("sam3_tracker_video", "Sam3TrackerVideoModel"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Comment thread
vasqu marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@
("sam", "SamProcessor"),
("sam2", "Sam2Processor"),
("sam3", "Sam3Processor"),
("sam3_lite_text", "Sam3Processor"),
("sam_hq", "SamHQProcessor"),
("seamless_m4t", "SeamlessM4TProcessor"),
("sew", "Wav2Vec2Processor"),
Expand Down
4 changes: 2 additions & 2 deletions src/transformers/models/sam3/convert_sam3_to_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
import regex as re
import torch

from transformers import CLIPTokenizerFast, Sam3Config, Sam3ImageProcessorFast, Sam3Model, Sam3Processor
from transformers import CLIPTokenizerFast, Sam3Config, Sam3ImageProcessor, Sam3Model, Sam3Processor
from transformers.utils import logging


Expand Down Expand Up @@ -383,7 +383,7 @@ def convert_sam3_checkpoint(

# Save processor
print("Creating and saving processor...")
image_processor = Sam3ImageProcessorFast()
image_processor = Sam3ImageProcessor()
tokenizer = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32", max_length=32, model_max_length=32)
processor = Sam3Processor(image_processor=image_processor, tokenizer=tokenizer)
processor.save_pretrained(output_path)
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/sam3/modeling_sam3.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ def box_cxcywh_to_xyxy(x):


class Sam3MLP(nn.Module):
def __init__(self, config: Sam3ViTConfig):
def __init__(self, config):
super().__init__()
self.config = config
self.activation_fn = ACT2FN[config.hidden_act]
Expand Down
28 changes: 28 additions & 0 deletions src/transformers/models/sam3_lite_text/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright 2026 the HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_sam3_lite_text import *
from .modeling_sam3_lite_text import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading
Loading