Integrate MobileCLIP-student LiteText encoder and add conversion tooling for EfficientSAM3 LiteText by NielsRogge · Pull Request #69 · NielsRogge/transformers

NielsRogge · 2026-02-26T19:44:43Z

Replace SAM3's CLIP text encoder with a compact MobileCLIP student (LiteText) implementation and enable reliable conversion from upstream EfficientSAM3 checkpoints hosted on the HF Hub.
Provide a conversion/debugging workflow to load full detector checkpoints, map keys into the HF model, and verify exact parity on dummy inputs between the original implementation and the HF model.
Make the conversion robust to merged checkpoints that include extra non-text weights and add CLI convenience for converting multiple checkpoints and optionally pushing converted models to the Hub.

Added a custom MobileCLIP-student LiteText text encoder implementation and wired it into the HF model by replacing the CLIP-based text encoder with Sam3LiteTextTextEncoder in modeling_sam3_lite_text.py and modular_sam3_lite_text.py.
Implemented a full conversion utility src/transformers/models/sam3_lite_text/convert_sam3_lite_text_to_hf.py that downloads checkpoints from the Hub, remaps LiteText (backbone.language_backbone.*) keys via TEXT_KEY_MAPPING, preserves packed in-proj format for the MobileCLIP text MHA, splits other qkv keys, and normalizes various attention/key naming patterns.
Made the text architecture inference dynamic from checkpoint weights (hidden size, number of layers, model name mct/base, context length) and added position-embedding interpolation, FP32 layer-norm behavior, RepMixer blocks, and projection handling to match upstream internals.
Added conversion CLI features: --convert_all, --debug_intermediates (prints embedding/per-layer/final LN max-abs diffs between original and HF implementations), --push_to_hub and --hub_model_id, checkpoint component summarization, and improved filtering of unused keys (e.g. sam2_convs and geometry point-projector weights).

Converted sam3_litetext/efficient_sam3_image_encoder_mobileclip_s0_ctx16.pt with convert_checkpoint and observed Missing: 0 while only a small set of non-text unexpected keys remained, which succeeded locally.
Ran parity debugging with --debug_intermediates to compare the original TextStudentEncoder and HF Sam3LiteTextModel on deterministic dummy input_ids, and after iterative fixes the per-layer and final outputs matched exactly (Max abs diff: 0.0).
Ran --convert_all against the Hub repo and successfully converted all 3 LiteText checkpoints with Missing: 0 and only the expected small set of unused non-text keys.
Verified CLI and non-push conversion flows locally; automated conversion and parity checks completed successfully.

Add push-to-hub options for SAM3-LiteText conversion

490ff1f

NielsRogge added the codex label Feb 26, 2026 — with ChatGPT Codex Connector

NielsRogge merged commit 00777a5 into add_sam_3_lite_text Feb 26, 2026
4 checks passed

Provide feedback