Skip to content

Integrate MobileCLIP-student LiteText encoder and add conversion tooling for EfficientSAM3 LiteText#69

Merged
NielsRogge merged 1 commit intoadd_sam_3_lite_textfrom
codex/add-sam3-litetext-model-to-transformers
Feb 26, 2026
Merged

Integrate MobileCLIP-student LiteText encoder and add conversion tooling for EfficientSAM3 LiteText#69
NielsRogge merged 1 commit intoadd_sam_3_lite_textfrom
codex/add-sam3-litetext-model-to-transformers

Conversation

@NielsRogge
Copy link
Copy Markdown
Owner

Motivation

  • Replace SAM3's CLIP text encoder with a compact MobileCLIP student (LiteText) implementation and enable reliable conversion from upstream EfficientSAM3 checkpoints hosted on the HF Hub.
  • Provide a conversion/debugging workflow to load full detector checkpoints, map keys into the HF model, and verify exact parity on dummy inputs between the original implementation and the HF model.
  • Make the conversion robust to merged checkpoints that include extra non-text weights and add CLI convenience for converting multiple checkpoints and optionally pushing converted models to the Hub.

Description

  • Added a custom MobileCLIP-student LiteText text encoder implementation and wired it into the HF model by replacing the CLIP-based text encoder with Sam3LiteTextTextEncoder in modeling_sam3_lite_text.py and modular_sam3_lite_text.py.
  • Implemented a full conversion utility src/transformers/models/sam3_lite_text/convert_sam3_lite_text_to_hf.py that downloads checkpoints from the Hub, remaps LiteText (backbone.language_backbone.*) keys via TEXT_KEY_MAPPING, preserves packed in-proj format for the MobileCLIP text MHA, splits other qkv keys, and normalizes various attention/key naming patterns.
  • Made the text architecture inference dynamic from checkpoint weights (hidden size, number of layers, model name mct/base, context length) and added position-embedding interpolation, FP32 layer-norm behavior, RepMixer blocks, and projection handling to match upstream internals.
  • Added conversion CLI features: --convert_all, --debug_intermediates (prints embedding/per-layer/final LN max-abs diffs between original and HF implementations), --push_to_hub and --hub_model_id, checkpoint component summarization, and improved filtering of unused keys (e.g. sam2_convs and geometry point-projector weights).

Testing

  • Converted sam3_litetext/efficient_sam3_image_encoder_mobileclip_s0_ctx16.pt with convert_checkpoint and observed Missing: 0 while only a small set of non-text unexpected keys remained, which succeeded locally.
  • Ran parity debugging with --debug_intermediates to compare the original TextStudentEncoder and HF Sam3LiteTextModel on deterministic dummy input_ids, and after iterative fixes the per-layer and final outputs matched exactly (Max abs diff: 0.0).
  • Ran --convert_all against the Hub repo and successfully converted all 3 LiteText checkpoints with Missing: 0 and only the expected small set of unused non-text keys.
  • Verified CLI and non-push conversion flows locally; automated conversion and parity checks completed successfully.

Codex Task

@NielsRogge NielsRogge merged commit 00777a5 into add_sam_3_lite_text Feb 26, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant