Integrate MobileCLIP-student LiteText encoder and add conversion tooling for EfficientSAM3 LiteText#69
Merged
NielsRogge merged 1 commit intoadd_sam_3_lite_textfrom Feb 26, 2026
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Description
Sam3LiteTextTextEncoderinmodeling_sam3_lite_text.pyandmodular_sam3_lite_text.py.src/transformers/models/sam3_lite_text/convert_sam3_lite_text_to_hf.pythat downloads checkpoints from the Hub, remaps LiteText (backbone.language_backbone.*) keys viaTEXT_KEY_MAPPING, preserves packed in-proj format for the MobileCLIP text MHA, splits other qkv keys, and normalizes various attention/key naming patterns.mct/base, context length) and added position-embedding interpolation, FP32 layer-norm behavior, RepMixer blocks, and projection handling to match upstream internals.--convert_all,--debug_intermediates(prints embedding/per-layer/final LN max-abs diffs between original and HF implementations),--push_to_huband--hub_model_id, checkpoint component summarization, and improved filtering of unused keys (e.g.sam2_convsand geometry point-projector weights).Testing
sam3_litetext/efficient_sam3_image_encoder_mobileclip_s0_ctx16.ptwithconvert_checkpointand observedMissing: 0while only a small set of non-text unexpected keys remained, which succeeded locally.--debug_intermediatesto compare the originalTextStudentEncoderand HFSam3LiteTextModelon deterministic dummyinput_ids, and after iterative fixes the per-layer and final outputs matched exactly (Max abs diff: 0.0).--convert_allagainst the Hub repo and successfully converted all 3 LiteText checkpoints withMissing: 0and only the expected small set of unused non-text keys.Codex Task