Add SAM3-LiteText#44320
Conversation
…to-transformers Integrate MobileCLIP-student LiteText encoder and add conversion tooling for EfficientSAM3 LiteText
…-to-transformers-fuvllg
…to-transformers-fuvllg Integrate SAM3-LiteText MobileCLIP student text encoder, conversion tooling, and parity/test fixes
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
|
||
| # Get tokenizer class | ||
| if self.lowercase_name in TOKENIZER_MAPPING_NAMES: | ||
| self.tokenizer_class = None |
There was a problem hiding this comment.
Note: have opened a separate PR for the CLI fixes here: #44334
|
Maybe not the best practice to post this here but I'm getting an error when running |
4156a0f to
e5a5063
Compare
e5a5063 to
8f35675
Compare
|
ping me when it's ready for review 🤗 not sure atm :p |
|
Thanks for reviewing @vasqu ! It should be ready to merge now ;) |
vasqu
left a comment
There was a problem hiding this comment.
Sorry, found a few other smaller things. Shouldn't be anything big (and also some repeating stuff)
| model = AutoModel.from_pretrained("yonigozlan/sam3-litetext-s0", device_map="auto") | ||
| processor = AutoProcessor.from_pretrained("yonigozlan/sam3-litetext-s0") |
There was a problem hiding this comment.
Are there any plans to move these to another repo?
vasqu
left a comment
There was a problem hiding this comment.
I'm running slow tests in a second, I really only have the last nits left
Should be mergable today or tomorrow if we are fast enough :)
| window_size (`int`, *optional*, defaults to 24): | ||
| Window size for windowed attention. | ||
| global_attn_indexes (`list[int]`, *optional*, defaults to `[7, 15, 23, 31]`): | ||
| Indexes of layers with global attention. |
There was a problem hiding this comment.
These are for the underlying auto models ig?
| rope_theta (`float`, *optional*, defaults to 10000.0): | ||
| Base frequency for RoPE. |
There was a problem hiding this comment.
Should not be used at all, if anything we should change default theta (which is 10_000.0 already so no need to change)
There was a problem hiding this comment.
It is generated by make fix-repo
There was a problem hiding this comment.
I see that it is indeed needed either way for the sam vit model but we should probably refactor this a bit to follow more standard RoPE implementations cc @yonigozlan
|
run-slow: sam3_lite_text |
|
This comment contains models: ["models/sam3_lite_text"] |
|
run-slow: sam3, sam3_lite_text |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, sam3, sam3_lite_text, sam3_video |
|
This comment contains models: ["models/sam3", "models/sam3_lite_text"] |
|
Nice! I'll reach out to the authors to transfer the checkpoints |
* Fix * First draft * Add push-to-hub options for SAM3-LiteText conversion * Fix SAM3-LiteText model tests and text encoder init stability * Add LiteText ViT auto mappings and use LiteText config * Improve conversion script * Do not require triton * Improve modeling * Fix repo * Fix repo * Add vision model to auto mapping * Add missing entries to auto mapping * reverse serve.py * simplify implementation * fix modular * Address review comments * fix repo * fix after review 2 * fix tests + repo * Address comments * Address comments * Make fix-repo * add to hub cache + fixup base sam3 as well --------- Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co> Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com>
What does this PR do?
This PR adds SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation.
Fixes #44205