DO NOT MERGE adding SAML3-LiteText with a skill, first pass by tarekziade · Pull Request #45149 · huggingface/transformers

tarekziade · 2026-03-31T15:56:11Z

First pass: ~1M+ tokens in, ~115K+ out, Opus mainly, $42, 1h30

PR #44320 vs Our Implementation

What we got right

Same overall structure: modular file + generated standalone files + conversion script + tests + docs
Same model directory name (sam3_lite_text), same general class naming
Same regex-based key mapping approach in the conversion script
Same RepMixer + Transformer layer architecture for the MobileCLIP encoder
Same auto registration locations

Key differences where PR #44320 is better

Standalone config vs. inheriting from parent

The PR creates Sam3LiteTextConfig(PreTrainedConfig) as a standalone config with an explicit sub_configs dict, rather than inheriting from Sam3Config. This single decision avoids:

The converter renaming every sub-config (Sam3LiteTextVisionConfig, Sam3LiteTextDETREncoderConfig, etc.)
The sub_configs dict mismatch problem during save/load roundtrips
Function-level imports for SAM3 classes in __init__ bodies
TRF009 allowlist exemption for cross-model imports
Multiple CONFIG_MAPPING registrations for renamed sub-config model types

Time cost of our approach: ~60-90 minutes of debugging across Issues #1, #2, and #4 in the session.

Reusing existing attention (Siglip) vs. writing custom attention

The PR inherits from SiglipAttention / SiglipEncoderLayer for the transformer layers inside the MobileCLIP encoder. This gives free SDPA, FlashAttention, and FlexAttention support through the existing infrastructure.

Our implementation writes a custom Sam3LiteTextAttention with manual fused QKV and softmax, which only supports eager mode. This forced us to skip 25+ parameterized SDPA test variants individually.

Conditional RepMixer via config flag

The PR adds a use_repmixer_blocks: bool config attribute. When False, all layers are standard transformer layers. This makes testing trivial — set it to False for tests that need attention outputs or SDPA compatibility.

Our implementation always includes RepMixer blocks, which complicated test setup and required additional test skips.

`@capture_outputs` decorator for hidden states / attentions

The PR uses the modern @capture_outputs and _can_record_outputs pattern, which automatically collects hidden states and attentions from designated layers. This means test_training, test_hidden_states_output, and test_training_gradient_checkpointing all pass without overrides.

Our implementation doesn't use this pattern, requiring us to skip training tests and write custom test_hidden_states_output.

Simplified MobileOneBlock (no reparameterization)

The PR's Sam3LiteTextMobileOneBlock omits the reparameterize() method and scale branch entirely. Since reparameterization is an inference-only optimization (fusing multi-branch conv to a single conv), it's not needed for the HF integration — the checkpoint already stores the unfused weights.

Our implementation includes the full MobileOne reparameterization machinery (~100 extra lines), which adds complexity without practical benefit in this context.

EOT pooling vs. full-sequence projection

The PR does CLIP-style EOT (end-of-text) pooling: hidden_states[input_ids.argmax(dim=-1)] @ self.projection, matching the original MobileCLIP behavior.

Our implementation applies self.text_projection (an nn.Linear) to the full last_hidden_state sequence, which changes the text representation semantics.

Integration tests with exact values

The PR includes 7 @slow integration tests covering text prompts, box prompts, combined prompts, batched images, semantic segmentation, and efficient multi-prompt inference — all with exact numerical assertions against the converted checkpoint. Our implementation has no integration tests (deferred to Phase 4 / checkpoint availability).

The biggest lesson

Don't inherit from a composite parent config.

The PR creates Sam3LiteTextConfig(PreTrainedConfig) as a standalone config with explicit sub_configs dict. This completely avoids:

The converter renaming every sub-config
The sub_configs dict mismatch problem
Function-level imports for SAM3 classes
TRF009 exemption
Multiple CONFIG_MAPPING registrations for renamed sub-configs

This single decision would have saved roughly 60-90 minutes of our session.

What we did that the PR didn't

Full MobileOne reparameterization support (nice to have for inference optimization)
Separate Sam3LiteTextImageProcessor and Sam3LiteTextProcessor classes
More config attributes for MobileCLIP specifics (kernel_size, layer_scale_init_value, norm_type)

Skill update needed?

The biggest missing guidance: "For composite models where only one sub-component changes, prefer standalone config over inheriting from the parent config."

HuggingFaceDocBuilderDev · 2026-03-31T16:05:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-04-01T06:28:20Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, sam3_lite_text

addind SAML3-LiteText with a skill, first pass

5c00adf

tarekziade self-assigned this Mar 31, 2026

tarekziade marked this pull request as draft March 31, 2026 15:56

tweak skill

905fdb1

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE adding SAML3-LiteText with a skill, first pass#45149

DO NOT MERGE adding SAML3-LiteText with a skill, first pass#45149
tarekziade wants to merge 2 commits intomainfrom
tarekziade-model-skill

tarekziade commented Mar 31, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tarekziade commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #44320 vs Our Implementation

What we got right

Key differences where PR #44320 is better

Standalone config vs. inheriting from parent

Reusing existing attention (Siglip) vs. writing custom attention

Conditional RepMixer via config flag

@capture_outputs decorator for hidden states / attentions

Simplified MobileOneBlock (no reparameterization)

EOT pooling vs. full-sequence projection

Integration tests with exact values

The biggest lesson

Don't inherit from a composite parent config.

What we did that the PR didn't

Skill update needed?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tarekziade commented Mar 31, 2026 •

edited

Loading

`@capture_outputs` decorator for hidden states / attentions