Skip to content

DO NOT MERGE adding SAML3-LiteText with a skill, first pass#45149

Draft
tarekziade wants to merge 2 commits intomainfrom
tarekziade-model-skill
Draft

DO NOT MERGE adding SAML3-LiteText with a skill, first pass#45149
tarekziade wants to merge 2 commits intomainfrom
tarekziade-model-skill

Conversation

@tarekziade
Copy link
Copy Markdown
Collaborator

@tarekziade tarekziade commented Mar 31, 2026

First pass: ~1M+ tokens in, ~115K+ out, Opus mainly, $42, 1h30

PR #44320 vs Our Implementation

What we got right

  • Same overall structure: modular file + generated standalone files + conversion script + tests + docs
  • Same model directory name (sam3_lite_text), same general class naming
  • Same regex-based key mapping approach in the conversion script
  • Same RepMixer + Transformer layer architecture for the MobileCLIP encoder
  • Same auto registration locations

Key differences where PR #44320 is better

Standalone config vs. inheriting from parent

The PR creates Sam3LiteTextConfig(PreTrainedConfig) as a standalone config with an explicit sub_configs dict, rather than inheriting from Sam3Config. This single decision avoids:

  • The converter renaming every sub-config (Sam3LiteTextVisionConfig, Sam3LiteTextDETREncoderConfig, etc.)
  • The sub_configs dict mismatch problem during save/load roundtrips
  • Function-level imports for SAM3 classes in __init__ bodies
  • TRF009 allowlist exemption for cross-model imports
  • Multiple CONFIG_MAPPING registrations for renamed sub-config model types

Time cost of our approach: ~60-90 minutes of debugging across Issues #1, #2, and #4 in the session.

Reusing existing attention (Siglip) vs. writing custom attention

The PR inherits from SiglipAttention / SiglipEncoderLayer for the transformer layers inside the MobileCLIP encoder. This gives free SDPA, FlashAttention, and FlexAttention support through the existing infrastructure.

Our implementation writes a custom Sam3LiteTextAttention with manual fused QKV and softmax, which only supports eager mode. This forced us to skip 25+ parameterized SDPA test variants individually.

Conditional RepMixer via config flag

The PR adds a use_repmixer_blocks: bool config attribute. When False, all layers are standard transformer layers. This makes testing trivial — set it to False for tests that need attention outputs or SDPA compatibility.

Our implementation always includes RepMixer blocks, which complicated test setup and required additional test skips.

@capture_outputs decorator for hidden states / attentions

The PR uses the modern @capture_outputs and _can_record_outputs pattern, which automatically collects hidden states and attentions from designated layers. This means test_training, test_hidden_states_output, and test_training_gradient_checkpointing all pass without overrides.

Our implementation doesn't use this pattern, requiring us to skip training tests and write custom test_hidden_states_output.

Simplified MobileOneBlock (no reparameterization)

The PR's Sam3LiteTextMobileOneBlock omits the reparameterize() method and scale branch entirely. Since reparameterization is an inference-only optimization (fusing multi-branch conv to a single conv), it's not needed for the HF integration — the checkpoint already stores the unfused weights.

Our implementation includes the full MobileOne reparameterization machinery (~100 extra lines), which adds complexity without practical benefit in this context.

EOT pooling vs. full-sequence projection

The PR does CLIP-style EOT (end-of-text) pooling: hidden_states[input_ids.argmax(dim=-1)] @ self.projection, matching the original MobileCLIP behavior.

Our implementation applies self.text_projection (an nn.Linear) to the full last_hidden_state sequence, which changes the text representation semantics.

Integration tests with exact values

The PR includes 7 @slow integration tests covering text prompts, box prompts, combined prompts, batched images, semantic segmentation, and efficient multi-prompt inference — all with exact numerical assertions against the converted checkpoint. Our implementation has no integration tests (deferred to Phase 4 / checkpoint availability).

The biggest lesson

Don't inherit from a composite parent config.

The PR creates Sam3LiteTextConfig(PreTrainedConfig) as a standalone config with explicit sub_configs dict. This completely avoids:

  • The converter renaming every sub-config
  • The sub_configs dict mismatch problem
  • Function-level imports for SAM3 classes
  • TRF009 exemption
  • Multiple CONFIG_MAPPING registrations for renamed sub-configs

This single decision would have saved roughly 60-90 minutes of our session.

What we did that the PR didn't

  • Full MobileOne reparameterization support (nice to have for inference optimization)
  • Separate Sam3LiteTextImageProcessor and Sam3LiteTextProcessor classes
  • More config attributes for MobileCLIP specifics (kernel_size, layer_scale_init_value, norm_type)

Skill update needed?

The biggest missing guidance: "For composite models where only one sub-component changes, prefer standalone config over inheriting from the parent config."

@tarekziade tarekziade self-assigned this Mar 31, 2026
@tarekziade tarekziade marked this pull request as draft March 31, 2026 15:56
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, sam3_lite_text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants