DO NOT MERGE adding SAML3-LiteText with a skill, first pass#45149
Draft
tarekziade wants to merge 2 commits intomainfrom
Draft
DO NOT MERGE adding SAML3-LiteText with a skill, first pass#45149tarekziade wants to merge 2 commits intomainfrom
tarekziade wants to merge 2 commits intomainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, sam3_lite_text |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First pass: ~1M+ tokens in, ~115K+ out, Opus mainly, $42, 1h30
PR #44320 vs Our Implementation
What we got right
Key differences where PR #44320 is better
Standalone config vs. inheriting from parent
The PR creates
Sam3LiteTextConfig(PreTrainedConfig)as a standalone config with an explicitsub_configsdict, rather than inheriting fromSam3Config. This single decision avoids:Sam3LiteTextVisionConfig,Sam3LiteTextDETREncoderConfig, etc.)sub_configsdict mismatch problem during save/load roundtrips__init__bodiesCONFIG_MAPPINGregistrations for renamed sub-config model typesTime cost of our approach: ~60-90 minutes of debugging across Issues #1, #2, and #4 in the session.
Reusing existing attention (Siglip) vs. writing custom attention
The PR inherits from
SiglipAttention/SiglipEncoderLayerfor the transformer layers inside the MobileCLIP encoder. This gives free SDPA, FlashAttention, and FlexAttention support through the existing infrastructure.Our implementation writes a custom
Sam3LiteTextAttentionwith manual fused QKV and softmax, which only supports eager mode. This forced us to skip 25+ parameterized SDPA test variants individually.Conditional RepMixer via config flag
The PR adds a
use_repmixer_blocks: boolconfig attribute. WhenFalse, all layers are standard transformer layers. This makes testing trivial — set it toFalsefor tests that need attention outputs or SDPA compatibility.Our implementation always includes RepMixer blocks, which complicated test setup and required additional test skips.
@capture_outputsdecorator for hidden states / attentionsThe PR uses the modern
@capture_outputsand_can_record_outputspattern, which automatically collects hidden states and attentions from designated layers. This meanstest_training,test_hidden_states_output, andtest_training_gradient_checkpointingall pass without overrides.Our implementation doesn't use this pattern, requiring us to skip training tests and write custom
test_hidden_states_output.Simplified MobileOneBlock (no reparameterization)
The PR's
Sam3LiteTextMobileOneBlockomits thereparameterize()method and scale branch entirely. Since reparameterization is an inference-only optimization (fusing multi-branch conv to a single conv), it's not needed for the HF integration — the checkpoint already stores the unfused weights.Our implementation includes the full MobileOne reparameterization machinery (~100 extra lines), which adds complexity without practical benefit in this context.
EOT pooling vs. full-sequence projection
The PR does CLIP-style EOT (end-of-text) pooling:
hidden_states[input_ids.argmax(dim=-1)] @ self.projection, matching the original MobileCLIP behavior.Our implementation applies
self.text_projection(annn.Linear) to the fulllast_hidden_statesequence, which changes the text representation semantics.Integration tests with exact values
The PR includes 7
@slowintegration tests covering text prompts, box prompts, combined prompts, batched images, semantic segmentation, and efficient multi-prompt inference — all with exact numerical assertions against the converted checkpoint. Our implementation has no integration tests (deferred to Phase 4 / checkpoint availability).The biggest lesson
Don't inherit from a composite parent config.
The PR creates Sam3LiteTextConfig(PreTrainedConfig) as a standalone config with explicit sub_configs dict. This completely avoids:
This single decision would have saved roughly 60-90 minutes of our session.
What we did that the PR didn't
Skill update needed?
The biggest missing guidance: "For composite models where only one sub-component changes, prefer standalone config over inheriting from the parent config."