Skip to content

Add SAM 3.1#45110

Open
NielsRogge wants to merge 4 commits intohuggingface:mainfrom
NielsRogge:add_sam_3_1
Open

Add SAM 3.1#45110
NielsRogge wants to merge 4 commits intohuggingface:mainfrom
NielsRogge:add_sam_3_1

Conversation

@NielsRogge
Copy link
Copy Markdown
Collaborator

@NielsRogge NielsRogge commented Mar 30, 2026

What does this PR do?

[disclaimer: PR was entirely written by Codex where I just nudge it in the right directions, similar to #44285]

Feature request

I'd like to add support for Meta's SAM 3.1 release to transformers.

SAM 3.1 does not look like a simple checkpoint refresh for the video stack. The upstream release introduces the new Object Multiplex tracking architecture, so for video this is not just a drop-in replacement for the existing SAM 3 / sam3_video implementation.

Proposed scope

I have a local implementation working for the following scope:

  1. Image support for SAM 3.1 via the existing sam3 image family

    • Reuse Sam3Model for image inference.
    • Extend the SAM 3 conversion script to accept the merged facebook/sam3.1 checkpoint and extract the detector weights from it.
    • Verify conversion with a save/load/forward check, plus preprocessing parity against the upstream SAM 3 image preprocessing pipeline.
  2. New sam3_1_video model family for SAM 3.1 video

    • Add a dedicated sam3_1_video implementation based on the SAM 3.1 multiplex tracker architecture.
    • Build it from a modular source file (modular_sam3_1_video.py), with generated config/modeling files.
    • Add a conversion script that loads the public sam3.1_multiplex.pt checkpoint and verifies parity against the upstream implementation.
  3. Docs and tests

    • Add model docs for sam3_1 and sam3_1_video.
    • Add focused unit tests for the new sam3_1_video model family.

Local status

This is already working locally against the upstream SAM 3.1 codebase and checkpoint:

  • real image conversion from facebook/sam3.1 succeeds
  • real video conversion from facebook/sam3.1 succeeds
  • video parity passes against the upstream implementation
  • image preprocessing parity passes against the upstream preprocessing pipeline
  • targeted SAM 3 / SAM 3.1 tests pass
  • make check-repo passes locally

Why a new sam3_1_video family?

My current recommendation is:

  • keep SAM 3.1 image support inside the existing sam3 image family
  • add a separate sam3_1_video family for video, since the multiplex tracker architecture and checkpoint layout differ from the current SAM 3 video implementation

This keeps the image path minimal while avoiding forcing the existing sam3_video code into a checkpoint-incompatible architecture jump.

Open questions for maintainers

Codex also had some questions to confirm the expected scope and structure:

  1. Is a new sam3_1_video model family the preferred approach for SAM 3.1 video support?
  2. Is it acceptable to keep SAM 3.1 image support in the existing sam3 family rather than adding a separate sam3_1 image model class?
  3. Should the first PR focus on:
    • image conversion + sam3_1_video core model only
    • or also include a higher-level processor / session API for SAM 3.1 video?
  4. If this direction looks good, is there any preferred PR split for reviewability?

To do:

  • remove plan.md and progress.md
  • convert and push checkpoint

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, sam3, sam3_1_video

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread plan.md
Comment thread progress.md
@jjabo
Copy link
Copy Markdown

jjabo commented Apr 23, 2026

Is this going to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants