Skip to content

[WIP] Add MM Grounding DINO#37925

Merged
qubvel merged 37 commits intohuggingface:mainfrom
rziga:add_mm_grounding_dino
Aug 1, 2025
Merged

[WIP] Add MM Grounding DINO#37925
qubvel merged 37 commits intohuggingface:mainfrom
rziga:add_mm_grounding_dino

Conversation

@rziga
Copy link
Copy Markdown
Contributor

@rziga rziga commented May 2, 2025

What does this PR do?

Fixes #37744.

It adds support for MM Grounding DINO and LLMDet (inference only).

  • I've added a weight conversion script and a modular transformers implementation.
  • The forward pass works with 1e-3 tolerance compared to the original (checked tiny and large checkpoints).
  • The tests are copied and modified from Grounding DINO, but two integration tests fail currently (batched consistency and CPU-GPU consistency).

TODO

[ ] fix failing tests

Tagging @qubvel

rziga added 7 commits May 2, 2025 13:56
Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.
…ion where box and class heads were not correctly assigned to inner model
Cross att masking and cpu-gpu consistency tests are still failing however.
@yonigozlan
Copy link
Copy Markdown
Member

Hi @rziga ! Feel free to ping me when the PR is ready for review!

@rziga
Copy link
Copy Markdown
Contributor Author

rziga commented May 21, 2025

Sorry, I got busy with life stuff for the last couple of weeks, so I didn't have time to finish it up.

The model works, but I have to upload all the checkpoints to the hub and update the model doc before I mark it ready for review.

@rziga rziga marked this pull request as ready for review May 30, 2025 11:09
@rziga
Copy link
Copy Markdown
Contributor Author

rziga commented May 30, 2025

Pinging @yonigozlan and @qubvel as requested.

-->

# MM Grounding DINO

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model Documentations now are of a different format as per this issue right here. To standardize model cards and make it easy for the user.

@1himan
Copy link
Copy Markdown
Contributor

1himan commented Jun 27, 2025

Hey @rziga, I wanted to ask you that - how do you add a new model? Huggingface provides utilities to help you get started with the boilerplate, AFAIK this is where you get started - transformers add-new-model-like this CLI command will asks around 7-8 questions related to the model. The first one goes like this "What is the model you would like to duplicate? Please provide the lowercase model_type (e.g. roberta):"
this model will be used as a scaffolding to build the new model, we just have to change the main files which includes - architecture of the model, any pre/post processing utilities and a few more thing. We don't just dump in the research code(the original implementation of a model ) here. We first need to do a sort of code archaeology, remove the research specific code, and only add the code useful for inference.

Correct me if I'm wrong. And could you tell me your approach. 😄

@rziga
Copy link
Copy Markdown
Contributor Author

rziga commented Jun 27, 2025

Hi,

I don't really know if I'm the right person to ask, but yes, I started with transformers add-new-model-like. From then on I don't remember exactly, but it was a combination of following these two guides:

Since this modular approach is quite new, I also looked at other models that used modular transformers, like this one:

Hope this helps somewhat. But again I don't think I'm the right person to ask here. It is probably better to open an issue to add a model and ask the actual contributors.

@qubvel
Copy link
Copy Markdown
Contributor

qubvel commented Jun 27, 2025

Hey @rziga, I'm really sorry for the long delay. I'll review it next week. The modular code looks very minimal, so I expect we can merge it quickly! Thanks a lot for working on the model.

@qubvel qubvel requested review from qubvel and removed request for ArthurZucker and Rocketknight1 July 16, 2025 09:38
@qubvel
Copy link
Copy Markdown
Contributor

qubvel commented Jul 30, 2025

Hi @rziga, I created org for MM Grounding DINO checkpoints and invited you

Comment on lines -191 to -193
"""
Preprocess an image or batch of images.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that's a bit strange, we should probably merge main and make repo-consistency to avoid it

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's remove this unrelated change

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thanks for making it so easy to review with modular!

Comment on lines +220 to +242
if config.decoder_cls_embed_share:
_class_embed = MMGroundingDinoContrastiveEmbedding(config)
self.class_embed = nn.ModuleList([_class_embed for _ in range(config.decoder_layers)])
else:
module_list = []
for _ in range(config.decoder_layers):
_class_embed = MMGroundingDinoContrastiveEmbedding(config)
module_list.append(_class_embed)
self.class_embed = nn.ModuleList(module_list)

if config.decoder_bbox_embed_share:
_bbox_embed = MMGroundingDinoMLPPredictionHead(
input_dim=config.d_model, hidden_dim=config.d_model, output_dim=4, num_layers=3
)
self.bbox_embed = nn.ModuleList([_bbox_embed for _ in range(config.decoder_layers)])
else:
module_list = []
for _ in range(config.decoder_layers):
_bbox_embed = MMGroundingDinoMLPPredictionHead(
input_dim=config.d_model, hidden_dim=config.d_model, output_dim=4, num_layers=3
)
module_list.append(_bbox_embed)
self.bbox_embed = nn.ModuleList(module_list)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general we should not have so many code paths! Each case would lead to a different Embedding.... Are there a model for each case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these are always false, yes. I remember debating what to do. Cleanest thing would be to delete both decoder_bbox_embed_share and decoder_cls_embed_share and have no branching here. The issue is that original GroundingDino requires decoder_bbox_embed_share parameter, so I was kinda stuck with using it here as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rziga if we are overriding the entire method, I suppose we can remove if/else code path here and remove these args from config (del decoder_bbox_embed_share in modular)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do that just now. The issue is that decoder_bbox_embed_share is used in a check inside config, so it doesn't get removed during modular conversion. Do I just copy the whole config class to remove the check?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I see! Yes, let's copy and modify the whole config class then

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I keep the modified config in modular file or just move it to configuration file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I need to keep it in modular otherwise the conversion deletes it.

Comment on lines -191 to -193
"""
Preprocess an image or batch of images.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's remove this unrelated change

self.backbone = MMGroundingDinoConvModel(backbone, position_embeddings)

# Create input projection layers
if config.num_feature_levels > 1:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, can we remove this if/else? I suppose num_feature_levels > 1 for all checkpoints

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'm not really sure why this if/else even is here. The first branch seems to produce the same thing as the second one.

@qubvel
Copy link
Copy Markdown
Contributor

qubvel commented Jul 31, 2025

run-slow: mm_grounding_dino

@github-actions
Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/mm_grounding_dino']
quantizations: [] ...

@qubvel qubvel self-requested a review August 1, 2025 10:04
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Aug 1, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, grounding_dino, mm_grounding_dino

@qubvel
Copy link
Copy Markdown
Contributor

qubvel commented Aug 1, 2025

run-slow: grounding_dino, mm_grounding_dino

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Aug 1, 2025

This comment contains run-slow, running the specified jobs:

models: ['models/grounding_dino', 'models/mm_grounding_dino']
quantizations: [] ...

@qubvel qubvel merged commit 3951d4a into huggingface:main Aug 1, 2025
25 of 26 checks passed
@qubvel
Copy link
Copy Markdown
Contributor

qubvel commented Aug 1, 2025

Huge thanks for the contribution @rziga! Great work 🤗

@sushmanthreddy
Copy link
Copy Markdown
Contributor

@qubvel here support for llmdet is only added for the inference ,we have any support for training also ??can I pls take up the issue if its planned or open for community

zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MM Grounding DINO

9 participants