Skip to content

model: Add DEIMv2 to Transformers#44339

Merged
vasqu merged 30 commits intohuggingface:mainfrom
harshaljanjani:add-deimv2
Apr 27, 2026
Merged

model: Add DEIMv2 to Transformers#44339
vasqu merged 30 commits intohuggingface:mainfrom
harshaljanjani:add-deimv2

Conversation

@harshaljanjani
Copy link
Copy Markdown
Contributor

@harshaljanjani harshaljanjani commented Feb 27, 2026

What does this PR do?

→ This PR adds DEIMv2 to Transformers!
IMP: I've linked two notebooks: a Colab notebook here demonstrating the functionality, predictions, checkpoint conversion, and the passing test suite for all types of presets, and another notebook here that demos the detailed fine-tuning and predictions on this preset: https://huggingface.co/Intellindust/DEIMv2_HGNetv2_N_COCO.
References:
Model Checkpoints
GitHub Pages
GitHub Repository
Paper
→ All variants, including the base variant, as well as those requiring LiteEncoder and DinoV3+STA support, are now complete and working, as demonstrated in the Colab notebook (I've used one preset from each of the three types for demo) 🥳

Closes #41211 and completes #41327.

Before submitting

  • This PR adds a new model to Transformers.
  • Did you read the contributor guidelines, specifically the Pull Request section?
  • Was this discussed or approved via a GitHub issue or the forum? If so, please add a link.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you add any necessary tests?

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Mar 1, 2026

Happy to make changes until it's production-ready; the test suite is passing locally, and all the functionality has been tested (I've left notebooks to demo the same as mentioned in the PR description). The way I pushed the PR is as follows: the changes for the base preset are included in eaef822, and the changes for the ornamentations on top of it (HGNetV2+HybridEncoder (N), HGNetV2+LiteEncoder (Pico/Femto/Atto), and DINOv3+HybridEncoder (S/M/L/X)) are included in 85c7356 :)

@harshaljanjani harshaljanjani marked this pull request as ready for review March 1, 2026 08:26
@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Mar 4, 2026

Just a gentle ping on this :)

@Rocketknight1
Copy link
Copy Markdown
Member

Yes, sorry! cc @NielsRogge for review, maybe @molbap

@yonigozlan
Copy link
Copy Markdown
Member

Thanks for working on this @harshaljanjani ! I'll have a look in the coming days

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Thanks for working on this @harshaljanjani ! I'll have a look in the coming days

Thanks @yonigozlan! I'll await your feedback 🤗

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

@yonigozlan @NielsRogge Just a gentle ping :)

@yonigozlan
Copy link
Copy Markdown
Member

@harshaljanjani Having a look now :), thank you for your patience!

Copy link
Copy Markdown
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @harshaljanjani ! Thanks a lot for the huge work on this PR! I have only looked at the modular file for now, but there are a few things to fix there before I get to the rest. The main point is that we shouldn't need to redefine a new dinov3 based backbone, we should be able to use the existing one through load_backbone.
Also, if you could add an integration tests for each model variants in the modeling test file, that will be great to make sure that we don't break anything during the review process.

Comment on lines +205 to +206
use_spatial_tuning_adapter (`bool`, *optional*, defaults to `False`):
Whether to use the Spatial Tuning Adapter (STA) for DINOv2 backbone variants.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always True? let's remove it if so

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's True for DINOv3 and False for HGNetV2, but this gave me a good direction to place it in Deimv2DINOv3ConvEncoder (created in the refactor) unconditionally instead; we don't need the flag :)

Comment on lines +219 to +227
dinov3_backbone_config (`dict`, *optional*):
Configuration dictionary for the DINOv3 ViT backbone. Passed as kwargs to `DINOv3ViTConfig`.
dinov3_interaction_indexes (`list[int]`, *optional*):
Layer indices in the DINOv3 ViT backbone from which to extract intermediate features.
dinov3_hidden_dim (`int`, *optional*):
Hidden dimension for the DINOv3 backbone projection convolutions. If `None`, uses `hidden_size` from
the DINOv3 ViT config.
dinov3_apply_layernorm (`bool`, *optional*, defaults to `False`):
Whether to apply LayerNorm to intermediate features extracted from the DINOv3 ViT backbone.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in backbone_config, we should have one way to load the different possible backbones, not separate paths for each

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-refactor, backbone_config is now a DINOv3ViTConfig object directly.

Comment on lines +217 to +218
backbone_type (`str`, *optional*, defaults to `"hgnetv2"`):
Type of backbone to use. `"hgnetv2"` uses HGNetV2, `"dinov3"` uses DINOv3 ViT backbone with STA.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant with backbone config

Comment thread src/transformers/models/deimv2/modular_deimv2.py
Comment on lines +415 to +425
class Deimv2RMSNorm(nn.Module):
def __init__(self, dim: int, eps: float = 1e-6):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
input_dtype = hidden_states.dtype
hidden_states = hidden_states.float()
hidden_states = hidden_states * torch.rsqrt(hidden_states.pow(2).mean(-1, keepdim=True) + self.eps)
return (hidden_states * self.scale).to(input_dtype)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use an existing one from the library like LlamaRMSNorm, and make the necessary weight conversion. Also add the @use_kernel_forward_from_hub("RMSNorm") decorator

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-refactor Deimv2RMSNorm inherits from LlamaRMSNorm directly and LlamaRMSNorm already has the decorator which should be inherited as well :)

Comment on lines +1049 to +1050
if config.use_spatial_tuning_adapter and config.backbone_type != "dinov3":
self.spatial_tuning_adapter = Deimv2SpatialTuningAdapter(config)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, let's move this to the Deimv2ConvEncoder as well (if I understood correctly)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Described in the first reply regarding the flag, STA is now inside Deimv2DINOv3ConvEncoder

Comment on lines +1072 to +1079
if self.config.backbone_type == "dinov3":
proj_feats = self.dinov3_backbone(pixel_values)
elif self.config.encoder_type == "lite":
features = self.backbone(pixel_values, pixel_mask)
proj_feats = [source for source, mask in features]
else:
features = self.backbone(pixel_values, pixel_mask)
proj_feats = [self.encoder_input_proj[level](source) for level, (source, mask) in enumerate(features)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should be able to have a single backbone call:

Suggested change
if self.config.backbone_type == "dinov3":
proj_feats = self.dinov3_backbone(pixel_values)
elif self.config.encoder_type == "lite":
features = self.backbone(pixel_values, pixel_mask)
proj_feats = [source for source, mask in features]
else:
features = self.backbone(pixel_values, pixel_mask)
proj_feats = [self.encoder_input_proj[level](source) for level, (source, mask) in enumerate(features)]
proj_feats = self.backbone(pixel_values)

level_start_index = torch.cat((spatial_shapes.new_zeros((1,)), spatial_shapes.prod(1).cumsum(0)[:-1]))

if self.training and self.config.num_denoising > 0 and labels is not None:
from ..d_fine.modeling_d_fine import get_contrastive_denoising_training_group
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's import it at the top, and it will be copied to the modeling file thanks to modular


init_reference_points = reference_points_unact.detach()

from ..d_fine.modeling_d_fine import DFineModelOutput
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define a Deimv2ModelOutput inheriting from DFineModelOutput above

Comment on lines +1221 to +1231
@property
def _tied_weights_keys(self):
keys = {
r"class_embed.(?![0])\d+": r"^class_embed.0",
"class_embed": "model.decoder.class_embed",
"bbox_embed": "model.decoder.bbox_embed",
}
if getattr(self.config, "share_bbox_head", False):
keys[r"model\.decoder\.bbox_embed\.(?![0])\d+"] = r"model.decoder.bbox_embed.0"
keys[r"bbox_embed.(?![0])\d+"] = r"bbox_embed.0"
return keys
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we would have this but let's replace with a standard:

Suggested change
@property
def _tied_weights_keys(self):
keys = {
r"class_embed.(?![0])\d+": r"^class_embed.0",
"class_embed": "model.decoder.class_embed",
"bbox_embed": "model.decoder.bbox_embed",
}
if getattr(self.config, "share_bbox_head", False):
keys[r"model\.decoder\.bbox_embed\.(?![0])\d+"] = r"model.decoder.bbox_embed.0"
keys[r"bbox_embed.(?![0])\d+"] = r"bbox_embed.0"
return keys
_tied_weights_keys = {
r"class_embed.(?![0])\d+": r"^class_embed.0",
"class_embed": "model.decoder.class_embed",
"bbox_embed": "model.decoder.bbox_embed",
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I didn't make this change because share_bbox_head varies across presets; "Pico" / "Femto" / "Atto" share bbox heads but the rest don't; happy to update this if you think the above approach would be better and I'm missing something!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Not sure if this would cause issue tbh but it might be ok. I'll a core maintainer see if that can work. Otherwise the alternative would be to just copy the shared bbox embeds when converting the model, but not ideal

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine tbh, we rarely have these cases but when they appear it's usually modified in the init. I actually like this version more as property - although not sure why we have to use getattr as only nit.

cc @Cyrilvallez in any case if you disagree

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Mar 17, 2026

Thanks for your time @yonigozlan; and for the amazing review round; I've addressed the comments and ensured the test suite is passing after the update 🤗❤️
I've pushed the changes for ease of review within one single commit; this includes everything updated in response to the feedback. I added replies to your comments where I'd love for you to see how I thought through things so you can provide feedback if needed :)
Hope you find this to be a solid improvement over the last!

image

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Also, if you could add an integration tests for each model variants in the modeling test file, that will be great to make sure that we don't break anything during the review process.

Just a quick note, the test harness should already cover the different variants :)

Copy link
Copy Markdown
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for iterating on this so quickly @harshaljanjani ! This looks much better, almost good to merge on my side,
Once you have made the last changes I suggested, You can directly ping a core maintainer for a final review.
Also don't forget to merge/rebase with your next modifications!

Comment on lines +428 to +438
class Deimv2SwiGLUFFN(nn.Module):
def __init__(self, in_features: int, hidden_features: int, out_features: int):
super().__init__()
self.w12 = nn.Linear(in_features, 2 * hidden_features)
self.w3 = nn.Linear(hidden_features, out_features)

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
x12 = self.w12(hidden_states)
x1, x2 = x12.chunk(2, dim=-1)
hidden = F.silu(x1) * x2
return self.w3(hidden)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the back and forth on this, but if we can't directly inherit from DINOv3ViTGatedMLP here anyway, let's use the format/naming of the dinov2 swigluffn. Not splitting the linear layers might result in slightly better performance too.

class Deimv2GAPFusion(nn.Module):
def __init__(self, config: Deimv2Config, channels: int):
super().__init__()
self.cv = Deimv2ConvNormLayer(config, channels, channels, 1, 1, activation=config.activation_function)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a better/more descriptive name than cv?

Comment on lines +912 to +914
if pixel_mask is None:
pixel_mask = torch.ones(((batch_size, height, width)), device=device)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up that pixel_mask is not used here. This is not on you as pixel mask is not supported in dinov3 for now. I'll add it later but this is out of scope for this PR. Would you mind adding a Todo comment to pass the mask to the backbone once dinov3 supports it? thanks!

Comment on lines +719 to +720
class Deimv2MLPPredictionHead(DFineMLP):
pass
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used at all

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, removed!

Comment on lines +847 to +850
if is_dinov3:
self.backbone = Deimv2DINOv3ConvEncoder(config)
else:
self.backbone = Deimv2ConvEncoder(config)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think it would make more sense to name these self.conv_encoder. I know in DFine this is named backbone as well, but here we include the feature projections and sta etc., so this is more than a backbone. + it will avoid having "backbone.backbone" which I don't like in DFine

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; thought it was a bit weird too. Made the broader self.conv_encoder change as and where needed :)

Comment thread docs/source/en/model_doc/deimv2.md Outdated
@@ -0,0 +1,68 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
<!--Copyright 2026 The HuggingFace Team. All rights reserved.

@@ -0,0 +1,29 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2025 The HuggingFace Team. All rights reserved.
# Copyright 2026 The HuggingFace Team. All rights reserved.

@@ -0,0 +1,784 @@
# Copyright 2025 The HuggingFace Inc. team.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2025 The HuggingFace Inc. team.
# Copyright 2026 The HuggingFace Inc. team.

@@ -0,0 +1,1142 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.

Comment on lines +1 to +2
# coding = utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# coding = utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Mar 19, 2026

Thanks a lot for the review cycle! Addressed all the newer comments in the commit 4ad0dc5 and verified that there are no regressions in the test suite 🤗
Pinging @ArthurZucker for the final review as suggested above; please let me know if I've missed any additional pings :)

image

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Good day @ArthurZucker; just a gentle ping to check if there are any updates regarding the final review, thank you!
cc: @yonigozlan

@yonigozlan
Copy link
Copy Markdown
Member

Also pinging @vasqu @Cyrilvallez for core maintainer review :)

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Apr 23, 2026

Hi, could you please check this PR #45601 regarding a bug in the D-FINE loss, I think it is also relevant for DEIMv2

@m-matthias Thanks for flagging this in the PR; it turns out DEIMv2 suffered from the same coupling in the copy-over. I've pushed the fix in fb1f387: it's independent of the D-FINE fix (also cc: @Abineshabee!). You both are awesome 🤗
@vasqu Would be grateful if you could cross-check this as well whenever you have a moment, thank you :)

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 23, 2026

SGTM, I reviewed the other PR but would like to wait on Yoni before merging. Can you sync with main here btw, we updated the linter so the false positive shouldn't appear anymore

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, d_fine, deimv2

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Can you sync with main here btw, we updated the linter so the false positive shouldn't appear anymore

Yepp; false positive's gone, awesome!

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Good day; just checking in to see if there are any updates!

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 27, 2026

run-slow: d_fine, deimv2

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 27, 2026

@harshaljanjani sorry, no everything should be ready; one last sanity check then merging 🤗 thanks a lot for the contribution, definitely wasn't an easy model

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, d_fine, deimv2

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/d_fine", "models/deimv2"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN f7039bb7 workflow commit (merge commit)
PR a3d9ac56 branch commit (from PR)
main e651c68e base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@vasqu vasqu enabled auto-merge April 27, 2026 13:20
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, d_fine, deimv2

@vasqu vasqu added this pull request to the merge queue Apr 27, 2026
@vasqu vasqu removed this pull request from the merge queue due to a manual request Apr 27, 2026
@vasqu vasqu added this pull request to the merge queue Apr 27, 2026
Merged via the queue into huggingface:main with commit 739b46c Apr 27, 2026
28 checks passed
@harshaljanjani harshaljanjani deleted the add-deimv2 branch April 27, 2026 14:23
@harshaljanjani
Copy link
Copy Markdown
Contributor Author

@harshaljanjani sorry, no everything should be ready; one last sanity check then merging 🤗 thanks a lot for the contribution, definitely wasn't an easy model

Had tons of fun; thank you as well for the timely review cycles on the PR, @vasqu and @yonigozlan! 😊

ArthurZucker pushed a commit that referenced this pull request Apr 28, 2026
* init: Add files (v1)

* fix: Fix ci/circleci: check_repository_consistency

* feat: Add support and test harness for all variants

* fix: Fix ci/circleci: check_repository_consistency

* refactor: Resolve review comments

* refactor: Resolve second review round

* nit: Fix copyright year

* refactor: Resolve third review round

* revert: Adhere to the pattern from yonigozlan

* nit: Clarify the docstring

* refactor: Resolve fourth review round

* refactor: Closing in on the final set of nits

* fix: Resolve merge conflicts

* fix: Add loss override and address nits

* nits: Fix minor issues

* fixup their init weights

* fix: Fix loss coupling issue

* fix date

---------

Co-authored-by: vasqu <antonprogamer@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add DEIMv2

6 participants