model: Add DEIMv2 to Transformers#44339
Conversation
|
Happy to make changes until it's production-ready; the test suite is passing locally, and all the functionality has been tested (I've left notebooks to demo the same as mentioned in the PR description). The way I pushed the PR is as follows: the changes for the base preset are included in eaef822, and the changes for the ornamentations on top of it (HGNetV2+HybridEncoder (N), HGNetV2+LiteEncoder (Pico/Femto/Atto), and DINOv3+HybridEncoder (S/M/L/X)) are included in 85c7356 :) |
|
Just a gentle ping on this :) |
|
Yes, sorry! cc @NielsRogge for review, maybe @molbap |
|
Thanks for working on this @harshaljanjani ! I'll have a look in the coming days |
Thanks @yonigozlan! I'll await your feedback 🤗 |
|
@yonigozlan @NielsRogge Just a gentle ping :) |
|
@harshaljanjani Having a look now :), thank you for your patience! |
yonigozlan
left a comment
There was a problem hiding this comment.
Hey @harshaljanjani ! Thanks a lot for the huge work on this PR! I have only looked at the modular file for now, but there are a few things to fix there before I get to the rest. The main point is that we shouldn't need to redefine a new dinov3 based backbone, we should be able to use the existing one through load_backbone.
Also, if you could add an integration tests for each model variants in the modeling test file, that will be great to make sure that we don't break anything during the review process.
| use_spatial_tuning_adapter (`bool`, *optional*, defaults to `False`): | ||
| Whether to use the Spatial Tuning Adapter (STA) for DINOv2 backbone variants. |
There was a problem hiding this comment.
Is this always True? let's remove it if so
There was a problem hiding this comment.
It's True for DINOv3 and False for HGNetV2, but this gave me a good direction to place it in Deimv2DINOv3ConvEncoder (created in the refactor) unconditionally instead; we don't need the flag :)
| dinov3_backbone_config (`dict`, *optional*): | ||
| Configuration dictionary for the DINOv3 ViT backbone. Passed as kwargs to `DINOv3ViTConfig`. | ||
| dinov3_interaction_indexes (`list[int]`, *optional*): | ||
| Layer indices in the DINOv3 ViT backbone from which to extract intermediate features. | ||
| dinov3_hidden_dim (`int`, *optional*): | ||
| Hidden dimension for the DINOv3 backbone projection convolutions. If `None`, uses `hidden_size` from | ||
| the DINOv3 ViT config. | ||
| dinov3_apply_layernorm (`bool`, *optional*, defaults to `False`): | ||
| Whether to apply LayerNorm to intermediate features extracted from the DINOv3 ViT backbone. |
There was a problem hiding this comment.
This should be in backbone_config, we should have one way to load the different possible backbones, not separate paths for each
There was a problem hiding this comment.
Post-refactor, backbone_config is now a DINOv3ViTConfig object directly.
| backbone_type (`str`, *optional*, defaults to `"hgnetv2"`): | ||
| Type of backbone to use. `"hgnetv2"` uses HGNetV2, `"dinov3"` uses DINOv3 ViT backbone with STA. |
There was a problem hiding this comment.
This is redundant with backbone config
| class Deimv2RMSNorm(nn.Module): | ||
| def __init__(self, dim: int, eps: float = 1e-6): | ||
| super().__init__() | ||
| self.eps = eps | ||
| self.scale = nn.Parameter(torch.ones(dim)) | ||
|
|
||
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: | ||
| input_dtype = hidden_states.dtype | ||
| hidden_states = hidden_states.float() | ||
| hidden_states = hidden_states * torch.rsqrt(hidden_states.pow(2).mean(-1, keepdim=True) + self.eps) | ||
| return (hidden_states * self.scale).to(input_dtype) |
There was a problem hiding this comment.
Let's use an existing one from the library like LlamaRMSNorm, and make the necessary weight conversion. Also add the @use_kernel_forward_from_hub("RMSNorm") decorator
There was a problem hiding this comment.
Post-refactor Deimv2RMSNorm inherits from LlamaRMSNorm directly and LlamaRMSNorm already has the decorator which should be inherited as well :)
| if config.use_spatial_tuning_adapter and config.backbone_type != "dinov3": | ||
| self.spatial_tuning_adapter = Deimv2SpatialTuningAdapter(config) |
There was a problem hiding this comment.
This is weird, let's move this to the Deimv2ConvEncoder as well (if I understood correctly)
There was a problem hiding this comment.
Described in the first reply regarding the flag, STA is now inside Deimv2DINOv3ConvEncoder
| if self.config.backbone_type == "dinov3": | ||
| proj_feats = self.dinov3_backbone(pixel_values) | ||
| elif self.config.encoder_type == "lite": | ||
| features = self.backbone(pixel_values, pixel_mask) | ||
| proj_feats = [source for source, mask in features] | ||
| else: | ||
| features = self.backbone(pixel_values, pixel_mask) | ||
| proj_feats = [self.encoder_input_proj[level](source) for level, (source, mask) in enumerate(features)] |
There was a problem hiding this comment.
Here we should be able to have a single backbone call:
| if self.config.backbone_type == "dinov3": | |
| proj_feats = self.dinov3_backbone(pixel_values) | |
| elif self.config.encoder_type == "lite": | |
| features = self.backbone(pixel_values, pixel_mask) | |
| proj_feats = [source for source, mask in features] | |
| else: | |
| features = self.backbone(pixel_values, pixel_mask) | |
| proj_feats = [self.encoder_input_proj[level](source) for level, (source, mask) in enumerate(features)] | |
| proj_feats = self.backbone(pixel_values) |
| level_start_index = torch.cat((spatial_shapes.new_zeros((1,)), spatial_shapes.prod(1).cumsum(0)[:-1])) | ||
|
|
||
| if self.training and self.config.num_denoising > 0 and labels is not None: | ||
| from ..d_fine.modeling_d_fine import get_contrastive_denoising_training_group |
There was a problem hiding this comment.
Let's import it at the top, and it will be copied to the modeling file thanks to modular
|
|
||
| init_reference_points = reference_points_unact.detach() | ||
|
|
||
| from ..d_fine.modeling_d_fine import DFineModelOutput |
There was a problem hiding this comment.
Let's define a Deimv2ModelOutput inheriting from DFineModelOutput above
| @property | ||
| def _tied_weights_keys(self): | ||
| keys = { | ||
| r"class_embed.(?![0])\d+": r"^class_embed.0", | ||
| "class_embed": "model.decoder.class_embed", | ||
| "bbox_embed": "model.decoder.bbox_embed", | ||
| } | ||
| if getattr(self.config, "share_bbox_head", False): | ||
| keys[r"model\.decoder\.bbox_embed\.(?![0])\d+"] = r"model.decoder.bbox_embed.0" | ||
| keys[r"bbox_embed.(?![0])\d+"] = r"bbox_embed.0" | ||
| return keys |
There was a problem hiding this comment.
Not sure why we would have this but let's replace with a standard:
| @property | |
| def _tied_weights_keys(self): | |
| keys = { | |
| r"class_embed.(?![0])\d+": r"^class_embed.0", | |
| "class_embed": "model.decoder.class_embed", | |
| "bbox_embed": "model.decoder.bbox_embed", | |
| } | |
| if getattr(self.config, "share_bbox_head", False): | |
| keys[r"model\.decoder\.bbox_embed\.(?![0])\d+"] = r"model.decoder.bbox_embed.0" | |
| keys[r"bbox_embed.(?![0])\d+"] = r"bbox_embed.0" | |
| return keys | |
| _tied_weights_keys = { | |
| r"class_embed.(?![0])\d+": r"^class_embed.0", | |
| "class_embed": "model.decoder.class_embed", | |
| "bbox_embed": "model.decoder.bbox_embed", | |
| } |
There was a problem hiding this comment.
Actually I didn't make this change because share_bbox_head varies across presets; "Pico" / "Femto" / "Atto" share bbox heads but the rest don't; happy to update this if you think the above approach would be better and I'm missing something!
There was a problem hiding this comment.
Oh I see. Not sure if this would cause issue tbh but it might be ok. I'll a core maintainer see if that can work. Otherwise the alternative would be to just copy the shared bbox embeds when converting the model, but not ideal
There was a problem hiding this comment.
Should be fine tbh, we rarely have these cases but when they appear it's usually modified in the init. I actually like this version more as property - although not sure why we have to use getattr as only nit.
cc @Cyrilvallez in any case if you disagree
|
Thanks for your time @yonigozlan; and for the amazing review round; I've addressed the comments and ensured the test suite is passing after the update 🤗❤️
|
Just a quick note, the test harness should already cover the different variants :) |
There was a problem hiding this comment.
Thanks a lot for iterating on this so quickly @harshaljanjani ! This looks much better, almost good to merge on my side,
Once you have made the last changes I suggested, You can directly ping a core maintainer for a final review.
Also don't forget to merge/rebase with your next modifications!
| class Deimv2SwiGLUFFN(nn.Module): | ||
| def __init__(self, in_features: int, hidden_features: int, out_features: int): | ||
| super().__init__() | ||
| self.w12 = nn.Linear(in_features, 2 * hidden_features) | ||
| self.w3 = nn.Linear(hidden_features, out_features) | ||
|
|
||
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: | ||
| x12 = self.w12(hidden_states) | ||
| x1, x2 = x12.chunk(2, dim=-1) | ||
| hidden = F.silu(x1) * x2 | ||
| return self.w3(hidden) |
There was a problem hiding this comment.
Sorry for the back and forth on this, but if we can't directly inherit from DINOv3ViTGatedMLP here anyway, let's use the format/naming of the dinov2 swigluffn. Not splitting the linear layers might result in slightly better performance too.
| class Deimv2GAPFusion(nn.Module): | ||
| def __init__(self, config: Deimv2Config, channels: int): | ||
| super().__init__() | ||
| self.cv = Deimv2ConvNormLayer(config, channels, channels, 1, 1, activation=config.activation_function) |
There was a problem hiding this comment.
Can we find a better/more descriptive name than cv?
| if pixel_mask is None: | ||
| pixel_mask = torch.ones(((batch_size, height, width)), device=device) | ||
|
|
There was a problem hiding this comment.
Just a heads up that pixel_mask is not used here. This is not on you as pixel mask is not supported in dinov3 for now. I'll add it later but this is out of scope for this PR. Would you mind adding a Todo comment to pass the mask to the backbone once dinov3 supports it? thanks!
| class Deimv2MLPPredictionHead(DFineMLP): | ||
| pass |
There was a problem hiding this comment.
My bad, removed!
| if is_dinov3: | ||
| self.backbone = Deimv2DINOv3ConvEncoder(config) | ||
| else: | ||
| self.backbone = Deimv2ConvEncoder(config) |
There was a problem hiding this comment.
Here I think it would make more sense to name these self.conv_encoder. I know in DFine this is named backbone as well, but here we include the feature projections and sta etc., so this is more than a backbone. + it will avoid having "backbone.backbone" which I don't like in DFine
There was a problem hiding this comment.
Agreed; thought it was a bit weird too. Made the broader self.conv_encoder change as and where needed :)
| @@ -0,0 +1,68 @@ | |||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | |||
There was a problem hiding this comment.
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | |
| <!--Copyright 2026 The HuggingFace Team. All rights reserved. |
| @@ -0,0 +1,29 @@ | |||
| # Copyright 2025 The HuggingFace Team. All rights reserved. | |||
There was a problem hiding this comment.
| # Copyright 2025 The HuggingFace Team. All rights reserved. | |
| # Copyright 2026 The HuggingFace Team. All rights reserved. |
| @@ -0,0 +1,784 @@ | |||
| # Copyright 2025 The HuggingFace Inc. team. | |||
There was a problem hiding this comment.
| # Copyright 2025 The HuggingFace Inc. team. | |
| # Copyright 2026 The HuggingFace Inc. team. |
| @@ -0,0 +1,1142 @@ | |||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |||
There was a problem hiding this comment.
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |
| # Copyright 2026 The HuggingFace Inc. team. All rights reserved. |
| # coding = utf-8 | ||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. |
There was a problem hiding this comment.
| # coding = utf-8 | |
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |
| # Copyright 2026 The HuggingFace Inc. team. All rights reserved. |
|
Thanks a lot for the review cycle! Addressed all the newer comments in the commit 4ad0dc5 and verified that there are no regressions in the test suite 🤗
|
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Good day @ArthurZucker; just a gentle ping to check if there are any updates regarding the final review, thank you! |
|
Also pinging @vasqu @Cyrilvallez for core maintainer review :) |
@m-matthias Thanks for flagging this in the PR; it turns out DEIMv2 suffered from the same coupling in the copy-over. I've pushed the fix in fb1f387: it's independent of the D-FINE fix (also cc: @Abineshabee!). You both are awesome 🤗 |
|
SGTM, I reviewed the other PR but would like to wait on Yoni before merging. Can you sync with main here btw, we updated the linter so the false positive shouldn't appear anymore |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, d_fine, deimv2 |
Yepp; false positive's gone, awesome! |
|
Good day; just checking in to see if there are any updates! |
|
run-slow: d_fine, deimv2 |
|
@harshaljanjani sorry, no everything should be ready; one last sanity check then merging 🤗 thanks a lot for the contribution, definitely wasn't an easy model |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, d_fine, deimv2 |
|
This comment contains models: ["models/d_fine", "models/deimv2"] |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, d_fine, deimv2 |
Had tons of fun; thank you as well for the timely review cycles on the PR, @vasqu and @yonigozlan! 😊 |
* init: Add files (v1) * fix: Fix ci/circleci: check_repository_consistency * feat: Add support and test harness for all variants * fix: Fix ci/circleci: check_repository_consistency * refactor: Resolve review comments * refactor: Resolve second review round * nit: Fix copyright year * refactor: Resolve third review round * revert: Adhere to the pattern from yonigozlan * nit: Clarify the docstring * refactor: Resolve fourth review round * refactor: Closing in on the final set of nits * fix: Resolve merge conflicts * fix: Add loss override and address nits * nits: Fix minor issues * fixup their init weights * fix: Fix loss coupling issue * fix date --------- Co-authored-by: vasqu <antonprogamer@gmail.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>


What does this PR do?
→ This PR adds DEIMv2 to Transformers!
→ IMP: I've linked two notebooks: a Colab notebook here demonstrating the functionality, predictions, checkpoint conversion, and the passing test suite for all types of presets, and another notebook here that demos the detailed fine-tuning and predictions on this preset: https://huggingface.co/Intellindust/DEIMv2_HGNetv2_N_COCO.
References:
→ Model Checkpoints
→ GitHub Pages
→ GitHub Repository
→ Paper
→ All variants, including the base variant, as well as those requiring LiteEncoder and DinoV3+STA support, are now complete and working, as demonstrated in the Colab notebook (I've used one preset from each of the three types for demo) 🥳
Closes #41211 and completes #41327.
Before submitting