Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
eaef822
init: Add files (v1)
harshaljanjani Feb 27, 2026
ddc1bd7
fix: Fix ci/circleci: check_repository_consistency
harshaljanjani Feb 27, 2026
85c7356
feat: Add support and test harness for all variants
harshaljanjani Mar 1, 2026
adc4079
fix: Fix ci/circleci: check_repository_consistency
harshaljanjani Mar 1, 2026
81a3d06
Merge branch 'main' into add-deimv2
harshaljanjani Mar 1, 2026
39d300e
refactor: Resolve review comments
harshaljanjani Mar 17, 2026
476d69f
Merge branch 'main' into add-deimv2
harshaljanjani Mar 19, 2026
4ad0dc5
refactor: Resolve second review round
harshaljanjani Mar 19, 2026
16f2d07
nit: Fix copyright year
harshaljanjani Mar 19, 2026
78eaf93
Merge branch 'main' into add-deimv2
harshaljanjani Mar 19, 2026
dbe577b
Merge branch 'main' into add-deimv2
harshaljanjani Mar 21, 2026
1259628
Merge branch 'main' into add-deimv2
harshaljanjani Mar 28, 2026
31ee908
refactor: Resolve third review round
harshaljanjani Mar 28, 2026
4a3a877
revert: Adhere to the pattern from yonigozlan
harshaljanjani Mar 29, 2026
558c2af
Merge branch 'main' into add-deimv2
harshaljanjani Mar 30, 2026
ada78bf
nit: Clarify the docstring
harshaljanjani Mar 30, 2026
496ce9c
refactor: Resolve fourth review round
harshaljanjani Mar 31, 2026
5a12a56
Merge branch 'main' into add-deimv2
harshaljanjani Mar 31, 2026
85b4079
Merge branch 'main' into add-deimv2
harshaljanjani Apr 16, 2026
422a440
refactor: Closing in on the final set of nits
harshaljanjani Apr 16, 2026
f932158
Merge branch 'main' into add-deimv2
harshaljanjani Apr 20, 2026
b833ee3
fix: Resolve merge conflicts
harshaljanjani Apr 20, 2026
58a6424
fix: Add loss override and address nits
harshaljanjani Apr 21, 2026
7dd0fb1
nits: Fix minor issues
harshaljanjani Apr 22, 2026
943f4bb
fixup their init weights
vasqu Apr 22, 2026
6213518
Merge branch 'main' into add-deimv2
vasqu Apr 22, 2026
07e3831
[CB] Changes for long generation (#45530)
remi-or Apr 23, 2026
706acf5
Allow for registered experts from kernels hub (#45577)
winglian Apr 23, 2026
bd69ed2
[docs] multi-turn tool calling (#45554)
stevhliu Apr 23, 2026
8e64e53
[AMD CI] Fix expectations for Gemma3n (#45602)
Abdennacer-Badaoui Apr 23, 2026
0323898
fix transformers + torchao nvfp4 serialization (#45573)
vkuzo Apr 23, 2026
533c4e1
SonicMoe (#45433)
IlyasMoutawwakil Apr 23, 2026
1e071b2
Processing Utils: continue when content is a string (#45605)
RyanMullins Apr 23, 2026
57f9936
qa: bumped mlinter and allow local override (#45585)
tarekziade Apr 23, 2026
fb1f387
fix: Fix loss coupling issue
harshaljanjani Apr 23, 2026
3629f13
Merge branch 'main' into add-deimv2
harshaljanjani Apr 23, 2026
967335e
Merge remote-tracking branch 'pr/44339' into merge-cluster-cluster-41…
evalstate Apr 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -899,6 +899,8 @@
title: DAB-DETR
- local: model_doc/deformable_detr
title: Deformable DETR
- local: model_doc/deimv2
title: DEIMv2
- local: model_doc/deit
title: DeiT
- local: model_doc/depth_anything
Expand Down
65 changes: 65 additions & 0 deletions docs/source/en/model_doc/deimv2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
<!--Copyright 2026 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->
*This model was released on 2025-09-25 and added to Hugging Face Transformers on 2026-04-22.*

# DEIMv2

## Overview

DEIMv2 (DETR with Improved Matching v2) was proposed in [DEIMv2: Real-Time Object Detection Meets DINOv3](https://huggingface.co/papers/2509.20787) by Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, and Xi Shen.

The abstract from the paper is the following:

*Driven by the simple and effective Dense O2O, DEIM demonstrates faster convergence and enhanced performance. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained / distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3M parameters, surpassing prior X-scale models that require over 60M parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10M model (9.71M) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5M parameters, delivers 38.5 AP-matching YOLOv10-Nano (2.3M) with ~50% fewer parameters.*

## Usage

```python
from transformers import AutoImageProcessor, AutoModelForObjectDetection
from transformers.image_utils import load_image

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

image_processor = AutoImageProcessor.from_pretrained("harshaljanjani/DEIMv2_HGNetv2_N_COCO_Transformers")
model = AutoModelForObjectDetection.from_pretrained("harshaljanjani/DEIMv2_HGNetv2_N_COCO_Transformers", device_map="auto")

inputs = image_processor(images=image, return_tensors="pt").to(model.device)
outputs = model(**inputs)

results = image_processor.post_process_object_detection(
outputs, threshold=0.5, target_sizes=[image.size[::-1]]
)

for result in results:
for score, label, box in zip(result["scores"], result["labels"], result["boxes"]):
box = [round(i, 2) for i in box.tolist()]
print(f"Detected {model.config.id2label[label.item()]} with confidence {round(score.item(), 3)} at location {box}")
```

## Deimv2Config

[[autodoc]] Deimv2Config

## Deimv2Model

[[autodoc]] Deimv2Model
- forward

## Deimv2ForObjectDetection

[[autodoc]] Deimv2ForObjectDetection
- forward
16 changes: 8 additions & 8 deletions docs/source/en/modeling_rules.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,22 @@ specific language governing permissions and limitations under the License.

# Model structure rules

Transformers enforces a set of static rules on every `modeling_*.py`, `modular_*.py`, and `configuration_*.py` file. The [mlinter](https://github.com/huggingface/transformers-mlinter) tool checks them as part of `make typing` and errors out if violations are found.
Transformers enforces a set of static rules on every `modeling_*.py`, `modular_*.py`, and `configuration_*.py` file. The [mlinter](https://github.com/huggingface/transformers-mlinter) package provides the checker engine, and the repository keeps its active rule set in `utils/rules.toml`. That local TOML lets us enable, disable, or tweak rules quickly without waiting for a new `transformers-mlinter` release.

These are the expected model conventions for adding or changing modeling code. They keep the codebase consistent and ensure compatibility with features like pipeline parallelism, device maps, and weight tying.

## Running the checker

`make typing` runs `mlinter` alongside the `ty` type checker. Run `mlinter` on its own with the following commands.
`make typing` runs `mlinter` alongside the `ty` type checker through the repo wrapper, so it picks up `utils/rules.toml`. Run the same wrapper directly with the following commands.

```bash
mlinter # check all modeling files
mlinter --changed-only # check only files changed vs origin/main
mlinter --list-rules # list all rules and their enabled status
mlinter --rule TRF001 # show built-in docs for a specific rule
python utils/check_modeling_structure.py # check all modeling files
python utils/check_modeling_structure.py --changed-only # check only files changed vs origin/main
python utils/check_modeling_structure.py --list-rules # list all rules and their enabled status
python utils/check_modeling_structure.py --rule TRF001 # show built-in docs for a specific rule
```

The `--changed-only` flag is the fastest option during development. It only checks the files you've modified relative to the main branch.
The `--changed-only` flag is the fastest option during development. It only checks the files you've modified relative to the main branch. If you invoke `mlinter` directly instead of the wrapper, pass `--rules-toml utils/rules.toml` so local overrides are applied.

## Fixing a violation

Expand All @@ -52,7 +52,7 @@ Use the rule ID to look up the fix in the [rules reference](#rules-reference). T

## Rules reference

Each rule below lists what it enforces and a diff showing the fix. Run `mlinter --rule TRF001` to see the built-in docs for any rule.
Each rule below lists what it enforces and a diff showing the fix. Run `python utils/check_modeling_structure.py --rule TRF001` to see the built-in docs for any rule with the repo's current rule set.

<!-- BEGIN RULES REFERENCE -->

Expand Down
95 changes: 84 additions & 11 deletions docs/source/en/serve-cli/serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -456,7 +456,7 @@ data: {"id":"f47ac10b-58cc-4372-a567-0e02b2c3d479","choices":[{"delta":{"content

### Audio-based completions

Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input using the OpenAI `input_audio` content type. The audio must be base64-encoded and the format (`mp3` or `wav`) must be specified.
Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input through the OpenAI `input_audio` content type. Base64-encode the audio and specify the format (`mp3` or `wav`).

<hfoptions id="audio-completions">
<hfoption id="huggingface_hub">
Expand Down Expand Up @@ -695,7 +695,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
> [!WARNING]
> The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.

As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.

```python
completion = client.chat.completions.create(
Expand All @@ -717,7 +717,7 @@ completion = client.chat.completions.create(
> [!WARNING]
> The `video_url` content type is an extension not part of the OpenAI standard and may change in future versions.

Video input is supported using the `video_url` content type. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the audio track is automatically extracted from the video and processed alongside the visual frames.
Use the `video_url` content type for video input. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the server extracts the audio track from the video and processes it with the visual frames.

> [!TIP]
> Video processing requires [torchcodec](https://github.com/pytorch/torchcodec). Install it with `pip install torchcodec`.
Expand Down Expand Up @@ -934,7 +934,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
</hfoption>
</hfoptions>

### Multi-turn conversations
### Multi-turn conversations[[completions]]

To have a multi-turn conversation, include the full conversation history in the `messages` list with alternating `user` and `assistant` roles. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.

Expand All @@ -954,7 +954,7 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```

The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.

```
As of 2021, the population of Paris is approximately 2.2 million people.
Expand Down Expand Up @@ -1466,7 +1466,7 @@ data: {"content_index":0,"delta":"This ","item_id":"msg_a1b2c3d4","output_index"
> [!WARNING]
> The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.

As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.

```python
response = client.responses.create(
Expand Down Expand Up @@ -1621,7 +1621,7 @@ data: {"content_index":0,"delta":"Based ","item_id":"msg_b2c3d4e5","output_index
</hfoption>
</hfoptions>

### Multi-turn conversations
### Multi-turn conversations[[responses]]

For multi-turn conversations, pass a list of messages with `role` keys in the `input` field. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.

Expand All @@ -1643,7 +1643,7 @@ response = client.responses.create(
print(response.output[0].content[0].text)
```

The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.

```
As of 2021, Paris has a population of approximately 2.8 million people.
Expand Down Expand Up @@ -1734,15 +1734,15 @@ The stream ends with exactly one terminal event, `ready` (success) or `error` (f

## Timeout

`transformers serve` supports different requests by different models. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free up GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading entirely.
`transformers serve` handles requests for any model. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading.

```shell
transformers serve --model-timeout 400
```

### Loading examples

See the example responses below for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model that already exists in memory.
The examples below show responses for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model already in memory.

<hfoptions id="load-model-examples">
<hfoption id="fresh load">
Expand Down Expand Up @@ -1784,7 +1784,7 @@ data: {"status": "ready", "model": "org/model@main", "cached": true}
The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.

> [!NOTE]
> Tool calling is currently limited to the Qwen model family.
> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Open an [issue](https://github.com/huggingface/transformers/issues/new/choose) to request support for a specific model.

Define tools as a list of function specifications following the OpenAI format.

Expand Down Expand Up @@ -1846,6 +1846,79 @@ for event in response:
print(event)
```

### Multi-turn tool calling

After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. See the [OpenAI function calling guide](https://developers.openai.com/api/docs/guides/function-calling?api-mode=chat) for the full spec.

The examples below reuse the `tools` list defined above.

<hfoptions id="multi-turn-tool-calling">
<hfoption id="v1/chat/completions">

Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`.

```py
# Model returns a tool call
messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=messages,
tools=tools,
)
assistant_message = response.choices[0].message

# Execute the tool locally
tool_call = assistant_message.tool_calls[0]
result = {"temperature": 22, "condition": "sunny"} # your actual function call here

# Send the tool result back
messages.append(assistant_message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
final_response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=messages,
tools=tools,
)
print(final_response.choices[0].message.content)
```

</hfoption>
<hfoption id="v1/responses">

Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request.

```py
user_message = {"role": "user", "content": "What's the weather like in San Francisco?"}
response = client.responses.create(
model="Qwen/Qwen2.5-7B-Instruct",
input=[user_message],
tools=tools,
stream=False,
)
tool_call = next(item for item in response.output if item.type == "function_call")

result = {"temperature": 22, "condition": "sunny"}

final_response = client.responses.create(
model="Qwen/Qwen2.5-7B-Instruct",
input=[
user_message,
tool_call.model_dump(exclude_none=True),
{"type": "function_call_output", "call_id": tool_call.call_id, "output": json.dumps(result)},
],
tools=tools,
stream=False,
)
print(final_response.output_text)
```

</hfoption>
</hfoptions>

## Port forwarding

Port forwarding lets you serve models from a remote server. Make sure you have SSH access to the server, then run this command on your local machine.
Expand Down
4 changes: 3 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,9 @@
"rjieba",
"rouge-score!=0.0.7,!=0.0.8,!=0.1,!=0.1.1",
"ruff==0.14.10",
"transformers-mlinter==0.1.0",
# When bumping `transformers-mlinter`, sync repo-local rule overrides from
# `utils/rules.toml` back into the released package.
"transformers-mlinter==0.1.1",
"ty==0.0.20",
# `sacrebleu` not used in `transformers`. However, it is needed in several tests, when a test calls
# `evaluate.load("sacrebleu")`. This metric is used in the examples that we use to test the `Trainer` with, in the
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/dependency_versions_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
"rjieba": "rjieba",
"rouge-score": "rouge-score!=0.0.7,!=0.0.8,!=0.1,!=0.1.1",
"ruff": "ruff==0.14.10",
"transformers-mlinter": "transformers-mlinter==0.1.0",
"transformers-mlinter": "transformers-mlinter==0.1.1",
"ty": "ty==0.0.20",
"sacrebleu": "sacrebleu>=1.4.12,<2.0.0",
"sacremoses": "sacremoses",
Expand Down
18 changes: 14 additions & 4 deletions src/transformers/generation/configuration_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1556,8 +1556,10 @@ class ContinuousBatchingConfig:
Number of blocks in the KV cache. Auto-inferred from GPU memory when `None`.
max_batch_tokens (`int`, *optional*):
Maximum number of tokens in a batch. Auto-inferred from GPU memory when `None`.
max_memory_percent (`float`, *optional*, defaults to 0.8):
Maximum percentage of free GPU memory (after the model is loaded) to use for the KV cache.
max_memory_percent (`float`, *optional*):
Maximum percentage of free GPU memory (after the model is loaded) to use for the KV cache. When `None`,
resolved at runtime to 0.9 if there is no logit processing and 0.8 if there is, to leave headroom for
vocabulary-sized temporary tensors.
max_blocks_per_request (`int`, *optional*, defaults to 0):
Maximum blocks per request, used in the `flash_attn_with_kvcache` fast decode path to dimension
the block table. Setting this to 0 disables the fast decode path.
Expand Down Expand Up @@ -1607,8 +1609,9 @@ class ContinuousBatchingConfig:
num_blocks: int | None = None
max_batch_tokens: int | None = None

# The max percentage of free GPU memory (after the model is loaded) to use for the KV cache.
max_memory_percent: float = 0.8
# The max percentage of free GPU memory (after the model is loaded) to use for the KV cache. If None, auto resolved
# to 0.9 (no logit processing) or 0.8 (logit processing) to leave headroom for temporary tensors.
max_memory_percent: float | None = None

# This is only used in the flash_attn_with_kvcache fast decode path to dimension the block table. If it is set to 0,
# the fast decode path will not be used. Currently turned off by default.
Expand Down Expand Up @@ -1773,6 +1776,13 @@ def decide_use_async_batching(self, is_attn_mask_needed: bool) -> bool:
)
return self.use_async_batching

def resolve_max_memory_percent(self, has_logit_processors: bool) -> None:
"""Resolves `max_memory_percent` when unset: 0.9 without logit processors, 0.8 with them. Active processors
materialize `[N, V]` intermediates (e.g. top-p sort, softmax) that get captured into the CUDA graph pool, so
the cache has to cede some budget to that pool."""
if self.max_memory_percent is None:
self.max_memory_percent = 0.8 if has_logit_processors else 0.9

def resolve_sentinel_values(self) -> None:
"""For some parameters (padding intervals and max cached graphs), the default is a sentinel value of 0: that
way, if the user specifies a value for those parameters, we know they want it used, ie. we turn on cuda graphs.
Expand Down
Loading