[VLMs] support attention backends by zucchini-nlp · Pull Request #37576 · huggingface/transformers

zucchini-nlp · 2025-04-17T09:32:22Z

What does this PR do?

As per title, another step closer to vLLM + transformers

What was done:

Support attention API for VLM related models if not yet done
Pass kwargs so vLLM can forward its attention instances
Replace all loss computations to self.loss_fn (see Paligemma: fix generation with Gemma2 #36044 (comment))
Minor clean up so new models can copy prettified version, update return block with can_return_tuple

Fixes #36557, #35634, #36904 and fixes #33963

github-actions · 2025-04-17T09:32:34Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

HuggingFaceDocBuilderDev · 2025-04-17T10:07:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-04-18T13:33:33Z

-        # Unset attn implementation so it can be set to another one when loading back
-        model_to_save.config._attn_implementation_autoset = False
-


this is moved to configuration_utils.py, where it's deleted from all sub configs. Otherwise we're unsetting it only in base config

zucchini-nlp · 2025-04-18T13:36:35Z

    fx_compatible = True
    test_pruning = False
    test_missing_keys = False
+    test_head_masking = False  # new attn API doesn't support head mask


we'll be removing head_mask in v5 and it was discussed in a this PR that we can deprecate it for now . Thus no need to fix the test and support with new interface

zucchini-nlp · 2025-04-18T13:40:25Z

@qubvel if you have time to give it an initial review, while Arthur is off :)

qubvel

Thanks! Just a few review questions

zucchini-nlp · 2025-04-22T11:17:45Z

@qubvel comments addressed. Skipped Kosmos test can be run now (discovered your additional_model_inputs for each tester), but it has larger diff than expected sometimes. Skipping it for now and I will investigate the source of flakiness before merging

EDIT: kosmos apparently doesn't support padding, this test is skipped in new kosmos as well

zucchini-nlp · 2025-05-07T18:04:25Z

@qubvel @ArthurZucker updated the PR after the latest big refactor on VLMs. Can you review when you have time?

ArthurZucker

Nice work 🧼
Lot of these refactoring would be easier if we also applied modular to idefix and etc!

ArthurZucker

careful with 2-3 places where new attention softmax is in float32

zucchini-nlp · 2025-05-08T10:36:42Z

Yeah, agreed. We can do modular in a separate PR to not bloat up this one

ArthurZucker · 2025-05-08T10:41:01Z

completely!

uminaty · 2025-05-13T15:53:40Z

-        # Since we use packing, if Flash-Attn 2 is selected we rely on position_ids
-        if self.config._attn_implementation == "flash_attention_2":
-            kwargs["position_ids"] = kwargs["position_ids"].to(hidden_states.device, non_blocking=True)
-            attention_mask = None


Hi @zucchini-nlp why did you delete this line ? Because now the attention mask is used by default and there is an error when using flash attention 2 in my tests. Maybe I missed something but I think this was important. At least I don't have the RuntimeError: cu_seqlens_q must have shape (batch_size + 1) error when adding a attention_mask = None if we use fa2.

Looks like a faulty rebase, I don't remember any tests failing because of this line. Probably we need a new test or expans an existing one to test FA2 packing correctly. Thanks for flagging

Np! And yes I don't see how the tests could pass on this, I'm not familiar enough with the tests yet.

Should I open a PR for adding this line back or do you take care of it?

feel free to open a PR, I am a bit stuck on other tasks :)

And for the test, I realized that the one we have is for generative models only. Would be nice to add smth for all models if posssible

transformers/tests/test_modeling_common.py

Line 4106 in 1b00966

def test_flash_attention_2_padding_matches_padding_free_with_position_ids(self):

@zucchini-nlp I open a PR for a hotfix, I don't have the time rn to look at the tests sorry. But I'll try to get familiar with it, I just can't guarantee I'll have time soon for this.

* update models * why rename * return attn weights when sdpa * fixes * fix attn implementation composite * fix moshi * add message * add typings * use explicitly all flags for each attn type * fix some tests * import what is needed * kosmos on main has ew attention already, yay * new models in main, run fixup * won't fix kosmos yet * fix-copies * clean up after rebasing * fix tests * style * dont cast attns to fp32 * did we update ruff? oke, let's just do what it asks * fix pixtral after rebase

update models

99eff85

github-actions Bot marked this pull request as draft April 17, 2025 09:32

why rename

99d3ff6

zucchini-nlp marked this pull request as ready for review April 17, 2025 09:40

zucchini-nlp added 4 commits April 17, 2025 12:28

return attn weights when sdpa

15b3d61

fixes

d251583

fix attn implementation composite

cd6f3ab

fix moshi

c456ce9

zucchini-nlp commented Apr 18, 2025

View reviewed changes

Comment thread src/transformers/modeling_utils.py Outdated

zucchini-nlp commented Apr 18, 2025

View reviewed changes

add message

4a20bac

qubvel reviewed Apr 18, 2025

View reviewed changes

zucchini-nlp added 6 commits April 21, 2025 17:51

add typings

ae3934e

use explicitly all flags for each attn type

1bfc5b5

fix some tests

c2516fe

import what is needed

848ff0c

merge main

7e168d5

kosmos on main has ew attention already, yay

51d2691

zucchini-nlp added 2 commits April 22, 2025 13:26

new models in main, run fixup

98faf81

won't fix kosmos yet

8774712

zucchini-nlp requested a review from qubvel April 22, 2025 12:24

zucchini-nlp mentioned this pull request Apr 25, 2025

Support multimodal models in vLLM with transformers backend #37780

Closed

7 tasks

zucchini-nlp requested a review from ArthurZucker May 6, 2025 10:16

zucchini-nlp mentioned this pull request May 6, 2025

Add ALL_ATTENTION_FUNCTIONS compatibility for Pixtral model #37960

Merged

merge main

73b7a36

zucchini-nlp added 4 commits May 7, 2025 18:16

fix-copies

0fe22a6

clean up after rebasing

c9dcdd4

fix tests

d1a7900

style

34a9419

ArthurZucker approved these changes May 8, 2025

View reviewed changes

ArthurZucker reviewed May 8, 2025

View reviewed changes

Comment thread src/transformers/models/instructblip/modeling_instructblip.py

Comment thread src/transformers/models/instructblip/modeling_instructblip.py

Comment thread src/transformers/models/instructblipvideo/modeling_instructblipvideo.py

zucchini-nlp added 5 commits May 8, 2025 13:32

dont cast attns to fp32

12d63ee

merge main

d6619a3

did we update ruff? oke, let's just do what it asks

7f69fc6

fix pixtral after rebase

0d9eba8

Merge branch 'main' into new-attn-interface-vlms

b5757a4

zucchini-nlp merged commit d23aae2 into huggingface:main May 8, 2025
20 checks passed

uminaty reviewed May 13, 2025

View reviewed changes

uminaty mentioned this pull request May 15, 2025

Hotfix: Flash Attention 2 support in Pixtral #38146

Merged

This was referenced May 15, 2025

Minor llama4 fixes #38123

Merged

large accuracy drop for opt-125m with V4.52.2 #38277

Closed

ydshieh mentioned this pull request Jun 25, 2025

fix kosmos2 tests #39037

Open

		# Unset attn implementation so it can be set to another one when loading back
		model_to_save.config._attn_implementation_autoset = False

Conversation

zucchini-nlp commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions Bot commented Apr 17, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 17, 2025

Uh oh!

zucchini-nlp Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zucchini-nlp Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Apr 18, 2025

Uh oh!

qubvel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented May 7, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented May 8, 2025

Uh oh!

ArthurZucker commented May 8, 2025

Uh oh!

Uh oh!

uminaty May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp May 13, 2025

Choose a reason for hiding this comment

Uh oh!

uminaty May 13, 2025

Choose a reason for hiding this comment

Uh oh!

uminaty May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp May 14, 2025

Choose a reason for hiding this comment

Uh oh!

uminaty May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zucchini-nlp commented Apr 17, 2025 •

edited

Loading

zucchini-nlp commented Apr 22, 2025 •

edited

Loading

uminaty May 13, 2025 •

edited

Loading

uminaty May 14, 2025 •

edited

Loading