Add ColQwen2 to 🤗 transformers by tonywu71 · Pull Request #35778 · huggingface/transformers

tonywu71 · 2025-01-19T20:35:01Z

What does this PR do?

Add ColQwen2 in 🤗 transformers. ColQwen2 is a model that uses the ColPali architecture with a Qwen2-VL backbone.

Who can review?

Additional details

This PR uses the new Modular 🤗 transformers
The ColPali is mainly inspired from the colpali-engine repository I'm maintaining with my co-authors. The initial code was taken from colpali-engine==v0.3.6.
The newly converted model weights are stored in vidore/colqwen2-v1.0-hf.

Progress checklist

ArthurZucker · 2025-02-11T18:09:13Z

Feel free to ping us once this is ready for review!

ArthurZucker · 2025-03-26T13:26:47Z

Feel free to ping @Cyrilvallez once this is ready for review! 🤗

yonigozlan

Hey @tonywu71 ! Thanks for contributing 🤗. Looks almost ready to go to me, I just pointed out a few nits to change

yonigozlan · 2025-04-16T13:53:48Z

+        )
+        self.query_prefix = query_prefix or "Query: "
+
+        self.tokenizer.padding_side = "left"


This should be set when saving the tokenizer/processor

Fixed! The Hf Hub commit with the new processor_config.json can be found here for reference.

Update: after discussion with @yonigozlan, I have realized it makes much more sense to let tokenizer_config.json handle padding_side. I've just applied the necessary changes!

yonigozlan

Nice thanks for iterating! I see two small things left to change then LGTM for me!

Cyrilvallez

Hey! Sorry for the delay! This is pretty clean, great work! 🤗 I just left a few last comments!

Cyrilvallez · 2025-05-13T14:22:31Z


-        loss = None
-        if not return_dict:
-            output = (embeddings,) + outputs[2:]
-            output[2] = output[2] if output_hidden_states is not None else None
-            output[-1] = (outputs.image_hidden_states if pixel_values is not None else None,)
-            return (loss,) + output if loss is not None else output
-
        return ColPaliForRetrievalOutput(
-            loss=loss,
            embeddings=embeddings,


Why are we removing the loss here? 👀

The loss was strictly speaking removed:

it used to be set to None.

the default value for loss in ColPaliForRetrievalOutput is None.

So I have removed the unneeded lines to make the code clearer.

Cyrilvallez · 2025-05-13T14:23:50Z

+        visual_prompt_prefix: str = "Describe the image.",
+        query_prefix: str = "Question: ",
+    ):
+        super().__init__(image_processor=image_processor, tokenizer=tokenizer, chat_template=chat_template)
+        self.visual_prompt_prefix = visual_prompt_prefix
+        self.query_prefix = query_prefix


These kind of prefix should be part of the chat_template directly, not hardcoded here 🤗

Cyrilvallez · 2025-05-13T14:26:57Z

+if is_torch_available():
+    import torch
+    from torch import nn


No need to protect the torch import here!

Cyrilvallez · 2025-05-13T14:27:57Z

+            raise AttributeError(
+                "The `initializer_range` attribute is not set in the configuration. Please set it before using the model."
+            )


Let's make sure it is correctly defined in the Config with some default value instead of raising here

The ColQwen2Config already has a default value for initializer_range, so I'll just remove the raise 👍🏼

Cyrilvallez · 2025-05-13T14:35:27Z

+        if inputs_embeds is None:
+            inputs_embeds = self.vlm.model.embed_tokens(input_ids)
+
+            if pixel_values is not None:
+                pixel_values = pixel_values.type(self.vlm.visual.get_dtype())
+                image_embeds = self.vlm.visual(pixel_values, grid_thw=image_grid_thw)
+                image_mask = (
+                    (input_ids == self.config.vlm_config.image_token_id).unsqueeze(-1).expand_as(inputs_embeds)
+                )
+                image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
+                inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
+
+            if attention_mask is not None:
+                attention_mask = attention_mask.to(inputs_embeds.device)
+
+        outputs = self.vlm.model(
+            input_ids=None,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+        )
+        return outputs


If you don't mind, I think it would help readability to have this block directly in the main forward instead of separating in 2 functions (due to the large signatures, we need to go back and forth)

No I don't mind, I think it's actually a good idea! 🤗

Cyrilvallez · 2025-05-13T14:36:06Z

+        if visual_prompt_prefix is None:
+            visual_prompt_prefix = "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|><|endoftext|>"
+        self.visual_prompt_prefix = visual_prompt_prefix
+
+        if query_prefix is None:
+            query_prefix = "Query: "
+        self.query_prefix = query_prefix


These should be incorporated to the chat_template 🤗

Not sure if we should have a chat template here since this is not a chat model really. We had the same issue with Got OCR and ended up not using a chat template. wdyt?

Humm indeed was a bit fast here - let's keep as is, especially as it aligns with ColPali!

Cyrilvallez · 2025-05-13T14:37:06Z

+        if text is not None and images is not None:
+            raise ValueError("Only one of text or images can be processed at a time")


Alright, let's keep it then!

Cyrilvallez · 2025-05-13T14:38:22Z

+    def process_images(
+        self,
+        images: ImageInput = None,
+        **kwargs: Unpack[ColQwen2ProcessorKwargs],
+    ) -> BatchFeature:
+        """
+        Prepare for the model one or several image(s). This method is a wrapper around the `__call__` method of the ColQwen2Processor's
+        [`ColQwen2Processor.__call__`].
+
+        This method forwards the `images` and `kwargs` arguments to Qwen2VLImageProcessor's [`~Qwen2VLImageProcessor.__call__`].
+
+        Args:
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.
+
+        Returns:
+            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
+
+            - **input_ids** -- List of token ids to be fed to a model.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+        """
+        return self.__call__(images=images, **kwargs)
+
+    def process_queries(
+        self,
+        text: Union[TextInput, List[TextInput]],
+        **kwargs: Unpack[ColQwen2ProcessorKwargs],
+    ) -> BatchFeature:
+        """
+        Prepare for the model one or several texts. This method is a wrapper around the `__call__` method of the ColQwen2Processor's
+        [`ColQwen2Processor.__call__`].


Why do we redefine them here? They will be inherited directly!

Oh you're right! However, I think the docstring will inherit from ColPaliProcessor's docstring and thus referencing ColPali. Is there a way to simply override the docstring here? If not, should we keep the code as it is?

Wdyt @yonigozlan?

I don't see a clean way to do this, but we can just remove specific references to the tokenizer and image processor in the docstring imo

…ColQwen2Config`

…ing code

yonigozlan · 2025-05-23T16:43:04Z

@Cyrilvallez taking over for the final push on this PR as Tony is quite busy. I pushed some necessary updates after the refactoring of Qwen2VL (so nice to have btw), all should be good now and we use modular much more, including for the modeling code 🤗. @tonywu71 you'll still have to run the updated convert_weights script and push to your repo :), but apart from that we should be ready to merge!

… into add-colqwen2

tonywu71 · 2025-05-24T09:38:14Z

@Cyrilvallez taking over for the final push on this PR as Tony is quite busy. I pushed some necessary updates after the refactoring of Qwen2VL (so nice to have btw), all should be good now and we use modular much more, including for the modeling code 🤗. @tonywu71 you'll still have to run the updated convert_weights script and push to your repo :), but apart from that we should be ready to merge!

@yonigozlan Done, the model repo is updated! 🤗 I've also pushed a commit to fix the Hf model path for ColQwen2 integration tests. Lmk if there's anything left to do before merging!

Cyrilvallez

All right! Amazing work, congrats to you both @tonywu71 @yonigozlan! Super clean 🤗 I left 2 ultra small comments as my job here is to be very picky 🙃, but that's it! Feel free to merge @yonigozlan!
Thanks for the great addition 🤗

Cyrilvallez · 2025-05-28T10:02:01Z

+    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
+)


This is a super nit, feel free to disregard if you're too annoyed by the review process 😆 But passing None is a bit misleading for an example IMO, even if it's equivalent

Suggested change

attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,

)

attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa",

)

Agreed! It's been addressed 👌🏼

Actually it seems sdpa doesn't work out-of-the-box for ColQwen2 as I get this error when loading the model on MPS.

❌ Code:

model_name = "vidore/colqwen2-v1.0-hf" # Load model model = ColQwen2ForRetrieval.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", # "cpu", "cuda", or "mps" for Apple Silicon attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa", )

Note: Leaving attn_implementation=None works.

The error:

ValueError: ColQwen2ForRetrieval does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: https://github.com/huggingface/transformers/issues/28005. If you believe this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument `attn_implementation="eager"` meanwhile. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")`

✅ However, I managed to load Qwen2VL with SDPA:

model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-2B-Instruct", torch_dtype=torch.bfloat16, device_map="auto", # "cpu", "cuda", or "mps" for Apple Silicon attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa", )

@Cyrilvallez @yonigozlan I read about the instructions for enabling SDPA on ColQwen2 but next steps are a bit unclear as ColQwen2 essentially piggybacks on Qwen2VL thanks to modular. Any ideas about the right fix? 🤗

I believe it's only because the flags are not set in the PreTrainedModel - adding

_supports_flash_attn_2 = True _supports_sdpa = True _supports_flex_attn = True _supports_cache_class = True

should solve it

Tsm, the fix is working like a charm! And as you expected, ColQwen2 works with attn_implementation="flex_attention" too 👌🏼

Cyrilvallez · 2025-05-28T10:07:05Z

+if is_torch_available():
+    import torch


Let's not protect, simply import it 🤗

Problem is we need to protect the import for the processor :(

Oh I see - not a big issue anyway you can disregard (it's just that torch.nn is imported without protection anyway so a bit weird), but really not a big concern

Cyrilvallez · 2025-05-28T10:08:18Z

+        if visual_prompt_prefix is None:
+            visual_prompt_prefix = "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|><|endoftext|>"
+        self.visual_prompt_prefix = visual_prompt_prefix
+
+        if query_prefix is None:
+            query_prefix = "Query: "
+        self.query_prefix = query_prefix


Humm indeed was a bit fast here - let's keep as is, especially as it aligns with ColPali!

…not fixed

tonywu71 requested review from ArthurZucker, Rocketknight1, molbap, qubvel, stevhliu and yonigozlan as code owners January 19, 2025 20:35

tonywu71 marked this pull request as draft January 19, 2025 20:35

qubvel removed their request for review January 20, 2025 17:46

qubvel added New model Multimodal labels Jan 20, 2025

ArthurZucker removed request for ArthurZucker, Rocketknight1, molbap, stevhliu and yonigozlan February 11, 2025 18:09

tonywu71 force-pushed the add-colqwen2 branch from 025ca25 to 48e0aa5 Compare February 21, 2025 09:54

tonywu71 force-pushed the add-colqwen2 branch from f6e19a8 to adc0c1e Compare April 15, 2025 17:34

tonywu71 marked this pull request as ready for review April 15, 2025 18:10

github-actions Bot requested review from ArthurZucker and Rocketknight1 April 15, 2025 18:10

tonywu71 marked this pull request as draft April 15, 2025 18:19

tonywu71 force-pushed the add-colqwen2 branch 2 times, most recently from 5ec9758 to 7cfd9dc Compare April 16, 2025 09:29

tonywu71 marked this pull request as ready for review April 16, 2025 12:41

tonywu71 force-pushed the add-colqwen2 branch from 2982555 to 24ada8c Compare April 16, 2025 12:44

yonigozlan reviewed Apr 16, 2025

View reviewed changes

yonigozlan reviewed Apr 17, 2025

View reviewed changes

Comment thread src/transformers/models/colqwen2/configuration_colqwen2.py Outdated

Comment thread src/transformers/models/colqwen2/modular_colqwen2.py Outdated

yonigozlan reviewed Apr 17, 2025

View reviewed changes

Comment thread src/transformers/models/colqwen2/modeling_colqwen2.py Outdated

tonywu71 and others added 4 commits April 29, 2025 18:34

fix: add missing initializer_range attribute in ColQwen2Config

c8e360f

fix: use get_text_config in resize_token_embeddings

14d7b5c

Merge remote-tracking branch 'upstream/main' into add-colqwen2

0686b2a

update colwen2 with auto_docstring

10b3ddb

Cyrilvallez reviewed May 13, 2025

View reviewed changes

tonywu71 and others added 5 commits May 13, 2025 18:47

docs: fix wrong copyright year

bdef63f

chore: remove raise as initializer_range has a default value in `…

4b7f635

…ColQwen2Config`

refactor: merge inner_forward into forward

c638c07

Merge remote-tracking branch 'upstream/main' into add-colqwen2

30d2080

Refactor colqwen2 after refactoring of qwen2VL, use modular for model…

8277c43

…ing code

yonigozlan and others added 4 commits May 23, 2025 16:45

protect torch import in modular to protect in processing

86e0693

protect torch import in modular to protect in processing

c0a6442

Merge branch 'add-colqwen2' of https://github.com/tonywu71/transformers…

98a5338

… into add-colqwen2

tests: fix hf model path in ColQwen2 integration test

4aa5aa0

mgoin mentioned this pull request May 25, 2025

[Model] add colqwen2_vl code & inference vllm-project/vllm#14291

Closed

Cyrilvallez approved these changes May 28, 2025

View reviewed changes

tonywu71 and others added 8 commits May 29, 2025 12:03

docs: clarify attn_implementation and add comments

34ca1e7

docs: add fallback snippet for using offline PIL dummy images

43af0ad

docs: temporarily revert attn_implementation to None while sdpa is …

0356f3c

…not fixed

docs: tweaks in colpali/colqwen2 quick start snippets

7a4218b

fix: add missing flags to enable SDPA/Flex Attention in ColQwen2 model

58c7ff2

fix: add missing changes in modular file

3852c86

Merge remote-tracking branch 'upstream/main' into add-colqwen2

bd65ad3

fix modeling tests

1bc3dea

yonigozlan enabled auto-merge (squash) June 2, 2025 12:56

yonigozlan merged commit c72ba69 into huggingface:main Jun 2, 2025
20 checks passed

yonigozlan mentioned this pull request Jun 2, 2025

Add ColQwen2.5 to transformers 🤗 #38391

Closed

10 tasks

mgoin mentioned this pull request Jun 9, 2025

[New Model]: Support ColQwen2VL vllm-project/vllm#19381

Closed

1 task

		if text is not None and images is not None:
		raise ValueError("Only one of text or images can be processed at a time")

		attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
		)

Conversation

tonywu71 commented Jan 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Additional details

Progress checklist

Uh oh!

ArthurZucker commented Feb 11, 2025

Uh oh!

ArthurZucker commented Mar 26, 2025

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented May 23, 2025

Uh oh!

tonywu71 commented May 24, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

tonywu71 commented Jan 19, 2025 •

edited

Loading

Cyrilvallez May 28, 2025 •

edited

Loading

tonywu71 May 29, 2025 •

edited

Loading

tonywu71 May 30, 2025 •

edited

Loading

Cyrilvallez May 28, 2025 •

edited

Loading