Skip to content

Fix Idefics vision embedding mismatched devices#40117

Closed
sayandipdutta wants to merge 3 commits intohuggingface:mainfrom
sayandipdutta:idefics3-different-devices
Closed

Fix Idefics vision embedding mismatched devices#40117
sayandipdutta wants to merge 3 commits intohuggingface:mainfrom
sayandipdutta:idefics3-different-devices

Conversation

@sayandipdutta
Copy link
Copy Markdown

@sayandipdutta sayandipdutta commented Aug 12, 2025

What does this PR do?

In the following section:

boundaries = torch.arange(1 / self.num_patches_per_side, 1.0, 1 / self.num_patches_per_side)
position_ids = torch.full(size=(batch_size, max_nb_patches_h * max_nb_patches_w), fill_value=0)
for batch_idx, p_attn_mask in enumerate(patch_attention_mask):
nb_patches_h = p_attn_mask[:, 0].sum()
nb_patches_w = p_attn_mask[0].sum()
h_indices = torch.arange(nb_patches_h, device=position_ids.device, dtype=position_ids.dtype)
w_indices = torch.arange(nb_patches_w, device=position_ids.device, dtype=position_ids.dtype)
fractional_coords_h = h_indices / nb_patches_h * (1 - 1e-6)
fractional_coords_w = w_indices / nb_patches_w * (1 - 1e-6)
bucket_coords_h = torch.bucketize(fractional_coords_h, boundaries, right=True)
bucket_coords_w = torch.bucketize(fractional_coords_w, boundaries, right=True)
pos_ids = (bucket_coords_h[:, None] * self.num_patches_per_side + bucket_coords_w).flatten()
position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids

boundaries and position_ids are getting created on cpu, and the input device information is not passed through. However, patch_attention_mask (input to the forward method) carries the device chosen at the call-site. While using device other than cpu, in the for loop, p_attn_mask and consequently nb_patches_{h/w} are on the input device. Making the device to position_ids.device on line 150-151 doesn't solve the issue, since they are already on cpu.
This PR moves boundaries and position_ids to patch_attention_mask.device.

Fixes #40116

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: idefics2, idefics3

@sayandipdutta sayandipdutta changed the title Fix Idefics vision model mismatched devices Fix Idefics vision embedding mismatched devices Aug 12, 2025
@qgallouedec
Copy link
Copy Markdown
Member

Thanks, this is already addressed in #39975

@sayandipdutta
Copy link
Copy Markdown
Author

@qgallouedec thanks! I will close it. For posterity, can you tell me what's the process of changing modeling_smolvlm.py? It is written in the file not to edit it manually.

@Rocketknight1
Copy link
Copy Markdown
Member

@sayandipdutta you modify the underlying modular_smolvlm.py file, and run make fix-copies to propagate changes to modeling files

@sayandipdutta
Copy link
Copy Markdown
Author

Addressed in #39975

@sayandipdutta sayandipdutta deleted the idefics3-different-devices branch August 14, 2025 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SmolVLM RuntimeError Expected all tensors to be on the same device, but found at least two devices

3 participants