Skip to content

fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size#44899

Merged
zucchini-nlp merged 2 commits intohuggingface:mainfrom
harshaljanjani:fix/perceiver-interpolate-pos-encoding
Mar 25, 2026
Merged

fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size#44899
zucchini-nlp merged 2 commits intohuggingface:mainfrom
harshaljanjani:fix/perceiver-interpolate-pos-encoding

Conversation

@harshaljanjani
Copy link
Copy Markdown
Contributor

@harshaljanjani harshaljanjani commented Mar 20, 2026

What does this PR do?

The following failing Perceiver use case was identified and fixed in this PR:

c6d2848 (🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models) refactored all vision models' interpolate_pos_encoding methods for torch.jit.trace; the canonical pattern used across other vision models (e.g. modeling_vit.py, modeling_deit.py) is that they passes the target (height, width) to nn.functional.interpolate; but the Perceiver diff passed the source grid dims practically making the interpolation a no-op; this should fix that!
→ I also checked if other models have the exact same issue; and they don't, they compute new_height = height // self.patch_size (target patch grid) and pass that.

Fixes #44898

Before the fix (feel free to cross-check; these errors are reproducible):

1

After the fix (feel free to cross-check):

2

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you fix any necessary existing tests?

@harshaljanjani harshaljanjani marked this pull request as ready for review March 20, 2026 20:10
@github-actions github-actions Bot requested a review from zucchini-nlp March 20, 2026 20:10
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test if we don't yet have

position_embeddings = nn.functional.interpolate(
position_embeddings,
size=(new_height, new_width),
size=(height, width),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable. I am not super familiar with perceiver, do we not need to divide the height/width by patch size, so it matches with patched image features?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think I could've made this clearer as well in the PR description! I think this traces back to the reason it was missed as well, which was the prevalent ViT-family fix which worked for most models was applied here, only here there's no patch_size to divide by so it's incorrect (Perceiver uses conv1x1 with spatial_downsample=1).
→ ViT/DeiT: new_height = height // self.patch_size makes the value in new_height the target.
→ Perceiver: new_height = torch_int(num_positions**0.5) makes it the source grid size, so the same size=(new_height, new_width) pattern that's correct everywhere else becomes a no-op here.

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Mar 23, 2026

Let's add a test if we don't yet have

Thank you for your time @zucchini-nlp! Checked the test coverage, this behavior is covered by test_modeling_perceiver.py::PerceiverModelIntegrationTest::test_inference_interpolate_pos_encoding which currently fails. I just verified that this change fixes that as well!
If I may, I was wondering if I might ask if there's an update on this PR https://github.com/huggingface/transformers/pull/44695 as well at your convenience :))

image

@zucchini-nlp
Copy link
Copy Markdown
Member

@harshaljanjani I see, thaks a lot for explaining! Do you mind adding a fast test, PerceiverModelTest with dummy weights?

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

harshaljanjani commented Mar 23, 2026

Added the test!

Before the fix:

image

After the fix:

image

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: perceiver

@harshaljanjani
Copy link
Copy Markdown
Contributor Author

Good day @zucchini-nlp; just checking in to see if there are any updates :)

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@zucchini-nlp zucchini-nlp enabled auto-merge March 25, 2026 10:31
@zucchini-nlp zucchini-nlp added this pull request to the merge queue Mar 25, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Mar 25, 2026
@zucchini-nlp zucchini-nlp added this pull request to the merge queue Mar 25, 2026
Merged via the queue into huggingface:main with commit 2620c4d Mar 25, 2026
22 checks passed
@harshaljanjani harshaljanjani deleted the fix/perceiver-interpolate-pos-encoding branch March 25, 2026 11:54
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Mar 27, 2026
…the source size (huggingface#44899)

* fix: Correct interpolation target size

* test: Add fast test coverage
NielsRogge pushed a commit to NielsRogge/transformers that referenced this pull request Mar 30, 2026
…the source size (huggingface#44899)

* fix: Correct interpolation target size

* test: Add fast test coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Perceiver image classification (non-default res) fails even with interpolate_pos_encoding=True

3 participants