Skip to content

Fix tf32 issue: set torch.backends.cudnn.conv.fp32_precision explicitly.#45248

Merged
ydshieh merged 4 commits intomainfrom
fix_tf32_issue
Apr 5, 2026
Merged

Fix tf32 issue: set torch.backends.cudnn.conv.fp32_precision explicitly.#45248
ydshieh merged 4 commits intomainfrom
fix_tf32_issue

Conversation

@ydshieh
Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh commented Apr 5, 2026

What does this PR do?

PR #42428 change the way to enable / disable torch's TF32 using torch new API. It turns out set

torch.backends.fp32_precision = False

would still have

torch.backends.cudnn.conv.fp32_precision = "tf32"
torch.backends.cudnn.rnn.fp32_precision = "tf32"

It's not clear if it's a bug or a design in torch, I will talk to people at torch conference next week.

For now, this issue causes ~60 test_batching_equivalence failing. Set torch.backends.cudnn.conv.fp32_precision = "ieee" explicitly will have no such failing tests (on the commit of the linked PR).

I will merge this PR directly to move fast. If torch team says that it's a design instead of a bug, we could move the logic to our enable_tf32.

Keep in mind, even with this fix, there are still 37 failing test_batching_equivalence, which are caused by other issues introduced after #42428 , which should be fixed in separated PR(s).

Note: this PR bring the vit and clip CI back to ✅

@ydshieh ydshieh changed the title Fix tf32 issue Fix tf32 issue: set torch.backends.cudnn.conv.fp32_precision explicitly. Apr 5, 2026
@ydshieh
Copy link
Copy Markdown
Collaborator Author

ydshieh commented Apr 5, 2026

run-slow: vit

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/vit"]
quantizations: []

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 1c42ae2e workflow commit (merge commit)
PR 8fd7c7f7 branch commit (from PR)
main 499ef1d7 base commit (on main)

Model CI Report

1 new failed tests from this PR 😭

  • vit:
    tests/models/vit/test_modeling_vit.py::ViTModelTest::test_torch_export (✅ ⟹ ❌)

@ydshieh
Copy link
Copy Markdown
Collaborator Author

ydshieh commented Apr 5, 2026

run-slow: vit, clip

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/clip", "models/vit"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN d5cafdce workflow commit (merge commit)
PR e70c3db5 branch commit (from PR)
main 499ef1d7 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@ydshieh ydshieh merged commit 794d65f into main Apr 5, 2026
17 of 18 checks passed
@ydshieh ydshieh deleted the fix_tf32_issue branch April 5, 2026 09:42
@ydshieh
Copy link
Copy Markdown
Collaborator Author

ydshieh commented Apr 5, 2026

well, the remaining 37 failing tests only fail when we run the whole set of tests (from all models) like

python3 -m pytest -v tests/models/ -k "test_batching_equivalence"

If we run the set of those 37 tests only like

python3 -m pytest -v @failed.txt

all of them pass. So some tests might change the tf32 settings and never reset.

For the record, the list of those 37 tests are :

tests/models/aimv2/test_modeling_aimv2.py::Aimv2ModelTest::test_batching_equivalence
tests/models/altclip/test_modeling_altclip.py::AltCLIPVisionModelTest::test_batching_equivalence
tests/models/altclip/test_modeling_altclip.py::AltCLIPModelTest::test_batching_equivalence
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_batching_equivalence
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_batching_equivalence
tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPVisionModelTest::test_batching_equivalence
tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_batching_equivalence
tests/models/clap/test_modeling_clap.py::ClapAudioModelTest::test_batching_equivalence
tests/models/clap/test_modeling_clap.py::ClapModelTest::test_batching_equivalence
tests/models/clip/test_modeling_clip.py::CLIPVisionModelTest::test_batching_equivalence
tests/models/clip/test_modeling_clip.py::CLIPModelTest::test_batching_equivalence
tests/models/clipseg/test_modeling_clipseg.py::CLIPSegVisionModelTest::test_batching_equivalence
tests/models/clipseg/test_modeling_clipseg.py::CLIPSegModelTest::test_batching_equivalence
tests/models/convbert/test_modeling_convbert.py::ConvBertModelTest::test_batching_equivalence
tests/models/deformable_detr/test_modeling_deformable_detr.py::DeformableDetrModelTest::test_batching_equivalence
tests/models/flava/test_modeling_flava.py::FlavaForPreTrainingTest::test_batching_equivalence
tests/models/fuyu/test_modeling_fuyu.py::FuyuModelTest::test_batching_equivalence
tests/models/groupvit/test_modeling_groupvit.py::GroupViTModelTest::test_batching_equivalence
tests/models/lasr/test_modeling_lasr.py::LasrEncoderModelTest::test_batching_equivalence
tests/models/metaclip_2/test_modeling_metaclip_2.py::MetaClip2VisionModelTest::test_batching_equivalence
tests/models/metaclip_2/test_modeling_metaclip_2.py::MetaClip2ModelTest::test_batching_equivalence
tests/models/metaclip_2/test_modeling_metaclip_2.py::MetaClip2ForImageClassificationModelTest::test_batching_equivalence
tests/models/mlcd/test_modeling_mlcd.py::MLCDVisionModelTest::test_batching_equivalence
tests/models/musicgen_melody/test_modeling_musicgen_melody.py::MusicgenMelodyDecoderTest::test_batching_equivalence
tests/models/omdet_turbo/test_modeling_omdet_turbo.py::OmDetTurboModelTest::test_batching_equivalence
tests/models/owlv2/test_modeling_owlv2.py::Owlv2VisionModelTest::test_batching_equivalence
tests/models/owlv2/test_modeling_owlv2.py::Owlv2TextModelTest::test_batching_equivalence
tests/models/owlv2/test_modeling_owlv2.py::Owlv2ModelTest::test_batching_equivalence
tests/models/owlv2/test_modeling_owlv2.py::Owlv2ForObjectDetectionTest::test_batching_equivalence
tests/models/owlvit/test_modeling_owlvit.py::OwlViTVisionModelTest::test_batching_equivalence
tests/models/owlvit/test_modeling_owlvit.py::OwlViTTextModelTest::test_batching_equivalence
tests/models/owlvit/test_modeling_owlvit.py::OwlViTModelTest::test_batching_equivalence
tests/models/owlvit/test_modeling_owlvit.py::OwlViTForObjectDetectionTest::test_batching_equivalence
tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelTest::test_batching_equivalence
tests/models/wav2vec2_conformer/test_modeling_wav2vec2_conformer.py::Wav2Vec2ConformerModelTest::test_batching_equivalence
tests/models/x_clip/test_modeling_x_clip.py::XCLIPModelTest::test_batching_equivalence

louzongzhi pushed a commit to louzongzhi/transformers that referenced this pull request Apr 6, 2026
…itly. (huggingface#45248)

* empty

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
louzongzhi pushed a commit to louzongzhi/transformers that referenced this pull request Apr 6, 2026
…itly. (huggingface#45248)

* empty

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Comment thread conftest.py
# te that cuDNN conv and cuDNN RNN have different TF32 flags.This combination indicates that you have used a mix of the legacy and new APIs
# to set the TF32 flags. We suggest only using the new API to set the TF32 flag(s).`.
# TODO: report a bug to `torch`
if hasattr(torch.backends.cudnn, "allow_tf32"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically I feel like we are setting fp32_precision in enable_tf32 and then we are checking for allow_tf32 and it is giving runtime error.

This whole PR change is what happens inside enable_tf32(False) , I dont think allow_tf32 block is required , I do understand need for torch.backends.cudnn.conv block seeing the issue-1 being reported, for which I have kept a PR need, I need to get traction on it.

but if you still want to keep it we can do if-else
if hasattr(torch.backends.cudnn.conv, "fp32_precision"):
torch.backends.cudnn.conv.fp32_precision = "ieee"
else if hasattr(torch.backends.cudnn, "allow_tf32"):
torch.backends.cudnn.allow_tf32 = False

For issue-2 I have idea that I will add into the same PR.

These are just the suggetions before PR on pytorch is merged.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @khushali9

I don't want to keep this block about allow_tf32, but currently without it, it will fail some (torch export) tests.
I prefer to wait your PR being merged and I will remove this block.

Thank you!

Comment thread conftest.py

# This is necessary to make several `test_batching_equivalence` pass (within the tolerance `1e-5`)
if hasattr(torch.backends.cudnn.conv, "fp32_precision"):
torch.backends.cudnn.conv.fp32_precision = "ieee"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another easy one is just move this line torch.backends.cudnn.conv.fp32_precision = "ieee" in enable_tf32

sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
…itly. (huggingface#45248)

* empty

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants