Fix tf32 issue: set torch.backends.cudnn.conv.fp32_precision explicitly.#45248
Fix tf32 issue: set torch.backends.cudnn.conv.fp32_precision explicitly.#45248
torch.backends.cudnn.conv.fp32_precision explicitly.#45248Conversation
torch.backends.cudnn.conv.fp32_precision explicitly.
|
run-slow: vit |
|
This comment contains models: ["models/vit"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
CI ResultsCommit Info
Model CI Report❌ 1 new failed tests from this PR 😭
|
|
run-slow: vit, clip |
|
This comment contains models: ["models/clip", "models/vit"] |
|
well, the remaining 37 failing tests only fail when we run the whole set of tests (from all models) like
If we run the set of those 37 tests only like
all of them pass. So some tests might change the tf32 settings and never reset. For the record, the list of those 37 tests are : tests/models/aimv2/test_modeling_aimv2.py::Aimv2ModelTest::test_batching_equivalence
tests/models/altclip/test_modeling_altclip.py::AltCLIPVisionModelTest::test_batching_equivalence
tests/models/altclip/test_modeling_altclip.py::AltCLIPModelTest::test_batching_equivalence
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_batching_equivalence
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_batching_equivalence
tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPVisionModelTest::test_batching_equivalence
tests/models/chinese_clip/test_modeling_chinese_clip.py::ChineseCLIPModelTest::test_batching_equivalence
tests/models/clap/test_modeling_clap.py::ClapAudioModelTest::test_batching_equivalence
tests/models/clap/test_modeling_clap.py::ClapModelTest::test_batching_equivalence
tests/models/clip/test_modeling_clip.py::CLIPVisionModelTest::test_batching_equivalence
tests/models/clip/test_modeling_clip.py::CLIPModelTest::test_batching_equivalence
tests/models/clipseg/test_modeling_clipseg.py::CLIPSegVisionModelTest::test_batching_equivalence
tests/models/clipseg/test_modeling_clipseg.py::CLIPSegModelTest::test_batching_equivalence
tests/models/convbert/test_modeling_convbert.py::ConvBertModelTest::test_batching_equivalence
tests/models/deformable_detr/test_modeling_deformable_detr.py::DeformableDetrModelTest::test_batching_equivalence
tests/models/flava/test_modeling_flava.py::FlavaForPreTrainingTest::test_batching_equivalence
tests/models/fuyu/test_modeling_fuyu.py::FuyuModelTest::test_batching_equivalence
tests/models/groupvit/test_modeling_groupvit.py::GroupViTModelTest::test_batching_equivalence
tests/models/lasr/test_modeling_lasr.py::LasrEncoderModelTest::test_batching_equivalence
tests/models/metaclip_2/test_modeling_metaclip_2.py::MetaClip2VisionModelTest::test_batching_equivalence
tests/models/metaclip_2/test_modeling_metaclip_2.py::MetaClip2ModelTest::test_batching_equivalence
tests/models/metaclip_2/test_modeling_metaclip_2.py::MetaClip2ForImageClassificationModelTest::test_batching_equivalence
tests/models/mlcd/test_modeling_mlcd.py::MLCDVisionModelTest::test_batching_equivalence
tests/models/musicgen_melody/test_modeling_musicgen_melody.py::MusicgenMelodyDecoderTest::test_batching_equivalence
tests/models/omdet_turbo/test_modeling_omdet_turbo.py::OmDetTurboModelTest::test_batching_equivalence
tests/models/owlv2/test_modeling_owlv2.py::Owlv2VisionModelTest::test_batching_equivalence
tests/models/owlv2/test_modeling_owlv2.py::Owlv2TextModelTest::test_batching_equivalence
tests/models/owlv2/test_modeling_owlv2.py::Owlv2ModelTest::test_batching_equivalence
tests/models/owlv2/test_modeling_owlv2.py::Owlv2ForObjectDetectionTest::test_batching_equivalence
tests/models/owlvit/test_modeling_owlvit.py::OwlViTVisionModelTest::test_batching_equivalence
tests/models/owlvit/test_modeling_owlvit.py::OwlViTTextModelTest::test_batching_equivalence
tests/models/owlvit/test_modeling_owlvit.py::OwlViTModelTest::test_batching_equivalence
tests/models/owlvit/test_modeling_owlvit.py::OwlViTForObjectDetectionTest::test_batching_equivalence
tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelTest::test_batching_equivalence
tests/models/wav2vec2_conformer/test_modeling_wav2vec2_conformer.py::Wav2Vec2ConformerModelTest::test_batching_equivalence
tests/models/x_clip/test_modeling_x_clip.py::XCLIPModelTest::test_batching_equivalence
|
…itly. (huggingface#45248) * empty * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…itly. (huggingface#45248) * empty * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
| # te that cuDNN conv and cuDNN RNN have different TF32 flags.This combination indicates that you have used a mix of the legacy and new APIs | ||
| # to set the TF32 flags. We suggest only using the new API to set the TF32 flag(s).`. | ||
| # TODO: report a bug to `torch` | ||
| if hasattr(torch.backends.cudnn, "allow_tf32"): |
There was a problem hiding this comment.
Basically I feel like we are setting fp32_precision in enable_tf32 and then we are checking for allow_tf32 and it is giving runtime error.
This whole PR change is what happens inside enable_tf32(False) , I dont think allow_tf32 block is required , I do understand need for torch.backends.cudnn.conv block seeing the issue-1 being reported, for which I have kept a PR need, I need to get traction on it.
but if you still want to keep it we can do if-else
if hasattr(torch.backends.cudnn.conv, "fp32_precision"):
torch.backends.cudnn.conv.fp32_precision = "ieee"
else if hasattr(torch.backends.cudnn, "allow_tf32"):
torch.backends.cudnn.allow_tf32 = False
For issue-2 I have idea that I will add into the same PR.
These are just the suggetions before PR on pytorch is merged.
There was a problem hiding this comment.
Hi @khushali9
I don't want to keep this block about allow_tf32, but currently without it, it will fail some (torch export) tests.
I prefer to wait your PR being merged and I will remove this block.
Thank you!
|
|
||
| # This is necessary to make several `test_batching_equivalence` pass (within the tolerance `1e-5`) | ||
| if hasattr(torch.backends.cudnn.conv, "fp32_precision"): | ||
| torch.backends.cudnn.conv.fp32_precision = "ieee" |
There was a problem hiding this comment.
another easy one is just move this line torch.backends.cudnn.conv.fp32_precision = "ieee" in enable_tf32
…itly. (huggingface#45248) * empty * fix * fix * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
PR #42428 change the way to enable / disable torch's TF32 using torch new API. It turns out set
would still have
It's not clear if it's a bug or a design in
torch, I will talk to people at torch conference next week.For now, this issue causes ~60
test_batching_equivalencefailing. Settorch.backends.cudnn.conv.fp32_precision = "ieee"explicitly will have no such failing tests (on the commit of the linked PR).I will merge this PR directly to move fast. If
torchteam says that it's a design instead of a bug, we could move the logic to ourenable_tf32.Keep in mind, even with this fix, there are still 37 failing
test_batching_equivalence, which are caused by other issues introduced after #42428 , which should be fixed in separated PR(s).Note: this PR bring the
vitandclipCI back to ✅