XGLM - Fix Softmax NaNs when using FP16 by gsarti · Pull Request #18057 · huggingface/transformers

gsarti · 2022-07-07T14:51:23Z

What does this PR do?

Fixes #18049 following the exact same procedure used in #17437. Beside the added test, I also evaluated the fix on my personal use-case and found the behavior of the fixed model to be consistent when performing single or batched generation.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you write any new necessary tests?

Who can review?

@patil-suraj @ydshieh @patrickvonplaten

HuggingFaceDocBuilderDev · 2022-07-07T15:01:13Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh

Thank you, @gsarti, LGTM!

ydshieh · 2022-07-07T16:05:10Z

        return position_ids.unsqueeze(0).expand(input_shape).contiguous() + past_key_values_length


-# Copied from transformers.models.bart.modeling_bart.BartAttention with Bart->XGLM


We (HF team) have to remember to add this back once Bart takes the same fix.

younesbelkada

LGTM thanks a lot for the fix 🚀 !

patrickvonplaten · 2022-07-11T12:09:56Z

                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
                )
            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
+            attn_weights = torch.max(attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min))


@stas00 is this operation costly? Wondering how costly such a max operation is

what are the contenders? at least max, clamp and where, but probably others as well?

In [13]: a = torch.tensor([5,-1e20]) In [14]: b = torch.tensor(torch.finfo(torch.float16).min) In [16]: torch.clamp(a, min=b) Out[16]: tensor([ 5.0000e+00, -6.5504e+04]) In [21]: torch.where(a > b, a, b) Out[21]: tensor([ 5.0000e+00, -6.5504e+04]) In [22]: torch.max(a, b) Out[22]: tensor([ 5.0000e+00, -6.5504e+04])

Benchmark:

$ cat clamp-where-max.py import torch.utils.benchmark as benchmark import torch a = torch.empty(512) b = torch.tensor(torch.finfo(torch.float16).min) t0 = benchmark.Timer( stmt='torch.clamp(a, b)', setup='', globals=dict(a=a, b=b), ) t1 = benchmark.Timer( stmt='torch.max(a, b)', setup='', globals=dict(a=a, b=b), ) t2 = benchmark.Timer( stmt='torch.where(a > b, a, b)', setup='', globals=dict(a=a, b=b), ) print(t0.timeit(1000)) print(t1.timeit(1000)) print(t2.timeit(1000)) $ python clamp-where-max.py <torch.utils.benchmark.utils.common.Measurement object at 0x7f2c77739040> torch.clamp(a, b) 1.60 us 1 measurement, 1000 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7f2c77739d60> torch.max(a, b) 1.60 us 1 measurement, 1000 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7f2c77739040> torch.where(a > b, a, b) 4.36 us 1 measurement, 1000 runs , 1 thread

so max is tied with clamp, and where is slow.

but ensure to benchmark with the actual dimensions, though it shouldn't make much of a difference I think.

(edited: got the a wrong initially)

gsarti · 2022-08-03T14:13:47Z

@patil-suraj I think only your check is missing!

patrickvonplaten · 2022-08-23T17:40:47Z

Sorry for being so late here @gsarti! Merged master into it to ping circle ci here

patrickvonplaten · 2022-08-23T18:41:47Z

Hey @gsarti - it seems like a test is failing now:

tests/models/xglm/test_modeling_xglm.py::XGLMModelTest::test_xglm_model_past

with

 UnboundLocalError: local variable 'dtype_attn_weights' referenced before assignment

gsarti · 2022-08-24T14:03:24Z

Hey @gsarti - it seems like a test is failing now:

tests/models/xglm/test_modeling_xglm.py::XGLMModelTest::test_xglm_model_past

with

 UnboundLocalError: local variable 'dtype_attn_weights' referenced before assignment

I noticed this when running the code. My understanding is that setting dtype_attn_weights as torch.float32 as default beforehand would fix the issue and maintain the expected behavior, could you double-check?

ydshieh · 2022-09-27T14:06:14Z

I think we don't need this line.

ydshieh · 2022-09-27T14:07:46Z

Instead, we can change this part to

if attn_weights.dtype == torch.float16: attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(attn_weights.dtype)

I think that we should fix the line on OPT too:

transformers/src/transformers/models/opt/modeling_opt.py

Line 221 in 34be08e

dtype_attn_weights = attn_weights.dtype

in case attention_mask is set to None the forward pass will fail as described in #18057 (comment)

But I think that the issue has never been reported since attention_mask is never None:

transformers/src/transformers/models/opt/modeling_opt.py

Line 629 in 34be08e

attention_mask = torch.ones(inputs_embeds.shape[:2], dtype=torch.bool, device=inputs_embeds.device)

Good catch! Surprisingly, we don't have test failure for OPT due to this.

You answered my question before I asked it 😆

ahahaha yes :D

ydshieh · 2022-09-27T14:10:44Z

Hi @gsarti Sorry for being late for this PR. I re-opened it and give some suggestion for a fix to the failing test. Would you like to update this PR after rebasing your working branch on an updated main branch?

ydshieh · 2022-09-28T19:34:44Z

Hi @gsarti I made the necessary change to pass the tests, and pushed to your branch directly. The remaining failing test is irrelevant to this PR, but I will wait until tomorrow to check again, then I will merge.

cc @patrickvonplaten and @younesbelkada

younesbelkada · 2022-09-28T19:58:29Z

Thanks a lot for the fix @ydshieh !!
I think for consistency we should apply the same changes on OPT too, I will take care of that first thing in the morning tomorrow 💪

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

gsarti mentioned this pull request Jul 7, 2022

NaN in XGLM Softmax with FP16 #18049

Closed

ydshieh approved these changes Jul 7, 2022

View reviewed changes

ydshieh requested review from patil-suraj, patrickvonplaten and younesbelkada July 7, 2022 16:05

younesbelkada approved these changes Jul 7, 2022

View reviewed changes

patrickvonplaten reviewed Jul 11, 2022

View reviewed changes

patrickvonplaten approved these changes Jul 11, 2022

View reviewed changes

github-actions Bot closed this Sep 25, 2022

ydshieh reviewed Sep 27, 2022

View reviewed changes

Comment thread src/transformers/models/xglm/modeling_xglm.py Outdated

Copy link
Copy Markdown

Collaborator

ydshieh Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this line.

ydshieh reviewed Sep 27, 2022

View reviewed changes

ydshieh reopened this Sep 27, 2022

huggingface deleted a comment from github-actions Bot Sep 27, 2022

ydshieh force-pushed the fix-xglm-fp16-nans branch from 84af0ee to 87ef76e Compare September 28, 2022 19:05

Gabriele Sarti and others added 3 commits September 29, 2022 08:01

fix fp16 for xglm

d88dd2d

Removed misleading comment

901d319

Fix undefined variable

1eb0953

ydshieh force-pushed the fix-xglm-fp16-nans branch from 87ef76e to 1eb0953 Compare September 29, 2022 06:02

younesbelkada reviewed Sep 29, 2022

View reviewed changes

Comment thread src/transformers/models/xglm/modeling_xglm.py Outdated

younesbelkada mentioned this pull request Sep 29, 2022

Fix opt softmax small nit #19243

Merged

Update src/transformers/models/xglm/modeling_xglm.py

e16a037

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

ydshieh merged commit 9d732fd into huggingface:main Sep 29, 2022

younesbelkada mentioned this pull request Oct 2, 2022

Clamping hidden state values to allow FP16 #19229

Merged

ydshieh mentioned this pull request Oct 10, 2022

Fix XGLMModelLanguageGenerationTest.test_batched_nan_fp16 #19473

Merged

geniki mentioned this pull request Feb 6, 2023

Longformer FP16 training broken since transformers 4.21 #21449

Closed

4 tasks

		return position_ids.unsqueeze(0).expand(input_shape).contiguous() + past_key_values_length


		# Copied from transformers.models.bart.modeling_bart.BartAttention with Bart->XGLM

Conversation

gsarti commented Jul 7, 2022

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsarti commented Aug 3, 2022

Uh oh!

patrickvonplaten commented Aug 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Aug 23, 2022

Uh oh!

gsarti commented Aug 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydshieh Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Sep 27, 2022

Uh oh!

ydshieh commented Sep 28, 2022

Uh oh!

younesbelkada commented Sep 28, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HuggingFaceDocBuilderDev commented Jul 7, 2022 •

edited

Loading

stas00 Jul 12, 2022 •

edited

Loading

patrickvonplaten commented Aug 23, 2022 •

edited

Loading

ydshieh Sep 27, 2022 •

edited

Loading