Conversation
There was a problem hiding this comment.
Just to be on the safer side, setting it to the exact max value might again lead to inf values in subsequent layers
There was a problem hiding this comment.
Okey just noticed that we do the same in Bart as well
There was a problem hiding this comment.
maybe improve comment slightly:
| # clamp inf values | |
| # clamp inf values to enable fp16 training |
|
This is great! |
|
Dear @patil-suraj Can you tell me, should your code fix fp16 on google/t5-v1_1-xl model? Upd: I run my code on Transformers's branch from your current PR #9487 merged with PR #9211 needed for deepspeed integration. |
4e284b6 to
5a47157
Compare
Hey @exelents, can you include a code snippet to reproduce your error as well as the full stack trace of your error? |
|
as stated in #9432 This fix works for following models and versions, with apex
Just did a small experiment with also, @exelents by overflow error do you mean the gradient overflow warning thrown by |
Ah ok, we still see |
Here is error stack: |
I'm again trying to locate where exactly in the model this happen. In case it's the same as above (first |
|
I have checked a loss value, and it seems in is not NaN. It got values like "48.7500" or "40.9688" but there are vaild values. Despite that I see messages like "OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024.0, reducing to 512.0", that it seems means that something bad happened with model's loss. |
Those warnings don't mean anything went wrong, it's logical with dynamic loss scaling that some loss scale values are too big at the beginning of training. |
sgugger
left a comment
There was a problem hiding this comment.
LGTM, thanks for fixing this!
LysandreJik
left a comment
There was a problem hiding this comment.
Very cool! Thanks for working on this @patil-suraj!
What does this PR do?
This PR enables fp16 for T5 models, by clamping hidden states to the max value of the current data type.
As detailed in #9295, T5 produces large (
inf) activations at 3 placesT5LayerFFT5LayerSelfAttentionT5LayerCrossAttentionTo avoid these
infactivations this PR clamps thehidden_statesafter above 3 outputs