Fix BLOOM's softmax for half precisions#18185
Fix BLOOM's softmax for half precisions#18185NouamaneTazi wants to merge 4 commits intohuggingface:mainfrom
Conversation
- avoid having both `-inf` and `dtype.min` in causal mask due to addition
- clip values between dtype max and min to avoid infs (not liked by softmax) Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
|
The documentation is not available anymore as the PR was closed or merged. |
- it's okay to use addition since we're using `-inf` again
|
The two situations you described indeed exist. However, I think there is no real necessity to deal with them. As long as there is at least one position to attend to, it doesn't matter if we have mixed And for a sequence without any position to attend, nothing we can't do. If we want to go really rigorous, we should multiply the softmaxed-scores by zeros for the unattended places. |
|
@ydshieh Are we sure |
|
@NouamaneTazi I don't think there is such guarantee, and what you mentioned is possible. However, it would be great if you can provide some examples for which you find this PR helps to get better results or solve some issues. Thank you! |
|
So stupid question: instead of running |
| @@ -599,7 +601,6 @@ def _prepare_attn_mask(self, attention_mask, input_shape, inputs_embeds, past_ke | |||
| combined_attention_mask = ( | |||
| expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask | |||
There was a problem hiding this comment.
Stupid question but when you sum torch.finfo(dtype).min with torch.finfo(dtype).min, it's not the masked value anymore?
There was a problem hiding this comment.
torch.finfo(dtype).min + torch.finfo(dtype).min = -inf
There was a problem hiding this comment.
We would get something like this:
>> print(attention_mask)
tensor([[[[ 0., -65504., -65504., -65504., -65504., -65504., -65504.],
[ 0., 0., -65504., -65504., -65504., -65504., -65504.],
[ 0., 0., 0., -65504., -65504., -65504., -65504.],
[ 0., 0., 0., 0., -65504., -65504., -65504.],
[ 0., 0., 0., 0., 0., -65504., -65504.],
[ 0., 0., 0., 0., 0., 0., -65504.],
[ 0., 0., 0., 0., 0., 0., 0.]]],
[[[-65504., -inf, -inf, -65504., -65504., -65504., -65504.],
[-65504., -65504., -inf, -65504., -65504., -65504., -65504.],
[-65504., -65504., -65504., -65504., -65504., -65504., -65504.],
[-65504., -65504., -65504., 0., -65504., -65504., -65504.],
[-65504., -65504., -65504., 0., 0., -65504., -65504.],
[-65504., -65504., -65504., 0., 0., 0., -65504.],
[-65504., -65504., -65504., 0., 0., 0., 0.]]]],
device='cuda:0', dtype=torch.float16)
I'm not sure what |
I think @thomasw21 is talking about the place where an attn. score (where you say it could be positive) is added by the mask. |
|
Should be fixed in this PR: #18344 |
This PR aims at fixing the following issues:
-infin the attention mask instead, and only after the addition, we replace the inf values by the respective max/min dtype valuestorch.clipinstead oftorch.maxto ensure we avoid both-infand+inffor softmaxtorch.finfo(dtype).minin attention mask] In this line, if we use the minimum dtype values, after performing the addition, we get mixed-infandtorch.finfo(dtype).minin the attention maskAll tests (including slow ones) are passing. ✅
Related to: #17437
Co-authored by: @younesbelkada
cc @ydshieh @stas00