🐛 Bug
As spotted in huggingface/transformers#9156 (comment), some models can expose an interesting structure, with one very big tensor (which is not optimized), bigger than the sum of the other tensors. In that case, one rank will get no grads, and a couple of dependent functions don't like that
Command
huggingface/transformers#9156 (comment)
seq2seq basic model run on 2 ranks: one rank gets 1 static tensor, the other one gets all the grads. Gradient clipping breaks and computations are not really optimized
🐛 Bug
As spotted in huggingface/transformers#9156 (comment), some models can expose an interesting structure, with one very big tensor (which is not optimized), bigger than the sum of the other tensors. In that case, one rank will get no grads, and a couple of dependent functions don't like that
Command
huggingface/transformers#9156 (comment)
seq2seq basic model run on 2 ranks: one rank gets 1 static tensor, the other one gets all the grads. Gradient clipping breaks and computations are not really optimized