🐛 Describe the bug
Problem
Running ChatGPT/examples/train_prompts.py, I found sometimes the training hangs when using Gemini.
This occurs when randomly while changing batch size.
Possible reason
I found the padding policy is to pad to the longest sequence in the batch.
In DDP scheme, different process may have different input lengths due to random sampling. That is to say, they may have different generation steps.
When using Gemini, which need communication during forward, different forward steps leads to different number of communication calls. And this asymmetric communication leads to hang.
Possible solution
Change padding policy to 'max_length', see huggingface tokenizer doc for more details.
In addition, when enabling early stopping, we should also consider DDP and ensure the number of generation steps of each i process is the same.
Environment
No response
🐛 Describe the bug
Problem
Running
ChatGPT/examples/train_prompts.py, I found sometimes the training hangs when using Gemini.This occurs when randomly while changing batch size.
Possible reason
I found the padding policy is to pad to the longest sequence in the batch.
In DDP scheme, different process may have different input lengths due to random sampling. That is to say, they may have different generation steps.
When using Gemini, which need communication during forward, different forward steps leads to different number of communication calls. And this asymmetric communication leads to hang.
Possible solution
Change padding policy to
'max_length', see huggingface tokenizer doc for more details.In addition, when enabling early stopping, we should also consider DDP and ensure the number of generation steps of each i process is the same.
Environment
No response