Skip to content

[BUG]: chatgpt ppo training hangs when using gemini #3161

@ver217

Description

@ver217

🐛 Describe the bug

Problem

Running ChatGPT/examples/train_prompts.py, I found sometimes the training hangs when using Gemini.

This occurs when randomly while changing batch size.

Possible reason

I found the padding policy is to pad to the longest sequence in the batch.
In DDP scheme, different process may have different input lengths due to random sampling. That is to say, they may have different generation steps.

When using Gemini, which need communication during forward, different forward steps leads to different number of communication calls. And this asymmetric communication leads to hang.

Possible solution

Change padding policy to 'max_length', see huggingface tokenizer doc for more details.

In addition, when enabling early stopping, we should also consider DDP and ensure the number of generation steps of each i process is the same.

Environment

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingchatgptChatGPT Application

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions