Skip to content

[BUG]: PPO training may be incorrect #3374

@jasper881108

Description

@jasper881108

🐛 Describe the bug

As title, I'm dealing with OpenAI's Paper : Learning to Summarize from Human Feedback. After finishing stage 1, stage 2. I got Model below

  1. SFT OPT1.3b with Rogue1 score 0.29
  2. RM OPT350m for accuracy of 66.3%.

However, after training with PPO process, the Rogue1 drop to 0.02, which is unusual. I currently doubt Colossal AI's PPO process because its reward is not calculate word by word but whole completion. Maybe trlx's approach is more close to Open AI's ChatGPT
trlx (https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf )

Environment

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions