[BUG]: PPO training may be incorrect

### 🐛 Describe the bug

As title, I'm dealing with OpenAI's Paper : Learning to Summarize from Human Feedback. After finishing stage 1, stage 2. I got Model below

1. SFT OPT1.3b with Rogue1 score 0.29
2. RM OPT350m for accuracy of 66.3%. 

However, after training with PPO process, the Rogue1 drop to 0.02, which is unusual. I currently doubt Colossal AI's PPO process because its reward is not calculate word by word but whole completion. Maybe trlx's approach is more close to Open AI's ChatGPT
trlx (https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf )
### Environment

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: PPO training may be incorrect #3374

🐛 Describe the bug

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: PPO training may be incorrect #3374

Description

🐛 Describe the bug

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions