🐛 Describe the bug
As title, I'm dealing with OpenAI's Paper : Learning to Summarize from Human Feedback. After finishing stage 1, stage 2. I got Model below
- SFT OPT1.3b with Rogue1 score 0.29
- RM OPT350m for accuracy of 66.3%.
However, after training with PPO process, the Rogue1 drop to 0.02, which is unusual. I currently doubt Colossal AI's PPO process because its reward is not calculate word by word but whole completion. Maybe trlx's approach is more close to Open AI's ChatGPT
trlx (https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf )
Environment
No response
🐛 Describe the bug
As title, I'm dealing with OpenAI's Paper : Learning to Summarize from Human Feedback. After finishing stage 1, stage 2. I got Model below
However, after training with PPO process, the Rogue1 drop to 0.02, which is unusual. I currently doubt Colossal AI's PPO process because its reward is not calculate word by word but whole completion. Maybe trlx's approach is more close to Open AI's ChatGPT
trlx (https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf )
Environment
No response