Skip to content

[FEATURE]: Graphic card ram friendly PPO training for big model(larger than 2B) #3566

@yynil

Description

@yynil

Describe the feature

The PPO training needs to maintain 4 models in memory at the same time. The original implementation keep the reward/actor critic/initial model in video ram at the same time.
The Actor/Initial models' outputs are ids which means actions for Reward/Critic model. If reward model and actor model don't share the same tokenizer, the Ids mean nothing for reward model.

Even for the same model like bloom, developers can't keep the strong assumption that different scale models share the same tokenizer. For an example, bloom7b-mt doesn't need to share the same tokenizer with bloom-560m.

Things get even worse if we only have one LLM, like ChatGLM-6B. We even don't have chance to bet a smaller model has the same tokenizer.

So a video ram friendly PPO trainer is needed, so we only need to keep on model in video ram to do the training.

I have finished the codes and Readme doc in my fork. Later I'll submit a PR for this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions