[chat] fix train_prompts.py gemini strategy bug by zhang-yi-chi · Pull Request #3666 · hpcaitech/ColossalAI

zhang-yi-chi · 2023-04-28T05:34:04Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

📝 What does this PR do?

Intial model and reward model are not wrapped by ZeroDDP wrapper，so they cannot accept ColoTensor as model input. Here we use .data as model input.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

ver217 · 2023-04-28T06:22:49Z

What is the problem? I test https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/benchmarks/benchmark_opt_lora_dummy.py with gemini strategy and no error occurs.

A naive torch module should be able to receive ColoTensor as well.

zhang-yi-chi · 2023-04-28T06:26:54Z

What is the problem? I test https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/benchmarks/benchmark_opt_lora_dummy.py with gemini strategy and no error occurs.

A naive torch module should be able to receive ColoTensor as well.

run train_prompts.sh with colossal_gemini strategy cause following error

File "ColossalAI/applications/Chat/coati/experience_maker/naive.py", line 25, in make_experience
    base_action_log_probs = self.initial_model(sequences, num_actions, attention_mask)
...
File "/lib/python3.8/site-packages/colossalai/nn/_ops/embedding.py", line 111, in colo_embedding
    assert isinstance(weight, ColoTensor)

initial_model and reward_model are not ZeroDDP module so their weights are not ColoTensor.

ver217 · 2023-04-28T06:32:31Z

What is the problem? I test https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/benchmarks/benchmark_opt_lora_dummy.py with gemini strategy and no error occurs.
A naive torch module should be able to receive ColoTensor as well.

run train_prompts.sh with colossal_gemini strategy cause following error
File "ColossalAI/applications/Chat/coati/experience_maker/naive.py", line 25, in make_experience
    base_action_log_probs = self.initial_model(sequences, num_actions, attention_mask)
...
File "/lib/python3.8/site-packages/colossalai/nn/_ops/embedding.py", line 111, in colo_embedding
    assert isinstance(weight, ColoTensor)
initial_model and reward_model are not ZeroDDP module so their weights are not ColoTensor.

How do you run this script?

zhang-yi-chi · 2023-04-28T06:36:51Z

What is the problem? I test https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/benchmarks/benchmark_opt_lora_dummy.py with gemini strategy and no error occurs.
A naive torch module should be able to receive ColoTensor as well.

run train_prompts.sh with colossal_gemini strategy cause following error
File "ColossalAI/applications/Chat/coati/experience_maker/naive.py", line 25, in make_experience
    base_action_log_probs = self.initial_model(sequences, num_actions, attention_mask)
...
File "/lib/python3.8/site-packages/colossalai/nn/_ops/embedding.py", line 111, in colo_embedding
    assert isinstance(weight, ColoTensor)
initial_model and reward_model are not ZeroDDP module so their weights are not ColoTensor.
How do you run this script?

I installed colossal by CUDA_EXT=1 pip3 install -v .

Something like this under applications/Chat folder

cp examples/train_prompts.py .
torchrun --standalone --nproc_per_node=1 train_prompts.py \
   --strategy colossalai_gemini \
   --pretrain_dataset /projects/llm/data/coati/instinwild_en.json \
   --prompt_dataset /projects/llm/data/coati/instinwild_en.json \
   --model 'bloom' \
   --pretrain /projects/llm/Coati-BLOOM-560M \
   --rm_model 'bloom' \
   --rm_pretrain /projects/llm/hf/bloom-560m \
   --rm_path /projects/llm/Coati-RM/hh-rlhf.pt \
   --save_path /projects/llm/Coati-PROMPTS/ppo.pt

ver217 · 2023-04-28T06:49:34Z

This issue can be simply resolved by move with strategy.model_init_context(): to here

ColossalAI/applications/Chat/examples/train_prompts.py

Line 38 in 816add7

ver217 · 2023-04-28T07:02:02Z

ColoTensor will be removed in the future. So we'd better reduce the dependency on ColoTensor.

zhang-yi-chi · 2023-04-28T07:03:43Z

This issue can be simply resolved by move with strategy.model_init_context(): to here

ColossalAI/applications/Chat/examples/train_prompts.py

Line 38 in 816add7

That's a better solution. I submitted what you proposed.

zhang-yi-chi added 2 commits April 28, 2023 13:22

fix gemini strategy bug

3f2b8a0

add comment

18db2e8

zhang-yi-chi mentioned this pull request Apr 28, 2023

[BUG]: training GPT2-S using a single card on colab, AssertionError: You should use zero_ddp_wrapper first #3423

Closed

binmakeswell requested a review from ver217 April 28, 2023 06:16

add comment

7aaad2c

better solution

4e2bab6

ver217 approved these changes Apr 28, 2023

View reviewed changes

ver217 merged commit 2da5d81 into hpcaitech:main May 6, 2023

Anuj040 mentioned this pull request May 8, 2023

[BUG]: Cannot run Stage-3 for a 6.7B parameter model #3704

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chat] fix train_prompts.py gemini strategy bug#3666

[chat] fix train_prompts.py gemini strategy bug#3666
ver217 merged 4 commits intohpcaitech:mainfrom
zhang-yi-chi:fix/chat-train-prompts-gemini

zhang-yi-chi commented Apr 28, 2023

Uh oh!

ver217 commented Apr 28, 2023

Uh oh!

zhang-yi-chi commented Apr 28, 2023

Uh oh!

ver217 commented Apr 28, 2023

Uh oh!

zhang-yi-chi commented Apr 28, 2023

Uh oh!

ver217 commented Apr 28, 2023

Uh oh!

ver217 commented Apr 28, 2023

Uh oh!

zhang-yi-chi commented Apr 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhang-yi-chi commented Apr 28, 2023

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

ver217 commented Apr 28, 2023

Uh oh!

zhang-yi-chi commented Apr 28, 2023

Uh oh!

ver217 commented Apr 28, 2023

Uh oh!

zhang-yi-chi commented Apr 28, 2023

Uh oh!

ver217 commented Apr 28, 2023

Uh oh!

ver217 commented Apr 28, 2023

Uh oh!

zhang-yi-chi commented Apr 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants