[chat] Fix coati-sft-Instruction Tuning by NicholasCao · Pull Request #3568 · hpcaitech/ColossalAI

NicholasCao · 2023-04-14T16:00:56Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Fix coati-sft-Instruction Tuning (applications/Chat/coati/dataset/sft_dataset.py)

Multiple unnecessary copy.deepcopy leads to slow tokenizer
Use tokenizer.eos_token instead of the fixed <|endoftext|>
Fix input_ids
Fix wrong attribute name self.prompts
Fix NameError: name 'data_collator' is not defined (applications/Chat/examples/train_sft.py line 114)

Fix: log when is_rank_0 (applications/Chat/coati/trainer/sft.py)

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

Fazziekey · 2023-04-17T08:47:39Z

@NicholasCao good work, thanks a lot for your contribution

allaccs · 2023-04-17T11:38:33Z

Hello, I trained a model yesterday without those changes. What are the consequences of these?
Is the tokenizer wrong in the old code? Where the self.input_ids wrong?

NicholasCao · 2023-04-17T11:47:26Z

@allaccs If your args.dataset != 'yizhongw/self_instruct', there are no errors.
Otherwise, make sure the tokenizer you use satisfies tokenizer.eos_token == '<|endoftext|>'.
The rest are some efficiency or redundancy issues that do not cause any errors.

Fazziekey · 2023-04-18T01:30:39Z

@allaccs如果您的args.dataset != 'yizhongw/self_instruct'，则没有错误。否则，请确保您使用的分词器满足 tokenizer.eos_token == '<|endoftext|>'. 剩下的就是一些不会导致任何错误的效率或者冗余问题。

acctually, 'yizhongw/self_instruct' is a community contirbutred, and we put it here tempory, and we will remove it to community exampple

fix: fix sft

90120ad

binmakeswell requested a review from Fazziekey April 17, 2023 08:26

Fazziekey approved these changes Apr 17, 2023

View reviewed changes

Fazziekey merged commit 7788e0b into hpcaitech:main Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chat] Fix coati-sft-Instruction Tuning#3568

[chat] Fix coati-sft-Instruction Tuning#3568
Fazziekey merged 1 commit intohpcaitech:mainfrom
NicholasCao:main

NicholasCao commented Apr 14, 2023

Uh oh!

Fazziekey commented Apr 17, 2023

Uh oh!

allaccs commented Apr 17, 2023

Uh oh!

NicholasCao commented Apr 17, 2023 •

edited

Loading

Uh oh!

Fazziekey commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NicholasCao commented Apr 14, 2023

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Fazziekey commented Apr 17, 2023

Uh oh!

allaccs commented Apr 17, 2023

Uh oh!

NicholasCao commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fazziekey commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NicholasCao commented Apr 17, 2023 •

edited

Loading