Skip to content

[chat] Fix coati-sft-Instruction Tuning#3568

Merged
Fazziekey merged 1 commit intohpcaitech:mainfrom
NicholasCao:main
Apr 17, 2023
Merged

[chat] Fix coati-sft-Instruction Tuning#3568
Fazziekey merged 1 commit intohpcaitech:mainfrom
NicholasCao:main

Conversation

@NicholasCao
Copy link
Copy Markdown
Contributor

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

  1. Fix coati-sft-Instruction Tuning (applications/Chat/coati/dataset/sft_dataset.py)
  • Multiple unnecessary copy.deepcopy leads to slow tokenizer
  • Use tokenizer.eos_token instead of the fixed <|endoftext|>
  • Fix input_ids
  • Fix wrong attribute name self.prompts
  • Fix NameError: name 'data_collator' is not defined (applications/Chat/examples/train_sft.py line 114)
  1. Fix: log when is_rank_0 (applications/Chat/coati/trainer/sft.py)

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@binmakeswell binmakeswell requested a review from Fazziekey April 17, 2023 08:26
@Fazziekey
Copy link
Copy Markdown
Contributor

@NicholasCao good work, thanks a lot for your contribution

@Fazziekey Fazziekey merged commit 7788e0b into hpcaitech:main Apr 17, 2023
@allaccs
Copy link
Copy Markdown

allaccs commented Apr 17, 2023

Hello, I trained a model yesterday without those changes. What are the consequences of these?
Is the tokenizer wrong in the old code? Where the self.input_ids wrong?

@NicholasCao
Copy link
Copy Markdown
Contributor Author

NicholasCao commented Apr 17, 2023

@allaccs If your args.dataset != 'yizhongw/self_instruct', there are no errors.
Otherwise, make sure the tokenizer you use satisfies tokenizer.eos_token == '<|endoftext|>'.
The rest are some efficiency or redundancy issues that do not cause any errors.

@Fazziekey
Copy link
Copy Markdown
Contributor

@allaccs如果您的args.dataset != 'yizhongw/self_instruct',则没有错误。 否则,请确保您使用的分词器满足 tokenizer.eos_token == '<|endoftext|>'. 剩下的就是一些不会导致任何错误的效率或者冗余问题。

acctually, 'yizhongw/self_instruct' is a community contirbutred, and we put it here tempory, and we will remove it to community exampple

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants