Add support for ChatML dataset format in#1208
Add support for ChatML dataset format in#1208younesbelkada merged 14 commits intohuggingface:mainfrom
Conversation
SFTTrainer
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
younesbelkada
left a comment
There was a problem hiding this comment.
Amazing addition, thanks! I went through the PR and it looks great to me!
Can you add few lines in the documentaiton to explain what it does under the hood? I think the doc section should live inside the SFTTrainer docs - wdyt?
I didn't worked on the documentation yet, since i wanted to see what you think. Will work on the docs next. |
younesbelkada
left a comment
There was a problem hiding this comment.
Looking great to me thanks! I just left one question for the documentation
| {"prompt": "<prompt text>", "completion": "<ideal generated text>"} | ||
| ``` | ||
|
|
||
| If your dataset uses one of the above formats, you can directly pass it to the trainer without pre-processing. The [`SFTTrainer`] will then format the dataset for you using the defined format from the model's tokenizer with the [apply_chat_template](https://huggingface.co/docs/transformers/main/en/chat_templating#templates-for-chat-models) method. |
There was a problem hiding this comment.
Maybe worth adding a line saying that the user need to make sure that tokenizer support apply_chat_template - otherwise it'll fail I think no?
There was a problem hiding this comment.
it is always supported. If there is no template defined it falls back to a default template, which is ChatML format from OAI. cc @Rocketknight1 to confirm
There was a problem hiding this comment.
Yep, the default format for tokenizers with no chat_template or class-level default_chat_template is ChatML.
lvwerra
left a comment
There was a problem hiding this comment.
Thanks a lot @philschmid! Overall very clean, just left a few small nits.
younesbelkada
left a comment
There was a problem hiding this comment.
Still looks really good to me! Is it ok for merge @philschmid ?
* Add support for ChatML dataset format in SFTTrainer * fix formatting * fix tests * more comment * fix intent * fix doc string * Update dataset_formatting.py * Update dataset_formatting.py * add documentation * Update sft_trainer.mdx * add leonardos comment and more tests * added more tests and fixed batching * style * comment in
* Add support for ChatML dataset format in SFTTrainer * fix formatting * fix tests * more comment * fix intent * fix doc string * Update dataset_formatting.py * Update dataset_formatting.py * add documentation * Update sft_trainer.mdx * add leonardos comment and more tests * added more tests and fixed batching * style * comment in
What does this PR do?
This PR adds support for a standardized dataset to be automatically formated for training in the
SFTTrainerusing theapply_chat_templatefromtransformers.This allow users to pass the dataset without the need of a
formatting_functo theSFTTrainer. Example belowIn the init method the SFTTrainer tries to finds the correct formatting function based on the dataset structure. Currently supported datasets are:
-
ChatMLwith [{"role": str, "content": str}]-
instructionwith [{"prompt": str, "completion": str}]Based on the dataset it returns a callable which uses the
tokenizerof the model and the correspondingapply_chat_templatemethod. This allows continues fine-tuning for, e.g. Llama2-chat or other models which already have a defined format.The nice part about is that you can use the "extras" outside of the SFFTrainer, e.g. if you want to format DPO datasets with the methods.