Feature request
Extend tokenizer.apply_chat_template with functionality for training/finetuning, returning attention_masks and (optional) labels (for ignoring "System" and "User" messages during loss computation).
I think this requires the following steps:
- Adding support for taking in a batch of conversations (e.g.,
List[Conversation := List[Dict[str, str]])
- Invoking the native
tokenizer.__call__() after applying the template to each example (passing through padding, truncation, any other parameters).
- Important: Adding an optional output for
labels -- a "masked" version of the returned input_ids with tokens corresponding to the System/User roles set to be ignored for loss computation (e.g., set to IGNORE_INDEX = -100).
Motivation
The new tokenizer.apply_chat_template feature is great, and resolves a lot of ambiguity when it comes to formatting inputs for chat-based LLMs.
However, right now it's geared for inference-time usage, only taking a single "conversation" and outputting the input_ids (tokens) after applying the chat template.
When finetuning models on chat-based data, it would be really nice to unify the apply_chat_template API with the tokenizer.__call__() API, returning attention_masks and (optionally) labels (with "System" and "User" role text automatically ignored for loss computation).
Your contribution
I can try building a proof-of-concept for a "standard" workflow and Draft PR; I think there'd need to be a few discussions about the actual implementation details though!
Feature request
Extend
tokenizer.apply_chat_templatewith functionality for training/finetuning, returningattention_masksand (optional)labels(for ignoring "System" and "User" messages during loss computation).I think this requires the following steps:
List[Conversation := List[Dict[str, str]])tokenizer.__call__()after applying the template to each example (passing through padding, truncation, any other parameters).labels-- a "masked" version of the returnedinput_idswith tokens corresponding to the System/User roles set to be ignored for loss computation (e.g., set toIGNORE_INDEX = -100).Motivation
The new
tokenizer.apply_chat_templatefeature is great, and resolves a lot of ambiguity when it comes to formatting inputs for chat-based LLMs.However, right now it's geared for inference-time usage, only taking a single "conversation" and outputting the
input_ids(tokens) after applying the chat template.When finetuning models on chat-based data, it would be really nice to unify the
apply_chat_templateAPI with thetokenizer.__call__()API, returningattention_masksand (optionally)labels(with "System" and "User" role text automatically ignored for loss computation).Your contribution
I can try building a proof-of-concept for a "standard" workflow and Draft PR; I think there'd need to be a few discussions about the actual implementation details though!