Skip to content

Extend Chat Template Tokenization for Training/Finetuning #27609

@siddk

Description

@siddk

Feature request

Extend tokenizer.apply_chat_template with functionality for training/finetuning, returning attention_masks and (optional) labels (for ignoring "System" and "User" messages during loss computation).

I think this requires the following steps:

  • Adding support for taking in a batch of conversations (e.g., List[Conversation := List[Dict[str, str]])
  • Invoking the native tokenizer.__call__() after applying the template to each example (passing through padding, truncation, any other parameters).
  • Important: Adding an optional output for labels -- a "masked" version of the returned input_ids with tokens corresponding to the System/User roles set to be ignored for loss computation (e.g., set to IGNORE_INDEX = -100).

Motivation

The new tokenizer.apply_chat_template feature is great, and resolves a lot of ambiguity when it comes to formatting inputs for chat-based LLMs.

However, right now it's geared for inference-time usage, only taking a single "conversation" and outputting the input_ids (tokens) after applying the chat template.

When finetuning models on chat-based data, it would be really nice to unify the apply_chat_template API with the tokenizer.__call__() API, returning attention_masks and (optionally) labels (with "System" and "User" role text automatically ignored for loss computation).

Your contribution

I can try building a proof-of-concept for a "standard" workflow and Draft PR; I think there'd need to be a few discussions about the actual implementation details though!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions