System Info
transformers version: 4.44.1
- Platform: Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.14
- Huggingface_hub version: 0.24.5
- Safetensors version: 0.4.4
- Accelerate version: 0.33.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA A100-SXM4-80GB
Who can help?
@ArthurZucker
I noticed that the apply_chat_template for the PreTrainedTokenizerBase class does not work correctly when return_assistant_tokens_mask=True. We would expect to get back a list of indices for each example where 1 indicates the token is part of an assistant message and 0 otherwise. This is the case for the Llama 2 tokenizer for example. I am sharing a minimal example to reproduce this issue.
Looking deeper into the apply_chat_template method it seems the issue is related to the char_to_token method of the tokenizers.Embedding class and could be related to the fact that the Llama 3 tokenizer was trained with tiktoken as opposed to sentencepiece.
Information
Tasks
Reproduction
from transformers import AutoTokenizer
from datasets import load_dataset
dataset_name = "m-a-p/Code-Feedback"
model_name = "meta-llama/Meta-Llama-3.1-8B" # apply_chat_template does not work correctly
#model_name = "meta-llama/Llama-2-7b-hf" # apply_chat_template works correctly
chat_template = """{% if messages[0]['role'] == 'system' %}
{% set offset = 1 %}
{% else %}
{% set offset = 0 %}
{% endif %}
{% for message in messages %}
{% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{% endif %}
{{ '### ' + message['role'] + ':\n'}}
{% if (message['role'] == 'assistant') %}
{% generation %} {{ message['content'] | trim + eos_token }} {% endgeneration %}
{% else %}
{{ message['content'] | trim + eos_token }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
{{ '### ' + 'assistant' + ':\n' }}
{% endif %}"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.chat_template = chat_template
datasets = load_dataset(dataset_name, trust_remote_code=True)
# assistant_mask is all zeros for llama3 tokenizer
chat = tokenizer.apply_chat_template(
datasets["train"][0]["messages"],
add_generation_prompt=False,
return_dict=True,
tokenize=True,
return_assistant_tokens_mask=True
)
print("assistant_masks", chat["assistant_masks"])
Executing the steps to get the assistant mask in the apply chat template method shows that the char_to_token method of the tokenizers. Embedding class seems to be not working correctly.
compiled_template = tokenizer._compile_jinja_template(chat_template)
template_kwargs = {**tokenizer.special_tokens_map}
rendered_chat, generation_indices = tokenizer._render_with_assistant_indices(
compiled_template=compiled_template,
messages=datasets["train"][0]["messages"],
tools=[],
documents=None,
add_generation_prompt=False,
**tokenizer.special_tokens_map
)
out = tokenizer(
rendered_chat,
padding=False,
truncation=False,
max_length=None,
add_special_tokens=False,
return_tensors=None
)
first_assistant_start_char, first_assistant_end_char = generation_indices[0]
# returns None for llama3
print("char_to_token", out[0].char_to_token(0, first_assistant_start_char))
Expected behavior
If we assume that the entire chat is 10 characters and the assistant tokens occur at indices 4-6 and 8-9 we would have an expected output that looks like this
[0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
The actual output for the llama 3 tokenizer is always all 0s
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
System Info
transformersversion: 4.44.1Who can help?
@ArthurZucker
I noticed that the apply_chat_template for the PreTrainedTokenizerBase class does not work correctly when return_assistant_tokens_mask=True. We would expect to get back a list of indices for each example where 1 indicates the token is part of an assistant message and 0 otherwise. This is the case for the Llama 2 tokenizer for example. I am sharing a minimal example to reproduce this issue.
Looking deeper into the apply_chat_template method it seems the issue is related to the char_to_token method of the tokenizers.Embedding class and could be related to the fact that the Llama 3 tokenizer was trained with tiktoken as opposed to sentencepiece.
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Executing the steps to get the assistant mask in the apply chat template method shows that the char_to_token method of the tokenizers. Embedding class seems to be not working correctly.
Expected behavior
If we assume that the entire chat is 10 characters and the assistant tokens occur at indices 4-6 and 8-9 we would have an expected output that looks like this
[0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
The actual output for the llama 3 tokenizer is always all 0s
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]