added mllama doc#37647
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
|
|
||
| Mllama has an extra token used as a placeholder for image positions in the text. It means that input ids and an input embedding layer will have an extra token. But since the weights for input and output embeddings are not tied, the `lm_head` layer has one less token and will fail if you want to calculate loss on image tokens or apply some logit processors. In case you are training, make sure to mask out special `"<|image|>"` tokens in the `labels` as the model should not be trained on predicting them. | ||
| ```python | ||
| from transformers import pipeline |
There was a problem hiding this comment.
Lets use a real image for the example here:
import torch
from transformers import pipeline
pipeline = pipeline(
task="image-text-to-text",
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
device=0,
torch_dtype=torch.bfloat16
)
messages = [
[
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "What does the image show?"}
]
}
],
]
pipeline(text=messages, return_full_text=False)| import torch | ||
| from PIL import Image | ||
| from transformers import MllamaForConditionalGeneration, AutoProcessor | ||
| from transformers import AutoModelForCausalLM |
There was a problem hiding this comment.
Use the BitsAndBytesConfig
import torch
from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = MllamaForConditionalGeneration.from_pretrained(
"meta-llama/Llama-3.2-11B-Vision-Instruct",
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
quantization_config=bnb_config
)
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
messages = [
[
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "What does the image show?"}
]
}
],
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to("cuda")
output = model.generate(**inputs, max_new_tokens=25)
print(processor.decode(output[0]))| - When training, mask out the `<|image|>` tokens in labels | ||
| - For CUDA index errors during generation, expand the `lm_head`: | ||
|
|
||
| ```python |
There was a problem hiding this comment.
Indent this code block so it falls under the last list item
| ## MllamaForCausalLM | ||
|
|
||
| [[autodoc]] MllamaForCausalLM | ||
| - forward | ||
|
|
||
| ## MllamaVisionModel | ||
|
|
||
| [[autodoc]] MllamaVisionModel | ||
| - forward |
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
@stevhliu Thanks a lot for the help mate, used ai a bit here and there thinking it can do a better job, looks like it just gave u more work, new to all this , will keep it organic from now,thanks, let me know if i can make any more edits tho |
|
|
||
|
|
||
| ## Usage Example | ||
| For quantized inference, use `BitsAndBytesConfig`: |
There was a problem hiding this comment.
| For quantized inference, use `BitsAndBytesConfig`: | |
| Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. | |
| The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits. |
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
| model_id = "meta-llama/Llama-3.2-11B-Vision" | ||
| model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16) | ||
| processor = AutoProcessor.from_pretrained(model_id) | ||
| <div class="flex justify-center"> |
There was a problem hiding this comment.
We can remove this image and replace it with the quantization example in this comment.
import torch
from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = MllamaForConditionalGeneration.from_pretrained(
"meta-llama/Llama-3.2-11B-Vision-Instruct",
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
quantization_config=bnb_config
)
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
messages = [
[
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
{"type": "text", "text": "What does the image show?"}
]
}
],
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to("cuda")
output = model.generate(**inputs, max_new_tokens=25)
print(processor.decode(output[0]))There was a problem hiding this comment.
Separate from the AutoModel example and outside of the <hfoption> block, you should have a separate code example for quantization as shown in the code snippet above.
The image hasn't been removed yet
| ```python | ||
| import torch | ||
| from transformers import MllamaForConditionalGeneration, AutoProcessor | ||
| from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor |
There was a problem hiding this comment.
| from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor | |
| from transformers import MllamaForConditionalGeneration, AutoProcessor |
There was a problem hiding this comment.
The AutoModel example shouldn't show quantization usage, so it was fine the way it was before. I was just removing BitsAndBytesConfig from the import
| - When training, mask out the `<|image|>` tokens in labels | ||
| - For CUDA index errors during generation, expand the `lm_head`: | ||
|
|
||
| ```python |
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
|
is it fine now? |
| ```python | ||
| import torch | ||
| from transformers import MllamaForConditionalGeneration, AutoProcessor | ||
| from transformers import BitsAndBytesConfig, MllamaForConditionalGeneration, AutoProcessor |
There was a problem hiding this comment.
The AutoModel example shouldn't show quantization usage, so it was fine the way it was before. I was just removing BitsAndBytesConfig from the import
| model_id = "meta-llama/Llama-3.2-11B-Vision" | ||
| model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16) | ||
| processor = AutoProcessor.from_pretrained(model_id) | ||
| <div class="flex justify-center"> |
There was a problem hiding this comment.
Separate from the AutoModel example and outside of the <hfoption> block, you should have a separate code example for quantization as shown in the code snippet above.
The image hasn't been removed yet
| - When training, mask out the `<|image|>` tokens in labels | ||
| - For CUDA index errors during generation, expand the `lm_head`: | ||
|
|
||
| ```python |
|
|
||
|
|
||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
No need to add all these extra lines at the end either
What does this PR do?
As suggested in this issue #issue-2947704577 - this PR updates the documentation of the CLIP model, which will now be aligned with the standardized format for all the docs.
Worked on mllama, Used AI , so please let me know even if u need a complete rewrite.
Please let me know if there are any changes to be done, do share references if any for those changes
Documentation: @stevhliu
-->