From 8325a43018a0265a091ca74555aa23f3f5a41b23 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Mon, 31 Mar 2025 17:25:43 +0000 Subject: [PATCH 01/10] Update model card for jamba --- docs/source/en/model_doc/jamba.md | 129 +++++++++++++++++------------- 1 file changed, 75 insertions(+), 54 deletions(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index c8d66b163b5a..cf79321f70eb 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -22,88 +22,109 @@ rendered properly in your Markdown viewer. SDPA -## Overview +[Jamba](https://huggingface.co/papers/2403.19887) is a family of state-of-the-art, hybrid SSM-Transformer large language models ranging from 51B to 399B parameters. [Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1) is a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and an overall of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.[Jamba-Mini-1.6](https://huggingface.co/ai21labs/AI21-Jamba-Mini-1.6) comprises 52 billion parameters in total, with 12 billion active during inference and [Jamba-Large-1.6](https://huggingface.co/ai21labs/AI21-Jamba-Large-1.6) has a total of 398 billion parameters, with 94 billion active during inference. -Jamba is a state-of-the-art, hybrid SSM-Transformer LLM. It is the first production-scale Mamba implementation, which opens up interesting research and application opportunities. While this initial experimentation shows encouraging gains, we expect these to be further enhanced with future optimizations and explorations. +Jamba's architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers. -For full details of this model please read the [release blog post](https://www.ai21.com/blog/announcing-jamba). + -### Model Details +> [!TIP] +> Click on the Jamba models in the right sidebar for more examples of how to apply Jamba to different language tasks. -Jamba is a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and an overall of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU. +The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line. -As depicted in the diagram below, Jamba's architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers. + + - +```py +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer -## Usage +model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6", + torch_dtype=torch.bfloat16, + attn_implementation="flash_attention_2", + device_map="auto") -### Prerequisites +tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6") -Jamba requires you use `transformers` version 4.39.0 or higher: -```bash -pip install transformers>=4.39.0 -``` +messages = [ + {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."}, + {"role": "user", "content": "Hello!"}, +] -In order to run optimized Mamba implementations, you first need to install `mamba-ssm` and `causal-conv1d`: -```bash -pip install mamba-ssm causal-conv1d>=1.2.0 +input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device) + +outputs = model.generate(input_ids, max_new_tokens=216) + +# Decode the output +conversation = tokenizer.decode(outputs[0], skip_special_tokens=True) + +# Split the conversation to get only the assistant's response +assistant_response = conversation.split(messages[-1]['content'])[1].strip() +print(assistant_response) +# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes? ``` -You also have to have the model on a CUDA device. -You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model. + + -### Run the model -```python +```py +import torch from transformers import AutoModelForCausalLM, AutoTokenizer -model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1") -tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1") +from transformers import AutoModelForCausalLM, BitsAndBytesConfig +quantization_config = BitsAndBytesConfig(load_in_8bit=True, + llm_int8_skip_modules=["mamba"]) -input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"] +# a device map to distribute the model evenly across 8 GPUs +device_map = {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 2, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 3, 'model.layers.28': 3, 'model.layers.29': 3, 'model.layers.30': 3, 'model.layers.31': 3, 'model.layers.32': 3, 'model.layers.33': 3, 'model.layers.34': 3, 'model.layers.35': 3, 'model.layers.36': 4, 'model.layers.37': 4, 'model.layers.38': 4, 'model.layers.39': 4, 'model.layers.40': 4, 'model.layers.41': 4, 'model.layers.42': 4, 'model.layers.43': 4, 'model.layers.44': 4, 'model.layers.45': 5, 'model.layers.46': 5, 'model.layers.47': 5, 'model.layers.48': 5, 'model.layers.49': 5, 'model.layers.50': 5, 'model.layers.51': 5, 'model.layers.52': 5, 'model.layers.53': 5, 'model.layers.54': 6, 'model.layers.55': 6, 'model.layers.56': 6, 'model.layers.57': 6, 'model.layers.58': 6, 'model.layers.59': 6, 'model.layers.60': 6, 'model.layers.61': 6, 'model.layers.62': 6, 'model.layers.63': 7, 'model.layers.64': 7, 'model.layers.65': 7, 'model.layers.66': 7, 'model.layers.67': 7, 'model.layers.68': 7, 'model.layers.69': 7, 'model.layers.70': 7, 'model.layers.71': 7, 'model.final_layernorm': 7, 'lm_head': 7} +model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-Large-1.6", + torch_dtype=torch.bfloat16, + attn_implementation="flash_attention_2", + quantization_config=quantization_config, + device_map=device_map) -outputs = model.generate(input_ids, max_new_tokens=216) +tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Large-1.6") -print(tokenizer.batch_decode(outputs)) -# ["<|startoftext|>In the recent Super Bowl LVIII, the Kansas City Chiefs emerged victorious, defeating the San Francisco 49ers in a thrilling overtime showdown. The game was a nail-biter, with both teams showcasing their skills and determination.\n\nThe Chiefs, led by their star quarterback Patrick Mahomes, displayed their offensive prowess, while the 49ers, led by their strong defense, put up a tough fight. The game went into overtime, with the Chiefs ultimately securing the win with a touchdown.\n\nThe victory marked the Chiefs' second Super Bowl win in four years, solidifying their status as one of the top teams in the NFL. The game was a testament to the skill and talent of both teams, and a thrilling end to the NFL season.\n\nThe Super Bowl is not just about the game itself, but also about the halftime show and the commercials. This year's halftime show featured a star-studded lineup, including Usher, Alicia Keys, and Lil Jon. The show was a spectacle of music and dance, with the performers delivering an energetic and entertaining performance.\n"] -``` +messages = [ + {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."}, + {"role": "user", "content": "Hello!"}, +] + +input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device) -
-Loading the model in half precision +outputs = model.generate(input_ids, max_new_tokens=216) -The published checkpoint is saved in BF16. In order to load it into RAM in BF16/FP16, you need to specify `torch_dtype`: +# Decode the output +conversation = tokenizer.decode(outputs[0], skip_special_tokens=True) -```python -from transformers import AutoModelForCausalLM -import torch -model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1", torch_dtype=torch.bfloat16) -# you can also use torch_dtype=torch.float16 +# Split the conversation to get only the assistant's response +assistant_response = conversation.split(messages[-1]['content'])[1].strip() +print(assistant_response) +# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes? ``` -When using half precision, you can enable the [FlashAttention2](https://github.com/Dao-AILab/flash-attention) implementation of the Attention blocks. In order to use it, you also need the model on a CUDA device. Since in this precision the model is to big to fit on a single 80GB GPU, you'll also need to parallelize it using [accelerate](https://huggingface.co/docs/accelerate/index): -```python -from transformers import AutoModelForCausalLM -import torch -model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1", - torch_dtype=torch.bfloat16, - attn_implementation="flash_attention_2", - device_map="auto") + + + +```bash +echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ai21labs/AI21-Jamba-Mini-1.6 --device 0 ``` -
-
Load the model in 8-bit + + -**Using 8-bit precision, it is possible to fit up to 140K sequence lengths on a single 80GB GPU.** You can easily quantize the model to 8-bit using [bitsandbytes](https://huggingface.co/docs/bitsandbytes/index). In order to not degrade model quality, we recommend to exclude the Mamba blocks from the quantization: +## Notes -```python -from transformers import AutoModelForCausalLM, BitsAndBytesConfig -quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_skip_modules=["mamba"]) -model = AutoModelForCausalLM.from_pretrained( - "ai21labs/Jamba-v0.1", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", quantization_config=quantization_config -) +In order to run optimized Mamba implementations, you first need to install `mamba-ssm` and `causal-conv1d`: +```bash +pip install mamba-ssm causal-conv1d>=1.2.0 ``` -
+ +You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model. It is also recommended to read this [article](https://huggingface.co/docs/accelerate/usage_guides/big_modeling) on how to perform inference +for big models. + ## JambaConfig From efa7dd05ce743f2e4d4fb0209a95fee6c4335bec Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Wed, 2 Apr 2025 19:23:01 +0530 Subject: [PATCH 02/10] Apply the suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/jamba.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index cf79321f70eb..1e384dcae395 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -22,22 +22,25 @@ rendered properly in your Markdown viewer. SDPA -[Jamba](https://huggingface.co/papers/2403.19887) is a family of state-of-the-art, hybrid SSM-Transformer large language models ranging from 51B to 399B parameters. [Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1) is a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and an overall of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.[Jamba-Mini-1.6](https://huggingface.co/ai21labs/AI21-Jamba-Mini-1.6) comprises 52 billion parameters in total, with 12 billion active during inference and [Jamba-Large-1.6](https://huggingface.co/ai21labs/AI21-Jamba-Large-1.6) has a total of 398 billion parameters, with 94 billion active during inference. +[Jamba](https://huggingface.co/papers/2403.19887) is a hybrid Transformer-Mamba mixture-of-experts (MoE) language model ranging from 52B to 398B total parameters. This model aims to combine the advantages of both model families, the performance of transformer models and the efficiency and longer context (256K tokens) of state space models (SSMs) like Mamba. -Jamba's architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers. +Jamba's architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers. MoE layers are mixed in to increase model capacity. -drawing > [!TIP] > Click on the Jamba models in the right sidebar for more examples of how to apply Jamba to different language tasks. -The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line. +The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`], and from the command line. ```py +# !pip install -U flash-attn --no-build-isolation +# install optimized Mamba implementations +# !pip install mamba-ssm causal-conv1d>=1.2.0 import torch from transformers import AutoModelForCausalLM, AutoTokenizer @@ -53,7 +56,7 @@ messages = [ {"role": "user", "content": "Hello!"}, ] -input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device) +input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to("cuda") outputs = model.generate(input_ids, max_new_tokens=216) @@ -71,9 +74,8 @@ print(assistant_response) ```py import torch -from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer -from transformers import AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_skip_modules=["mamba"]) @@ -92,7 +94,7 @@ messages = [ {"role": "user", "content": "Hello!"}, ] -input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device) +input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to("cuda") outputs = model.generate(input_ids, max_new_tokens=216) @@ -109,7 +111,7 @@ print(assistant_response) ```bash -echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ai21labs/AI21-Jamba-Mini-1.6 --device 0 +echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ai21labs/AI21-Jamba-Mini-1.6 --torch_dtype auto --attn_implementation flash_attention_2 --device 0 ``` From 5b8070405d7048f0d288283cceb30ef02b64bd97 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Wed, 2 Apr 2025 19:32:55 +0530 Subject: [PATCH 03/10] Apply suggestions from code review-2 Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/jamba.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index 1e384dcae395..727ccd3d29b2 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -119,12 +119,10 @@ echo -e "Plants create energy through a process known as" | transformers-cli run ## Notes -In order to run optimized Mamba implementations, you first need to install `mamba-ssm` and `causal-conv1d`: -```bash -pip install mamba-ssm causal-conv1d>=1.2.0 ``` -You can run the model not using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly lower latencies. In order to do that, you'll need to specify `use_mamba_kernels=False` when loading the model. It is also recommended to read this [article](https://huggingface.co/docs/accelerate/usage_guides/big_modeling) on how to perform inference +- It is not recommended to use Mamba without the optimized Mamba kernels as it results in significantly lower latencies. If you still want to use Mamba without the kernels, then set `use_mamba_kernels=False` in [`~AutoModel.from_pretrained`]. +- Don't quantize the Mamba blocks to prevent model performance degradation. for big models. From 5084d1fb6f1a49f11f10e9723eadc53f7d8284d8 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Wed, 2 Apr 2025 14:20:26 +0000 Subject: [PATCH 04/10] update model page. --- docs/source/en/model_doc/jamba.md | 34 ++++++++----------------------- 1 file changed, 9 insertions(+), 25 deletions(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index 727ccd3d29b2..f9b7b1d43a70 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -42,31 +42,15 @@ The example below demonstrates how to generate text with [`Pipeline`], [`AutoMod # install optimized Mamba implementations # !pip install mamba-ssm causal-conv1d>=1.2.0 import torch -from transformers import AutoModelForCausalLM, AutoTokenizer - -model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6", - torch_dtype=torch.bfloat16, - attn_implementation="flash_attention_2", - device_map="auto") - -tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-Mini-1.6") - -messages = [ - {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."}, - {"role": "user", "content": "Hello!"}, -] - -input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to("cuda") - -outputs = model.generate(input_ids, max_new_tokens=216) - -# Decode the output -conversation = tokenizer.decode(outputs[0], skip_special_tokens=True) - -# Split the conversation to get only the assistant's response -assistant_response = conversation.split(messages[-1]['content'])[1].strip() -print(assistant_response) -# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes? +from transformers import pipeline + +pipeline = pipeline( + task="text-generation", + model="ai21labs/AI21-Jamba-Mini-1.6", + torch_dtype=torch.float16, + device=0 +) +pipeline("Plants create energy through a process known as") ``` From ae7585c49f6b8336c8d04892a6b5499b2deb0d90 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Thu, 3 Apr 2025 21:13:22 +0530 Subject: [PATCH 05/10] Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/jamba.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index f9b7b1d43a70..99ef7482df12 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -27,7 +27,6 @@ rendered properly in your Markdown viewer. Jamba's architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers. MoE layers are mixed in to increase model capacity. You can find all the original Jamba checkpoints under the [AI21](https://huggingface.co/ai21labs) organization. -alt="drawing" width="600"/> > [!TIP] > Click on the Jamba models in the right sidebar for more examples of how to apply Jamba to different language tasks. @@ -38,7 +37,6 @@ The example below demonstrates how to generate text with [`Pipeline`], [`AutoMod ```py -# !pip install -U flash-attn --no-build-isolation # install optimized Mamba implementations # !pip install mamba-ssm causal-conv1d>=1.2.0 import torch @@ -57,6 +55,7 @@ pipeline("Plants create energy through a process known as") ```py +# !pip install -U flash-attn --no-build-isolation import torch from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer @@ -95,7 +94,7 @@ print(assistant_response) ```bash -echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ai21labs/AI21-Jamba-Mini-1.6 --torch_dtype auto --attn_implementation flash_attention_2 --device 0 +echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ai21labs/AI21-Jamba-Mini-1.6 --device 0 ``` From 59ae5b8f4d1930703bded39baee31e48d59f6b33 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Fri, 4 Apr 2025 17:02:29 +0000 Subject: [PATCH 06/10] Update as per code review. --- docs/source/en/model_doc/jamba.md | 54 +++++++++++++++++++++++-------- 1 file changed, 40 insertions(+), 14 deletions(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index 99ef7482df12..164a08ef48ee 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -14,14 +14,16 @@ rendered properly in your Markdown viewer. --> -# Jamba - -
-PyTorch -FlashAttention -SDPA +
+
+ PyTorch + FlashAttention + SDPA +
+# Jamba + [Jamba](https://huggingface.co/papers/2403.19887) is a hybrid Transformer-Mamba mixture-of-experts (MoE) language model ranging from 52B to 398B total parameters. This model aims to combine the advantages of both model families, the performance of transformer models and the efficiency and longer context (256K tokens) of state space models (SSMs) like Mamba. Jamba's architecture features a blocks-and-layers approach that allows Jamba to successfully integrate Transformer and Mamba architectures altogether. Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron (MLP), producing an overall ratio of one Transformer layer out of every eight total layers. MoE layers are mixed in to increase model capacity. @@ -55,10 +57,32 @@ pipeline("Plants create energy through a process known as") ```py -# !pip install -U flash-attn --no-build-isolation import torch -from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer +from transformers import AutoModelForCausalLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained( + "ai21labs/AI21-Jamba-Large-1.6", +) +model = AutoModelForCausalLM.from_pretrained( + "ai21labs/AI21-Jamba-Large-1.6", + torch_dtype=torch.float16, + device_map="auto", + attn_implementation="sdpa" +) +input_ids = tokenizer("Plants create energy through a process known as", return_tensors="pt").to("cuda") + +output = model.generate(**input_ids, cache_implementation="static") +print(tokenizer.decode(output[0], skip_special_tokens=True)) +``` + +We can generate text with the model in quantized form as follows: + +```py +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +from transformers import AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_skip_modules=["mamba"]) @@ -77,7 +101,7 @@ messages = [ {"role": "user", "content": "Hello!"}, ] -input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to("cuda") +input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device) outputs = model.generate(input_ids, max_new_tokens=216) @@ -89,7 +113,6 @@ assistant_response = conversation.split(messages[-1]['content'])[1].strip() print(assistant_response) # Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes? ``` - @@ -102,12 +125,15 @@ echo -e "Plants create energy through a process known as" | transformers-cli run ## Notes -``` - -- It is not recommended to use Mamba without the optimized Mamba kernels as it results in significantly lower latencies. If you still want to use Mamba without the kernels, then set `use_mamba_kernels=False` in [`~AutoModel.from_pretrained`]. - Don't quantize the Mamba blocks to prevent model performance degradation. -for big models. +- It is not recommended to use Mamba without the optimized Mamba kernels as it results in significantly lower latencies. If you still want to use Mamba without the kernels, then set `use_mamba_kernels=False` in [`~AutoModel.from_pretrained`] as follows: +```py +import torch +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Large", + use_mamba_kernels=False) +``` ## JambaConfig From db2c14afb842382099e69265d1ef79199276e703 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Sat, 5 Apr 2025 20:15:24 +0530 Subject: [PATCH 07/10] Update docs/source/en/model_doc/jamba.md as per code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/jamba.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index 164a08ef48ee..4d1eae1d965e 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -126,7 +126,7 @@ echo -e "Plants create energy through a process known as" | transformers-cli run ## Notes - Don't quantize the Mamba blocks to prevent model performance degradation. -- It is not recommended to use Mamba without the optimized Mamba kernels as it results in significantly lower latencies. If you still want to use Mamba without the kernels, then set `use_mamba_kernels=False` in [`~AutoModel.from_pretrained`] as follows: +- It is not recommended to use Mamba without the optimized Mamba kernels as it results in significantly lower latencies. If you still want to use Mamba without the kernels, then set `use_mamba_kernels=False` in [`~AutoModel.from_pretrained`]. ```py import torch From e74f3e2b49d48740c9c627ca768578c65262e642 Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Sat, 5 Apr 2025 20:52:15 +0530 Subject: [PATCH 08/10] Update docs/source/en/model_doc/jamba.md as per code review ` Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/model_doc/jamba.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index 4d1eae1d965e..5b9f01e0ab7c 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -76,7 +76,9 @@ print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` -We can generate text with the model in quantized form as follows: +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. + +The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 8-bits. ```py import torch From d66719258285b61572eeeb5e85169264a548d1cc Mon Sep 17 00:00:00 2001 From: Parag Ekbote Date: Sat, 5 Apr 2025 16:00:38 +0000 Subject: [PATCH 09/10] update as per code review. --- docs/source/en/model_doc/jamba.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index 5b9f01e0ab7c..991c14f03484 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -75,6 +75,14 @@ output = model.generate(**input_ids, cache_implementation="static") print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` + + +```bash +echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ai21labs/AI21-Jamba-Mini-1.6 --device 0 +``` + + + Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. @@ -115,15 +123,6 @@ assistant_response = conversation.split(messages[-1]['content'])[1].strip() print(assistant_response) # Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes? ``` - - - -```bash -echo -e "Plants create energy through a process known as" | transformers-cli run --task text-generation --model ai21labs/AI21-Jamba-Mini-1.6 --device 0 -``` - - - ## Notes From 77d3c9e593d68b46619a3f4135e0ba26c46d25aa Mon Sep 17 00:00:00 2001 From: Steven Liu <59462357+stevhliu@users.noreply.github.com> Date: Mon, 7 Apr 2025 10:24:00 -0700 Subject: [PATCH 10/10] fixes --- docs/source/en/model_doc/jamba.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/source/en/model_doc/jamba.md b/docs/source/en/model_doc/jamba.md index 991c14f03484..8c2c147c0c7e 100644 --- a/docs/source/en/model_doc/jamba.md +++ b/docs/source/en/model_doc/jamba.md @@ -90,9 +90,8 @@ The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quan ```py import torch -from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig -from transformers import AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_skip_modules=["mamba"]) @@ -129,12 +128,12 @@ print(assistant_response) - Don't quantize the Mamba blocks to prevent model performance degradation. - It is not recommended to use Mamba without the optimized Mamba kernels as it results in significantly lower latencies. If you still want to use Mamba without the kernels, then set `use_mamba_kernels=False` in [`~AutoModel.from_pretrained`]. -```py -import torch -from transformers import AutoModelForCausalLM -model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Large", - use_mamba_kernels=False) -``` + ```py + import torch + from transformers import AutoModelForCausalLM + model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Large", + use_mamba_kernels=False) + ``` ## JambaConfig