From e48e5d9742add7468ad0cc1a73b98ba48654ed8c Mon Sep 17 00:00:00 2001
From: Arpon Kapuria <arpkapuria@gmail.com>
Date: Wed, 23 Jul 2025 18:01:54 +0600
Subject: [PATCH 1/2] Update model card for Cohere2 (Command R7B)

---
 docs/source/en/model_doc/cohere2.md | 81 +++++++++++++++++++++--------
 1 file changed, 60 insertions(+), 21 deletions(-)
diff --git a/docs/source/en/model_doc/cohere2.md b/docs/source/en/model_doc/cohere2.md
index 24f649666395..3e6ec98df65d 100644
--- a/docs/source/en/model_doc/cohere2.md
+++ b/docs/source/en/model_doc/cohere2.md
@@ -1,45 +1,84 @@
-# Cohere
-
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
+    </div>
 </div>
 
-## Overview
-[C4AI Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion parameter model developed by Cohere and Cohere For AI. It has advanced capabilities optimized for various use cases, including reasoning, summarization, question answering, and code. The model is trained to perform sophisticated tasks including Retrieval Augmented Generation (RAG) and tool use. The model also has powerful agentic capabilities that can use and combine multiple tools over multiple steps to accomplish more difficult tasks. It obtains top performance on enterprise-relevant code use cases. C4AI Command R7B is a multilingual model trained on 23 languages.
 
-The model features three layers with sliding window attention (window size 4096) and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
+# Cohere2
+
+[Cohere Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion parameter model developed by Cohere and Cohere For AI. It has advanced capabilities optimized for various use cases, including RAG, tool use, agentic capabilities and tasks requiring complex reasoning and multiple steps,. C4AI Command R7B is a multilingual model trained on 23 languages and has a context window of 128k.
+
+You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
+
+
+> [!TIP]
+> Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
+
+The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class.
+
+<hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```python
+import torch
+from transformers import pipeline
+
+pipeline = pipeline(
+    task="text-generation", 
+    model="CohereLabs/c4ai-command-r7b-12-2024",
+    torch_dtype=torch.float16,
+    device_map=0
+)
 
-The model has been trained on 23 languages: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, Chinese, Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.
+messages = [
+    {"role": "user", "content": "Who are you?"},
+]
+pipeline(messages)
+```
 
-## Usage tips
-The model and tokenizer can be loaded via:
+</hfoption>
+<hfoption id="AutoModel">
 
 ```python
-# pip install transformers
+import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 
-model_id = "CohereForAI/c4ai-command-r7b-12-2024"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
+tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r7b-12-2024")
+model = AutoModelForCausalLM.from_pretrained(
+    "CohereForAI/c4ai-command-r7b-12-2024", torch_dtype=torch.float16, 
+    device_map="auto"
+)
 
 # Format message with the command-r chat template
 messages = [{"role": "user", "content": "Hello, how are you?"}]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+input_ids = tokenizer.apply_chat_template(
+    messages, 
+    tokenize=True, 
+    add_generation_prompt=True, 
+    return_tensors="pt"
+)
 
-gen_tokens = model.generate(
+output = model.generate(
     input_ids,
     max_new_tokens=100,
     do_sample=True,
     temperature=0.3,
 )
 
-gen_text = tokenizer.decode(gen_tokens[0])
-print(gen_text)
+print(tokenizer.decode(output[0],skip_special_tokens=True))
 ```
 
+</hfoption>
+</hfoptions>
+
+
+## Notes
+- For quantized version of Cohere R7B, you can refer to this [collection](https://huggingface.co/models?other=base_model:quantized:CohereLabs/c4ai-command-r7b-12-2024).
+
 ## Cohere2Config
 
 [[autodoc]] Cohere2Config

From 843dafd056ca5a9ef75cfe82ee0bec8030feefd5 Mon Sep 17 00:00:00 2001
From: Arpon Kapuria <arpkapuria@gmail.com>
Date: Wed, 30 Jul 2025 04:22:24 +0600
Subject: [PATCH 2/2] fix: applied suggested changes

---
 docs/source/en/model_doc/cohere2.md | 69 +++++++++++++++++++++--------
 1 file changed, 51 insertions(+), 18 deletions(-)

diff --git a/docs/source/en/model_doc/cohere2.md b/docs/source/en/model_doc/cohere2.md
index 3e6ec98df65d..a4836e7790cf 100644
--- a/docs/source/en/model_doc/cohere2.md
+++ b/docs/source/en/model_doc/cohere2.md
@@ -10,7 +10,9 @@
 
 # Cohere2
 
-[Cohere Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion parameter model developed by Cohere and Cohere For AI. It has advanced capabilities optimized for various use cases, including RAG, tool use, agentic capabilities and tasks requiring complex reasoning and multiple steps,. C4AI Command R7B is a multilingual model trained on 23 languages and has a context window of 128k.
+[Cohere Command R7B](https://cohere.com/blog/command-r7b) is an open weights research release of a 7B billion parameter model. It is a multilingual model trained on 23 languages and has a context window of 128k. The model features three layers with sliding window attention and ROPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence.
+
+This model is optimized for speed, cost-performance, and compute resources.
 
 You can find all the original Command-R checkpoints under the [Command Models](https://huggingface.co/collections/CohereForAI/command-models-67652b401665205e17b192ad) collection.
 
@@ -18,7 +20,7 @@ You can find all the original Command-R checkpoints under the [Command Models](h
 > [!TIP]
 > Click on the Cohere models in the right sidebar for more examples of how to apply Cohere to different language tasks.
 
-The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class.
+The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`] class, and from the command line.
 
 <hfoptions id="usage">
 <hfoption id="Pipeline">
@@ -35,7 +37,7 @@ pipeline = pipeline(
 )
 
 messages = [
-    {"role": "user", "content": "Who are you?"},
+    {"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"},
 ]
 pipeline(messages)
 ```
@@ -47,37 +49,68 @@ pipeline(messages)
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 
-tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r7b-12-2024")
+tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
 model = AutoModelForCausalLM.from_pretrained(
-    "CohereForAI/c4ai-command-r7b-12-2024", torch_dtype=torch.float16, 
-    device_map="auto"
-)
-
-# Format message with the command-r chat template
-messages = [{"role": "user", "content": "Hello, how are you?"}]
-input_ids = tokenizer.apply_chat_template(
-    messages, 
-    tokenize=True, 
-    add_generation_prompt=True, 
-    return_tensors="pt"
+    "CohereLabs/c4ai-command-r7b-12-2024", 
+    torch_dtype=torch.float16, 
+    device_map="auto", 
+    attn_implementation="sdpa"
 )
 
+# format message with the Command-R chat template
+messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
 output = model.generate(
     input_ids,
     max_new_tokens=100,
     do_sample=True,
     temperature=0.3,
+    cache_implementation="static",
 )
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+</hfoption>
+<hfoption id="transformers CLI">
 
-print(tokenizer.decode(output[0],skip_special_tokens=True))
+```bash
+# pip install -U flash-attn --no-build-isolation
+transformers-cli chat CohereLabs/c4ai-command-r7b-12-2024 --torch_dtype auto --attn_implementation flash_attention_2
 ```
 
 </hfoption>
 </hfoptions>
 
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview.md) overview for more available quantization backends.
+
+The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to 4-bits.
+
+```python
+import torch
+from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
+
+bnb_config = BitsAndBytesConfig(load_in_4bit=True)
+tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
+model = AutoModelForCausalLM.from_pretrained(
+    "CohereLabs/c4ai-command-r7b-12-2024", 
+    torch_dtype=torch.float16, 
+    device_map="auto", 
+    quantization_config=bnb_config, 
+    attn_implementation="sdpa"
+)
 
-## Notes
-- For quantized version of Cohere R7B, you can refer to this [collection](https://huggingface.co/models?other=base_model:quantized:CohereLabs/c4ai-command-r7b-12-2024).
+# format message with the Command-R chat template
+messages = [{"role": "user", "content": "Hello, can you please help me book a hotel in Japan?"}]
+input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
+output = model.generate(
+    input_ids,
+    max_new_tokens=100,
+    do_sample=True,
+    temperature=0.3,
+    cache_implementation="static",
+)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
 
 ## Cohere2Config