From 5f2bfd38f30f4192e07dd5637d20e744be0595bc Mon Sep 17 00:00:00 2001
From: Jihad <jihadhammoud_@hotmail.com>
Date: Thu, 17 Apr 2025 16:13:53 +0200
Subject: [PATCH 01/13] Added documentation for phi model

---
 docs/source/en/model_doc/phi.md | 190 ++++++++------------------------
 1 file changed, 46 insertions(+), 144 deletions(-)
diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index 097d7fdd39ee..f5f8c1e1c249 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -13,186 +13,88 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.
 
 -->
-
-# Phi
-
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
-<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
+        <img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
 </div>
 
-## Overview
-
-The Phi-1 model was proposed in [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li.
-
-The Phi-1.5 model was proposed in [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
-
-### Summary
-
-In Phi-1 and Phi-1.5 papers, the authors showed how important the quality of the data is in training relative to the model size.
-They selected high quality "textbook" data alongside with synthetically generated data for training their small sized Transformer
-based model Phi-1 with 1.3B parameters. Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP.
-They follow the same strategy for Phi-1.5 and created another 1.3B parameter model with performance on natural language tasks comparable
-to models 5x larger, and surpassing most non-frontier LLMs. Phi-1.5 exhibits many of the traits of much larger LLMs such as the ability
-to “think step by step” or perform some rudimentary in-context learning.
-With these two experiments the authors successfully showed the huge impact of quality of training data when training machine learning models.
-
-The abstract from the Phi-1 paper is the following:
-
-*We introduce phi-1, a new large language model for code, with significantly smaller size than
-competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on
-8 A100s, using a selection of “textbook quality” data from the web (6B tokens) and synthetically
-generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains
-pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent
-properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding
-exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as
-phi-1 that still achieves 45% on HumanEval.*
-
-The abstract from the Phi-1.5 paper is the following:
-
-*We continue the investigation into the power of smaller Transformer-based language models as
-initiated by TinyStories – a 10 million parameter model that can produce coherent English – and
-the follow-up work on phi-1, a 1.3 billion parameter model with Python coding performance close
-to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to
-generate “textbook quality” data as a way to enhance the learning process compared to traditional
-web data. We follow the “Textbooks Are All You Need” approach, focusing this time on common
-sense reasoning in natural language, and create a new 1.3 billion parameter model named phi-1.5,
-with performance on natural language tasks comparable to models 5x larger, and surpassing most
-non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic
-coding. More generally, phi-1.5 exhibits many of the traits of much larger LLMs, both good –such
-as the ability to “think step by step” or perform some rudimentary in-context learning– and bad,
-including hallucinations and the potential for toxic and biased generations –encouragingly though, we
-are seeing improvement on that front thanks to the absence of web data. We open-source phi-1.5 to
-promote further research on these urgent topics.*
-
-This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
-
-The original code for Phi-1, Phi-1.5 and Phi-2 can be found [here](https://huggingface.co/microsoft/phi-1), [here](https://huggingface.co/microsoft/phi-1_5) and [here](https://huggingface.co/microsoft/phi-2), respectively.
-
-## Usage tips
-
-- This model is quite similar to `Llama` with the main difference in [`PhiDecoderLayer`], where they used [`PhiAttention`] and [`PhiMLP`] layers in parallel configuration.
-- The tokenizer used for this model is identical to the [`CodeGenTokenizer`].
-
-## How to use Phi-2
-
-<Tip warning={true}>
+# Phi
 
-Phi-2 has been integrated in the development version (4.37.0.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing one of the following:
+[Phi](https://huggingface.co/papers/2306.11644) (phi-1) is a lightweight transformer model developed by Microsoft, optimized for Python code generation. With just 1.3 billion parameters, phi-1 demonstrates strong performance on standard coding benchmarks like HumanEval and MBPP, thanks to its focus on high-quality training data rather than scale. What makes phi-1 unique is its training strategy: instead of using large volumes of generic web data, it was trained on a carefully curated dataset of "textbook-quality" code examples and exercises, including synthetic Python problems generated by GPT-3.5. This data-centric approach allows phi-1 to compete with much larger models while remaining efficient and accessible for research and deployment.
 
-* When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
+You can find all the original Phi checkpoints under the [Phi](https://huggingface.co/collections/microsoft/phi-1-6626e29134744e94e222d572) collection.
 
-* Update your local `transformers` to the development version: `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers`. The previous command is an alternative to cloning and installing from the source.
+> [!TIP]
+> Click on the Phi models in the right sidebar for more examples of how to apply Phi to different language tasks.
 
-</Tip>
+The example below demonstrates how to generate and classify text with [`Pipeline`], [`AutoModel`] and from the command line.
 
-```python
->>> from transformers import AutoModelForCausalLM, AutoTokenizer
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
->>> model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
->>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
+```py
+import torch
+from transformers import pipeline
 
->>> inputs = tokenizer('Can you help me write a formal email to a potential business partner proposing a joint venture?', return_tensors="pt", return_attention_mask=False)
+pipe = pipeline(
+task="text-generation",
+model="microsoft/phi-1.5"
+)
+message = "Why Lebanon is a special country?"
+output = pipe(message)
 
->>> outputs = model.generate(**inputs, max_length=30)
->>> text = tokenizer.batch_decode(outputs)[0]
->>> print(text)
-Can you help me write a formal email to a potential business partner proposing a joint venture?
-Input: Company A: ABC Inc.
-Company B
 ```
 
-### Example :
+</hfoption>
 
-```python
->>> from transformers import PhiForCausalLM, AutoTokenizer
+<hfoption id="AutoModel">
 
->>> # define the model and tokenizer.
->>> model = PhiForCausalLM.from_pretrained("microsoft/phi-1_5")
->>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5")
+```py
+from transformers import AutoTokenizer, PhiForCausalLM
 
->>> # feel free to change the prompt to your liking.
->>> prompt = "If I were an AI that had just achieved"
+model = PhiForCausalLM.from_pretrained("microsoft/phi-1")
+tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1")
 
->>> # apply the tokenizer.
->>> tokens = tokenizer(prompt, return_tensors="pt")
+prompt = "Hey, are you conscious? Can you talk to me?"
+inputs = tokenizer(prompt, return_tensors="pt")
+generate_ids = model.generate(inputs.input_ids, max_length=30)
 
->>> # use the model to generate new tokens.
->>> generated_output = model.generate(**tokens, use_cache=True, max_new_tokens=10)
-
->>> tokenizer.batch_decode(generated_output)[0]
-'If I were an AI that had just achieved a breakthrough in machine learning, I would be thrilled'
+tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
 ```
 
-## Combining Phi and Flash Attention 2
-
-First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
+</hfoption>
+<hfoption id="transformers-cli">
 
 ```bash
-pip install -U flash-attn --no-build-isolation
-```
-
-Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16``)
-
-To load and run a model using Flash Attention 2, refer to the snippet below:
-
-```python
->>> import torch
->>> from transformers import PhiForCausalLM, AutoTokenizer
-
->>> # define the model and tokenizer and push the model and tokens to the GPU.
->>> model = PhiForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to("cuda")  # doctest: +SKIP
->>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5")
-
->>> # feel free to change the prompt to your liking.
->>> prompt = "If I were an AI that had just achieved"
-
->>> # apply the tokenizer.
->>> tokens = tokenizer(prompt, return_tensors="pt").to("cuda")
-
->>> # use the model to generate new tokens.
->>> generated_output = model.generate(**tokens, use_cache=True, max_new_tokens=10)  # doctest: +SKIP
-
->>> tokenizer.batch_decode(generated_output)[0]  # doctest: +SKIP
-'If I were an AI that had just achieved a breakthrough in machine learning, I would be thrilled'
+echo -e "The weather is so nice here" | transformers-cli run --task text-classification --model microsoft/phi-1.5 --device 0
 ```
 
-### Expected speedups
-
-Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `microsoft/phi-1` checkpoint and the Flash Attention 2 version of the model using a sequence length of 2048.
+</hfoption>
+</hfoptions>
 
-<div style="text-align: center">
-<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/phi_1_speedup_plot.jpg">
-</div>
+## Notes
 
-## PhiConfig
+## Notes
 
-[[autodoc]] PhiConfig
+- This model is quite similar to `Llama` with the main difference in `PhiDecoderLayer`, where they used `PhiAttention` and `PhiMLP` layers in parallel configuration.
 
-<frameworkcontent>
-<pt>
+- The tokenizer used for this model is identical to the [CodeGenTokenizer](https://huggingface.co/docs/transformers/v4.51.3/en/model_doc/codegen#transformers.CodeGenTokenizer).
 
 ## PhiModel
 
-[[autodoc]] PhiModel
-    - forward
+[[autodoc]] PhiModel - forward
 
 ## PhiForCausalLM
 
-[[autodoc]] PhiForCausalLM
-    - forward
-    - generate
+[[autodoc]] PhiForCausalLM - forward - generate
 
 ## PhiForSequenceClassification
 
-[[autodoc]] PhiForSequenceClassification
-    - forward
+[[autodoc]] PhiForSequenceClassification - forward
 
 ## PhiForTokenClassification
 
-[[autodoc]] PhiForTokenClassification
-    - forward
-
-</pt>
-</frameworkcontent>
+[[autodoc]] PhiForTokenClassification - forward

From 24a09eff099e59a17a6e7b35bea56c8c148d30f1 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Thu, 17 Apr 2025 16:16:09 +0200
Subject: [PATCH 02/13] Update phi.md

---
 docs/source/en/model_doc/phi.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index f5f8c1e1c249..15601e83fe36 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.
 
 # Phi
 
-[Phi](https://huggingface.co/papers/2306.11644) (phi-1) is a lightweight transformer model developed by Microsoft, optimized for Python code generation. With just 1.3 billion parameters, phi-1 demonstrates strong performance on standard coding benchmarks like HumanEval and MBPP, thanks to its focus on high-quality training data rather than scale. What makes phi-1 unique is its training strategy: instead of using large volumes of generic web data, it was trained on a carefully curated dataset of "textbook-quality" code examples and exercises, including synthetic Python problems generated by GPT-3.5. This data-centric approach allows phi-1 to compete with much larger models while remaining efficient and accessible for research and deployment.
+[Phi](https://huggingface.co/papers/2306.11644)  (phi-1) is a lightweight transformer model developed by Microsoft, optimized for Python code generation. With just 1.3 billion parameters, phi-1 demonstrates strong performance on standard coding benchmarks like HumanEval and MBPP, thanks to its focus on high-quality training data rather than scale. What makes phi-1 unique is its training strategy: instead of using large volumes of generic web data, it was trained on a carefully curated dataset of "textbook-quality" code examples and exercises, including synthetic Python problems generated by GPT-3.5. This data-centric approach allows phi-1 to compete with much larger models while remaining efficient and accessible for research and deployment.
 
 You can find all the original Phi checkpoints under the [Phi](https://huggingface.co/collections/microsoft/phi-1-6626e29134744e94e222d572) collection.
 
@@ -77,8 +77,6 @@ echo -e "The weather is so nice here" | transformers-cli run --task text-classif
 
 ## Notes
 
-## Notes
-
 - This model is quite similar to `Llama` with the main difference in `PhiDecoderLayer`, where they used `PhiAttention` and `PhiMLP` layers in parallel configuration.
 
 - The tokenizer used for this model is identical to the [CodeGenTokenizer](https://huggingface.co/docs/transformers/v4.51.3/en/model_doc/codegen#transformers.CodeGenTokenizer).

From af299c52781604fe6aebede6a24fa7bae51e0d42 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Thu, 17 Apr 2025 16:32:26 +0200
Subject: [PATCH 03/13] Update phi.md

---
 docs/source/en/model_doc/phi.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index 15601e83fe36..ccb7d6919b8c 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -53,9 +53,9 @@ output = pipe(message)
 <hfoption id="AutoModel">
 
 ```py
-from transformers import AutoTokenizer, PhiForCausalLM
+from transformers import AutoTokenizer, AutoModelForCausalLM
 
-model = PhiForCausalLM.from_pretrained("microsoft/phi-1")
+model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1")
 tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1")
 
 prompt = "Hey, are you conscious? Can you talk to me?"

From 327de833f0d8a5d58d929abec729edbeac75dd08 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Thu, 17 Apr 2025 18:09:39 +0200
Subject: [PATCH 04/13] Update phi.md

---
 docs/source/en/model_doc/phi.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index ccb7d6919b8c..73f5d3c738a5 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -81,6 +81,9 @@ echo -e "The weather is so nice here" | transformers-cli run --task text-classif
 
 - The tokenizer used for this model is identical to the [CodeGenTokenizer](https://huggingface.co/docs/transformers/v4.51.3/en/model_doc/codegen#transformers.CodeGenTokenizer).
 
+ ## PhiConfig
+[[autodoc]] PhiConfig
+
 ## PhiModel
 
 [[autodoc]] PhiModel - forward

From fba21c2244f49ebbe5fc2b14f9890082ad37b265 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Fri, 18 Apr 2025 09:48:11 +0200
Subject: [PATCH 05/13] Update docs/source/en/model_doc/phi.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/phi.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index 73f5d3c738a5..39f4fc1f3760 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.
 
 # Phi
 
-[Phi](https://huggingface.co/papers/2306.11644)  (phi-1) is a lightweight transformer model developed by Microsoft, optimized for Python code generation. With just 1.3 billion parameters, phi-1 demonstrates strong performance on standard coding benchmarks like HumanEval and MBPP, thanks to its focus on high-quality training data rather than scale. What makes phi-1 unique is its training strategy: instead of using large volumes of generic web data, it was trained on a carefully curated dataset of "textbook-quality" code examples and exercises, including synthetic Python problems generated by GPT-3.5. This data-centric approach allows phi-1 to compete with much larger models while remaining efficient and accessible for research and deployment.
+[Phi](https://huggingface.co/papers/2306.11644) is a 1.3B parameter transformer model optimized for Python code generation. It focuses on "textbook-quality" training data of code examples, exercises and synthetic Python problems rather than scaling the model size or compute.
 
 You can find all the original Phi checkpoints under the [Phi](https://huggingface.co/collections/microsoft/phi-1-6626e29134744e94e222d572) collection.
 

From 6bb71e4fcad9f553104a470700bd29ff6eb4ef22 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Fri, 18 Apr 2025 09:48:35 +0200
Subject: [PATCH 06/13] Update docs/source/en/model_doc/phi.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/phi.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index 39f4fc1f3760..5815b4845f3a 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -25,7 +25,7 @@ rendered properly in your Markdown viewer.
 
 [Phi](https://huggingface.co/papers/2306.11644) is a 1.3B parameter transformer model optimized for Python code generation. It focuses on "textbook-quality" training data of code examples, exercises and synthetic Python problems rather than scaling the model size or compute.
 
-You can find all the original Phi checkpoints under the [Phi](https://huggingface.co/collections/microsoft/phi-1-6626e29134744e94e222d572) collection.
+You can find all the original Phi checkpoints under the [Phi-1](https://huggingface.co/collections/microsoft/phi-1-6626e29134744e94e222d572) collection.
 
 > [!TIP]
 > Click on the Phi models in the right sidebar for more examples of how to apply Phi to different language tasks.

From db76c1e4b306835ab898e66f49cf392fb8dbf091 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Fri, 18 Apr 2025 09:48:49 +0200
Subject: [PATCH 07/13] Update docs/source/en/model_doc/phi.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/phi.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index 5815b4845f3a..cb9fe9bbd690 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -30,7 +30,7 @@ You can find all the original Phi checkpoints under the [Phi-1](https://huggingf
 > [!TIP]
 > Click on the Phi models in the right sidebar for more examples of how to apply Phi to different language tasks.
 
-The example below demonstrates how to generate and classify text with [`Pipeline`], [`AutoModel`] and from the command line.
+The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`] and from the command line.
 
 <hfoptions id="usage">
 <hfoption id="Pipeline">

From cfcd09324ca483c0ad549656b60b1e459680ace6 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Fri, 18 Apr 2025 09:50:02 +0200
Subject: [PATCH 08/13] Update docs/source/en/model_doc/phi.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/phi.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index cb9fe9bbd690..3c5cc44cbae2 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -69,7 +69,7 @@ tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
 <hfoption id="transformers-cli">
 
 ```bash
-echo -e "The weather is so nice here" | transformers-cli run --task text-classification --model microsoft/phi-1.5 --device 0
+echo -e "'''def print_prime(n): """ Print all primes between 1 and n"""'''" | transformers-cli run --task text-classification --model microsoft/phi-1.5 --device 0
 ```
 
 </hfoption>

From 414f7463844a742a266e09c3654c1ba3d0109892 Mon Sep 17 00:00:00 2001
From: Jihad <jihad.hammoud@grenoble-inp.org>
Date: Fri, 18 Apr 2025 14:54:18 +0200
Subject: [PATCH 09/13] Updated model card

---
 docs/source/en/model_doc/phi.md | 76 +++++++++++++++++++++++++++------
 1 file changed, 62 insertions(+), 14 deletions(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index 3c5cc44cbae2..b66984ce4334 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -39,12 +39,8 @@ The example below demonstrates how to generate text with [`Pipeline`], [`AutoMod
 import torch
 from transformers import pipeline
 
-pipe = pipeline(
-task="text-generation",
-model="microsoft/phi-1.5"
-)
-message = "Why Lebanon is a special country?"
-output = pipe(message)
+pipeline = pipeline(task="text-generation", model="microsoft/phi-1.5", device=0, torch_dtype=torch.bfloat16)
+pipeline("pipeline('''def print_prime(n): """ Print all primes between 1 and n"""''')")
 
 ```
 
@@ -53,16 +49,19 @@ output = pipe(message)
 <hfoption id="AutoModel">
 
 ```py
+import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 
-model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1")
 tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1")
+model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
 
-prompt = "Hey, are you conscious? Can you talk to me?"
-inputs = tokenizer(prompt, return_tensors="pt")
-generate_ids = model.generate(inputs.input_ids, max_length=30)
+input_ids = tokenizer('''def print_prime(n):
+   """
+   Print all primes between 1 and n
+   """''', return_tensors="pt").to("cuda")
 
-tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+output = model.generate(**input_ids, cache_implementation="static")
+print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
 </hfoption>
@@ -75,13 +74,62 @@ echo -e "'''def print_prime(n): """ Print all primes between 1 and n"""'''" | tr
 </hfoption>
 </hfoptions>
 
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+
+The example below uses 4-bit weight-only quantization with NF4 [https://huggingface.co/docs/transformers/en/quantization/bitsandbytes] to quantize only the model weights, reducing memory usage while maintaining performance.
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+import torch
+
+model_id = "microsoft/phi-1_5"
+bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
+
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    quantization_config=bnb_config,
+
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+inputs = tokenizer('''def print_prime(n):
+   """
+   Print all primes between 1 and n
+   """''', return_tensors="pt").to("cuda")
+
+outputs = model.generate(**inputs)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+
+```
+
 ## Notes
 
-- This model is quite similar to `Llama` with the main difference in `PhiDecoderLayer`, where they used `PhiAttention` and `PhiMLP` layers in parallel configuration.
+- If you're using Transformers < 4.37.0.dev, set `trust_remote_code=True` in [~AutoModel.from_pretrained]. Otherwise, make sure you update Transformers to the latest stable version.
+
+```py
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1")
+model = AutoModelForCausalLM.from_pretrained(
+    "microsoft/phi-1",
+    torch_dtype=torch.float16,
+    device_map="auto",
+    trust_remote_code=True,
+    attn_implementation="sdpa")
+
+input_ids = tokenizer('''def print_prime(n):
+   """
+   Print all primes between 1 and n
+   """''', return_tensors="pt").to("cuda")
+
+output = model.generate(**input_ids, cache_implementation="static")
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
 
-- The tokenizer used for this model is identical to the [CodeGenTokenizer](https://huggingface.co/docs/transformers/v4.51.3/en/model_doc/codegen#transformers.CodeGenTokenizer).
+## PhiConfig
 
- ## PhiConfig
 [[autodoc]] PhiConfig
 
 ## PhiModel

From 53847fc014a4fc12eca27de146c6e574452fd685 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Fri, 18 Apr 2025 21:09:29 +0200
Subject: [PATCH 10/13] Update phi.md

---
 docs/source/en/model_doc/phi.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index b66984ce4334..448742200431 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -76,7 +76,7 @@ echo -e "'''def print_prime(n): """ Print all primes between 1 and n"""'''" | tr
 
 Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
 
-The example below uses 4-bit weight-only quantization with NF4 [https://huggingface.co/docs/transformers/en/quantization/bitsandbytes] to quantize only the model weights, reducing memory usage while maintaining performance.
+The example below uses [bitsandbytes](https://huggingface.co/docs/transformers/en/quantization/bitsandbytes) to only quantize the weights to 4-bits.
 
 ```py
 from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

From 955f4a24a280cf1c48a5ea97569ae87aa198e270 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Sat, 19 Apr 2025 00:25:43 +0200
Subject: [PATCH 11/13] Update phi.md

---
 docs/source/en/model_doc/phi.md | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index 448742200431..a3d420ff78e8 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -79,28 +79,20 @@ Quantization reduces the memory burden of large models by representing the weigh
 The example below uses [bitsandbytes](https://huggingface.co/docs/transformers/en/quantization/bitsandbytes) to only quantize the weights to 4-bits.
 
 ```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
 import torch
+from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
 
-model_id = "microsoft/phi-1_5"
 bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
+tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1")
+model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa", quantization_config=bnb_config)
 
-model = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    device_map="auto",
-    quantization_config=bnb_config,
-
-)
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-
-inputs = tokenizer('''def print_prime(n):
+input_ids = tokenizer('''def print_prime(n):
    """
    Print all primes between 1 and n
    """''', return_tensors="pt").to("cuda")
 
-outputs = model.generate(**inputs)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-
+output = model.generate(**input_ids, cache_implementation="static")
+print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
 ## Notes
@@ -134,16 +126,21 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 
 ## PhiModel
 
-[[autodoc]] PhiModel - forward
+[[autodoc]] PhiModel
+    - forward
 
 ## PhiForCausalLM
 
-[[autodoc]] PhiForCausalLM - forward - generate
+[[autodoc]] PhiForCausalLM
+    - forward
+    - generate
 
 ## PhiForSequenceClassification
 
-[[autodoc]] PhiForSequenceClassification - forward
+[[autodoc]] PhiForSequenceClassification
+    - forward
 
 ## PhiForTokenClassification
 
-[[autodoc]] PhiForTokenClassification - forward
+[[autodoc]] PhiForTokenClassification
+    - forward

From df1b15f36e5895fc3364a1c74882ce42266242d2 Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Sat, 19 Apr 2025 10:42:25 +0200
Subject: [PATCH 12/13] Update phi.md

---
 docs/source/en/model_doc/phi.md | 40 ++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index a3d420ff78e8..c5c0bdfc9fe5 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -99,26 +99,26 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 
 - If you're using Transformers < 4.37.0.dev, set `trust_remote_code=True` in [~AutoModel.from_pretrained]. Otherwise, make sure you update Transformers to the latest stable version.
 
-```py
-import torch
-from transformers import AutoTokenizer, AutoModelForCausalLM
-
-tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1")
-model = AutoModelForCausalLM.from_pretrained(
-    "microsoft/phi-1",
-    torch_dtype=torch.float16,
-    device_map="auto",
-    trust_remote_code=True,
-    attn_implementation="sdpa")
-
-input_ids = tokenizer('''def print_prime(n):
-   """
-   Print all primes between 1 and n
-   """''', return_tensors="pt").to("cuda")
-
-output = model.generate(**input_ids, cache_implementation="static")
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
+    ```py
+    import torch
+    from transformers import AutoTokenizer, AutoModelForCausalLM
+    
+    tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1")
+    model = AutoModelForCausalLM.from_pretrained(
+        "microsoft/phi-1",
+        torch_dtype=torch.float16,
+        device_map="auto",
+        trust_remote_code=True,
+        attn_implementation="sdpa")
+    
+    input_ids = tokenizer('''def print_prime(n):
+       """
+       Print all primes between 1 and n
+       """''', return_tensors="pt").to("cuda")
+    
+    output = model.generate(**input_ids, cache_implementation="static")
+    print(tokenizer.decode(output[0], skip_special_tokens=True))
+    ```
 
 ## PhiConfig
 

From 593adf33a0e066960c604087f1cab1e882ea76bf Mon Sep 17 00:00:00 2001
From: JihadHammoud02 <94748033+JihadHammoud02@users.noreply.github.com>
Date: Mon, 21 Apr 2025 18:33:57 +0200
Subject: [PATCH 13/13] Update docs/source/en/model_doc/phi.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/phi.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md
index c5c0bdfc9fe5..37db41bae0ac 100644
--- a/docs/source/en/model_doc/phi.md
+++ b/docs/source/en/model_doc/phi.md
@@ -97,7 +97,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 
 ## Notes
 
-- If you're using Transformers < 4.37.0.dev, set `trust_remote_code=True` in [~AutoModel.from_pretrained]. Otherwise, make sure you update Transformers to the latest stable version.
+- If you're using Transformers < 4.37.0.dev, set `trust_remote_code=True` in [`~AutoModel.from_pretrained`]. Otherwise, make sure you update Transformers to the latest stable version.
 
     ```py
     import torch