From 1421921da0a6b083c17c9fe85b5b5f8beebd7216 Mon Sep 17 00:00:00 2001
From: Aamir Nazir <aamir.nazir@intel.com>
Date: Fri, 12 Sep 2025 13:05:24 +0400
Subject: [PATCH] Update README.md with quantization paragraph

---
 examples/openvino/llama/README.md | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/examples/openvino/llama/README.md b/examples/openvino/llama/README.md
index d357f038781..7a97e27410c 100644
--- a/examples/openvino/llama/README.md
+++ b/examples/openvino/llama/README.md
@@ -24,6 +24,24 @@ python -m executorch.extension.llm.export.export_llm \
   +base.tokenizer_path="${LLAMA_TOKENIZER:?}"
 ```
 
+### Compress Model Weights and Export
+OpenVINO backend also offers Quantization support for llama models when exporting the model. The different quantization modes that are offered are INT4 groupwise & per-channel weights compression and INT8 per-channel weights compression. It can be achieved using the `--pt2e_quantize opevnino_4wo` flag. For modifying the group size `--group_size` can be used. By default group size 128 is used to achieve optimal performance with the NPU.
+
+```
+LLAMA_CHECKPOINT=<path/to/model/folder>/consolidated.00.pth
+LLAMA_PARAMS=<path/to/model/folder>/params.json
+LLAMA_TOKENIZER=<path/to/model/folder>/tokenizer.model
+
+python -m executorch.extension.llm.export.export_llm \
+  --config llama3_2_ov_4wo.yaml \
+  +backend.openvino.device="CPU" \
+  +base.model_class="llama3_2" \
+  +pt2e_quantize opevnino_4wo \
+  +base.checkpoint="${LLAMA_CHECKPOINT:?}" \
+  +base.params="${LLAMA_PARAMS:?}" \
+  +base.tokenizer_path="${LLAMA_TOKENIZER:?}"
+```
+
 ## Build OpenVINO C++ Runtime with Llama Runner:
 First, build the backend libraries by executing the script below in `<executorch_root>/backends/openvino/scripts` folder:
 ```bash