From 1421921da0a6b083c17c9fe85b5b5f8beebd7216 Mon Sep 17 00:00:00 2001 From: Aamir Nazir Date: Fri, 12 Sep 2025 13:05:24 +0400 Subject: [PATCH] Update README.md with quantization paragraph --- examples/openvino/llama/README.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/examples/openvino/llama/README.md b/examples/openvino/llama/README.md index d357f038781..7a97e27410c 100644 --- a/examples/openvino/llama/README.md +++ b/examples/openvino/llama/README.md @@ -24,6 +24,24 @@ python -m executorch.extension.llm.export.export_llm \ +base.tokenizer_path="${LLAMA_TOKENIZER:?}" ``` +### Compress Model Weights and Export +OpenVINO backend also offers Quantization support for llama models when exporting the model. The different quantization modes that are offered are INT4 groupwise & per-channel weights compression and INT8 per-channel weights compression. It can be achieved using the `--pt2e_quantize opevnino_4wo` flag. For modifying the group size `--group_size` can be used. By default group size 128 is used to achieve optimal performance with the NPU. + +``` +LLAMA_CHECKPOINT=/consolidated.00.pth +LLAMA_PARAMS=/params.json +LLAMA_TOKENIZER=/tokenizer.model + +python -m executorch.extension.llm.export.export_llm \ + --config llama3_2_ov_4wo.yaml \ + +backend.openvino.device="CPU" \ + +base.model_class="llama3_2" \ + +pt2e_quantize opevnino_4wo \ + +base.checkpoint="${LLAMA_CHECKPOINT:?}" \ + +base.params="${LLAMA_PARAMS:?}" \ + +base.tokenizer_path="${LLAMA_TOKENIZER:?}" +``` + ## Build OpenVINO C++ Runtime with Llama Runner: First, build the backend libraries by executing the script below in `/backends/openvino/scripts` folder: ```bash