Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions examples/openvino/llama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,24 @@ python -m executorch.extension.llm.export.export_llm \
+base.tokenizer_path="${LLAMA_TOKENIZER:?}"
```

### Compress Model Weights and Export
OpenVINO backend also offers Quantization support for llama models when exporting the model. The different quantization modes that are offered are INT4 groupwise & per-channel weights compression and INT8 per-channel weights compression. It can be achieved using the `--pt2e_quantize opevnino_4wo` flag. For modifying the group size `--group_size` can be used. By default group size 128 is used to achieve optimal performance with the NPU.

```
LLAMA_CHECKPOINT=<path/to/model/folder>/consolidated.00.pth
LLAMA_PARAMS=<path/to/model/folder>/params.json
LLAMA_TOKENIZER=<path/to/model/folder>/tokenizer.model

python -m executorch.extension.llm.export.export_llm \
--config llama3_2_ov_4wo.yaml \
+backend.openvino.device="CPU" \
+base.model_class="llama3_2" \
+pt2e_quantize opevnino_4wo \
+base.checkpoint="${LLAMA_CHECKPOINT:?}" \
+base.params="${LLAMA_PARAMS:?}" \
+base.tokenizer_path="${LLAMA_TOKENIZER:?}"
```

## Build OpenVINO C++ Runtime with Llama Runner:
First, build the backend libraries by executing the script below in `<executorch_root>/backends/openvino/scripts` folder:
```bash
Expand Down
Loading