-
Notifications
You must be signed in to change notification settings - Fork 873
[OpenVINO] NNCF Data-Aware Compression Algorithms Support for OVQuantizer #16002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
mergennachin
merged 25 commits into
pytorch:main
from
anzr299:an/openvino/nncf_compress_pt2e
Feb 25, 2026
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
ece4a0b
Extend quantizer to support compress_pt2e
anzr299 9cc0991
integrate compress_pt2e into the example
anzr299 b7bac57
Merge branch 'main' into an/openvino/nncf_compress_pt2e
anzr299 6c0d766
remove extra directories
anzr299 fcd40bb
Merge branch 'an/openvino/nncf_compress_pt2e' of https://github.com/a…
anzr299 f9c782b
review changes
anzr299 24f684f
lint
anzr299 0963b73
add unit test
anzr299 792caf2
add some corner case checks in llm compression
anzr299 dc3b219
clean unused imports
anzr299 0d3d681
lint
anzr299 12efc70
review changes
anzr299 1236dfc
comprae reference scale values in tests
anzr299 019b2cc
remove dead code
anzr299 659a834
Merge branch 'main' into an/openvino/nncf_compress_pt2e
anzr299 562261f
lint fixes
anzr299 ecd5b8a
extend test for error
anzr299 d72466d
lint
anzr299 6e349c3
Merge branch 'main' into an/openvino/nncf_compress_pt2e
anzr299 83f0fb8
remove leading space in error message
anzr299 42fc491
Merge branch 'pytorch:main' into an/openvino/nncf_compress_pt2e
anzr299 ba68d56
Merge branch 'pytorch:main' into an/openvino/nncf_compress_pt2e
anzr299 b1b2fb2
Merge branch 'pytorch:main' into an/openvino/nncf_compress_pt2e
anzr299 0093592
update nncf version to 3.0.0
anzr299 0c82495
Merge branch 'main' into an/openvino/nncf_compress_pt2e
suryasidd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,9 @@ | ||
| from .llm_compression import apply_nncf_data_aware_compression | ||
| from .quantizer import OpenVINOQuantizer, QuantizationMode, quantize_model | ||
|
|
||
| __all__ = ["OpenVINOQuantizer", "quantize_model", "QuantizationMode"] | ||
| __all__ = [ | ||
| "OpenVINOQuantizer", | ||
| "quantize_model", | ||
| "QuantizationMode", | ||
| "apply_nncf_data_aware_compression", | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| # Copyright (c) Intel Corporation | ||
| # | ||
| # Licensed under the BSD License (the "License"); you may not use this file | ||
| # except in compliance with the License. See the license file found in the | ||
| # LICENSE file in the root directory of this source tree. | ||
|
|
||
| # mypy: disable-error-code=import-not-found | ||
|
|
||
| from typing import Tuple | ||
|
|
||
| import torch | ||
| from executorch.extension.llm.export.builder import LLMEdgeManager | ||
| from torchao.quantization.pt2e.quantizer import Quantizer | ||
|
|
||
| try: | ||
| import nncf # type: ignore[import-untyped] | ||
| from pytorch_tokenizers import get_tokenizer # type: ignore[import-untyped] | ||
| except ImportError: | ||
| raise ImportError("Please install nncf via backends/openvino/requirements.txt") | ||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| # This code is adapted from https://github.com/pytorch/executorch/blob/0c54fd0483314da173f8e14d63d2ed9591c7133a/extension/llm/export/builder.py#L278 | ||
| def get_calibration_data( | ||
| module: torch.fx.GraphModule, tokenizer, prompts: str, max_len: int | ||
| ): | ||
| """ | ||
| This method is used to obtain calibration data from a prompt so that the algorithm | ||
| is calibrated not only with the dataset but also the inputs which are output by | ||
| the model. | ||
| Currently, this method is only tested with Llama models. | ||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| # TODO: change criteria & support batch inputs if necessary | ||
| pos = 0 | ||
| token_list = tokenizer.encode(prompts, bos=True, eos=False) | ||
|
|
||
| with torch.no_grad(): | ||
| while token_list[-1] != tokenizer.eos_id and pos < max_len: | ||
| logits = module( | ||
| torch.full((1, 1), token_list[pos]), | ||
| {"input_pos": torch.tensor((pos,))}, | ||
| ) | ||
| pos += 1 | ||
| if pos >= len(token_list): | ||
| token_list.append(torch.argmax(logits[:], dim=-1).item()) | ||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| token_list = [ | ||
| ( | ||
| torch.tensor(pos, dtype=torch.int64), | ||
| token, | ||
| ) | ||
| for pos, token in enumerate(token_list) | ||
| ] | ||
| return token_list | ||
|
|
||
|
|
||
| def transform_fn(token_pos_map: Tuple[int, int]): | ||
| """ | ||
| Transforms and returns input from dataset so that it is acceptable by the model | ||
| Currently, this method is only tested with Llama models. | ||
|
|
||
| :param token_pos_map: This input contains the position and its token ID | ||
| """ | ||
| inputs = ( | ||
| torch.tensor([[token_pos_map[1]]]), | ||
| {"input_pos": torch.tensor([token_pos_map[0]])}, | ||
| ) | ||
|
|
||
| return inputs | ||
|
|
||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| def apply_nncf_data_aware_compression( | ||
| builder_exported: LLMEdgeManager, | ||
| quantizer: Quantizer, | ||
| awq: bool, | ||
| scale_estimation: bool, | ||
| ) -> LLMEdgeManager: | ||
| """ | ||
| Applies NNCF data-aware weight compression to the exported LLM graph. | ||
| Uses the builder's tokenizer and calibration prompt to generate token-level | ||
| calibration data, then runs `nncf.experimental.torch.fx.compress_pt2e` with | ||
| the given quantizer and optional AWQ / scale estimation enabled. | ||
|
|
||
| :param builder_exported: LLMEdgeManager containing the FX graph, tokenizer path, | ||
| calibration prompt, and max sequence length. | ||
| :param quantizer: TorchAO quantizer to use for compression. | ||
| :param awq: If True, enables Activation-aware Weights Quantization (AWQ). | ||
| :param scale_estimation: If True, enables NNCF's scale estimation algorithm. | ||
| :return: The updated LLMEdgeManager with compressed torch FX model | ||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| nncf_calibration_data = None | ||
| if ( | ||
| builder_exported.calibration_seq_length is not None | ||
| and builder_exported.calibration_data is not None | ||
| and builder_exported.tokenizer_path is not None | ||
| and (awq or scale_estimation) | ||
| ): | ||
| tokenizer = get_tokenizer(builder_exported.tokenizer_path) | ||
| nncf_calibration_data = nncf.Dataset( | ||
| get_calibration_data( | ||
| builder_exported.pre_autograd_graph_module, # type: ignore[arg-type] | ||
| tokenizer, | ||
| builder_exported.calibration_data, | ||
| builder_exported.calibration_seq_length, | ||
| ), | ||
| transform_func=transform_fn, | ||
| ) | ||
|
|
||
| # AWQ can work without a dataset as well. | ||
| if scale_estimation and not nncf_calibration_data: | ||
| missing_params = [] | ||
| if builder_exported.calibration_data is None: | ||
| missing_params.append("calibration_data") | ||
| if builder_exported.calibration_seq_length is None: | ||
| missing_params.append("calibration_seq_length") | ||
| if builder_exported.tokenizer_path is None: | ||
| missing_params.append("tokenizer_path") | ||
| if missing_params: | ||
| msg = ( | ||
| "Missing required calibration parameter(s): " | ||
| + ", ".join(missing_params) | ||
| + ". Please provide calibration_data, calibration_seq_length, and tokenizer_path." | ||
| ) | ||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| raise ValueError(msg) | ||
|
|
||
| builder_exported.pre_autograd_graph_module = ( | ||
| nncf.experimental.torch.fx.compress_pt2e( | ||
| builder_exported.pre_autograd_graph_module, | ||
| quantizer=quantizer, | ||
| dataset=nncf_calibration_data, | ||
| awq=awq, | ||
| scale_estimation=scale_estimation, | ||
| ) | ||
| ) | ||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| return builder_exported | ||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1 @@ | ||
| git+https://github.com/openvinotoolkit/nncf@3d753ac#egg=nncf | ||
| nncf==3.0.0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
22 changes: 22 additions & 0 deletions
22
backends/openvino/tests/quantizer/synthetic_test_models.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| import torch | ||
|
|
||
|
|
||
| class ExportLlamaTestModel(torch.nn.Module): | ||
| def __init__(self, vocab_size=5, hidden_size=2, num_layers=1): | ||
| super().__init__() | ||
| self.embed = torch.nn.Embedding(vocab_size, hidden_size) | ||
| self.layers = torch.nn.ModuleList( | ||
| [torch.nn.Linear(hidden_size, hidden_size) for _ in range(num_layers)] | ||
| ) | ||
| self.lm_head = torch.nn.Linear(hidden_size, vocab_size) | ||
| self.vocab_size = vocab_size | ||
|
|
||
| def forward(self, tokens, input_pos): | ||
anzr299 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| x = self.embed(tokens) | ||
|
|
||
| for layer in self.layers: | ||
| x = torch.relu(layer(x)) | ||
|
|
||
| logits = self.lm_head(x) | ||
|
|
||
| return logits | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a basic unit test that calls this flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done