AMMO Integration with Llama2 Post-Training Quantization Example and Tests#8444
AMMO Integration with Llama2 Post-Training Quantization Example and Tests#8444ericharper merged 33 commits intomainfrom
Conversation
| "Once upon a time, in the middle of a dense forest, there was a small house, where lived a pretty little girl " | ||
| "named Little Red Riding Hood.", |
Check warning
Code scanning / CodeQL
Implicit string concatenation in a list
| "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore " | ||
| "magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea " | ||
| "commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat " | ||
| "nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit " | ||
| "anim id est laborum...", |
Check warning
Code scanning / CodeQL
Implicit string concatenation in a list
5b2744a to
ec47227
Compare
ceebcb4 to
69305d5
Compare
16a17f9 to
b151a2a
Compare
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
reinstall.sh
Outdated
| python -m build --no-isolation --wheel | ||
| DIST_FILE=$(find ./dist -name "*.whl" | head -n 1) | ||
| ${PIP} install "${DIST_FILE}[all]" | ||
| ${PIP} install --extra-index-url https://pypi.nvidia.com "${DIST_FILE}[all]" |
There was a problem hiding this comment.
Don't add the --extra-index-url here, install ammo separately
| return checkpoint_dir | ||
|
|
||
|
|
||
| def save_artifacts(model, output_dir: str, use_abspath: bool = False) -> None: |
There was a problem hiding this comment.
Do we need this or can we use the existing implementation? I.e. the save/restore connector?
There was a problem hiding this comment.
Should be able to use register_artifact for this
There was a problem hiding this comment.
I took time to revisit this. This helper actually just copies artifacts from a source Nemo model (tar or dir) to a folder with quantized weights.
I would prefer using the helper because of two main reasons:
- Most importantly the quantized model is actually a directory produced with AMMO export step here: https://github.com/NVIDIA/NeMo/blob/b56ff60381b80d0add4456297dab0fb52b30cf1e/nemo/export/quantize/quantizer.py#L184-L191 as opposed to a Nemo model offering
register_artifactmethod. - Artifacts saved with
register_artifactare prefixed with a MD5 hash. On the other hand, utils in Nemo Inference container typically assume hardcoded "plain" paths liketokenizer.modelinstead of449ae6fd76d84842bf152e4ae4701764_tokenizer.model(for example). So I would need to perform and extra operation to remove this prefix somehwere.
Defining the save_artifacts helper gives me the flexibility I need. Are you OK with this?
nemo/export/quantize/quantizer.py
Outdated
| import tarfile | ||
| from typing import List, Optional | ||
|
|
||
| import ammo.torch.quantization as atq |
nemo/export/quantize/quantizer.py
Outdated
|
|
||
| import ammo.torch.quantization as atq | ||
| import torch.distributed as dist | ||
| from ammo.torch.export import export_model_config |
nemo/export/quantize/quantizer.py
Outdated
|
|
||
| 1. Loading a Nemo model from disk using appropriate parallelism strategy | ||
| 2. Calibrating the model to obtain appropriate algorithm-specific scaling factors | ||
| 3. Producing .qnemo tarball with model config (JSON), quantized weights (safetensors) |
There was a problem hiding this comment.
We use extracted .nemo files for llm, i.e. just directories, the idea of a .qnemo tarball probably doesn't make sense
There was a problem hiding this comment.
We could enable producing either directory or tarball depending on user choice via model_save.endswith(".qnemo"). This ".qnemo" was an initial suggestion as for what to pass to a Nemo Inference container.
I agree that directories are more convenient to work with.
There was a problem hiding this comment.
Both options are enabled now via 3a7f07e
nemo/export/quantize/quantizer.py
Outdated
| 3. Producing .qnemo tarball with model config (JSON), quantized weights (safetensors) | ||
| and tokenizer config (yaml). | ||
|
|
||
| The .qnemo file produced is intended consumed by TensorRT-LLM toolbox for inference. |
There was a problem hiding this comment.
We use extracted .nemo files for llm, i.e. just directories, the idea of a .qnemo tarball probably doesn't make sense
There was a problem hiding this comment.
Both are enabled -- addressed in #8444 (comment) above
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
for more information, see https://pre-commit.ci
|
jenkins |
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
…hecks Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
a14f766 to
3a7f07e
Compare
|
jenkins |
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
|
jenkins |
|
|
||
| def get_calib_dataloader(data="cnn_dailymail", batch_size=64, calib_size=512, max_sequence_length=512): | ||
| if data == "pileval": | ||
| dataset = load_dataset("json", data_files="https://the-eye.eu/public/AI/pile/val.jsonl.zst", split="train") |
There was a problem hiding this comment.
This link doesn't work. This one should be okay: https://huggingface.co/datasets/monology/pile-uncopyrighted
ericharper
left a comment
There was a problem hiding this comment.
LGTM. Thanks!
Please send a follow up PR with documentation.
…ests (NVIDIA-NeMo#8444) * AMMO integration with Llama2 PTQ example and tests Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Jenkins megatron_llama_quantization.py test setup Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * License headers Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Add AMMO to requirements_nlp.txt with --extra-index-url for pip install Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Bump AMMO version to latest Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Guards workaround on spec definition Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Save artifacts and tokenizer config at once Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Extend nemo.utils package with new tools Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reorganize & reformat Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Tests for FP8 and INT4 AWQ Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add load_config helper function Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Unused import removal Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix FP8 Jenkins test Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix TP=2 test cont'd: no need to use mpirun Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ test for now (need to debug) Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning cont'd Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Use AMMO spec from MCore as it has been published Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Make AMMO optional dependency and properly import guard it Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add Llama2 AWQ test and update some paths Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable specifying quantization.algorithm=null for baseline accuracy checks Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable exporting qnemo tarball or just to a directory Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ testing for now Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Test case for export.inference_tensor_parallel=2 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Flag to export TRT-LLM config.json Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
…ests (#8444) * AMMO integration with Llama2 PTQ example and tests Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Jenkins megatron_llama_quantization.py test setup Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * License headers Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Add AMMO to requirements_nlp.txt with --extra-index-url for pip install Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Bump AMMO version to latest Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Guards workaround on spec definition Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Save artifacts and tokenizer config at once Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Extend nemo.utils package with new tools Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reorganize & reformat Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Tests for FP8 and INT4 AWQ Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add load_config helper function Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Unused import removal Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix FP8 Jenkins test Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix TP=2 test cont'd: no need to use mpirun Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ test for now (need to debug) Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning cont'd Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Use AMMO spec from MCore as it has been published Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Make AMMO optional dependency and properly import guard it Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add Llama2 AWQ test and update some paths Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable specifying quantization.algorithm=null for baseline accuracy checks Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable exporting qnemo tarball or just to a directory Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ testing for now Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Test case for export.inference_tensor_parallel=2 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Flag to export TRT-LLM config.json Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: ataghibakhsh <ataghibakhsh@nvidia.com>
…ests (#8444) * AMMO integration with Llama2 PTQ example and tests Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Jenkins megatron_llama_quantization.py test setup Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * License headers Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Add AMMO to requirements_nlp.txt with --extra-index-url for pip install Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Bump AMMO version to latest Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Guards workaround on spec definition Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Save artifacts and tokenizer config at once Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Extend nemo.utils package with new tools Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reorganize & reformat Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Tests for FP8 and INT4 AWQ Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add load_config helper function Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Unused import removal Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix FP8 Jenkins test Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix TP=2 test cont'd: no need to use mpirun Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ test for now (need to debug) Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning cont'd Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Use AMMO spec from MCore as it has been published Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Make AMMO optional dependency and properly import guard it Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add Llama2 AWQ test and update some paths Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable specifying quantization.algorithm=null for baseline accuracy checks Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable exporting qnemo tarball or just to a directory Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ testing for now Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Test case for export.inference_tensor_parallel=2 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Flag to export TRT-LLM config.json Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>
…ests (NVIDIA-NeMo#8444) * AMMO integration with Llama2 PTQ example and tests Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Jenkins megatron_llama_quantization.py test setup Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * License headers Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Add AMMO to requirements_nlp.txt with --extra-index-url for pip install Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Bump AMMO version to latest Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Guards workaround on spec definition Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Save artifacts and tokenizer config at once Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Extend nemo.utils package with new tools Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reorganize & reformat Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Tests for FP8 and INT4 AWQ Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add load_config helper function Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Unused import removal Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix FP8 Jenkins test Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix TP=2 test cont'd: no need to use mpirun Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ test for now (need to debug) Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning cont'd Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Use AMMO spec from MCore as it has been published Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Make AMMO optional dependency and properly import guard it Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add Llama2 AWQ test and update some paths Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable specifying quantization.algorithm=null for baseline accuracy checks Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable exporting qnemo tarball or just to a directory Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ testing for now Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Test case for export.inference_tensor_parallel=2 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Flag to export TRT-LLM config.json Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do ?
Integrating AMMO library to the project and providing utilities for quantizing models with Llama2 PTQ example.
Different quantization algorithms are available including INT8 SmoothQuant, INT4 AWQ, and FP8.
Main class
Quantizerfrom thenemo.export.quantizesubmodule produces directory or .qnemo tarball to be consumed by TensorRT-LLM toolbox for efficient inference. This will be a part of NeMo Framework Inference Container.Collection: [NLP]
Changelog
nemo.export.quantizesubmodule for quantizing modelstests.setupmodule to facilitate Jenkins setupUsage
Example for INT8 SmoothQuant method:
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkinson the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
For more transparent and easier review process some components were isolated into individual MRs: