Skip to content

Add ONNX export support for TE modules#41

Merged
ksivaman merged 17 commits intoNVIDIA:mainfrom
asfiyab-nvidia:dev-onnx-export-support
Jan 18, 2023
Merged

Add ONNX export support for TE modules#41
ksivaman merged 17 commits intoNVIDIA:mainfrom
asfiyab-nvidia:dev-onnx-export-support

Conversation

@asfiyab-nvidia
Copy link
Contributor

  • Add TorchScript Operators
  • Add symbolic methods to ONNX exporter
  • Add tests for the ONNX export

Signed-off-by: Asfiya Baig asfiyab@nvidia.com
Signed-off-by: Neta Zmora nzmora@nvidia.com

@ptrendx
Copy link
Member

ptrendx commented Dec 14, 2022

Hi @asfiyab-nvidia, what is that libcustom so file?

@asfiyab-nvidia
Copy link
Contributor Author

@ptrendx it contains the onnxruntime (ORT) implementations for FP8 functionality. This is used to test the ONNX export and validate the ORT outputs against TE outputs. (code under tests/test_onnx_export.py)
The so is included in the PR so there's no dependencies on external sources

@ptrendx
Copy link
Member

ptrendx commented Dec 14, 2022

Does that have to be closed source? If so, can we at least move it to tests directory instead of the top level one? If it does not have to be closed source then maybe we can have the source inside tests and compile it on the fly?

@asfiyab-nvidia asfiyab-nvidia force-pushed the dev-onnx-export-support branch from 693ee53 to 8ed54a8 Compare December 14, 2022 18:37
@asfiyab-nvidia
Copy link
Contributor Author

Moving the .so to the tests directory seems to be a better approach at the moment. We can potentially include the source code in a follow up PR.

@asfiyab-nvidia asfiyab-nvidia force-pushed the dev-onnx-export-support branch from b39f87e to 0cf5e16 Compare December 27, 2022 19:22
@asfiyab-nvidia asfiyab-nvidia force-pushed the dev-onnx-export-support branch from 65f4196 to b9b5477 Compare January 4, 2023 21:32
@ptrendx
Copy link
Member

ptrendx commented Jan 5, 2023

/te-ci

@ptrendx
Copy link
Member

ptrendx commented Jan 6, 2023

Please fix the tests (see the results for commit 4812408) - the biggest problem is that you try to run tests requiring FP8 on non-Hopper, which triggers the assertion failure. I am working on enabling Hopper GPU in CI, so we should be able to get the FP8 tests running soon too.

@netaz
Copy link

netaz commented Jan 8, 2023

@ptrendx is there some code in TE we can leverage to query the SM version, or do you recommend us installing some lib (e.g. pynvml)?

@asfiyab-nvidia
Copy link
Contributor Author

/te-ci

1 similar comment
@ptrendx
Copy link
Member

ptrendx commented Jan 10, 2023

/te-ci

@asfiyab-nvidia
Copy link
Contributor Author

@ptrendx can you please authorize a pipeline run for the latest commit? It contains fixes for the failures from the last run. Thanks

@ptrendx
Copy link
Member

ptrendx commented Jan 11, 2023

/te-ci

asfiyab-nvidia and others added 14 commits January 17, 2023 20:18
* Add TorchScript Operators
* Add symbolic methods to ONNX exporter
* Add tests for the ONNX export

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
* Increase layernorm FP16 threshold
* Normalize onnx file names: _ separates configs; - separates words in a single config
* Add get_attn_mask_str and fix mask string
* Add missing ONNX files
* Moved generated ONNX files to tests/gen_onnx_models/

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
1. remove List import for pylint failure
2. address comments: remove state tensors from GPU
3. address comments: Update reverse_map_dtype function and add to namespace

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
1. skip FP8 tests on  non-hopper devices
2. minor fix for C++ lint check

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
1. update copyrights
2. update path to ORT .so

Signed-off-by: Asfiya Baig <asfiyab@nvidia.com>
@asfiyab-nvidia asfiyab-nvidia force-pushed the dev-onnx-export-support branch from 119a0ec to ab4410f Compare January 17, 2023 20:35
Copy link
Member

@ksivaman ksivaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments

Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: asfiyab-nvidia <117682710+asfiyab-nvidia@users.noreply.github.com>
@ksivaman
Copy link
Member

/te-ci

1 similar comment
@ksivaman
Copy link
Member

/te-ci

Copy link
Member

@ksivaman ksivaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ksivaman ksivaman merged commit 6c9ce17 into NVIDIA:main Jan 18, 2023
zhiyu-deep pushed a commit to zhiyu-deep/TransformerEngine that referenced this pull request Sep 3, 2024
[New API] Added support for Reshape operation.
[New API] Added support for DgradDreluBNBwdWeight operation

[Minor Enhancement] Added cudnn frontend enums to simplify Resample operation creation.
[Minor Enhancement] Added alpha and beta values as key for the plan caches.

[Bug Fix] Fixed an error which was causing reference code to fail with segmentation fault.
[Bug Fix] Fixed an issue where stride/padding and dilation values were incorrectly cached for 2d convolutions.
[Bug Fix] Fixed issues where error statuses were not handled correctly during tensor creation.

[Samples] Added a new sample to show case how fMHA graph can be programmed through FE API. This sample contains both fprop and backprop graphs.
[Samples] Added a new sample to show case DgradDreluBNBwdWeight operation.

[Samples] Added a modular block which models fprop of residual block resnet.

Co-authored-by: Anerudhan Gopal <agopal@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments