[PyTorch] Build custom ORT ops before running ONNX export tests#1252
[PyTorch] Build custom ORT ops before running ONNX export tests#1252timmoon10 merged 4 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci pytorch |
Matches internal impl of TE kernels. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci pytorch |
|
I'm not sure what changed when I bumped the ONNX Runtime version, but I've experienced test failures for some FP8 operations (GeLU, LayerNorm, RMSNorm). Our tests tolerances basically require bit-perfect FP8 casting. However, this is not reasonable since the exported ONNX ops might do intermediate compute in FP16 while the TE kernels do all intermediate compute in FP32. I've modified the ONNX export infrastructure to match TE and do intermediate compute in FP32. In the future we should consider loosening the numerical tolerances to handle expected numerical error with FP8. |
|
@timmoon10 I haven't looked but there are some differences between the |
|
Perhaps, but I don't think any of those changes should have affected numerics. In any case, the changes in this PR make the ONNX export more correct, so I don't think there's much risk to merging if the tests pass. |
…IA#1252) * Build custom ORT ops before running ONNX tests Signed-off-by: Tim Moon <tmoon@nvidia.com> * Remove ONNX from context parallelism tests Signed-off-by: Tim Moon <tmoon@nvidia.com> * Export ONNX ops that do compute in FP32 Matches internal impl of TE kernels. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add build script for custom ORT ops Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com>
Description
We have experienced failures in the ONNX export tests when running with Python 3.12 because PyPI does not have an available distribution for ONNX Runtime 1.13.1. Instead of manually rebuilding the custom ONNX Runtime ops at
libcustom_ort_fp8_qdq_ops.so, I figure it's a good time to add logic to build the ops automatically before running the test.Related: #41
Pinging @nzmora-nvidia and @asfiyab-nvidia.
Type of change
Changes
Checklist: