Skip to content

Debug CI tests on Ada#397

Merged
timmoon10 merged 27 commits intoNVIDIA:mainfrom
timmoon10:ada-ci-debug
Oct 12, 2023
Merged

Debug CI tests on Ada#397
timmoon10 merged 27 commits intoNVIDIA:mainfrom
timmoon10:ada-ci-debug

Conversation

@timmoon10
Copy link
Collaborator

This applies the changes in #393 to the PyTorch and Paddle tests. In particular, it only runs tests involving cuDNN fused attention on compute capabilities 8.0 and 9.0.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 added the bug Something isn't working label Aug 23, 2023
@timmoon10 timmoon10 requested review from cyanguwa and ksivaman August 23, 2023 21:12
@timmoon10
Copy link
Collaborator Author

Pipeline 9489089

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Avoid split-k kernels on Ada.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Collaborator Author

Running on an L40, I found that the JAX FP8 GEMM tests on integer matrices were failing. It seems cuBLAS chooses a split-k kernel that prevents us from getting bit-wise correct results, although it is still within the expected FP8 error. I've changed the matrix dims to help cuBLAS pick a nicer kernel.

Pipeline 9504876.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Collaborator Author

Pipeline 9617488 is green.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Collaborator Author

I've tweaked the PyTorch and JAX fused attention tests so we check if there's a supported backed (namely F16_arbitrary_seqlen on Ada). These pass when I run manually on an L40 and I've launched pipeline 9938409.

#403 adds some PyTorch attention tests and #411 adds backend detection logic to Paddle. We should hold off on merging until those are in.

@timmoon10 timmoon10 requested a review from cyanguwa October 4, 2023 00:19
@timmoon10
Copy link
Collaborator Author

timmoon10 commented Oct 4, 2023

This PR is now good to go, pending pipeline 10094388 pipeline 70748350.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Copy link
Collaborator

@cyanguwa cyanguwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that one comment, otherwise looks good!

Review suggestion from @cyanguwa

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 changed the title Debug PyTorch and Paddle tests on Ada Debug CI tests on Ada Oct 11, 2023
@timmoon10
Copy link
Collaborator Author

Tests passed in pipeline 10211932.

Copy link
Member

@ksivaman ksivaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@timmoon10 timmoon10 merged commit 4ae3476 into NVIDIA:main Oct 12, 2023
@timmoon10 timmoon10 deleted the ada-ci-debug branch October 12, 2023 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments