Skip to content

Conversation

@areusch
Copy link
Contributor

@areusch areusch commented May 21, 2021

  • Allows test to pass on ci-gpu with new 18.04 image.

@tkonolige @tqchen

 * Allows test to pass on ci-gpu with new 18.04 image.
@comaniac
Copy link
Contributor

FYI: It's possible that this error is caused by the CUBLAS math flag:

inline void CUBLASTryEnableTensorCore(cublasHandle_t hdl) {

According to the CUBLAS document, this flag is being deprecated. We also made some tests previously around this flag, and found that it is effective even for float32, meaning that CUBLAS kernel internally casts float32 to float16, does the computation, and casts the results back. As a result, this flag may introduce accuracy issue.

@areusch
Copy link
Contributor Author

areusch commented May 24, 2021

@comaniac i was sending this through CI for @tkonolige as he was out friday. I'll let him reply to your comment.

@tkonolige
Copy link
Contributor

@comaniac Disabling the flag makes the tests pass. What should we do here? Accept lower accuracy for performance?

@comaniac
Copy link
Contributor

@comaniac Disabling the flag makes the tests pass. What should we do here? Accept lower accuracy for performance?

I personally prefer to keep the accuracy, because it seems not right to tolerate 1e-2 for a single batch_matmul op. It means the end-to-end accuracy of all models with cublas.batch_matmul may be larger than 1e-2. cc @Hzfengsy @Laurawly as they added this flag at the time it hasn't been deprecated.

@Hzfengsy
Copy link
Member

I also prefer to keep accuracy. Just as @comaniac said, 1e-2 is too low for larger end-to-end workloads

@areusch
Copy link
Contributor Author

areusch commented May 27, 2021

superseded by #8130

@areusch areusch closed this May 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants