Add new batch_gemm types#466
Conversation
|
@Rbiessy, cc |
|
@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs. "I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers." |
hjabird
left a comment
There was a problem hiding this comment.
Since the reference cblas implementation doesn't support some of the operations that are being added (as I understand it), is the new functionality actually tested?
Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with Looking at We can use 2024.0 for the tests for now. Aidan is running more tests. |
@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1. |
|
@andrewtbarker Will you be able to help with this review? |
Sure, I will take a look. |
andrewtbarker
left a comment
There was a problem hiding this comment.
Thanks for the PR, there is a lot of good work here. Most of my comments are just about style and naming consistency.
Yes, this should have been fixed in #445 . If not we should fix it. |
|
Have you tested the PR with hipSYCL/AdaptiveSYCL? Can you please add the logs? |
e436f1c to
20a7057
Compare
|
No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases. MKL tests error: |
It looks like |
The failures it Dot are due to error DotU is an odd one, it does not appear to be related to my changes however |
|
I have resolved all but one issue with GemmBatch's tests. The CPU MKL implementation has significant amounts of error compared to the GPU. I believe there may be a fundamental difference in the precision of the calculation for the CPU. One possible fix would be to increase the tolerance significantly just for the CPU. Im not a fan of this approach as it is a bit of a brute force solution. Does anyone have any recommendations on how they would like to see this handled? |
Can you please test hipSYCL backend as well? |
andrewtbarker
left a comment
There was a problem hiding this comment.
Some suggestions to make interpreting failed test results easier - I flagged a few places but there are similar issues in most of the new tests.
If, as we suspect, the CPU backend is doing accumulation in double while the GPU backend does it in float, one option would be changing what reference gemm from |
|
What is the status here? As I see it we have three outstanding items:
(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it? |
|
Hi @andrewtbarker, I have updated the status by email as it was easier to discuss issues with testing AdaptiveCpp on the CI. In short there are a few issues @AidanBeltonS will need to look at once he is back from Holiday next week! |
I have addressed items 2. and 3. |
andrewtbarker
left a comment
There was a problem hiding this comment.
I think we're good to go. Thanks for sticking with this one!
6e70f68 to
7dee1c1
Compare
|
I have also disabled the in8, float combination for MKLCPU/GPU as I found similar precision issues. |
|
I have confirmed the tests pass with AdaptiveCpp on AMD and Nvidia HW. |
Add support for more batch_gemm types to follow the specification. Some combination using int8 are disabled on some backends due to precision issue.
Description
This adds new data types for the gemm_batch operation, to better be in line with the oneMKL spec. The types added are <half, half, float, float>, <int8, int8, float, float>, and <int8, int8, int32, float>.
New testing is added for these data types. Tests where the scalar type does not match the input type require a higher tolerance as the reference calculation is being performed at a much higher precision.
Test logs:
rocblas_test_log.txt
cublas_test_log.txt
I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers.
Fixes # (GitHub issue)
#446
Checklist
All Submissions
New interfaces
it was accepted? # (RFC)
New features