FP4 cast kernel through FusionExecutorCache#4748
Conversation
|
Review updated until commit 91d9c34 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test --diff |
|
!test |
|
Bench failure might be related, looking into it. |
Should be fixed now |
|
!test --diff |
| auto rhs_bit_multiple = | ||
| broadcast_bit_multiples[break_point_i].rhs_multiple; | ||
|
|
||
| // Estimate transfer cost with this break point |
There was a problem hiding this comment.
Hmm... I don't think the original code of this part makes sense. The unit of cur_transfer_size and right_transfer_size were in the unit of byte ^ N, where N is the number of dimensions on the left/right of the break point. My PR will change it to bit ^ N, so in this sense, my PR is NOT a no-op. But I believe we should not block my PR because of this. Instead, we should fix it in a later PR.
There was a problem hiding this comment.
Later in code, we are comparing the transfer sizes with L2 cache. This makes no sense. We can not compare a quantity with unit byte ^ N with a quantity with unit byte.
There was a problem hiding this comment.
Ah, so that's because there's * lhs_byte_multiple and * rhs_byte_multiple inside the loops. Yeah, that doesn't make sense.
|
I'm not sure if the diff results are benign or not. For example, This is one of the segments of |
I believe it is due to #4748 (comment) |
OK, yeah, that's likely the case. |
naoyam
left a comment
There was a problem hiding this comment.
LGTM. The issue with the breakpoint analysis is concerning, but I agree with you that it should not block this PR.
I begin with adding a test `Fp4CastToHighPrecisionFusionExecutorCache` and start to fix errors. But end up changing many things in our system to use bit instead of byte. Except for `Fp4CastToHighPrecisionFusionExecutorCache`, I expect this PR to have no behavioral change other than byte -> bit.
I begin with adding a test
Fp4CastToHighPrecisionFusionExecutorCacheand start to fix errors. But end up changing many things in our system to use bit instead of byte. Except forFp4CastToHighPrecisionFusionExecutorCache, I expect this PR to have no behavioral change other than byte -> bit.