Conversation
|
!test --diff |
|
Review updated until commit 329f19a Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test --diff |
|
!test --diff |
| SegmentedGroup* group = to_visit.front(); | ||
| to_visit.pop_front(); | ||
|
|
||
| if (group->exprs().empty()) { |
There was a problem hiding this comment.
Not strictly related, but while testing this PR, I was hitting a bug that should been easily spotted if this was in place.
|
|
||
| // Grab factory ops that should be forwarded. Add created tensors to | ||
| // the fusion input list to make them handled like fusion inputs | ||
| // TODO: Handle more factory methods such as IotaOp, EyeOp, |
There was a problem hiding this comment.
I was originally planning to add support for other factory ops but decided to leave them as follow-up extensions.
|
!test --diff |
Codediff resultsTestNvFuserFrontend::test_issue1872This seems expected because the full-op result is used in two slices. jit_codegen_diff_20_6/7, jit_codegen_diff_20_7/7Something seems off in these two cases. They are showing lots of new and removed tests. Maybe the script somehow fails to match two results? CC: @jacobhinkle |
| for (auto expr : completeFusion()->exprs()) { | ||
| if (expr->isA<FullOp>() && | ||
| // Don't bother if it's a fusion output | ||
| !expr->output(0)->isFusionOutput()) { |
There was a problem hiding this comment.
should we not forward even though they are fusion outputs, that would still save bandwidth on reading gmem in other consumers right?!
There was a problem hiding this comment.
That's true in theory, but in practice it'll be likely cached anyway, so the impact would be minimal.
Implementation wise, we would need to further extend the forwarding logic since it's likely to result in causing something unexpected if multiple segments had the same op producing the same fusion output.
I just don't think it's worthwhile supporting such a case.
Fusion segmenter sets aside a certain sequence of unary ops starting with fusion inputs, which we call forwarding. It effectively works as an optimization by recomputing (cheap) unary ops instead of passing tensors from one segment to another.
This PR extends the forwarding optimization to those starting with factory methods. Here's a motivating example (Litgpt Llama 3 RoPE backward):
The
T81tensor is the output a full op. The tensor is used inside both yellow and gray segments. The op itself is in the yellow segment, so it's created inside the yellow segment, and that is passed, through gmem, to the gray segment. Obviously, cheap ops like this should be just replicated in the gray segment instead of passing a full tensor. Here's another way to see it:And
T81is used in the next segment of:There are multiple ways to achieve that. What seems to most make sense to me is to extend the existing forwarding method to handle cases like this. The existing method only considers ops starting with fusion inputs, which do not include factory-created tensors.
This PR applies a small change to the forwarding logic to include factory ops as well. The end result of this change with the above example case is that the full result is no longer passed around. Here's the first segment:
Notice that
T81is no longer a segment output. And the second segment is: