fixing fusion segmenter dropping allocation domain#1033
Conversation
jacobhinkle
left a comment
There was a problem hiding this comment.
Good idea. It seems like this implementation will work for stride order, but not if there are Exprs between alloc and rfactor. In that case we would need a replay.
Good point. |
|
!build |
I am not 100% sure but I was thinking BestEffortReplay could do this. |
|
!build |
2 similar comments
|
!build |
|
!build |
That would be |
|
CI is all green. what's the parse job failure that was reported?!?! @xwang233 |
|
That's from an old run (with the same commit), since Github doesn't allow deleting commit status. You may ignore that. Sorry for the confusions. |
|
!build |
|
errr. nuking it didn't work. I'm seeing failures on: and the second test failure actually killed the node. 😠 |
|
lol, the error message gives me headache. https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/71022350/raw |
This reverts commit 32b0561.
|
Here's the failing test fusion printMath. we are segmenting the pointwise op after the pad into a separate kernel. And this is what we have when we clean up the rfactor domain on inputs. So I compared the before and after on that kernel. Looks like with the rfactor thing in place, the input tensor view seems to be carrying something coming from the previous pad operation?! |
|
I put a separate issue to track the suggested removal #1040 For this PR. I'll revert that and refactor to use the replay suggested by Jacob/Naoya. |
|
!build |
This is a little bit tricky. See #840. That carryover scalar comes from concretization where we currently are modifying extents downstream of dynamic reshapes. Note also #912 which would preserve extents during concretization instead but which also hits this issue on the first segment instead of the second one; that one is relatively easy to fix by just including the output sizes in the For #840 I need to fix the traversal before merging since currently it includes scalars that are already included as extents of input TVs. Once that is done these tests should pass again. |
|
IIUC, we can just wait until your fixes are in and re-evaluate nuking the function. Meanwhile, does this PR serve as a reasonable workaround at this time so we can move forward with our allocation domain plumbing on inputs? |
Yes. I think so. We can address #1040 once #840 goes through, or in conjunction. Thanks! |
|
@jjsjann123 @jacobhinkle Another potential error would be misusing root domains in place of rfactor domains. In early days we tended to think only about root domains even when rfactor should be used. There may be still some code for fusion/segment inputs as they are (currently) guaranteed to not have rfactor domains. |
csrc/fusion_segmenter.cpp
Outdated
| new_root_domain, tv->domain()->contiguity()); | ||
|
|
||
| if (tv->domain()->hasAllocation()) { | ||
| // we need to replay the new root domain following the old rfactor domain |
There was a problem hiding this comment.
Could you add tests that actually require this replay? The current one only reorders the domains.
There was a problem hiding this comment.
Turns out testing this is tricky, since we don't have convertInputRfactorsToRoots exposed.
So we only hit this function when we go through automatic scheduler, which triggers segmentation. But automatic scheduler do not support complex transforms on allocation domain.
The reason is that, setAllocationDomain has a check that requires allocation domain "consistent" with leaf domain (read nvfuser::ir_utils::validateDomainEquivalence). So in order to set a merge or something like that on the allocation domain, I need to apply the same transformation on leaf.
But we are not supposed to hand a transformed fusion to automatic scheduler.
Here's the issue when I try to run this example.
auto tv0 = TensorViewBuilder().ndims(3).shape({-1, -1, -1}).build();
fusion->addInput(tv0);
auto s0 = IrBuilder::create<Val>(5, DataType::Float);
auto tv1 = add(tv0, s0);
fusion->addOutput(tv1);
// tv0->merge(0, 1); // sorry this code shouldn't have been here. It's copied after I started changing the code to manual scheduling.
std::vector<IterDomain*> alloc_domain = {tv0->axis(2)};
IterDomain* merged_id =
IterDomainBuilder(fusion->zeroVal(), mul(tv0->axis(0)->extent(), tv0->axis(1)->extent()))
.parallel_type(tv0->axis(0)->getParallelType())
.iter_type(tv0->axis(0)->getIterType())
.build();
IrBuilder::create<Merge>(tv0->axis(0)->container(), merged_id, tv0->axis(0), tv0->axis(1));
alloc_domain.push_back(merged_id);
tv0->setAllocationDomain(alloc_domain, {false, true});
C++ exception with description "derived_domain_ == frontier_ INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/ir/utils.cpp":877, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Invalid derived domain. Initial domain: iS2{i2}, iS3{( i0 * i1 )}. Derived domain: iS0{i0}, iS1{i1}, iS2{i2}
If we really want to test this code path with transformation, I guess we can expose this function directly. But that feels a bit too much. tagging @naoyam for a second opinion.
There was a problem hiding this comment.
Is there a typo here?: std::vector<IterDomain*> alloc_domain = {tv0->axis(2)};. Since tv0 is merged to a 2D tensor, tv0->axis(2) should not work, right?
There was a problem hiding this comment.
sorry I copied the wrong code.... I was trying to manually schedule it after running into the error.
Then I realized that manual scheduling won't hit the same code path, unless I expose the function.
There was a problem hiding this comment.
If you look at this
std::vector<IterDomain*> alloc_domain = {tv0->axis(2)};
IterDomain* merged_id =
IterDomainBuilder(fusion->zeroVal(), mul(tv0->axis(0)->extent(), tv0->axis(1)->extent()))
.parallel_type(tv0->axis(0)->getParallelType())
.iter_type(tv0->axis(0)->getIterType())
.build();
IrBuilder::create<Merge>(tv0->axis(0)->container(), merged_id, tv0->axis(0), tv0->axis(1));
alloc_domain.push_back(merged_id);
You can see that I was manually merging the two iterdomain.
There was a problem hiding this comment.
I'm not entirely sure what you're trying to do here, but the error is because of the manual merge, which doesn't update the leaf domain. The check in setAllocationDomain is to make sure the allocation domain sits between the root and leaf domains, and now the code is trying to set an allocation domain that's outside of the path.
Conceptually, this should be completely fine since the new allocation domain is equivalent to the leaf domain. However, I don't think the current indexing works with that since it basically assumes the allocation domain consists of IterDomains that are ascendants of leaf IterDomains. Fundamentally, it should be possible to traverse from leaf IterDomains to whatever allocation domains, but that's not how it works at this moment.
For this PR, can we just set up a tensor with an allocation domain that sits between root and rfactor domains? Isn't there a test doing that already?
There was a problem hiding this comment.
Thanks for trying it out. Looks like there are still many things we need to do in the scheduler, as expected.
For the meantime, please add an assertion that the allocation domain is just an reordered rfactor domain. Or we could expose this function as public so that we could test it without going through the scheduler/segmenter. Either seems fine to me.
There was a problem hiding this comment.
I'll add an assertion for now. since that's easier to remove later when we try to fix other utils in the scheduler.
|
rolled everything back to the original commit without the replay refactor. |
|
!build |
|
failing CI is irrelevant and patched here: #1058 |
|
!build |
naoyam
left a comment
There was a problem hiding this comment.
LGTM, but why there're lots of diff changes? False alarms?
hmmm. I'm surprised that code changes here causes code diff... I'll try to take a look at that before merging this one. |
OOM on those CI machines? I'm also seeing some diff coming from here cc'ing @xwang233 |
Probably the build concurrency was too high. I've added a limit on the concurrency. Feel free to restart a build. |
Should we increase this value? Line 318 in ecc9123 |
|
I'm gonna ignore the issue then since it feels low risk for this PR to impact generated code. |
Fixes #1021