-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[fix] Change the rule of determining compact dataflow. #10420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aa8a3ef to
e102f7a
Compare
| Array<StmtSRef> child_block_srefs = GetChildBlockSRefOnSRefTree(self, block_sref); | ||
| if (!block->init.defined() && child_block_srefs.size() == 1 && all_iter_vars_data_parallel) { | ||
| const StmtSRef& child_block_sref = child_block_srefs[0]; | ||
| if (IsDominantBlock(self->GetBlockScope(block_sref), child_block_sref)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heyyy I'm curious that why we require the child's dominance property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to avoid something like:
with T.block("outer"):
vi, vj = T.axis.remap("SS", [i, j])
b[vi, vj, 0] = b[vi, vj, 1] + b[vi, vj, 2]
for k, l in T.grid(16, 16):
with T.block("inner"):
vk, vl = T.axis.remap("SR", [k, l])
with T.init():
b[vi, vj, vk] = 0.0
b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]But unfortunately the IsDominantBlock return true here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay... To avoid this stuff, IsDominantBlock doesn't work for sure, since IsDominantBlock("inner") checks “whether inner is the only block writing to b under outer,” while b[vi, vj, 0] = b[vi, vj, 1] + b[vi, vj, 2] isn't wrapped by any sub-block.
IMO an alternative is to require outer to have single child on the AST. What do you think of this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked the implementation of IsDominantBlock I found this case was not considered. But I wonder if it's desired behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking AST sounds good and I'm working on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked the implementation of
IsDominantBlockI found this case was not considered. But I wonder if it's desired behavior?
Right. “Block B is dominant” here means B is the only writer block of all the buffer it writes into, under the scope of B’s parent block. Hence we check all blocks under the parent block of B.
In case some BufferStore is not wrapped by a sub-block, the check indeed misses that BufferStore... I guess we expect all such BufferStore to be wrapped by some block, and that might explain why we only check the blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip this corner case and use IsDominantBlock for now. Since this case is a known problem in TIR and I will fix it together.
|
Thanks, @yzh119 for pointing this out. On the other side, can we somehow determine By this approach, we may meet another case that we should take care of. for i, j in T.grid(16, 16):
with T.block("outer_1"):
vi, vj = T.axis.remap("SS", [i, j])
for k, l in T.grid(16, 16):
with T.block("inner_1"):
vk, vl = T.axis.remap("SR", [k, l])
with T.init():
b[vi, vj, vk] = 0.0
b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]
with T.block("outer_2"):
vi, vj = T.axis.remap("SS", [i, j])
for k, l in T.grid(16, 16):
with T.block("inner_2"):
vk, vl = T.axis.remap("SR", [k, l])
with T.init():
b[vi, vj, vk] = 0.0
b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]
|
294b402 to
6931872
Compare
6931872 to
e5c27e1
Compare
|
@Hzfengsy @MasterJH5574 @junrushao1994
WDYT? |
Why it's not a complete block? |
@Hzfengsy I think the nested block would not influence how we determine complete blocks, and the read and write region should out overlap. The only tricky thing is on reduction blocks, where there are overlaps between reads and writes, and the If there is a reduction iter-var in the current block, then there should be a |
|
This case might help explain my point: |
|
After discussion, we decide to change the read region of reduction blocks instead. |
… blocks. (#10638) After discussion w/ @spectrometerHBH @Hzfengsy , we decide to exclude the buffer access from read regions if it's being written to inside a reduction block. In this way, the outer block would not find overlap between the region reads and writes simultaneously, thus solving the issue mentioned in #10420 . One tricky case is how to handle opaque memory access in `GetBlockReadWriteRegion`, where we have no hint about which buffer is being written to. And I keep the original behavior that the opaque access was added to both read and write regions of a block, no matter whether it's a reduction block or not.
… blocks. (apache#10638) After discussion w/ @spectrometerHBH @Hzfengsy , we decide to exclude the buffer access from read regions if it's being written to inside a reduction block. In this way, the outer block would not find overlap between the region reads and writes simultaneously, thus solving the issue mentioned in apache#10420 . One tricky case is how to handle opaque memory access in `GetBlockReadWriteRegion`, where we have no hint about which buffer is being written to. And I keep the original behavior that the opaque access was added to both read and write regions of a block, no matter whether it's a reduction block or not.
Previously, we can not bind the loop
i/jto any data-parallel physical threads in the following example becauseouteris neither determined asCompleteBlocknorReductionBlock:outerwrites and readsbsimultaneously so it's not a complete block.outerhas noinitsub-block so it's not a reduction block.Such a case might happen after performing blockize or block isolation in Sparse TIR.
In this PR I changed the rule we determine reduction blocks: if there is no init block, and there are sub-blocks, we check the following rules:
cc @Hzfengsy @MasterJH5574 @spectrometerHBH