Skip to content

Conversation

@yzh119
Copy link
Member

@yzh119 yzh119 commented Mar 1, 2022

Previously, we can not bind the loop i/j to any data-parallel physical threads in the following example because outer is neither determined as CompleteBlock nor ReductionBlock:

  1. outer writes and reads b simultaneously so it's not a complete block.
  2. outer has no init sub-block so it's not a reduction block.
@T.prim_func
def nested_block_bind(a_ptr: T.handle, b_ptr: T.handle):
    a = T.match_buffer(a_ptr, [16, 16, 16, 16], "float32")
    b = T.match_buffer(b_ptr, [16, 16, 16], "float32")
    for i, j in T.grid(16, 16):
        with T.block("outer"):
            vi, vj = T.axis.remap("SS", [i, j])
            for k, l in T.grid(16, 16):
                with T.block("inner"):
                    vk, vl = T.axis.remap("SR", [k, l])
                    with T.init():
                        b[vi, vj, vk] = 0.0
                    b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]

Such a case might happen after performing blockize or block isolation in Sparse TIR.

In this PR I changed the rule we determine reduction blocks: if there is no init block, and there are sub-blocks, we check the following rules:

  1. all block iters in the current block are data-parallel.
  2. all sub-blocks are complete/reduction (this implies they are dominant).
  3. there must be at least one reduction sub-block (there is a init block inside it).

cc @Hzfengsy @MasterJH5574 @spectrometerHBH

Array<StmtSRef> child_block_srefs = GetChildBlockSRefOnSRefTree(self, block_sref);
if (!block->init.defined() && child_block_srefs.size() == 1 && all_iter_vars_data_parallel) {
const StmtSRef& child_block_sref = child_block_srefs[0];
if (IsDominantBlock(self->GetBlockScope(block_sref), child_block_sref)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heyyy I'm curious that why we require the child's dominance property?

Copy link
Member Author

@yzh119 yzh119 Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid something like:

with T.block("outer"):
    vi, vj = T.axis.remap("SS", [i, j])
    b[vi, vj, 0] = b[vi, vj, 1] + b[vi, vj, 2]
    for k, l in T.grid(16, 16):
        with T.block("inner"):
            vk, vl = T.axis.remap("SR", [k, l])
            with T.init():
                b[vi, vj, vk] = 0.0
            b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]

But unfortunately the IsDominantBlock return true here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay... To avoid this stuff, IsDominantBlock doesn't work for sure, since IsDominantBlock("inner") checks “whether inner is the only block writing to b under outer,” while b[vi, vj, 0] = b[vi, vj, 1] + b[vi, vj, 2] isn't wrapped by any sub-block.

IMO an alternative is to require outer to have single child on the AST. What do you think of this idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked the implementation of IsDominantBlock I found this case was not considered. But I wonder if it's desired behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking AST sounds good and I'm working on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked the implementation of IsDominantBlock I found this case was not considered. But I wonder if it's desired behavior?

Right. “Block B is dominant” here means B is the only writer block of all the buffer it writes into, under the scope of B’s parent block. Hence we check all blocks under the parent block of B.

In case some BufferStore is not wrapped by a sub-block, the check indeed misses that BufferStore... I guess we expect all such BufferStore to be wrapped by some block, and that might explain why we only check the blocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip this corner case and use IsDominantBlock for now. Since this case is a known problem in TIR and I will fix it together.

@Hzfengsy
Copy link
Member

Hzfengsy commented Mar 2, 2022

Thanks, @yzh119 for pointing this out. On the other side, can we somehow determine outer is a complete block?

By this approach, we may meet another case that we should take care of.

for i, j in T.grid(16, 16):
    with T.block("outer_1"):
        vi, vj = T.axis.remap("SS", [i, j])
        for k, l in T.grid(16, 16):
            with T.block("inner_1"):
                vk, vl = T.axis.remap("SR", [k, l])
                with T.init():
                    b[vi, vj, vk] = 0.0
                b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]
    with T.block("outer_2"):
        vi, vj = T.axis.remap("SS", [i, j])
        for k, l in T.grid(16, 16):
            with T.block("inner_2"):
                vk, vl = T.axis.remap("SR", [k, l])
                with T.init():
                    b[vi, vj, vk] = 0.0
                b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]

outer_1 and outer_2 are not complete block or reduction block, while inner_1 and inner_2 are both reduction blocks of its scope. However, in this case, we cannot simply bind the thread index.

@yzh119 yzh119 force-pushed the complete-block-rules branch from 294b402 to 6931872 Compare March 2, 2022 20:24
upd

sanity

re-trigger CI

fix

reorg

upd

docstring

upd
@yzh119 yzh119 force-pushed the complete-block-rules branch from 6931872 to e5c27e1 Compare March 12, 2022 02:22
@yzh119
Copy link
Member Author

yzh119 commented Mar 12, 2022

@Hzfengsy @MasterJH5574 @junrushao1994
I changed the rule we determine reduction blocks: if there is no init block, and there are sub-blocks, we check the following rules:

  1. all block iters in the current block are data-parallel.
  2. all sub-blocks are complete/reduction (this implies they are dominant).
  3. there must be at least one reduction sub-block (there is a init block inside it).

WDYT?

@Hzfengsy
Copy link
Member

Hzfengsy commented Mar 12, 2022

  1. all block iters in the current block are data-parallel.

Why it's not a complete block?

@yzh119
Copy link
Member Author

yzh119 commented Mar 12, 2022

  1. all block iters in the current block are data-parallel.

Why it's not a complete block?

@Hzfengsy I think the nested block would not influence how we determine complete blocks, and the read and write region should out overlap.

The only tricky thing is on reduction blocks, where there are overlaps between reads and writes, and the init may reside in some of the sub-blocks.

If there is a reduction iter-var in the current block, then there should be a init block inside the current block. So we only consider the case the all iter vars are data parallel.

@yzh119
Copy link
Member Author

yzh119 commented Mar 12, 2022

@yzh119 yzh119 closed this Mar 15, 2022
@yzh119
Copy link
Member Author

yzh119 commented Mar 15, 2022

After discussion, we decide to change the read region of reduction blocks instead.

junrushao pushed a commit that referenced this pull request Mar 23, 2022
… blocks. (#10638)

After discussion w/ @spectrometerHBH @Hzfengsy , we decide to exclude the buffer access from read regions if it's being written to inside a reduction block. In this way, the outer block would not find overlap between the region reads and writes simultaneously, thus solving the issue mentioned in #10420 .

One tricky case is how to handle opaque memory access in `GetBlockReadWriteRegion`, where we have no hint about which buffer is being written to. And I keep the original behavior that the opaque access was added to both read and write regions of a block, no matter whether it's a reduction block or not.
pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
… blocks. (apache#10638)

After discussion w/ @spectrometerHBH @Hzfengsy , we decide to exclude the buffer access from read regions if it's being written to inside a reduction block. In this way, the outer block would not find overlap between the region reads and writes simultaneously, thus solving the issue mentioned in apache#10420 .

One tricky case is how to handle opaque memory access in `GetBlockReadWriteRegion`, where we have no hint about which buffer is being written to. And I keep the original behavior that the opaque access was added to both read and write regions of a block, no matter whether it's a reduction block or not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants