[fix] Change the rule of determining compact dataflow. #10420

yzh119 · 2022-03-01T01:31:20Z

Previously, we can not bind the loop i/j to any data-parallel physical threads in the following example because outer is neither determined as CompleteBlock nor ReductionBlock:

outer writes and reads b simultaneously so it's not a complete block.
outer has no init sub-block so it's not a reduction block.

@T.prim_func
def nested_block_bind(a_ptr: T.handle, b_ptr: T.handle):
    a = T.match_buffer(a_ptr, [16, 16, 16, 16], "float32")
    b = T.match_buffer(b_ptr, [16, 16, 16], "float32")
    for i, j in T.grid(16, 16):
        with T.block("outer"):
            vi, vj = T.axis.remap("SS", [i, j])
            for k, l in T.grid(16, 16):
                with T.block("inner"):
                    vk, vl = T.axis.remap("SR", [k, l])
                    with T.init():
                        b[vi, vj, vk] = 0.0
                    b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]

Such a case might happen after performing blockize or block isolation in Sparse TIR.

In this PR I changed the rule we determine reduction blocks: if there is no init block, and there are sub-blocks, we check the following rules:

all block iters in the current block are data-parallel.
all sub-blocks are complete/reduction (this implies they are dominant).
there must be at least one reduction sub-block (there is a init block inside it).

cc @Hzfengsy @MasterJH5574 @spectrometerHBH

tests/python/unittest/test_tir_schedule_bind.py

src/tir/schedule/analysis/analysis.cc

MasterJH5574 · 2022-03-02T00:24:45Z

src/tir/schedule/analysis/analysis.cc

+  Array<StmtSRef> child_block_srefs = GetChildBlockSRefOnSRefTree(self, block_sref);
+  if (!block->init.defined() && child_block_srefs.size() == 1 && all_iter_vars_data_parallel) {
+    const StmtSRef& child_block_sref = child_block_srefs[0];
+    if (IsDominantBlock(self->GetBlockScope(block_sref), child_block_sref)) {


Heyyy I'm curious that why we require the child's dominance property?

This is to avoid something like:

with T.block("outer"): vi, vj = T.axis.remap("SS", [i, j]) b[vi, vj, 0] = b[vi, vj, 1] + b[vi, vj, 2] for k, l in T.grid(16, 16): with T.block("inner"): vk, vl = T.axis.remap("SR", [k, l]) with T.init(): b[vi, vj, vk] = 0.0 b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]

But unfortunately the IsDominantBlock return true here...

Okay... To avoid this stuff, IsDominantBlock doesn't work for sure, since IsDominantBlock("inner") checks “whether inner is the only block writing to b under outer,” while b[vi, vj, 0] = b[vi, vj, 1] + b[vi, vj, 2] isn't wrapped by any sub-block.

IMO an alternative is to require outer to have single child on the AST. What do you think of this idea?

Yes, I checked the implementation of IsDominantBlock I found this case was not considered. But I wonder if it's desired behavior?

Checking AST sounds good and I'm working on that.

Yes, I checked the implementation of IsDominantBlock I found this case was not considered. But I wonder if it's desired behavior?

Right. “Block B is dominant” here means B is the only writer block of all the buffer it writes into, under the scope of B’s parent block. Hence we check all blocks under the parent block of B.

In case some BufferStore is not wrapped by a sub-block, the check indeed misses that BufferStore... I guess we expect all such BufferStore to be wrapped by some block, and that might explain why we only check the blocks.

We can skip this corner case and use IsDominantBlock for now. Since this case is a known problem in TIR and I will fix it together.

Hzfengsy · 2022-03-02T04:55:37Z

Thanks, @yzh119 for pointing this out. On the other side, can we somehow determine outer is a complete block?

By this approach, we may meet another case that we should take care of.

for i, j in T.grid(16, 16):
    with T.block("outer_1"):
        vi, vj = T.axis.remap("SS", [i, j])
        for k, l in T.grid(16, 16):
            with T.block("inner_1"):
                vk, vl = T.axis.remap("SR", [k, l])
                with T.init():
                    b[vi, vj, vk] = 0.0
                b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]
    with T.block("outer_2"):
        vi, vj = T.axis.remap("SS", [i, j])
        for k, l in T.grid(16, 16):
            with T.block("inner_2"):
                vk, vl = T.axis.remap("SR", [k, l])
                with T.init():
                    b[vi, vj, vk] = 0.0
                b[vi, vj, vk] = b[vi, vj, vk] + a[vi, vj, vk, vl]

outer_1 and outer_2 are not complete block or reduction block, while inner_1 and inner_2 are both reduction blocks of its scope. However, in this case, we cannot simply bind the thread index.

upd sanity re-trigger CI fix reorg upd docstring upd

yzh119 · 2022-03-12T02:27:13Z

@Hzfengsy @MasterJH5574 @junrushao1994
I changed the rule we determine reduction blocks: if there is no init block, and there are sub-blocks, we check the following rules:

all block iters in the current block are data-parallel.
all sub-blocks are complete/reduction (this implies they are dominant).
there must be at least one reduction sub-block (there is a init block inside it).

WDYT?

Hzfengsy · 2022-03-12T07:03:04Z

all block iters in the current block are data-parallel.

Why it's not a complete block?

yzh119 · 2022-03-12T07:20:23Z

all block iters in the current block are data-parallel.

Why it's not a complete block?

@Hzfengsy I think the nested block would not influence how we determine complete blocks, and the read and write region should out overlap.

The only tricky thing is on reduction blocks, where there are overlaps between reads and writes, and the init may reside in some of the sub-blocks.

If there is a reduction iter-var in the current block, then there should be a init block inside the current block. So we only consider the case the all iter vars are data parallel.

yzh119 · 2022-03-12T07:21:27Z

This case might help explain my point:
https://github.com/apache/tvm/pull/10420/files#diff-3691acc810116f8dd7b3e12a57c8e0292e102d001c22b69b171b3ff0ff178ca9R373-R397

yzh119 · 2022-03-15T02:06:51Z

After discussion, we decide to change the read region of reduction blocks instead.

@spectrometerHBH

… blocks. (#10638) After discussion w/ @spectrometerHBH @Hzfengsy , we decide to exclude the buffer access from read regions if it's being written to inside a reduction block. In this way, the outer block would not find overlap between the region reads and writes simultaneously, thus solving the issue mentioned in #10420 . One tricky case is how to handle opaque memory access in `GetBlockReadWriteRegion`, where we have no hint about which buffer is being written to. And I keep the original behavior that the opaque access was added to both read and write regions of a block, no matter whether it's a reduction block or not.

@spectrometerHBH

… blocks. (apache#10638) After discussion w/ @spectrometerHBH @Hzfengsy , we decide to exclude the buffer access from read regions if it's being written to inside a reduction block. In this way, the outer block would not find overlap between the region reads and writes simultaneously, thus solving the issue mentioned in apache#10420 . One tricky case is how to handle opaque memory access in `GetBlockReadWriteRegion`, where we have no hint about which buffer is being written to. And I keep the original behavior that the opaque access was added to both read and write regions of a block, no matter whether it's a reduction block or not.

yzh119 requested review from Hzfengsy, ZihengJiang, areusch, comaniac, icemelon, jroesch, junrushao, kparzysz-quic, masahi, merrymercy, tqchen, vinx13 and yzhliu as code owners March 1, 2022 01:31

github-actions bot requested a review from spectrometerHBH March 1, 2022 01:34

yzh119 force-pushed the complete-block-rules branch from aa8a3ef to e102f7a Compare March 1, 2022 04:15

MasterJH5574 reviewed Mar 2, 2022

View reviewed changes

yzh119 force-pushed the complete-block-rules branch from 294b402 to 6931872 Compare March 2, 2022 20:24

upd

e5c27e1

upd sanity re-trigger CI fix reorg upd docstring upd

yzh119 force-pushed the complete-block-rules branch from 6931872 to e5c27e1 Compare March 12, 2022 02:22

yzh119 added 2 commits March 11, 2022 18:28

lint

a2a406f

lint

d64d233

upd

4f4236f

yzh119 closed this Mar 15, 2022

yzh119 mentioned this pull request Mar 16, 2022

[TIR] Change the behavior of read/write region analysis for reduction blocks. #10638

Merged

shingjan mentioned this pull request Mar 26, 2022

[Metaschedule] Add demonstration of selectively tuning relay ops with TIR schedules #10793

Merged

[fix] Change the rule of determining compact dataflow. #10420

[fix] Change the rule of determining compact dataflow. #10420

Uh oh!

Conversation

yzh119 commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MasterJH5574 Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

yzh119 Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MasterJH5574 Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

yzh119 Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

yzh119 Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

MasterJH5574 Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

Hzfengsy Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

Hzfengsy commented Mar 2, 2022

Uh oh!

yzh119 commented Mar 12, 2022

Uh oh!

Hzfengsy commented Mar 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzh119 commented Mar 12, 2022

Uh oh!

yzh119 commented Mar 12, 2022

Uh oh!

yzh119 commented Mar 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yzh119 commented Mar 1, 2022 •

edited

Loading

yzh119 Mar 2, 2022 •

edited

Loading

Hzfengsy commented Mar 12, 2022 •

edited

Loading