Skip to content

Conversation

@llehtahw
Copy link
Contributor

@llehtahw llehtahw commented Apr 19, 2021

After #7497, this snippet fails:

def test_tensorize_reuse_compute():
    def get_compute_args():
        l = 2
        a = tvm.te.placeholder([l], name="a")
        b = tvm.te.placeholder([l], name="b")
        return a, b, tvm.te.compute([l], lambda i: a[i] + b[i])

    a, b, c = get_compute_args()

    def get_intrin():
        def _intrin_func(ins, outs):
            return tvm.tir.call_packed("fakeadd", ins[0], ins[1], outs[0])
        return tvm.te.decl_tensor_intrin(c.op, _intrin_func)

    s = tvm.te.create_schedule([c.op])
    s[c].tensorize(c.op.axis[0], get_intrin())
    tvm.lower(s, (a, b, c))
Check failed: (expr_equal(lhs, rhs)) is false: Failed to match the compute with TensorIntrin tensor_intrin's declaration  provided= (a[i] + b[i]), intrin=  (a[i] + b[i])

When using the same compute body in both decl_tensor_intrin and create_schedule, after this Substitute, some nodes' address changed, and then check failed.

I haven't figured out why this happens when reusing compute body, but I think we can drop the comparison if the two exprs are exactly the same one.

@llehtahw llehtahw changed the title [Tensorize]Fix tensorize error while reusignbai [Tensorize]Fix tensorize error while reusing compute Apr 19, 2021
@llehtahw
Copy link
Contributor Author

@leeexyz @comaniac do you have any suggestion?

@comaniac
Copy link
Contributor

This example seems not right to me. At the first glance, I don't think we should reuse the compute in this case.
@leeexyz @tqchen please share your thoughts.

@llehtahw
Copy link
Contributor Author

This example seems not right to me. At the first glance, I don't think we should reuse the compute in this case.

I agree with you. Not only this example, we are able to avoid reusing in any case.

But basically, reusing is never a special case for expr analyzing or tensorize matching. Should TVM complains about the that usage?

If reusing should be avoided, I think we still need some clearer error message to inform the user to correct the wrong code.

@leeexyz
Copy link
Contributor

leeexyz commented Apr 21, 2021

@llehtahw @comaniac The modification does not consider this reusing case. :( Substitute LHS (aka the provide) should solve this reusing case. The reason is after Normalize step, IterVar i is rebased, but only RHS (aka the intrin) has been updated.

// src/te/operation/tensorize.cc:330
    PrimExpr lhs = ana.Simplify(Substitute(body[i], value_map));

I think it is okay to reuse compute since this is a way to reuse the compute concepts to describe the behavior of HW intrinsic. Actually, we also do some reusings, but we don't use the same compute directly, what we do like follows.

    a, b, c = get_compute_args()
    s = tvm.te.create_schedule([c.op])
    # just like create a new op
    _, _, intrin_c = get_compute_args()
    # get the intrinsic
    intrin_c = get_intrin(intrin_c)
    # do tensorize
    s[c].tensorize(c.op.axis[0], intrin_c)
    tvm.lower(s, (a, b, c))

@llehtahw
Copy link
Contributor Author

Thank you @leeexyz .

_, _, intrin_c = get_compute_args()

That's also my workaround.

I think both of the two usage should be considered valid. Would you like to create another PR to complete #7497?

And this will be closed.

@leeexyz
Copy link
Contributor

leeexyz commented Apr 21, 2021

@llehtahw Thanks for pointing this out. :) One more thing, why not use call_extern, etc. to achieve the same goal?

@leeexyz
Copy link
Contributor

leeexyz commented Apr 21, 2021

What I need to explain more is in our case, we use compute to define the HW intrinsic, it is may as the same as Operators, such as element-wise. But it is not the same compute (I mean the same addresses of all nodes), instead of the concept. :)

For example, an Add op has 4 dims (i0, i1, i2, i3) and the three innermost (i1, i2, i3) axes will be split, then after reorder, the region of the inner axes (i1.i, i2.i, i3.i) will be tensorized to an Intrin_Add. In this situation, Add op and the Intrin_Add share the compute.

So, in the reusing case, it is more natural to call_extern, I think. Anyway, I think it is better to fix this issue by updating LHS also.

@llehtahw
Copy link
Contributor Author

why not use call_extern, etc. to achieve the same goal?

In fact, my example was just drafted for showing the Check failed, so is't not important whether to use call_extern or call_packed here.

Actually I use tensorize to take over the whole compute, with compute body generated else where, so tensorize is the only thing I do in such a schedule. I reused the compute body by accident but it worked for long time, and cost a lot to debug 😆

I think. Anyway, I think it is better to fix this issue by updating LHS also.

+1

Thanks for you explaining @leeexyz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants