Skip to content

Conversation

@asparkhi
Copy link
Contributor

@asparkhi asparkhi commented Dec 8, 2021

This PR fixes following issue: MergeComposite callback function from pattern table returns non boolean data type.

Copy link

@u99127 u99127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a test case for this.

Ramana

@asparkhi
Copy link
Contributor Author

asparkhi commented Dec 8, 2021

true that, I clubbed it with another bug fix which I'm still working on. If its okay with you, can I add that test later?

@u99127
Copy link

u99127 commented Dec 8, 2021

We should be looking to add tests when they are fixing bugs at the time the commits go in to show what problem they are solving. Sometimes it is obvious but in this case we don't have any test coverage in the test suite for this failure.

A PR should also be ideally fixing one bug not multiple bugs and that too in an unclear fashion.

I'm not a committer or a reviewer in the TVM community , so others might have a different view.

@asparkhi
Copy link
Contributor Author

asparkhi commented Dec 8, 2021

I agree with you on all the points there. Just wanted to avoid the long CI cycle. I will add the test.

def check_qnn_binary_op(extract):
"""Check if multiply is supported by CMSIS-NN."""
return (
return bool(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im puzzled by this change. Specifically I don't understand how the insertion of the bool() cast here changes the return type of the function, superficially it looks like the two subclauses are comparisons of str == str, both resulting in a class 'bool', applying 'and' to two bools will result in a bool. So what does the cast achieve?

Ditto the same logic applies to each of the previous bool casts inserted above...

What am I missing?

Copy link
Contributor Author

@asparkhi asparkhi Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem originated because of checks for Conv2D function: check_qnn_conv2d()

Here is how it looks :

    conv2d.attrs.out_dtype == "int32"
            and conv2d.attrs.padding[2] == 0
            and conv2d.attrs.padding[3] == 0
            and conv2d_input.checked_type.dtype == "int8"
            and conv2d_weight.checked_type.dtype == "int8"
            and pattern.checked_type.dtype == "int8"
            and bias_dtype == "int32"
            and all([zp == 0 for zp in kernel_zp])
            and (not is_depthwise or bias_add is not None)

This check conv2d.attrs.padding[2] == 0 returns IntImm type instead of bool. So, the fix could be casting just the conv2d.attrs.padding[2] instead of casting around the whole complex boolean expression. Just to be safe always, I chose to go with the latter choice for all the operators which involve accessing attributes.

Copy link
Contributor

@mshawcroft mshawcroft Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, that's curious, and presumably once one of those subclauses has an integer type, the subsequent application of and just propagates the numerical type rather than giving a bool.

Thanks for the explanation.

This explanation makes sense for the conv2d case where that padding comparison return intimm, but presumably it does not apply to several of the other cases where bool() has been inserted. For instance this qnn_binary_op case that I commented on... in which case it would probably be some what less confusing to the future reader of this code if we didn't insert the unnecessary bool() casts.

For the case(s) where == return intimm instead of bool, I wonder whether this is the actual bug, or whether the actual bug is that the underlying == machinery is broken and returning an inappropriate type, it is certainly not illegal to have == return an arbitrary object in python, but there is a strong convention that eq returns a bool, Im thinking specifically from the perspective of the principle of least astonishment .

If this really is the bug (and == really should be returning a non bool object), then I'd suggest that from the readers perspective, inserting bool cast just around the subexpression that need to be cast rather than around entire subexpression is less confusing. A comment as to why the spurious bool() cast is required might also be helpful.....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging @leandron if he knows something about the conv2d.attrs.padding[2] == 0 returning IntImm instead of bool. Its definitely surprising.

I'd update the code to just cast the subexpression as that makes code more readable, as you've suggested. Thanks @mshawcroft

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you see the IntImm?

I have seen python side bools become IntImm s when acessed from C++ via TVM's FFIs. Is it the case here?

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @ashutosh-arm

Sorry Im not fully following what is being fixed here. I think we need a test to get this PR in. I suppose it is the IntImm comparison ?

I agree with @u99127 that we should break the padding refactor out of this PR and publish that as a seperate PR concurrently. I don't think it will create a substantial delay as such because we can do the reviews for both PRs concurrently.

Then, if everything get resolved by the end of the day, it will create a maximum delay of one day, which IMO is reasonable.

@asparkhi
Copy link
Contributor Author

asparkhi commented Dec 9, 2021

@manupa-arm If you look at the latest changes that I've pushed, it contains both: Fix for the exact issue and the corresponding test.

The issue was that MergeComposite() failed because Conv2D's check function failed on cifar_10. By check function here, I mean the callback function registered with MergeComposite(). As you can see in the code snippet posted above in the comments, IntImm comparison was causing the equality to return non bool value: conv2d.attrs.padding[2] == 0 . As you pointed out, we are trying to access C++ fields from python here.

Regarding the test, @u99127 suggested to add corresponding test which I've pushed with my last commit. This test is a negative test to introduce the negative padding that would trigger a failure in the check function.

So, now we do have both the fix and the test as per reviewers' comments.

@asparkhi
Copy link
Contributor Author

@u99127 @manupa-arm any other thoughts on this?

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ashutosh-arm

I did a pass and I am still bit unclear how is the testing scope is increased to cover the bug.
Is it because it checks for "VALID" padding as well ? Earlier the testing scope seems to be only for "SAME" padding. Even then, earlier version of the test seemed to have have values generated for all pad_top, pad_left, pad_bottom and pad_right. Therefore, how is not captured before ?

Also I found out that we dont have any tflite based single operator E2E test -- I think we should have that in the absense of unit testing the compilation pipeline at different levels (i.e graph partitioning).

@ekalda would you be able to take a look as well ?

conv2d.attrs.out_dtype == "int32"
and conv2d.attrs.padding[2] == 0
and conv2d.attrs.padding[3] == 0
and int(conv2d.attrs.padding[2]) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the actual fix. Am I right ? (because others are changes to test code).
What test would fail in this PR without this change ?

I think we could have a simpler test to check whether graph partitioning works with conv2d with padding values for top, left, bottom and right.



def get_same_padding(data, kernel, dilation, stride, cmsisnn_padding=True):
def get_padding(data, kernel, dilation, stride, padding):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary to fix the bug here ?

kernel_w = kernel_shape[w_index]
a = relay.var("input", shape=shape, dtype=dtype)
p = (0, 0, 0, 0)
if padding == "SAME":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems orthogonal to the fix here

pad_left, pad_right = (pad, 0) if cmsisnn_padding else (0, pad)
return [pad_top, pad_left, pad_bottom, pad_right]
pad_left, pad_right = (pad, 0) if padding == "SAME" else (0, pad)
return [pad_top, pad_left, pad_bottom, pad_right], True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation as to what are returned with this function.
It used to return the pad top,left,bottom and right values which are bit simpler, However augmenting that with a boolean is confusing.

I would recommend using a named tuple as the return here to indicate what the boolean correspond to

pad_ = get_same_padding((shape[1], shape[2]), pool_size, dilation, strides)
dilation = (1, 1)
pad_, result = get_padding((shape[1], shape[2]), pool_size, dilation, strides, padding)
if result:
Copy link
Contributor

@manupak manupak Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment below, it is a bit difficult to follow what "result" means here. I think a named tuple might help.

However, looking at this feel, it is better to keep the get_padding(..) just to return the pad top, bottom, left, right values while here we could use padding check create the nn.pad rather than passing it inside the get_padding to return another boolean as 'result'

out = int(math.ceil(float(data[0]) / float(stride[0])))
pad = max(0, (out - 1) * stride[0] + dilated_kernel_h - data[0])
pad_top, pad_bottom = (pad, 0) if cmsisnn_padding else (0, pad)
pad_top, pad_bottom = (pad, 0) if padding == "SAME" else (0, pad)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What pad mode does the else correspond to here ?

Change-Id: If87e853e094b72e6c34e030cd1ae3b690982332f
Copy link
Contributor

@ekalda ekalda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some questions for clarification. In general, I agree with Manupa that it would be good to have some TFLite based E2E tests as well if there aren't any, might help us catch the bugs early, but that's probably outside the scope of this PR :)


def get_same_padding(data, kernel, dilation, stride, cmsisnn_padding=True):
def get_padding(data, kernel, dilation, stride, padding):
"""Provides CMSIS-NN padding when output dim == input dim"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the enlightenment, what is CMSIS-NN padding and why is it provided only when output dim == input dim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMSIS-NN padding is same as TFLU padding. This function was for providing "SAME" kind of padding defined by TFL / TF specs. When we expect output to be of the same dim as input, input to Conv2D / other ops needs to be padded first.



def get_same_padding(data, kernel, dilation, stride, cmsisnn_padding=True):
def get_padding(data, kernel, dilation, stride, padding):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does data represent here? If it is a shape of a tensor, then I think the parameter name should represent that (or some documentation could help)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will take care of it.

Comment on lines 51 to 54
op = relay.nn.pad(
op,
pad_width=[(0, 0), (pad_[0], pad_[2]), (pad_[1], pad_[3]), (0, 0)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For enlightenment, why do we add a pad operator to the graph instead of using the padding attribute of pool2d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That has troubled me as well at the time of implementing it :)
In CMSIS-NN, its only used for calculating the appropriate start, end for the Pooling kernel. I think this should work either way. I will try it out without the preceding padding op.

@asparkhi
Copy link
Contributor Author

We should definitely discuss about having TFLU E2E test with wider audience.

@manupak
Copy link
Contributor

manupak commented Dec 10, 2021

Agreed! that part is out of scope for this PR but meant to be a general comment :) .

As we discussed offline, if we can simply use get_<type>_padding() helpers as well as an invalid case to exercise non-offloading scenario, that should be good enough to get the PR in.

@asparkhi asparkhi force-pushed the merge_composite_error branch from 0eb97a5 to 0541a0c Compare December 10, 2021 16:02
Change-Id: Ib014ab6829d67b9d05f06fa6ed58f5e775daa5dd
Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (modulo others' comments) now!. Thanks for working with me on this.

@u99127
Copy link

u99127 commented Dec 10, 2021

LGTM (modulo others' comments) now!. Thanks for working with me on this.

LGTM .

Thanks,
Ramana

@manupak manupak merged commit a28a8bf into apache:main Dec 11, 2021
@manupak
Copy link
Contributor

manupak commented Dec 11, 2021

Thanks @ashutosh-arm @u99127 @mshawcroft @ekalda . This is merged now!

@asparkhi asparkhi deleted the merge_composite_error branch December 15, 2021 09:32
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
…che#9682)

This commit fixes the following issue: MergeComposite callback function from pattern table returns non boolean data type
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 11, 2022
…che#9682)

This commit fixes the following issue: MergeComposite callback function from pattern table returns non boolean data type
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 12, 2022
…che#9682)

This commit fixes the following issue: MergeComposite callback function from pattern table returns non boolean data type
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…che#9682)

This commit fixes the following issue: MergeComposite callback function from pattern table returns non boolean data type
qsqqsqqsq-intellif pushed a commit to qsqqsqqsq-intellif/tvm that referenced this pull request Apr 29, 2022
…che#9682)

This commit fixes the following issue: MergeComposite callback function from pattern table returns non boolean data type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants