Skip to content

Conversation

@guberti
Copy link
Member

@guberti guberti commented Sep 21, 2022

This pull request removes a number of restrictions on the usage of the depthwise_conv2d schedule for microTVM. It:

  • Adds support for the int16 input data type
  • Adds support for arbitrarily-sized (including asymmetric) kernels
  • Allows the int16 version to run on Cortex-M0 and M3 (which do not have SIMD instructions)
    • This is accomplished without performance loss
    • Adds unit tests for these features

It also removes use of the SMLAD instruction, which cannot improve performance on NHWC layouts (discussion about why in the comments below). This results in a slight performance boost for depthwise convolutions where kernel_width * kernel_height is odd, and no change otherwise. It also contains some readability improvements.

Known issue: my call to topi.reshape in depthwise_conv2d.py results in useless C code being generated that unnecessarily duplicates an array. The performance hit from this is small, but it is a gross issue.

@areusch
Copy link
Contributor

areusch commented Sep 21, 2022

cc @Mousius @ekalda @leandron

@guberti guberti force-pushed the micro/depthwise-weights-reorg branch from 711a213 to 5bba3d0 Compare September 27, 2022 13:04
@guberti guberti changed the title [microTVM] Replace fancy depthwise_conv2d kernel packing scheme [microTVM] Generalize depthwise_conv2d schedule Sep 27, 2022
@guberti
Copy link
Member Author

guberti commented Sep 27, 2022

Why doesn't SMLAD improve performance?

Recall that the SMLAD instruction takes two int16*2 values x1::x2 and y1::y2 and an accumulator z, and computes z += x1 * y1 + x2 * y2. For NHWC layouts however, the relevant x1::x2 values in the input tensor are not next to each other, though. Previously, we used a DSP-specific halfword packing instruction PKHBT to fix this, and then called SMLAD after - two instructions for two multiplies.

However, there is a non-DSP instruction SMLAxy that is present on all Cortex-M cores (see docs). This instruction allows us to only read one 16-bit half of an int32 register while performing multiply accumulates, allowing us to skip the PKHBT instruction. Doing the multiplies this way is just as fast, while being way more versatile and simpler.

This means it is impossible to use SMLAD to speed up our performance for input tensors in NHWC format - nowhere in the input tensor is the relevant data in the correct format, which necessitates the use of at least one instruction to fix this. However, this extra instruction already removes all benefit of SMLAD compared to the non-DSP instruction SMLAxy.

Note that for the NCHW format, SMLAD would be very helpful. We should look into changing the format in the Relay graph, as this would yield a major performance improvement.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @guberti , this looks pretty good. leaving a couple comments. @tkonolige, could you give the AlterOpLayout part a look?

KH, KW, _, _ = get_const_tuple(kernel.shape)
simd_width = get_dtype_simd_width(data.dtype)

HWOI_kernel_np = inputs[1].data.numpy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to check that the kernel is a constant and fallback to a different implementation if it is not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how easy it is to check if the kernel is a constant from python/tvm/relay/op/strategy/arm_cpu.py, but you're right that it is a thing we should check. I've added an assertion, though it is a bit of a stopgap solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a message to the assert and a comment about what needs to be done to not make it a stopgap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, a clean solution is hard, as the strategy function does not have access to the needed information. When conv2d_alter_op is called, inputs[1] (the kernel) has the form:

meta[relay.Constant][0] /* ty=Tensor[(3, 3, 3, 8), int16] */

However when the Relay strategy functions are called, inputs[1] (the kernel) looks like:

Tensor(shape=[3, 3, 8, 1], op.name=placeholder)

Nowhere inside relay/op/strategy do any of the strategy functions check whether the relevant tensors are constant, so there's not much we can do. I've added comments explaining this, but please let me know if you have ideas for how this could be done.

@guberti
Copy link
Member Author

guberti commented Sep 27, 2022

Thanks for the detailed comments @areusch @tkonolige! I've addressed your comments with dee04b1 - please take another look.

@guberti guberti force-pushed the micro/depthwise-weights-reorg branch from dee04b1 to 5198aea Compare September 27, 2022 21:09
Copy link
Contributor

@ekalda ekalda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and interesting discussion about using DSP instructions for depthwise! cc @ashutosh-arm for visibility

Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guberti!

@areusch areusch merged commit e3a6cb6 into apache:main Sep 28, 2022
@areusch areusch mentioned this pull request Sep 28, 2022
7 tasks
AndrewZhaoLuo pushed a commit that referenced this pull request Sep 28, 2022
* Method without SMLAD

* Remove kernel packing without decreasing speed

* Finish removing weights reorg

* Unit tests for larger kernels

* Prototype int16 depthwise schedule

* Bugfixes and unit tests

* Formatting and linting

* Linting fix

* Address comments from code review

* Fix accidental winograd bug

* Clarifying comment about Relay constant assertion

* Another round of code review comments
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* Method without SMLAD

* Remove kernel packing without decreasing speed

* Finish removing weights reorg

* Unit tests for larger kernels

* Prototype int16 depthwise schedule

* Bugfixes and unit tests

* Formatting and linting

* Linting fix

* Address comments from code review

* Fix accidental winograd bug

* Clarifying comment about Relay constant assertion

* Another round of code review comments
guberti added a commit that referenced this pull request Jan 13, 2023
* Method without SMLAD

* Remove kernel packing without decreasing speed

* Finish removing weights reorg

* Unit tests for larger kernels

* Prototype int16 depthwise schedule

* Bugfixes and unit tests

* Formatting and linting

* Linting fix

* Address comments from code review

* Fix accidental winograd bug

* Clarifying comment about Relay constant assertion

* Another round of code review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants