[microTVM] Generalize depthwise_conv2d schedule #12856

guberti · 2022-09-21T10:47:20Z

This pull request removes a number of restrictions on the usage of the depthwise_conv2d schedule for microTVM. It:

Adds support for the int16 input data type
Adds support for arbitrarily-sized (including asymmetric) kernels
Allows the int16 version to run on Cortex-M0 and M3 (which do not have SIMD instructions)
- This is accomplished without performance loss
- Adds unit tests for these features

It also removes use of the SMLAD instruction, which cannot improve performance on NHWC layouts (discussion about why in the comments below). This results in a slight performance boost for depthwise convolutions where kernel_width * kernel_height is odd, and no change otherwise. It also contains some readability improvements.

Known issue: my call to topi.reshape in depthwise_conv2d.py results in useless C code being generated that unnecessarily duplicates an array. The performance hit from this is small, but it is a gross issue.

areusch · 2022-09-21T15:41:46Z

cc @Mousius @ekalda @leandron

guberti · 2022-09-27T13:17:24Z

Why doesn't `SMLAD` improve performance?

Recall that the SMLAD instruction takes two int16*2 values x1::x2 and y1::y2 and an accumulator z, and computes z += x1 * y1 + x2 * y2. For NHWC layouts however, the relevant x1::x2 values in the input tensor are not next to each other, though. Previously, we used a DSP-specific halfword packing instruction PKHBT to fix this, and then called SMLAD after - two instructions for two multiplies.

However, there is a non-DSP instruction SMLAxy that is present on all Cortex-M cores (see docs). This instruction allows us to only read one 16-bit half of an int32 register while performing multiply accumulates, allowing us to skip the PKHBT instruction. Doing the multiplies this way is just as fast, while being way more versatile and simpler.

This means it is impossible to use SMLAD to speed up our performance for input tensors in NHWC format - nowhere in the input tensor is the relevant data in the correct format, which necessitates the use of at least one instruction to fix this. However, this extra instruction already removes all benefit of SMLAD compared to the non-DSP instruction SMLAxy.

Note that for the NCHW format, SMLAD would be very helpful. We should look into changing the format in the Relay graph, as this would yield a major performance improvement.

areusch

thanks @guberti , this looks pretty good. leaving a couple comments. @tkonolige, could you give the AlterOpLayout part a look?

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/common.py

tkonolige · 2022-09-27T20:15:37Z

python/tvm/topi/arm_cpu/conv2d_alter_op.py

+        KH, KW, _, _ = get_const_tuple(kernel.shape)
+        simd_width = get_dtype_simd_width(data.dtype)
+
+        HWOI_kernel_np = inputs[1].data.numpy()


You'll need to check that the kernel is a constant and fallback to a different implementation if it is not.

I'm not sure how easy it is to check if the kernel is a constant from python/tvm/relay/op/strategy/arm_cpu.py, but you're right that it is a thing we should check. I've added an assertion, though it is a bit of a stopgap solution.

Could you add a message to the assert and a comment about what needs to be done to not make it a stopgap.

Unfortunately, a clean solution is hard, as the strategy function does not have access to the needed information. When conv2d_alter_op is called, inputs[1] (the kernel) has the form:

meta[relay.Constant][0] /* ty=Tensor[(3, 3, 3, 8), int16] */

However when the Relay strategy functions are called, inputs[1] (the kernel) looks like:

Tensor(shape=[3, 3, 8, 1], op.name=placeholder)

Nowhere inside relay/op/strategy do any of the strategy functions check whether the relevant tensors are constant, so there's not much we can do. I've added comments explaining this, but please let me know if you have ideas for how this could be done.

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/common.py

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/multi_channel_convolve.py

guberti · 2022-09-27T21:02:25Z

Thanks for the detailed comments @areusch @tkonolige! I've addressed your comments with dee04b1 - please take another look.

ekalda

LGTM and interesting discussion about using DSP instructions for depthwise! cc @ashutosh-arm for visibility

tests/python/relay/strategy/arm_cpu/test_depthwise_conv2d.py

tkonolige

Thanks @guberti!

* Method without SMLAD * Remove kernel packing without decreasing speed * Finish removing weights reorg * Unit tests for larger kernels * Prototype int16 depthwise schedule * Bugfixes and unit tests * Formatting and linting * Linting fix * Address comments from code review * Fix accidental winograd bug * Clarifying comment about Relay constant assertion * Another round of code review comments

guberti added 7 commits September 27, 2022 06:03

Method without SMLAD

2647e98

Remove kernel packing without decreasing speed

9eb6a0d

Finish removing weights reorg

85d75ca

Unit tests for larger kernels

de54a26

Prototype int16 depthwise schedule

66439af

Bugfixes and unit tests

c9406c3

Formatting and linting

5bba3d0

guberti force-pushed the micro/depthwise-weights-reorg branch from 711a213 to 5bba3d0 Compare September 27, 2022 13:04

guberti changed the title ~~[microTVM] Replace fancy depthwise_conv2d kernel packing scheme~~ [microTVM] Generalize depthwise_conv2d schedule Sep 27, 2022

Linting fix

5b92774

areusch reviewed Sep 27, 2022

View reviewed changes

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/common.py Outdated Show resolved Hide resolved

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/common.py Outdated Show resolved Hide resolved

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/common.py Outdated Show resolved Hide resolved

tkonolige requested changes Sep 27, 2022

View reviewed changes

Address comments from code review

5198aea

guberti force-pushed the micro/depthwise-weights-reorg branch from dee04b1 to 5198aea Compare September 27, 2022 21:09

guberti added 2 commits September 27, 2022 14:12

Fix accidental winograd bug

73c8c07

Clarifying comment about Relay constant assertion

0e8e646

ekalda reviewed Sep 28, 2022

View reviewed changes

tests/python/relay/strategy/arm_cpu/test_depthwise_conv2d.py Show resolved Hide resolved

Another round of code review comments

05978a2

tkonolige approved these changes Sep 28, 2022

View reviewed changes

areusch merged commit e3a6cb6 into apache:main Sep 28, 2022

areusch mentioned this pull request Sep 28, 2022

[release] v0.10.0 Release Schedule #12832

Closed

7 tasks

guberti mentioned this pull request Oct 3, 2022

[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts #12969

Merged

leandron mentioned this pull request Feb 1, 2023

TVM v0.11.0 Release Candidate Notes #13899

Closed

[microTVM] Generalize depthwise_conv2d schedule #12856

[microTVM] Generalize depthwise_conv2d schedule #12856

Uh oh!

Conversation

guberti commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

areusch commented Sep 21, 2022

Uh oh!

guberti commented Sep 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why doesn't SMLAD improve performance?

Uh oh!

areusch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tkonolige Sep 27, 2022

Choose a reason for hiding this comment

Uh oh!

guberti Sep 27, 2022

Choose a reason for hiding this comment

Uh oh!

tkonolige Sep 27, 2022

Choose a reason for hiding this comment

Uh oh!

guberti Sep 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guberti commented Sep 27, 2022

Uh oh!

ekalda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tkonolige left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guberti commented Sep 21, 2022 •

edited

Loading

guberti commented Sep 27, 2022 •

edited

Loading

Why doesn't `SMLAD` improve performance?