[TOPI] depthwise-conv2d in NCHW[x]c layout for x86 #2045

yzhliu · 2018-10-31T21:35:52Z

Improves mobilenet1.0 from ~2.2ms (autotvm tuned arm cpu schedule) to 1.5ms, on ec2 c5.9xlarge (18 physical-core Intel Skylake cpu).

Reviewers @merrymercy @kevinthesun please review.

tqchen · 2018-10-31T22:05:13Z

cc @FrozenGene , what is the relation of this PR with #2028?

yzhliu · 2018-10-31T22:29:55Z

@tqchen this is for x86, #2028 is for arm.

I'm a little curious about #2028 , I tested ARM schedule on CPU (basically c5.9xlarge) , @FrozenGene 's branch got ~21.4 ms (I added registers for intel cpu manually since #2028 removes them), while previous ARM schedule (in current tvm) got ~2.2ms. I know it is not fair to benchmark ARM schedule on x86 CPU and it is not tuned, but I believe current ARM depthwise schedule is also not tuned on x86 CPU - such a large performance drop looks somewhat weird. @merrymercy @FrozenGene Could you double check?

I can also help to test that on ARM once I got a device.

Another comment is, on x86, the most efficient layout for normal conv2d turns to be NCHW[x]c, so by having depthwise-conv remain the same layout, we can get rid of layout transformation between layers. Thus I believe on x86, solution in this PR would be better than NCHW ARM schedule.

merrymercy · 2018-11-01T00:54:41Z

Did you turn for #2028 ? #2028 didn't upload any tuned configs.

merrymercy · 2018-11-01T01:08:42Z

We should add dilation arguments as in #1970, then we don't have to convert log when we start to optimize for dilation.

FrozenGene · 2018-11-01T03:02:31Z

@tqchen this is for x86, #2028 is for arm.

I'm a little curious about #2028 , I tested ARM schedule on CPU (basically c5.9xlarge) , @FrozenGene 's branch got ~21.4 ms (I added registers for intel cpu manually since #2028 removes them), while previous ARM schedule (in current tvm) got ~2.2ms. I know it is not fair to benchmark ARM schedule on x86 CPU and it is not tuned, but I believe current ARM depthwise schedule is also not tuned on x86 CPU - such a large performance drop looks somewhat weird. @merrymercy @FrozenGene Could you double check?

I can also help to test that on ARM once I got a device.

Another comment is, on x86, the most efficient layout for normal conv2d turns to be NCHW[x]c, so by having depthwise-conv remain the same layout, we can get rid of layout transformation between layers. Thus I believe on x86, solution in this PR would be better than NCHW ARM schedule.

As @merrymercy said, we haven't uploaded any tuned config logs for #2028 . I want to know how do you compare? Do you use these two schedules to tune on x86 CPU and run or just run without tuning? If you tune, as #2028 said, you should notice that You should make the XGBTunner constructor’s feature type argument be feature_type= 'knob'. i.e. XGBTuner(tsk, loss_type='rank', feature_type='knob').

yzhliu · 2018-11-01T04:49:52Z

@FrozenGene I run without tuning. My question is rather about the default/fallback schedule, looks like the previous one is far better. When running the previous one, I also got warnings like,

WARNING:autotvm:Cannot find config for target=llvm -mcpu=skylake-avx512, workload=('depthwise_conv2d_nchw', (1, 32, 112, 112, 'float32'), (32, 1, 3, 3, 'float32'), (1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Let's continue this discussion in #2028

merrymercy · 2018-11-01T06:09:58Z

@yzliu Your error should not happen. Maybe it is due to this bug #2047

yzhliu · 2018-11-01T17:22:03Z

@merrymercy Thanks, I'll checkout and verify.

anijain2305 · 2018-11-02T00:12:48Z

nnvm/python/nnvm/top/nn.py

    padding = attrs.get_int_tuple("padding")
    strides = attrs.get_int_tuple("strides")
    dilation = attrs.get_int_tuple("dilation")
+    channels = attrs.get_int("channels")


Just for curiosity. Are channels here "out channels? If yes, it will better to name it appropriately for clarity

Yes, will do.

anijain2305 · 2018-11-02T00:18:06Z

topi/python/topi/nn/depthwise_conv2d.py

 from ..util import simplify

+# workload description of depthwise-conv2d
+Workload = namedtuple('Workload',


Do we want dilation in here? Or that's for separate PR?

Workload here is for getting default schedule, since dilation so far does not impact how we calculate configs, I'd rather keep it simple for now.

Makes sense. This is resolved from my side.

anijain2305 · 2018-11-02T00:34:29Z

topi/python/topi/x86/depthwise_conv2d.py

+        data_pad = data
+
+    # depthconv stage
+    di = tvm.reduce_axis((0, filter_height), name='di')


Are kh, kw better names for di, dj? Just trying to be consistent with other Intel cpu schedules.

anijain2305 · 2018-11-02T00:35:04Z

topi/python/topi/x86/depthwise_conv2d.py

+    dj = tvm.reduce_axis((0, filter_width), name='dj')
+    Output = tvm.compute(
+        (batch, out_channel_chunk, out_height, out_width, out_channel_block),
+        lambda b, oco, i, j, oci: tvm.sum(


Same as above with oh, ow for i, j?

anijain2305 · 2018-11-02T00:36:27Z

topi/python/topi/x86/depthwise_conv2d.py

+    s[C].vectorize(ic_block)
+    parallel_axis = s[C].fuse(ic_chunk, oh)
+    s[C].parallel(parallel_axis)
+    s[C].unroll(ow_block)


For curiosity, do we need this if we are unrolling s[CC] block later?

no we don't, I'll also remove the vectorize above.

anijain2305 · 2018-11-02T00:36:46Z

topi/python/topi/x86/depthwise_conv2d.py

+    _, ic_chunk, oh, ow, ic_block = s[C].op.axis
+    ow_chunk, ow_block = s[C].split(ow, factor=tile_ow)
+    s[C].reorder(ic_chunk, oh, ow_chunk, ow_block, ic_block)
+    s[C].vectorize(ic_block)


For curiosity, do we need this if we are vectorizing ic_block for s[CC] later?

anijain2305 · 2018-11-02T00:38:30Z

topi/tests/python/test_topi_depthwise_conv2d.py

+                                                     dtype=DepthwiseConv2d.dtype), ctx)
+        relu_tvm = tvm.nd.array(np.zeros(shape=get_const_tuple(Relu.shape), dtype=Relu.dtype), ctx)
+        # launch kernel 1 (depthwise_conv2d)
+        timer_1 = f1.time_evaluator(f1.entry_name, ctx, number=1)


If 'number' here means how many times we run the experiment, then we should have a higher number.

well, since here we measure only functionality. I'd rather remove the time_evaluator, only do f1(...) instead.

kevinthesun · 2018-11-02T06:14:27Z

topi/tests/python/test_topi_depthwise_conv2d.py

+def _transform_data(data, bn):
+    # NCHW -> NCHW[x]c
+    batch_size, channel, height, width = data.shape
+    data = np.transpose(data, (0, 2, 3, 1))


First reshape and then transpose? Only need one transpose here.

kevinthesun · 2018-11-02T06:15:40Z

topi/tests/python/test_topi_depthwise_conv2d.py

+    # channel, channel_multiplier, kh, kw -> out_channel_chunk, kh, kw, out_channel_block
+    channel, channel_multiplier, kh, kw = kernel.shape
+    out_channel = channel * channel_multiplier
+    kernel = np.transpose(kernel, (2, 3, 0, 1))


yzhliu · 2018-11-04T07:01:48Z

@merrymercy @anijain2305 @kevinthesun Please review again.

merrymercy · 2018-11-07T07:56:21Z

Can you add depthwise convolution support in tune_nnvm_x86 tutorial?

yzhliu · 2018-11-12T21:46:05Z

@merrymercy tutorial updated.

yzhliu · 2018-11-13T00:34:16Z

@tqchen Could you help to merge if it is good?

tqchen · 2018-11-13T01:17:44Z

Thanks, @yizhi @anijain2305 @merrymercy @kevinthesun @FrozenGene ! this is now merged.

yzhliu added 2 commits October 31, 2018 14:28

add x86 depthwise_conv2d NCHWc

b1b6fc3

add test cases

b48a35e

yzhliu added the status: need review label Oct 31, 2018

yzhliu added 2 commits October 31, 2018 14:57

lint codes

40cb824

fix lint

59bca4d

yzhliu mentioned this pull request Oct 31, 2018

[ARM][Performance]Improve ARM CPU depthwise convolution performance #2028

Closed

anijain2305 reviewed Nov 2, 2018

View reviewed changes

kevinthesun reviewed Nov 2, 2018

View reviewed changes

merge from upstream

95bd158

yzhliu added status: review in progress status: need update need update based on feedbacks and removed status: need review labels Nov 3, 2018

yzhliu added 3 commits November 3, 2018 16:05

add dilation args

207c76c

clean code; enable conv2d_NCHWc tests

d782ac5

simplify kernel layout transform

96763e4

yzhliu added 3 commits November 11, 2018 15:44

merge from upstream

3e79711

fix merging upstream

5aef217

add depthwise conv2d NCHWc tuning support for x86, update tutorial

8fef6ae

kevinthesun approved these changes Nov 12, 2018

View reviewed changes

merrymercy approved these changes Nov 12, 2018

View reviewed changes

anijain2305 approved these changes Nov 12, 2018

View reviewed changes

tqchen merged commit 5712ea6 into apache:master Nov 13, 2018

tqchen added status: accepted and removed status: need update need update based on feedbacks status: review in progress labels Nov 13, 2018

FrozenGene pushed a commit to FrozenGene/tvm that referenced this pull request Dec 27, 2018

[TOPI] depthwise-conv2d in NCHW[x]c layout for x86 (apache#2045)

075f595

wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019

[TOPI] depthwise-conv2d in NCHW[x]c layout for x86 (apache#2045)

e6863a5

wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019

[TOPI] depthwise-conv2d in NCHW[x]c layout for x86 (apache#2045)

c4a063f

[TOPI] depthwise-conv2d in NCHW[x]c layout for x86 #2045

[TOPI] depthwise-conv2d in NCHW[x]c layout for x86 #2045

Uh oh!

Conversation

yzhliu commented Oct 31, 2018

Uh oh!

tqchen commented Oct 31, 2018

Uh oh!

yzhliu commented Oct 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Nov 1, 2018

Uh oh!

merrymercy commented Nov 1, 2018

Uh oh!

FrozenGene commented Nov 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzhliu commented Nov 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Nov 1, 2018

Uh oh!

yzhliu commented Nov 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yzhliu commented Nov 4, 2018

Uh oh!

merrymercy commented Nov 7, 2018

Uh oh!

yzhliu commented Nov 12, 2018

Uh oh!

yzhliu commented Nov 13, 2018

Uh oh!

tqchen commented Nov 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yzhliu commented Oct 31, 2018 •

edited

Loading

FrozenGene commented Nov 1, 2018 •

edited

Loading

yzhliu commented Nov 1, 2018 •

edited

Loading