Skip to content

Conversation

@yzhliu
Copy link
Member

@yzhliu yzhliu commented Oct 31, 2018

Improves mobilenet1.0 from ~2.2ms (autotvm tuned arm cpu schedule) to 1.5ms, on ec2 c5.9xlarge (18 physical-core Intel Skylake cpu).

Reviewers @merrymercy @kevinthesun please review.

@tqchen
Copy link
Member

tqchen commented Oct 31, 2018

cc @FrozenGene , what is the relation of this PR with #2028?

@yzhliu
Copy link
Member Author

yzhliu commented Oct 31, 2018

@tqchen this is for x86, #2028 is for arm.

I'm a little curious about #2028 , I tested ARM schedule on CPU (basically c5.9xlarge) , @FrozenGene 's branch got ~21.4 ms (I added registers for intel cpu manually since #2028 removes them), while previous ARM schedule (in current tvm) got ~2.2ms. I know it is not fair to benchmark ARM schedule on x86 CPU and it is not tuned, but I believe current ARM depthwise schedule is also not tuned on x86 CPU - such a large performance drop looks somewhat weird. @merrymercy @FrozenGene Could you double check?

I can also help to test that on ARM once I got a device.

Another comment is, on x86, the most efficient layout for normal conv2d turns to be NCHW[x]c, so by having depthwise-conv remain the same layout, we can get rid of layout transformation between layers. Thus I believe on x86, solution in this PR would be better than NCHW ARM schedule.

@merrymercy
Copy link
Member

Did you turn for #2028 ? #2028 didn't upload any tuned configs.

@merrymercy
Copy link
Member

We should add dilation arguments as in #1970, then we don't have to convert log when we start to optimize for dilation.

@FrozenGene
Copy link
Member

FrozenGene commented Nov 1, 2018

@tqchen this is for x86, #2028 is for arm.

I'm a little curious about #2028 , I tested ARM schedule on CPU (basically c5.9xlarge) , @FrozenGene 's branch got ~21.4 ms (I added registers for intel cpu manually since #2028 removes them), while previous ARM schedule (in current tvm) got ~2.2ms. I know it is not fair to benchmark ARM schedule on x86 CPU and it is not tuned, but I believe current ARM depthwise schedule is also not tuned on x86 CPU - such a large performance drop looks somewhat weird. @merrymercy @FrozenGene Could you double check?

I can also help to test that on ARM once I got a device.

Another comment is, on x86, the most efficient layout for normal conv2d turns to be NCHW[x]c, so by having depthwise-conv remain the same layout, we can get rid of layout transformation between layers. Thus I believe on x86, solution in this PR would be better than NCHW ARM schedule.

As @merrymercy said, we haven't uploaded any tuned config logs for #2028 . I want to know how do you compare? Do you use these two schedules to tune on x86 CPU and run or just run without tuning? If you tune, as #2028 said, you should notice that You should make the XGBTunner constructor’s feature type argument be feature_type= 'knob'. i.e. XGBTuner(tsk, loss_type='rank', feature_type='knob').

@yzhliu
Copy link
Member Author

yzhliu commented Nov 1, 2018

@FrozenGene I run without tuning. My question is rather about the default/fallback schedule, looks like the previous one is far better. When running the previous one, I also got warnings like,

WARNING:autotvm:Cannot find config for target=llvm -mcpu=skylake-avx512, workload=('depthwise_conv2d_nchw', (1, 32, 112, 112, 'float32'), (32, 1, 3, 3, 'float32'), (1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Let's continue this discussion in #2028

@merrymercy
Copy link
Member

@yzliu Your error should not happen. Maybe it is due to this bug #2047

@yzhliu
Copy link
Member Author

yzhliu commented Nov 1, 2018

@merrymercy Thanks, I'll checkout and verify.

padding = attrs.get_int_tuple("padding")
strides = attrs.get_int_tuple("strides")
dilation = attrs.get_int_tuple("dilation")
channels = attrs.get_int("channels")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for curiosity. Are channels here "out channels? If yes, it will better to name it appropriately for clarity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do.

from ..util import simplify

# workload description of depthwise-conv2d
Workload = namedtuple('Workload',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want dilation in here? Or that's for separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workload here is for getting default schedule, since dilation so far does not impact how we calculate configs, I'd rather keep it simple for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. This is resolved from my side.

data_pad = data

# depthconv stage
di = tvm.reduce_axis((0, filter_height), name='di')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are kh, kw better names for di, dj? Just trying to be consistent with other Intel cpu schedules.

dj = tvm.reduce_axis((0, filter_width), name='dj')
Output = tvm.compute(
(batch, out_channel_chunk, out_height, out_width, out_channel_block),
lambda b, oco, i, j, oci: tvm.sum(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above with oh, ow for i, j?

s[C].vectorize(ic_block)
parallel_axis = s[C].fuse(ic_chunk, oh)
s[C].parallel(parallel_axis)
s[C].unroll(ow_block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For curiosity, do we need this if we are unrolling s[CC] block later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we don't, I'll also remove the vectorize above.

_, ic_chunk, oh, ow, ic_block = s[C].op.axis
ow_chunk, ow_block = s[C].split(ow, factor=tile_ow)
s[C].reorder(ic_chunk, oh, ow_chunk, ow_block, ic_block)
s[C].vectorize(ic_block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For curiosity, do we need this if we are vectorizing ic_block for s[CC] later?

dtype=DepthwiseConv2d.dtype), ctx)
relu_tvm = tvm.nd.array(np.zeros(shape=get_const_tuple(Relu.shape), dtype=Relu.dtype), ctx)
# launch kernel 1 (depthwise_conv2d)
timer_1 = f1.time_evaluator(f1.entry_name, ctx, number=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 'number' here means how many times we run the experiment, then we should have a higher number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, since here we measure only functionality. I'd rather remove the time_evaluator, only do f1(...) instead.

def _transform_data(data, bn):
# NCHW -> NCHW[x]c
batch_size, channel, height, width = data.shape
data = np.transpose(data, (0, 2, 3, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First reshape and then transpose? Only need one transpose here.

# channel, channel_multiplier, kh, kw -> out_channel_chunk, kh, kw, out_channel_block
channel, channel_multiplier, kh, kw = kernel.shape
out_channel = channel * channel_multiplier
kernel = np.transpose(kernel, (2, 3, 0, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@yzhliu
Copy link
Member Author

yzhliu commented Nov 4, 2018

@merrymercy @anijain2305 @kevinthesun Please review again.

@merrymercy
Copy link
Member

Can you add depthwise convolution support in tune_nnvm_x86 tutorial?

@yzhliu
Copy link
Member Author

yzhliu commented Nov 12, 2018

@merrymercy tutorial updated.

@yzhliu
Copy link
Member Author

yzhliu commented Nov 13, 2018

@tqchen Could you help to merge if it is good?

@tqchen tqchen merged commit 5712ea6 into apache:master Nov 13, 2018
@tqchen
Copy link
Member

tqchen commented Nov 13, 2018

Thanks, @yizhi @anijain2305 @merrymercy @kevinthesun @FrozenGene ! this is now merged.

FrozenGene pushed a commit to FrozenGene/tvm that referenced this pull request Dec 27, 2018
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants