Skip to content

Conversation

@vinx13
Copy link
Member

@vinx13 vinx13 commented Sep 19, 2018

This PR added a int8 conv2d using NCHW[x]c layout, where x is multiple of 4. I obtained best performance when x = 4.
The template can accept either NCHW layout input or pre-packed data (NCHW4c) and kernel (OIHW4o4i).

inference time (ms) of different models (before classifier) on NVIDIA 1080, batch size = 1

model TVM (int8) TVM (fp32) TensorRT (int8) mxnet + cuDNN (fp32)
vgg-16 1.64 4.13 1.44 5.14
resnet-50 1.46 3.95 1.39 5.38
inception_v3 2.55 7.98 2.30 12.51

cc @merrymercy @tqchen

@vinx13 vinx13 force-pushed the topi/conv2d_int8 branch 2 times, most recently from 7d0046f to f07e0b4 Compare September 19, 2018 11:18
@vinx13 vinx13 changed the title [TOPI] Add conv2d int8 template [WIP] [TOPI] Add conv2d int8 template Sep 19, 2018
@tqchen
Copy link
Member

tqchen commented Sep 19, 2018

@masahi @nishi-t please help review this as well

@tqchen tqchen self-assigned this Sep 19, 2018
assert channels % ic_block_factor == 0, \
"Number of input channels should be multiple of {}".format(
ic_block_factor)
packed_data = tvm.compute((batch, channels/ic_block_factor, height, width, ic_block_factor),
Copy link
Member

@masahi masahi Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

channels //

"Number of output channels should be multiple of {}".format(
oc_block_factor)
packed_kernel = tvm.compute(
(out_channels / oc_block_factor, in_channels / ic_block_factor, kernel_h, kernel_w,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//

return s


@conv2d_NCHWc_int8_prepacked.register(["cuda", "gpu"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"gpu" should be removed, since you are using cuda specific intrinsic

ic_block_factor)
packed_data = tvm.compute((batch, channels // ic_block_factor, height, width,
ic_block_factor),
lambda n, c, h, w, vc: kernel[n,
Copy link
Member

@FrozenGene FrozenGene Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be data? not kernel?

packed_kernel.shape)

stride_h, stride_w = (stride, stride) if isinstance(
stride, int) else stride
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest

    if isinstance(stride, int):
        stride_h = stride_w = stride
    else:
        stride_h, stride_w = stride

the same as this

new_attrs['layout'] = 'NCHW4c'
new_attrs['out_layout'] = 'NCHW4c'
new_attrs['kernel_layout'] = 'OIHW4o4i'
return sym.contrib.conv2d_NCHWc_int8_prepacked(*copy_inputs, **new_attrs)
Copy link
Member

@merrymercy merrymercy Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use sym.conv2d?
I think the new contrib symbol is redundant.

We need new symbol for winograd because we have to use a different infer_shape. But for conv2d, the infer_shape in nnvm already supports these layouts.
Both your template and nnvm.sym.conv2d can take in the arguments after alter_op_layout, so we can just return sym.conv2d with new arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conv2d currently only accept NCHW / NHWC. If we pass NCHW4c, currently will meet layout assert error.

Copy link
Member

@merrymercy merrymercy Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And compute_conv2d also have this assert

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember I met some problems when using sym.conv2d. But I will check it again if changing this can work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinx13 Yes, the problem is the one @FrozenGene just pointed out. I think we can fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy another problem is that we can't call a topi template directly with packed layout, instead I registered a workload function to create original workload (in NCHW layout) from packed input.
Adding this works.
https://github.com/dmlc/tvm/blob/ecad8bf05e80804218f8ef02bbc5c4337d247783/nnvm/python/nnvm/top/nn.py#L108


s[output].bind(bf, tvm.thread_axis("blockIdx.y"))
s[output].bind(bx, tvm.thread_axis("blockIdx.x"))
s[output].bind(vf, tvm.thread_axis("vthread"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We bind n but don't bind or fuse by. Can you explain why you choose this strategy for batch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binding n can be very effective when batch size is large. I tested on some cases that fusing by can be slower, but I guess it should be tuneable.

@vinx13 vinx13 force-pushed the topi/conv2d_int8 branch 2 times, most recently from a0810d2 to 240a39f Compare September 25, 2018 08:09
if groups == 1 and layout == "NCHW":
return topi.generic.schedule_conv2d_nchw(outs)
elif groups == 1 and layout == "NCHW4c":
return topi.generic.schedule_conv2d_NCHWc_int8_prepacked(outs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy I can't check dtype of input here. Could you comment here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get dtype from outs[0].dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy outs[0].dtype is a user input, which can be arbitrary

@vinx13
Copy link
Member Author

vinx13 commented Sep 25, 2018

Currently, we cannot run a conv2d+bn model directly because of error with int8 bn weights.
In sym.batch_norm(data=int8_conv), error occurs when it tries to compute sqrt of int in fuse___add_scalar___sqrt___rdiv_scalar___elemwise_mul because bn params are int8.
Some hack may be needed to compute bn params in fp32.
@tqchen Could you help?

@tqchen
Copy link
Member

tqchen commented Sep 27, 2018

we can just insert the type casting before bn to cast it to fp32 then cast things back to int8 later

@tqchen
Copy link
Member

tqchen commented Sep 29, 2018

@vinx13 vinx13 changed the title [WIP] [TOPI] Add conv2d int8 template [TOPI] Add conv2d int8 template Sep 30, 2018
Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@FrozenGene
Copy link
Member

FrozenGene commented Sep 30, 2018

Just one question, could we not set NCHW(x)c to NCHW4c explicitly but give it to AutoTVM decide how to split input channel? Then we do handle NCHW(x)c, not NCHW4c specially. For example, we have _contrib_conv2d_NCHWc to handle it.

@vinx13
Copy link
Member Author

vinx13 commented Sep 30, 2018

@FrozenGene yes we could let AutoTVM to tune the input channel split factor. But this will need some change in alter_conv2d_layout and some extra layout transform will be needed. I think this will be a good idea when we have graph tuner. Actually, I have tried x = 4/8/16 on all resnet layers, and found that perf of 4 > 8 > 16.

@FrozenGene
Copy link
Member

@vinx13 Yes. Your data layout is one special case of NCHW(x)c and set it be 4 based on Resnet model benchmark. However, If we use it on another models, how do we make sure 4 is the best choice? So this is the reason I raise the question before. On AutoTVM x86, it walks on another road to achieve it: https://github.com/dmlc/tvm/pull/1772/files I prefer it more. It doesn't set one number manually.

@tqchen
Copy link
Member

tqchen commented Oct 2, 2018

Thanks @merrymercy @FrozenGene @nishi-t for review and @vinx13 for contribution, this is merged. Let us follow up to see if we can generalize the layout changes

@tqchen tqchen merged commit 06f91dd into apache:master Oct 2, 2018
FrozenGene pushed a commit to FrozenGene/tvm that referenced this pull request Dec 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants