[TOPI] Add conv2d int8 template #1735

vinx13 · 2018-09-19T11:02:18Z

This PR added a int8 conv2d using NCHW[x]c layout, where x is multiple of 4. I obtained best performance when x = 4.
The template can accept either NCHW layout input or pre-packed data (NCHW4c) and kernel (OIHW4o4i).

inference time (ms) of different models (before classifier) on NVIDIA 1080, batch size = 1

model	TVM (int8)	TVM (fp32)	TensorRT (int8)	mxnet + cuDNN (fp32)
vgg-16	1.64	4.13	1.44	5.14
resnet-50	1.46	3.95	1.39	5.38
inception_v3	2.55	7.98	2.30	12.51

cc @merrymercy @tqchen

tqchen · 2018-09-19T15:27:07Z

@masahi @nishi-t please help review this as well

masahi · 2018-09-19T15:31:48Z

topi/python/topi/cuda/conv2d_int8.py

+        assert channels % ic_block_factor == 0, \
+            "Number of input channels should be multiple of {}".format(
+                ic_block_factor)
+        packed_data = tvm.compute((batch, channels/ic_block_factor, height, width, ic_block_factor),


channels //

masahi · 2018-09-19T15:32:37Z

topi/python/topi/cuda/conv2d_int8.py

+            "Number of output channels should be multiple of {}".format(
+                oc_block_factor)
+        packed_kernel = tvm.compute(
+            (out_channels / oc_block_factor, in_channels / ic_block_factor, kernel_h, kernel_w,


masahi · 2018-09-19T15:36:31Z

topi/python/topi/cuda/conv2d_int8.py

+    return s
+
+
+@conv2d_NCHWc_int8_prepacked.register(["cuda", "gpu"])


"gpu" should be removed, since you are using cuda specific intrinsic

topi/python/topi/cuda/conv2d_int8.py

FrozenGene · 2018-09-21T03:30:19Z

topi/python/topi/cuda/conv2d_int8.py

+                ic_block_factor)
+        packed_data = tvm.compute((batch, channels // ic_block_factor, height, width,
+                                   ic_block_factor),
+                                  lambda n, c, h, w, vc: kernel[n,


Should be data? not kernel?

nishi-t · 2018-09-21T04:24:00Z

topi/python/topi/cuda/conv2d_int8.py

+        packed_kernel.shape)
+
+    stride_h, stride_w = (stride, stride) if isinstance(
+        stride, int) else stride


I'd suggest

if isinstance(stride, int): stride_h = stride_w = stride else: stride_h, stride_w = stride

the same as this

topi/python/topi/cuda/conv2d_winograd.py

merrymercy · 2018-09-21T15:51:36Z

topi/python/topi/cuda/conv2d_winograd.py

+            new_attrs['layout'] = 'NCHW4c'
+            new_attrs['out_layout'] = 'NCHW4c'
+            new_attrs['kernel_layout'] = 'OIHW4o4i'
+            return sym.contrib.conv2d_NCHWc_int8_prepacked(*copy_inputs, **new_attrs)


Can we just use sym.conv2d?
I think the new contrib symbol is redundant.

We need new symbol for winograd because we have to use a different infer_shape. But for conv2d, the infer_shape in nnvm already supports these layouts.
Both your template and nnvm.sym.conv2d can take in the arguments after alter_op_layout, so we can just return sym.conv2d with new arguments.

conv2d currently only accept NCHW / NHWC. If we pass NCHW4c, currently will meet layout assert error.

Do you mean this ?
https://github.com/dmlc/tvm/blob/ec0d497c69ca307fb998c3d81c0a7e48bb5f18d6/nnvm/python/nnvm/top/nn.py#L140-L146

It's better to change this than defining a new op

And compute_conv2d also have this assert

I remember I met some problems when using sym.conv2d. But I will check it again if changing this can work.

@vinx13 Yes, the problem is the one @FrozenGene just pointed out. I think we can fix it.

@merrymercy another problem is that we can't call a topi template directly with packed layout, instead I registered a workload function to create original workload (in NCHW layout) from packed input.
Adding this works.
https://github.com/dmlc/tvm/blob/ecad8bf05e80804218f8ef02bbc5c4337d247783/nnvm/python/nnvm/top/nn.py#L108

merrymercy · 2018-09-23T10:03:50Z

topi/python/topi/cuda/conv2d_int8.py

+
+    s[output].bind(bf, tvm.thread_axis("blockIdx.y"))
+    s[output].bind(bx, tvm.thread_axis("blockIdx.x"))
+    s[output].bind(vf, tvm.thread_axis("vthread"))


We bind n but don't bind or fuse by. Can you explain why you choose this strategy for batch?

binding n can be very effective when batch size is large. I tested on some cases that fusing by can be slower, but I guess it should be tuneable.

vinx13 · 2018-09-25T08:31:09Z

nnvm/python/nnvm/top/nn.py

        if groups == 1 and layout == "NCHW":
            return topi.generic.schedule_conv2d_nchw(outs)
+        elif groups == 1 and layout == "NCHW4c":
+            return topi.generic.schedule_conv2d_NCHWc_int8_prepacked(outs)


@merrymercy I can't check dtype of input here. Could you comment here?

Can we get dtype from outs[0].dtype?

@merrymercy outs[0].dtype is a user input, which can be arbitrary

vinx13 · 2018-09-25T11:09:34Z

Currently, we cannot run a conv2d+bn model directly because of error with int8 bn weights.
In sym.batch_norm(data=int8_conv), error occurs when it tries to compute sqrt of int in fuse___add_scalar___sqrt___rdiv_scalar___elemwise_mul because bn params are int8.
Some hack may be needed to compute bn params in fp32.
@tqchen Could you help?

tqchen · 2018-09-27T02:20:59Z

we can just insert the type casting before bn to cast it to fp32 then cast things back to int8 later

tqchen · 2018-09-29T22:33:01Z

@masahi @merrymercy @FrozenGene can you followup on this and https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly

topi/python/topi/cuda/conv2d_int8.py

masahi

Great!

FrozenGene · 2018-09-30T15:54:13Z

Just one question, could we not set NCHW(x)c to NCHW4c explicitly but give it to AutoTVM decide how to split input channel? Then we do handle NCHW(x)c, not NCHW4c specially. For example, we have _contrib_conv2d_NCHWc to handle it.

vinx13 · 2018-09-30T16:17:35Z

@FrozenGene yes we could let AutoTVM to tune the input channel split factor. But this will need some change in alter_conv2d_layout and some extra layout transform will be needed. I think this will be a good idea when we have graph tuner. Actually, I have tried x = 4/8/16 on all resnet layers, and found that perf of 4 > 8 > 16.

FrozenGene · 2018-09-30T17:25:52Z

@vinx13 Yes. Your data layout is one special case of NCHW(x)c and set it be 4 based on Resnet model benchmark. However, If we use it on another models, how do we make sure 4 is the best choice? So this is the reason I raise the question before. On AutoTVM x86, it walks on another road to achieve it: https://github.com/dmlc/tvm/pull/1772/files I prefer it more. It doesn't set one number manually.

tqchen · 2018-10-02T03:17:17Z

Thanks @merrymercy @FrozenGene @nishi-t for review and @vinx13 for contribution, this is merged. Let us follow up to see if we can generalize the layout changes

vinx13 force-pushed the topi/conv2d_int8 branch 2 times, most recently from 7d0046f to f07e0b4 Compare September 19, 2018 11:18

vinx13 changed the title ~~[TOPI] Add conv2d int8 template~~ [WIP] [TOPI] Add conv2d int8 template Sep 19, 2018

tqchen self-assigned this Sep 19, 2018

masahi reviewed Sep 19, 2018

View reviewed changes

vinx13 force-pushed the topi/conv2d_int8 branch from dde3d65 to 6e05403 Compare September 20, 2018 02:54

ZihengJiang added status: review in progress status: WIP labels Sep 20, 2018

vinx13 force-pushed the topi/conv2d_int8 branch from 6e05403 to fb5b883 Compare September 20, 2018 05:50

vinx13 commented Sep 20, 2018

View reviewed changes

topi/python/topi/cuda/conv2d_int8.py Show resolved Hide resolved

tqchen mentioned this pull request Sep 20, 2018

INT8 conv operator implementation with NCHWc data layout for Intel machines #1680

Merged

FrozenGene reviewed Sep 21, 2018

View reviewed changes

nishi-t reviewed Sep 21, 2018

View reviewed changes

masahi reviewed Sep 21, 2018

View reviewed changes

topi/python/topi/cuda/conv2d_winograd.py Show resolved Hide resolved

vinx13 force-pushed the topi/conv2d_int8 branch from 8922528 to 248c425 Compare September 21, 2018 06:16

merrymercy reviewed Sep 21, 2018

View reviewed changes

merrymercy reviewed Sep 23, 2018

View reviewed changes

vinx13 added 10 commits September 25, 2018 13:40

Add int8 conv2d NCHWc template

2bab6c8

Add NNVM symbol of conv2d NCHWc int8

2fd0a15

Fix computing packed_kernel

35df666

Fix style

5a2e6ff

Compute flop in conv2d_direct manually to support int8

74bced5

Handle bias in conv2d_int8 schedule

b673f00

Fix wrong variable

c48e1fb

Improve style

a037241

Handle dilate and packing schedule correctly

d685e0d

Add unittests of conv2d_NCHWc_int8

0809b0e

Add device compatability check in unittest

9553a63

vinx13 force-pushed the topi/conv2d_int8 branch 2 times, most recently from a0810d2 to 240a39f Compare September 25, 2018 08:09

Use sym.conv2d instead of new symbol

ecad8bf

vinx13 force-pushed the topi/conv2d_int8 branch from 240a39f to ecad8bf Compare September 25, 2018 08:16

vinx13 commented Sep 25, 2018

View reviewed changes

Add assert on target

71e4754

vinx13 added 3 commits September 26, 2018 10:31

Fix assert on target

7bc4c09

Bind block index to f, y, x

d8492ce

Fuse by and bx

4521c6a

vinx13 changed the title ~~[WIP] [TOPI] Add conv2d int8 template~~ [TOPI] Add conv2d int8 template Sep 30, 2018

masahi reviewed Sep 30, 2018

View reviewed changes

topi/python/topi/cuda/conv2d_int8.py Show resolved Hide resolved

masahi approved these changes Sep 30, 2018

View reviewed changes

merrymercy approved these changes Oct 1, 2018

View reviewed changes

tqchen merged commit 06f91dd into apache:master Oct 2, 2018

tqchen added status: accepted and removed status: WIP status: review in progress labels Oct 2, 2018

FrozenGene pushed a commit to FrozenGene/tvm that referenced this pull request Dec 27, 2018

[TOPI] Add conv2d int8 template (apache#1735)

6b24228

		return s


		@conv2d_NCHWc_int8_prepacked.register(["cuda", "gpu"])

[TOPI] Add conv2d int8 template #1735

[TOPI] Add conv2d int8 template #1735

Uh oh!

Conversation

vinx13 commented Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Sep 19, 2018

Uh oh!

masahi Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FrozenGene Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

merrymercy Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinx13 commented Sep 25, 2018

Uh oh!

tqchen commented Sep 27, 2018

Uh oh!

tqchen commented Sep 29, 2018

Uh oh!

Uh oh!

masahi left a comment

Choose a reason for hiding this comment

Uh oh!

FrozenGene commented Sep 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinx13 commented Sep 30, 2018

Uh oh!

FrozenGene commented Sep 30, 2018

Uh oh!

tqchen commented Oct 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vinx13 commented Sep 19, 2018 •

edited

Loading

masahi Sep 19, 2018 •

edited

Loading

FrozenGene Sep 21, 2018 •

edited

Loading

merrymercy Sep 21, 2018 •

edited

Loading

merrymercy Sep 21, 2018 •

edited

Loading

FrozenGene commented Sep 30, 2018 •

edited

Loading