-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[TOPI] Add conv2d int8 template #1735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7d0046f to
f07e0b4
Compare
topi/python/topi/cuda/conv2d_int8.py
Outdated
| assert channels % ic_block_factor == 0, \ | ||
| "Number of input channels should be multiple of {}".format( | ||
| ic_block_factor) | ||
| packed_data = tvm.compute((batch, channels/ic_block_factor, height, width, ic_block_factor), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
channels //
topi/python/topi/cuda/conv2d_int8.py
Outdated
| "Number of output channels should be multiple of {}".format( | ||
| oc_block_factor) | ||
| packed_kernel = tvm.compute( | ||
| (out_channels / oc_block_factor, in_channels / ic_block_factor, kernel_h, kernel_w, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//
topi/python/topi/cuda/conv2d_int8.py
Outdated
| return s | ||
|
|
||
|
|
||
| @conv2d_NCHWc_int8_prepacked.register(["cuda", "gpu"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"gpu" should be removed, since you are using cuda specific intrinsic
dde3d65 to
6e05403
Compare
6e05403 to
fb5b883
Compare
topi/python/topi/cuda/conv2d_int8.py
Outdated
| ic_block_factor) | ||
| packed_data = tvm.compute((batch, channels // ic_block_factor, height, width, | ||
| ic_block_factor), | ||
| lambda n, c, h, w, vc: kernel[n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be data? not kernel?
topi/python/topi/cuda/conv2d_int8.py
Outdated
| packed_kernel.shape) | ||
|
|
||
| stride_h, stride_w = (stride, stride) if isinstance( | ||
| stride, int) else stride |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest
if isinstance(stride, int):
stride_h = stride_w = stride
else:
stride_h, stride_w = stridethe same as this
8922528 to
248c425
Compare
| new_attrs['layout'] = 'NCHW4c' | ||
| new_attrs['out_layout'] = 'NCHW4c' | ||
| new_attrs['kernel_layout'] = 'OIHW4o4i' | ||
| return sym.contrib.conv2d_NCHWc_int8_prepacked(*copy_inputs, **new_attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use sym.conv2d?
I think the new contrib symbol is redundant.
We need new symbol for winograd because we have to use a different infer_shape. But for conv2d, the infer_shape in nnvm already supports these layouts.
Both your template and nnvm.sym.conv2d can take in the arguments after alter_op_layout, so we can just return sym.conv2d with new arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conv2d currently only accept NCHW / NHWC. If we pass NCHW4c, currently will meet layout assert error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean this ?
https://github.com/dmlc/tvm/blob/ec0d497c69ca307fb998c3d81c0a7e48bb5f18d6/nnvm/python/nnvm/top/nn.py#L140-L146
It's better to change this than defining a new op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And compute_conv2d also have this assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember I met some problems when using sym.conv2d. But I will check it again if changing this can work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 Yes, the problem is the one @FrozenGene just pointed out. I think we can fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merrymercy another problem is that we can't call a topi template directly with packed layout, instead I registered a workload function to create original workload (in NCHW layout) from packed input.
Adding this works.
https://github.com/dmlc/tvm/blob/ecad8bf05e80804218f8ef02bbc5c4337d247783/nnvm/python/nnvm/top/nn.py#L108
|
|
||
| s[output].bind(bf, tvm.thread_axis("blockIdx.y")) | ||
| s[output].bind(bx, tvm.thread_axis("blockIdx.x")) | ||
| s[output].bind(vf, tvm.thread_axis("vthread")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We bind n but don't bind or fuse by. Can you explain why you choose this strategy for batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binding n can be very effective when batch size is large. I tested on some cases that fusing by can be slower, but I guess it should be tuneable.
a0810d2 to
240a39f
Compare
240a39f to
ecad8bf
Compare
| if groups == 1 and layout == "NCHW": | ||
| return topi.generic.schedule_conv2d_nchw(outs) | ||
| elif groups == 1 and layout == "NCHW4c": | ||
| return topi.generic.schedule_conv2d_NCHWc_int8_prepacked(outs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merrymercy I can't check dtype of input here. Could you comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get dtype from outs[0].dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merrymercy outs[0].dtype is a user input, which can be arbitrary
|
Currently, we cannot run a |
|
we can just insert the type casting before bn to cast it to fp32 then cast things back to int8 later |
|
@masahi @merrymercy @FrozenGene can you followup on this and https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly |
masahi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
|
Just one question, could we not set NCHW(x)c to NCHW4c explicitly but give it to AutoTVM decide how to split input channel? Then we do handle NCHW(x)c, not NCHW4c specially. For example, we have _contrib_conv2d_NCHWc to handle it. |
|
@FrozenGene yes we could let AutoTVM to tune the input channel split factor. But this will need some change in |
|
@vinx13 Yes. Your data layout is one special case of NCHW(x)c and set it be 4 based on Resnet model benchmark. However, If we use it on another models, how do we make sure 4 is the best choice? So this is the reason I raise the question before. On AutoTVM x86, it walks on another road to achieve it: https://github.com/dmlc/tvm/pull/1772/files I prefer it more. It doesn't set one number manually. |
|
Thanks @merrymercy @FrozenGene @nishi-t for review and @vinx13 for contribution, this is merged. Let us follow up to see if we can generalize the layout changes |
This PR added a int8 conv2d using NCHW[x]c layout, where x is multiple of 4. I obtained best performance when x = 4.
The template can accept either NCHW layout input or pre-packed data (NCHW4c) and kernel (OIHW4o4i).
inference time (ms) of different models (before classifier) on NVIDIA 1080, batch size = 1
cc @merrymercy @tqchen