INT8 conv operator implementation with NCHWc data layout for Intel machines #1680

anijain2305 · 2018-08-31T01:06:51Z

This PR implements a conv operator for INT8 operations for Intel Skylake and upcoming Intel processors. Currently, this supports input of NCHWc format. Later, there will be NNVM effort to pick up this kernel for conv and perform kernel transform using "CorrectLayout". This PR tackles only the schedule for INT8 conv kernel.

Background

Hardware support
Skylake provides HW support of performing a dot product of 2 4-int8 values while keeping the computational precision at INT32. For Skylake these instructions are vpamaddubsw, vpmaddwd. This support will be enhanced for VNNI instruction. More details can be found at this link (https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training).

Why new schedule?
These instructions require some modifications to the current FP32 schedule. The current schedule does not perform reduction across different elements of a vector register. But, Intel instructions allow reduction across 4 int8 values. Therefore, a new schedule is required.

Why not just rely on LLVM for codegen?
LLVM codegen is not mature for generating these instructions. LLVM has a very restrictive pattern matching to lower down to these INT8 operations. We will need decent efforts on both LLVM and TVM side to reach an IR where LLVM can directly generate these instructions. Therefore, I am currently calling LLVM intrinsics directly from LLVM.

Performance Speedup

These are different conv layers from resnet network. I don't have NNVM changes yet to run an end-to-end experiment. Will update these numbers when we have that.

Workload	kernelSize	FP32 time	INT8 time	Speedup
Workload#0	3x3	1.27E-04	7.61E-05	1.668205764
Workload#1	1x1	1.45E-05	1.62E-05	0.8956588432
Workload#2	3x3	9.77E-05	4.52E-05	2.163945956
Workload#3	1x1	9.56E-06	8.55E-06	1.117662311
Workload#4	3x3	0.000124318896	9.35E-05	1.32998721
Workload#5	3x3	8.53E-05	4.26E-05	2.00352076
Workload#6	1x1	9.68E-06	8.45E-06	1.145552654
Workload#7	3x3	0.000101081785	8.07E-05	1.252166183
Workload#8	3x3	7.25E-05	4.82E-05	1.502493515
Workload#9	1x1	9.35E-06	7.04E-06	1.328638704
Workload#10	3x3	0.000118993864	9.70E-05	1.226515769
Workload#11	1x1	0.000117436952	5.97E-05	1.966779456
Workload#12	1x1	0.000121834512	6.08E-05	2.00419261
Workload#13	1x1	6.25E-05	2.78E-05	2.245298107
Workload#14	1x1	0.000106934483	4.96E-05	2.157467654
Workload#15	1x1	0.000280448215	9.23E-05	3.038109489
Workload#16	1x1	0.000118953154	5.06E-05	2.351587926
Workload#17	1x1	3.67E-05	2.20E-05	1.666902107
Workload#18	1x1	4.73E-05	3.66E-05	1.293318986
Workload#19	1x1	0.000132912593	7.11E-05	1.868658243
Workload#20	1x1	5.77E-05	3.91E-05	1.474495608
Workload#21	1x1	3.15E-05	1.70E-05	1.858185831
Workload#22	1x1	5.08E-05	2.74E-05	1.851147766
Workload#23	1x1	0.000111464776	5.45E-05	2.044977186
Workload#24	1x1	5.70E-05	3.06E-05	1.85923854
			Mean --->	1.732588287

Limitations

Current implementation requires input_channels to be a multiple of 4 and output_channels to be a multiple of 16. For other conv layers, we plan to use the FP32 schedule and not use avx512 bw instructions. If performance becomes a big concern, we can look into input channel padding.

anijain2305 · 2018-08-31T01:09:59Z

@yidawang @yzhliu @zhiics Please review. Also please feel free to add other reviewers who might be interested.

FrozenGene · 2018-08-31T04:07:00Z

@yzhliu Could we start to convert x86 cpu schedule into auto tvm? I think we can leverage arm cpu auto tvm template. Then like this PR, we could avoid add workload manually.

yidawang · 2018-08-31T05:14:58Z

@FrozenGene We are indeed working on applying auto tvm to x86 cpus. This PR is about INT8 quantization, using intrinsics provided by avx-512 bw, which is potentially applicable to auto tvm as well but we still need to anyway set it up manually first.

yidawang · 2018-08-31T16:42:09Z

@anijain2305 Can you edit the PR description to put the preliminary performance results on?

yidawang

In addition, please identify and fix the lint issues by running make lint locally.

yidawang · 2018-08-31T16:44:16Z

tests/python/unittest/test_conv_int8_intel.py

+
+target_name = 'llvm -mcpu=skylake-avx512'
+avx2_len = 16
+ctx = tvm.context(target_name, 0);


No need of the semicolon. Same comment applies to other similar lines in Python

topi/python/topi/x86/conv2d.py

yidawang · 2018-08-31T16:56:43Z

topi/python/topi/x86/conv2d_avx_1x1.py

+    _, oc_chunk, oh, ow, oc_block = s[CC].op.axis
+    ic_outer, ic_f_inner, ic_s_inner = s[CC].op.reduce_axis
+
+    # Sylake and future processors have 16 vector lanes 


topi/python/topi/x86/conv2d_avx_1x1.py

yidawang · 2018-08-31T17:21:19Z

topi/python/topi/x86/conv2d_avx_common.py

+
+    ow_chunk, ow_block = s[CC].split(ow, factor=sch.reg_n)
+
+    # Sylake and future processors have 16 vector lanes 


anijain2305 · 2018-08-31T17:25:51Z

Thanks @yidawang for the comments :) I will start working on them

tqchen · 2018-08-31T17:37:35Z

cc @ajtulloch @eqy @cowanmeg

src/codegen/llvm/codegen_llvm.cc

zhiics · 2018-08-31T17:37:30Z

tests/python/unittest/test_conv_int8_intel.py

+avx2_len = 16
+ctx = tvm.context(target_name, 0);
+
+def getShape(im_height, im_width, in_filter, out_filter, kh, kw, hpad, wpad,


Keep naming consistent: s/getShape/get_shape

zhiics · 2018-08-31T17:38:18Z

tests/python/unittest/test_conv_int8_intel.py

+def getShape(im_height, im_width, in_filter, out_filter, kh, kw, hpad, wpad,
+             hstride, wstride, outDtype):
+    ## Find shapes
+    dataShape = (1, in_filter/avx2_len, im_height, im_width, avx2_len)


Same naming style for variables, s/dataShape/data_shape
It also applies to the other parts.

be careful about / vs. //, in this case we will get a floating point value

zhiics · 2018-08-31T17:43:32Z

tests/python/unittest/test_conv_int8_intel.py

+        s = tvm.create_schedule(out.op);
+        func = tvm.build(s, [data, kernel, out], target=target_name, name='out')
+        func(a, b, cOrig)
+        #print(tvm.lower(s, [data, kernel], simple_mode=True));


remove debugging code?

zhiics · 2018-08-31T17:59:16Z

topi/python/topi/nn/conv2d.py

    else:
        HSTR, WSTR = stride, stride
-    assert data.dtype == kernel.dtype, \
+    assert data.dtype == kernel.dtype or (data.dtype == 'uint8' and kernel.dtype == 'int8'), \


(data.dtype == kernel.dtype)

zhiics · 2018-08-31T18:31:20Z

tests/python/unittest/test_conv_int8_intel.py

+    cSch = tvm.nd.array(np.zeros(oShape, dtype=outDtype), ctx);
+
+
+    with tvm.target.create(target_name):


I understand that quantization currently only works for x86 conv2d, so you want to invoke this "specialized" function directly. In the long team, I think the annotation I am doing could help here if quantization works on more devices. You can annotate node with the target and call conv2d_nchwc from a layer higher so that the dispatcher could find the correct compute.

Agreed. This is a specific usecase to trigger x86 conv2d compute/schedule.

zhiics · 2018-08-31T18:39:34Z

src/codegen/llvm/codegen_llvm.cc

      indices.push_back(i);
    }
    return builder_->CreateShuffleVector(v0, v1, indices);
+  } else if (op->is_intrinsic("broadcast16")){


I think we may want to avoid using the string literals directly from both backend and frontend sides because it might be error prone or user unfriendly as the number of them increases. Instead we can probably create a "mapping" or "enum" to do this. But again, this is fine for now.

eqy · 2018-08-31T20:28:12Z

tests/python/unittest/test_conv_int8_intel.py

+def getShape(im_height, im_width, in_filter, out_filter, kh, kw, hpad, wpad,
+             hstride, wstride, outDtype):
+    ## Find shapes
+    dataShape = (1, in_filter/avx2_len, im_height, im_width, avx2_len)


be careful about / vs. //, in this case we will get a floating point value

eqy · 2018-08-31T20:29:42Z

tests/python/unittest/test_conv_int8_intel.py

+    ## Find shapes
+    dataShape = (1, in_filter/avx2_len, im_height, im_width, avx2_len)
+
+    if outDtype == 'int32':


cosmetics: keep CamelCase vs. snake_case consistent

eqy · 2018-08-31T20:30:02Z

tests/python/unittest/test_conv_int8_intel.py

+    else:
+        a = tvm.nd.array(np.random.randint(100, size=dataShape).astype(dataDtype));
+        b = tvm.nd.array(np.random.randint(100, size=kernelShape).astype(kernelDtype));
+        #a = tvm.nd.array(np.ones(dataShape, dtype='uint8'), ctx);


delete comments if they are not useful here

eqy · 2018-08-31T20:31:26Z

topi/python/topi/x86/conv2d_avx_1x1.py

+            avx2_len = 16
+        else:
+            return s
+    assert(avx2_len != -1)


parenthesis not needed here (lint may complain about this)

eqy · 2018-08-31T20:32:34Z

topi/python/topi/x86/conv2d_avx_common.py

+    """
+    This function sets up the compute for INT8 conv 2d
+    Inputs are in INT8 datatype
+    Ouptut is in INT32 datatype


yidawang

LGTM

src/codegen/llvm/codegen_llvm.cc

topi/python/topi/x86/int8_intrinsics.py

ajtulloch · 2018-09-05T20:01:42Z

a) Would you be able to report achieved GOPS (ideally as a fraction of peak) instead of just time? Additionally, could you compare against MKL-DNN or similar for fp32/int8? (i.e. using benchdnn from MKL-DNN)
b) Do you find padding to be particularly expensive (either spatial or channel padding)? I've noticed that the codegen for tvm_if_then_else seems to be particularly poor, and I wonder if it's worth tackling that at some point.

anijain2305 · 2018-09-05T21:24:32Z

@ajtulloch Both good points.

I will update the numbers sometime next week. I agree GOPS is much better metric than just time. Tells us how much is left to optimize for.

For padding, I did not do anything specific for padding. The kernel is built on top of current x86 NCHWc kernel, which hid handling of padding for my implementation. But, I will look deeper and see if the speedup for padded kernels is worse.

yzhliu · 2018-09-13T08:31:26Z

topi/python/topi/x86/int8_intrinsics.py

+                               strides=[1])
+    b_buffer = tvm.decl_buffer(kernel.shape, dtype='int8', name="b_buffer",
+                               offset_factor=1,
+                               strides=[tvm.var('ldw'), 1])


I feel all these strides bindings are unnecessary and can be removed.

Actually, I dint have strides earlier. The memory accesses were wrong in that case. So, I had to put strides.

Honestly, I am not fully aware of what these different parameters of tvm.decl_buffer mean. I will look into it in more detail to ensure that I have good understanding of why presence of strides make it work.

That's bizarre. In my understanding, the strides is implicitly inferred (given input tensor is compact), and var(ldw) is for binding the inferred strides. Actually if you changed innermost stride 1 to some other number, I expect it would fail with some binding mismatch error.
@tqchen Could you help with this? I'm also not fully understand the strides for buffer.

#1725 The usage here is correct, thus it does not block merging this PR.

ajtulloch

Just some minor nits.

ajtulloch · 2018-09-13T19:25:05Z

topi/python/topi/nn/conv2d.py

    """
    raise ValueError("missing register for topi.nn.conv2d_winograd_without_weight_transform")
+
+def check_skylake(target):


This doesn't really belong in a generic file like nn/conv2d.py right? Shouldn't this be in some x86/ specific directory?

ajtulloch · 2018-09-13T19:27:16Z

topi/python/topi/x86/conv2d.py

+    target = tvm.target.current_target(allow_none=False)
+    for opt in target.options:
+        if opt == '-mcpu=skylake-avx512':
+            fp32_vec_len = 16


Shouldn't this reuse the check_skylake function?

yzhliu · 2018-09-17T17:12:22Z

@ajtulloch Could you take a look again and approve explicitly if it is good? thanks.

yzhliu

Also @tqchen please review again.

tqchen · 2018-09-17T18:37:06Z

topi/python/topi/x86/int8_intrinsics.py

@@ -0,0 +1,107 @@
+"""Core kernel of dot product of 4 Int8 operations"""
+#pylint: disable=invalid-name
+import tvm


Let us rename it to tensor_intrin.py to be consistent with #1707

tqchen · 2018-09-17T18:38:33Z

tests/python/unittest/test_conv_int8_intel.py

+
+if __name__ == "__main__":
+    LOGGER.info("Workload, Kernel_size, FP32_time, INT8_time, Speedup")
+    SPEEDUP_ARRAY = []


since it is in the unitest case, need to write this file in other form of nose-tests, and skip it when target is not supported. Alternatively, move it to topi/recipe for now

tqchen · 2018-09-17T18:39:46Z

topi/python/topi/x86/int8_intrinsics.py

+    with tvm.build_config(offset_factor=1, partition_const_loop=True):
+        return tvm.decl_tensor_intrin(C.op, _intrin_func, binds={data:a_buffer, kernel:b_buffer})
+
+def _intrin_reduce4int8_1x1(vec_size, num_elements_intel):


remove _intrin prefix if it is already in the tensor_intrin.py file. Make it a public function, document all the arguments and return types

tqchen · 2018-09-17T18:40:17Z

Make some comments mainly on documenting and make the code clear.

tqchen · 2018-09-20T16:07:17Z

related PR for CUDA #1735

tqchen · 2018-09-20T16:08:08Z

@anijain2305 please follow up to fix the recent reviews comments and let us bring this in

tqchen · 2018-09-21T03:10:45Z

topi/python/topi/x86/tensor_intrin.py

+    Int8 dot product by every 4 elements using AVX2 Skylake instructions
+
+    Parameters
+    -------------


docstring issue, the underline should be the same as the data https://docs.tvm.ai/contribute/document.html#document-python

Thanks a lot for the pointer

tqchen · 2018-09-21T03:12:25Z

topi/python/topi/x86/tensor_intrin.py

+
+def reduce_4int8_1x1(int32_lanes, num_elements_intel):
+    """
+    Int8 dot product by every 4 elements using AVX2 Skylake instructions


Can we give a more detailed example of the semantics here? i.i. what is the input what is the output. The parameter naming also seems obscure to me.

tqchen · 2018-09-21T03:14:59Z

Thanks for all the changes. The only complain I have is that the intrinsic functions' parameter naming seems to be confusing and it is hard for me to tell what it does exactly, we should be cautious on how we name the API since they are going to be used by the users. Maybe one way to make things clear is to document the behavior of the intrinsic using array and pseudo code.

Everyone is also welcomed to put weight on the API @ajtulloch @vinx13 @yizhi

anijain2305 · 2018-09-21T07:00:21Z

@tqchen Thanks for helping out with clear documentation. I have thought more clearly about the API and realized that it doesn't need any parameters as the tensor intrin is specific for Skylake machine. I have added a small summary with a pseudo code. Please review again and let me know if it needs more improvement.

tqchen · 2018-09-21T16:19:51Z

topi/python/topi/x86/tensor_intrin.py

+    function returns a TensorIntrin that can be used to tensorize
+    a schedule.
+
+    Parameters


if there is no parameters, we do not need to do parameters

tqchen · 2018-09-21T16:20:52Z

topi/python/topi/x86/tensor_intrin.py

+    datatype. Each entry of output array output[i] is equal to dot product
+    of data[4] and corresponding kernel[i][4]. The pseudo code is as follows
+
+    for (int i = 0; i < 16; i++)


we can embed c code in the docstring via restructured text tag. See example in https://docs.tvm.ai/contribute/document.html#document-python
look for (.. code::)

It is helpful to declare the pseudo code as function, like

void intrin_name(int8 data[4], int8 kernel[16][4], int32 output[16]) { body of the code }

tqchen · 2018-09-21T16:22:50Z

topi/python/topi/x86/tensor_intrin.py

+import tvm
+
+
+def reduce_4int8_common():


what does "common" mean in here?

There are 2 different schedules for Intel x86. First one is for 1x1 and second one is for 3x3 kernel. The common here means other kernel sizes.

One way to resolve this confusion is to remove "common". Other one can have 1x1. Thoughts?

tqchen · 2018-09-21T16:27:30Z

Thanks for the set of changes. Maybe we could put a bit more thought in terms of the intrinsic naming. I think there are two sensible ways to do so:

Use the native name for the intrinsic, e.g. dp4a
Use the mathematical meaning of the intrinsic
- Most intrinsic are doing matrix-vector product or dot product
  - We could use things like dot_8x1x8_int8_int8_int32

Thoughts?

anijain2305 · 2018-09-22T05:41:37Z

I like the second option better as it is more accurate.
For naming, how about we put the vector length and data type together. e.g. dot_4xint8_16x4xint8_16xint32. (this is AVX512, so there are 16 vector lanes)

…eshaping

tqchen · 2018-09-25T03:11:10Z

Thanks @anijain2305, @yzhliu this can be merged

yzhliu · 2018-09-25T17:30:56Z

Thanks everyone's effort!

masahi · 2018-10-15T15:31:51Z

@anijain2305 what LLVM version do I need to run test_conv_int8_intel.py? I'm getting

AssertionError: llvm.x86.avx512.pmaddubs.w.512 is not an LLVM intrinsic

with LLVM 6.0.

anijain2305 · 2018-10-15T17:11:41Z

This error is due to older LLVM version. Looks like, LLVM 6.0 does not support AVX512BW instructions.
I am using LLVM 8.0 and it works with that.

masahi · 2018-10-16T03:49:53Z

thanks, got it working with llvm trunk.

…chines (apache#1680) * Int8 implementation for convolution operator on Intel Skylake * Int8 implementation for convolution operator on Intel Skylake * PR changes * PR changes * PR changes * Fixing an error * Fixing an error * Minor typos fix * Minor typos fix * Removing the broadcast16 CPP code. Using astype feature instead * Replacing constant by variable name num_elements_intel * Name fixes and tensorize update rule updated * Fixing the bug about checking skylake * Replacing bitcast with reinterpret * Isolating INT8 and FP32 schedules to ease out future AutoTVM PR merge * Putting check_skylake function in the x86 directory * Added documentation and organizing files to better locations * Tensor intrin renaming. Avoid code duplication for intrin by kernel reshaping

Animesh Jain added 4 commits August 24, 2018 16:55

Int8 implementation for convolution operator on Intel Skylake

e7a8702

Int8 implementation for convolution operator on Intel Skylake

b268b7f

Merge branch 'master' of https://github.com/dmlc/tvm

c3d0144

Merge remote-tracking branch 'upstream/master'

96eec1f

yidawang suggested changes Aug 31, 2018

View reviewed changes

tqchen added the status: need review label Aug 31, 2018

zhiics reviewed Aug 31, 2018

View reviewed changes

eqy reviewed Aug 31, 2018

View reviewed changes

Animesh Jain and others added 8 commits August 31, 2018 15:13

Merge branch 'master' of https://github.com/dmlc/tvm

18101ab

PR changes

58e9fbb

PR changes

541d155

PR changes

314333d

Fixing an error

b24726f

Fixing an error

6d4aac2

Merge branch 'master' of https://github.com/anijain2305/tvm

ebb02e0

Minor typos fix

5ff9c82

yidawang approved these changes Sep 5, 2018

View reviewed changes

Ubuntu added 2 commits September 5, 2018 00:28

Minor typos fix

4cd7f30

Merge branch 'master' of https://github.com/anijain2305/tvm

0259edb

tqchen requested changes Sep 5, 2018

View reviewed changes

src/codegen/llvm/codegen_llvm.cc Outdated Show resolved Hide resolved

cowanmeg reviewed Sep 5, 2018

View reviewed changes

topi/python/topi/x86/int8_intrinsics.py Outdated Show resolved Hide resolved

Removing the broadcast16 CPP code. Using astype feature instead

621f7bb

Ubuntu added 2 commits September 13, 2018 01:06

Replacing bitcast with reinterpret

3a53b51

Isolating INT8 and FP32 schedules to ease out future AutoTVM PR merge

9acbd75

kevinthesun approved these changes Sep 13, 2018

View reviewed changes

yzhliu reviewed Sep 13, 2018

View reviewed changes

ajtulloch suggested changes Sep 13, 2018

View reviewed changes

Putting check_skylake function in the x86 directory

cdfce1f

yzhliu mentioned this pull request Sep 15, 2018

[TUTORIAL] Tutorial for tensorize #1379

Closed

yzhliu approved these changes Sep 17, 2018

View reviewed changes

tqchen requested changes Sep 17, 2018

View reviewed changes

tqchen requested changes Sep 21, 2018

View reviewed changes

Ubuntu added 2 commits September 22, 2018 06:19

Added documentation and organizing files to better locations

1fdef38

Tensor intrin renaming. Avoid code duplication for intrin by kernel r…

abd99da

…eshaping

tqchen approved these changes Sep 25, 2018

View reviewed changes

yzhliu merged commit 72ad9a3 into apache:master Sep 25, 2018


		ow_chunk, ow_block = s[CC].split(ow, factor=sch.reg_n)

		# Sylake and future processors have 16 vector lanes

		cSch = tvm.nd.array(np.zeros(oShape, dtype=outDtype), ctx);


		with tvm.target.create(target_name):

		import tvm


		def reduce_4int8_common():

INT8 conv operator implementation with NCHWc data layout for Intel machines #1680

INT8 conv operator implementation with NCHWc data layout for Intel machines #1680

Uh oh!

Conversation

anijain2305 commented Aug 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Performance Speedup

Limitations

Uh oh!

anijain2305 commented Aug 31, 2018

Uh oh!

FrozenGene commented Aug 31, 2018

Uh oh!

yidawang commented Aug 31, 2018

Uh oh!

yidawang commented Aug 31, 2018

Uh oh!

yidawang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anijain2305 commented Aug 31, 2018

Uh oh!

tqchen commented Aug 31, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yidawang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajtulloch commented Sep 5, 2018

Uh oh!

anijain2305 commented Sep 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajtulloch left a comment

anijain2305 commented Aug 31, 2018 •

edited

Loading