-
Notifications
You must be signed in to change notification settings - Fork 3.8k
INT8 conv operator implementation with NCHWc data layout for Intel machines #1680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@yzhliu Could we start to convert x86 cpu schedule into auto tvm? I think we can leverage arm cpu auto tvm template. Then like this PR, we could avoid add workload manually. |
|
@FrozenGene We are indeed working on applying auto tvm to x86 cpus. This PR is about INT8 quantization, using intrinsics provided by avx-512 bw, which is potentially applicable to auto tvm as well but we still need to anyway set it up manually first. |
|
@anijain2305 Can you edit the PR description to put the preliminary performance results on? |
yidawang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, please identify and fix the lint issues by running make lint locally.
|
|
||
| target_name = 'llvm -mcpu=skylake-avx512' | ||
| avx2_len = 16 | ||
| ctx = tvm.context(target_name, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need of the semicolon. Same comment applies to other similar lines in Python
| _, oc_chunk, oh, ow, oc_block = s[CC].op.axis | ||
| ic_outer, ic_f_inner, ic_s_inner = s[CC].op.reduce_axis | ||
|
|
||
| # Sylake and future processors have 16 vector lanes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skylake
|
|
||
| ow_chunk, ow_block = s[CC].split(ow, factor=sch.reg_n) | ||
|
|
||
| # Sylake and future processors have 16 vector lanes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skylake
|
Thanks @yidawang for the comments :) I will start working on them |
| avx2_len = 16 | ||
| ctx = tvm.context(target_name, 0); | ||
|
|
||
| def getShape(im_height, im_width, in_filter, out_filter, kh, kw, hpad, wpad, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep naming consistent: s/getShape/get_shape
| def getShape(im_height, im_width, in_filter, out_filter, kh, kw, hpad, wpad, | ||
| hstride, wstride, outDtype): | ||
| ## Find shapes | ||
| dataShape = (1, in_filter/avx2_len, im_height, im_width, avx2_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same naming style for variables, s/dataShape/data_shape
It also applies to the other parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be careful about / vs. //, in this case we will get a floating point value
| s = tvm.create_schedule(out.op); | ||
| func = tvm.build(s, [data, kernel, out], target=target_name, name='out') | ||
| func(a, b, cOrig) | ||
| #print(tvm.lower(s, [data, kernel], simple_mode=True)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove debugging code?
topi/python/topi/nn/conv2d.py
Outdated
| else: | ||
| HSTR, WSTR = stride, stride | ||
| assert data.dtype == kernel.dtype, \ | ||
| assert data.dtype == kernel.dtype or (data.dtype == 'uint8' and kernel.dtype == 'int8'), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(data.dtype == kernel.dtype)
| cSch = tvm.nd.array(np.zeros(oShape, dtype=outDtype), ctx); | ||
|
|
||
|
|
||
| with tvm.target.create(target_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that quantization currently only works for x86 conv2d, so you want to invoke this "specialized" function directly. In the long team, I think the annotation I am doing could help here if quantization works on more devices. You can annotate node with the target and call conv2d_nchwc from a layer higher so that the dispatcher could find the correct compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This is a specific usecase to trigger x86 conv2d compute/schedule.
src/codegen/llvm/codegen_llvm.cc
Outdated
| indices.push_back(i); | ||
| } | ||
| return builder_->CreateShuffleVector(v0, v1, indices); | ||
| } else if (op->is_intrinsic("broadcast16")){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to avoid using the string literals directly from both backend and frontend sides because it might be error prone or user unfriendly as the number of them increases. Instead we can probably create a "mapping" or "enum" to do this. But again, this is fine for now.
| def getShape(im_height, im_width, in_filter, out_filter, kh, kw, hpad, wpad, | ||
| hstride, wstride, outDtype): | ||
| ## Find shapes | ||
| dataShape = (1, in_filter/avx2_len, im_height, im_width, avx2_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be careful about / vs. //, in this case we will get a floating point value
| ## Find shapes | ||
| dataShape = (1, in_filter/avx2_len, im_height, im_width, avx2_len) | ||
|
|
||
| if outDtype == 'int32': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetics: keep CamelCase vs. snake_case consistent
| else: | ||
| a = tvm.nd.array(np.random.randint(100, size=dataShape).astype(dataDtype)); | ||
| b = tvm.nd.array(np.random.randint(100, size=kernelShape).astype(kernelDtype)); | ||
| #a = tvm.nd.array(np.ones(dataShape, dtype='uint8'), ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete comments if they are not useful here
| avx2_len = 16 | ||
| else: | ||
| return s | ||
| assert(avx2_len != -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parenthesis not needed here (lint may complain about this)
| """ | ||
| This function sets up the compute for INT8 conv 2d | ||
| Inputs are in INT8 datatype | ||
| Ouptut is in INT32 datatype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output
yidawang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
a) Would you be able to report achieved GOPS (ideally as a fraction of peak) instead of just time? Additionally, could you compare against MKL-DNN or similar for fp32/int8? (i.e. using benchdnn from MKL-DNN) |
|
@ajtulloch Both good points. I will update the numbers sometime next week. I agree GOPS is much better metric than just time. Tells us how much is left to optimize for. For padding, I did not do anything specific for padding. The kernel is built on top of current x86 NCHWc kernel, which hid handling of padding for my implementation. But, I will look deeper and see if the speedup for padded kernels is worse. |
| strides=[1]) | ||
| b_buffer = tvm.decl_buffer(kernel.shape, dtype='int8', name="b_buffer", | ||
| offset_factor=1, | ||
| strides=[tvm.var('ldw'), 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel all these strides bindings are unnecessary and can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I dint have strides earlier. The memory accesses were wrong in that case. So, I had to put strides.
Honestly, I am not fully aware of what these different parameters of tvm.decl_buffer mean. I will look into it in more detail to ensure that I have good understanding of why presence of strides make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's bizarre. In my understanding, the strides is implicitly inferred (given input tensor is compact), and var(ldw) is for binding the inferred strides. Actually if you changed innermost stride 1 to some other number, I expect it would fail with some binding mismatch error.
@tqchen Could you help with this? I'm also not fully understand the strides for buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1725 The usage here is correct, thus it does not block merging this PR.
ajtulloch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits.
topi/python/topi/nn/conv2d.py
Outdated
| """ | ||
| raise ValueError("missing register for topi.nn.conv2d_winograd_without_weight_transform") | ||
|
|
||
| def check_skylake(target): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really belong in a generic file like nn/conv2d.py right? Shouldn't this be in some x86/ specific directory?
topi/python/topi/x86/conv2d.py
Outdated
| target = tvm.target.current_target(allow_none=False) | ||
| for opt in target.options: | ||
| if opt == '-mcpu=skylake-avx512': | ||
| fp32_vec_len = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this reuse the check_skylake function?
|
@ajtulloch Could you take a look again and approve explicitly if it is good? thanks. |
yzhliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also @tqchen please review again.
| @@ -0,0 +1,107 @@ | |||
| """Core kernel of dot product of 4 Int8 operations""" | |||
| #pylint: disable=invalid-name | |||
| import tvm | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us rename it to tensor_intrin.py to be consistent with #1707
|
|
||
| if __name__ == "__main__": | ||
| LOGGER.info("Workload, Kernel_size, FP32_time, INT8_time, Speedup") | ||
| SPEEDUP_ARRAY = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it is in the unitest case, need to write this file in other form of nose-tests, and skip it when target is not supported. Alternatively, move it to topi/recipe for now
| with tvm.build_config(offset_factor=1, partition_const_loop=True): | ||
| return tvm.decl_tensor_intrin(C.op, _intrin_func, binds={data:a_buffer, kernel:b_buffer}) | ||
|
|
||
| def _intrin_reduce4int8_1x1(vec_size, num_elements_intel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove _intrin prefix if it is already in the tensor_intrin.py file. Make it a public function, document all the arguments and return types
|
Make some comments mainly on documenting and make the code clear. |
|
related PR for CUDA #1735 |
|
@anijain2305 please follow up to fix the recent reviews comments and let us bring this in |
| Int8 dot product by every 4 elements using AVX2 Skylake instructions | ||
|
|
||
| Parameters | ||
| ------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring issue, the underline should be the same as the data https://docs.tvm.ai/contribute/document.html#document-python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the pointer
|
|
||
| def reduce_4int8_1x1(int32_lanes, num_elements_intel): | ||
| """ | ||
| Int8 dot product by every 4 elements using AVX2 Skylake instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give a more detailed example of the semantics here? i.i. what is the input what is the output. The parameter naming also seems obscure to me.
|
Thanks for all the changes. The only complain I have is that the intrinsic functions' parameter naming seems to be confusing and it is hard for me to tell what it does exactly, we should be cautious on how we name the API since they are going to be used by the users. Maybe one way to make things clear is to document the behavior of the intrinsic using array and pseudo code. Everyone is also welcomed to put weight on the API @ajtulloch @vinx13 @yizhi |
|
@tqchen Thanks for helping out with clear documentation. I have thought more clearly about the API and realized that it doesn't need any parameters as the tensor intrin is specific for Skylake machine. I have added a small summary with a pseudo code. Please review again and let me know if it needs more improvement. |
| function returns a TensorIntrin that can be used to tensorize | ||
| a schedule. | ||
|
|
||
| Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is no parameters, we do not need to do parameters
| datatype. Each entry of output array output[i] is equal to dot product | ||
| of data[4] and corresponding kernel[i][4]. The pseudo code is as follows | ||
|
|
||
| for (int i = 0; i < 16; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can embed c code in the docstring via restructured text tag. See example in https://docs.tvm.ai/contribute/document.html#document-python
look for (.. code::)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is helpful to declare the pseudo code as function, like
void intrin_name(int8 data[4], int8 kernel[16][4], int32 output[16]) {
body of the code
}
| import tvm | ||
|
|
||
|
|
||
| def reduce_4int8_common(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "common" mean in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 different schedules for Intel x86. First one is for 1x1 and second one is for 3x3 kernel. The common here means other kernel sizes.
One way to resolve this confusion is to remove "common". Other one can have 1x1. Thoughts?
|
Thanks for the set of changes. Maybe we could put a bit more thought in terms of the intrinsic naming. I think there are two sensible ways to do so:
Thoughts? |
|
I like the second option better as it is more accurate. |
|
Thanks @anijain2305, @yzhliu this can be merged |
|
Thanks everyone's effort! |
|
@anijain2305 what LLVM version do I need to run test_conv_int8_intel.py? I'm getting with LLVM 6.0. |
|
This error is due to older LLVM version. Looks like, LLVM 6.0 does not support AVX512BW instructions. |
|
thanks, got it working with llvm trunk. |
…chines (apache#1680) * Int8 implementation for convolution operator on Intel Skylake * Int8 implementation for convolution operator on Intel Skylake * PR changes * PR changes * PR changes * Fixing an error * Fixing an error * Minor typos fix * Minor typos fix * Removing the broadcast16 CPP code. Using astype feature instead * Replacing constant by variable name num_elements_intel * Name fixes and tensorize update rule updated * Fixing the bug about checking skylake * Replacing bitcast with reinterpret * Isolating INT8 and FP32 schedules to ease out future AutoTVM PR merge * Putting check_skylake function in the x86 directory * Added documentation and organizing files to better locations * Tensor intrin renaming. Avoid code duplication for intrin by kernel reshaping
This PR implements a conv operator for INT8 operations for Intel Skylake and upcoming Intel processors. Currently, this supports input of NCHWc format. Later, there will be NNVM effort to pick up this kernel for conv and perform kernel transform using "CorrectLayout". This PR tackles only the schedule for INT8 conv kernel.
Background
Hardware support
Skylake provides HW support of performing a dot product of 2 4-int8 values while keeping the computational precision at INT32. For Skylake these instructions are vpamaddubsw, vpmaddwd. This support will be enhanced for VNNI instruction. More details can be found at this link (https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training).
Why new schedule?
These instructions require some modifications to the current FP32 schedule. The current schedule does not perform reduction across different elements of a vector register. But, Intel instructions allow reduction across 4 int8 values. Therefore, a new schedule is required.
Why not just rely on LLVM for codegen?
LLVM codegen is not mature for generating these instructions. LLVM has a very restrictive pattern matching to lower down to these INT8 operations. We will need decent efforts on both LLVM and TVM side to reach an IR where LLVM can directly generate these instructions. Therefore, I am currently calling LLVM intrinsics directly from LLVM.
Performance Speedup
These are different conv layers from resnet network. I don't have NNVM changes yet to run an end-to-end experiment. Will update these numbers when we have that.
Limitations