-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[TOPI,x86] Improve performance on int8 conv2d on x86 #9966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Appended fused operations in cov2d for int8 were computed in a separate
loop from the main conv2d computation:
```
for i in ... parallel
for j in ...
accumulator = 0
for k in ..
vectorized_multiply_add(accumulator, data, kernel)
out = accumulator
for k in ..
out = out + fused subsequent ops
```
This patch moves the fused ops one more loop nesting inwards to get
```
for i in ... parallel
for j in ...
accumulator = 0
for k in ..
vectorized_multiply_add(accumulator, data, kernel)
out = accumulator + fused subsequent ops
```
On quantized mobilenetv2, this results in approximately a 30% speedup.
masahi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, and good speed up!
Curious how much it helps for other models. At first I thought this would be a bigger win for larger workloads (with corresponding larger write cache). It is easy to test quantized resnet50, inception v3, or mobilenet v3 via PyTorch.
mbrookhart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
|
Appended fused operations in cov2d for int8 were computed in a separate
loop from the main conv2d computation:
```
for i in ... parallel
for j in ...
accumulator = 0
for k in ..
vectorized_multiply_add(accumulator, data, kernel)
out = accumulator
for k in ..
out = out + fused subsequent ops
```
This patch moves the fused ops one more loop nesting inwards to get
```
for i in ... parallel
for j in ...
accumulator = 0
for k in ..
vectorized_multiply_add(accumulator, data, kernel)
out = accumulator + fused subsequent ops
```
On quantized mobilenetv2, this results in approximately a 30% speedup.
Appended fused operations in cov2d for int8 were computed in a separate
loop from the main conv2d computation:
```
for i in ... parallel
for j in ...
accumulator = 0
for k in ..
vectorized_multiply_add(accumulator, data, kernel)
out = accumulator
for k in ..
out = out + fused subsequent ops
```
This patch moves the fused ops one more loop nesting inwards to get
```
for i in ... parallel
for j in ...
accumulator = 0
for k in ..
vectorized_multiply_add(accumulator, data, kernel)
out = accumulator + fused subsequent ops
```
On quantized mobilenetv2, this results in approximately a 30% speedup.
Appended fused operations in cov2d for int8 were computed in a separate
loop from the main conv2d computation:
```
for i in ... parallel
for j in ...
accumulator = 0
for k in ..
vectorized_multiply_add(accumulator, data, kernel)
out = accumulator
for k in ..
out = out + fused subsequent ops
```
This patch moves the fused ops one more loop nesting inwards to get
```
for i in ... parallel
for j in ...
accumulator = 0
for k in ..
vectorized_multiply_add(accumulator, data, kernel)
out = accumulator + fused subsequent ops
```
On quantized mobilenetv2, this results in approximately a 30% speedup.
Appended fused operations in cov2d for int8 were computed in a separate loop from the main conv2d computation:
This patch moves the fused ops one more loop nesting inwards to get
On quantized mobilenetv2, this results in approximately a 30% speedup.
@masahi @mbrookhart