Skip to content

Conversation

@tkonolige
Copy link
Contributor

@tkonolige tkonolige commented Jan 18, 2022

Appended fused operations in cov2d for int8 were computed in a separate loop from the main conv2d computation:

for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator
  for j in ..
    out = out + fused subsequent ops

This patch moves the fused ops one more loop nesting inwards to get

for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator + fused subsequent ops

On quantized mobilenetv2, this results in approximately a 30% speedup.

@masahi @mbrookhart

Appended fused operations in cov2d for int8 were computed in a separate
loop from the main conv2d computation:
```
for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator
  for k in ..
    out = out + fused subsequent ops
```
This patch moves the fused ops one more loop nesting inwards to get
```
for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator + fused subsequent ops
```
On quantized mobilenetv2, this results in approximately a 30% speedup.
Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, and good speed up!

Curious how much it helps for other models. At first I thought this would be a bigger win for larger workloads (with corresponding larger write cache). It is easy to test quantized resnet50, inception v3, or mobilenet v3 via PyTorch.

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

@tkonolige
Copy link
Contributor Author

tkonolige commented Jan 18, 2022

Model Before (ms) After (ms)
resnet50 5.5282 4.7501
inception_v3 3.4320 3.2994
mobilenetv2 1.5239 1.1931

@masahi masahi merged commit 19717aa into apache:main Jan 19, 2022
yuanfz98 pushed a commit to yuanfz98/tvm that referenced this pull request Jan 24, 2022
Appended fused operations in cov2d for int8 were computed in a separate
loop from the main conv2d computation:
```
for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator
  for k in ..
    out = out + fused subsequent ops
```
This patch moves the fused ops one more loop nesting inwards to get
```
for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator + fused subsequent ops
```
On quantized mobilenetv2, this results in approximately a 30% speedup.
crazydemo pushed a commit to crazydemo/tvm that referenced this pull request Jan 27, 2022
Appended fused operations in cov2d for int8 were computed in a separate
loop from the main conv2d computation:
```
for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator
  for k in ..
    out = out + fused subsequent ops
```
This patch moves the fused ops one more loop nesting inwards to get
```
for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator + fused subsequent ops
```
On quantized mobilenetv2, this results in approximately a 30% speedup.
ylc pushed a commit to ylc/tvm that referenced this pull request Feb 16, 2022
Appended fused operations in cov2d for int8 were computed in a separate
loop from the main conv2d computation:
```
for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator
  for k in ..
    out = out + fused subsequent ops
```
This patch moves the fused ops one more loop nesting inwards to get
```
for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator + fused subsequent ops
```
On quantized mobilenetv2, this results in approximately a 30% speedup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants