[TOPI,x86] Improve performance on int8 conv2d on x86 #9966

tkonolige · 2022-01-18T20:23:54Z

Appended fused operations in cov2d for int8 were computed in a separate loop from the main conv2d computation:

for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator
  for j in ..
    out = out + fused subsequent ops

This patch moves the fused ops one more loop nesting inwards to get

for i in ... parallel
  for j in ...
    accumulator = 0
    for k in ..
      vectorized_multiply_add(accumulator, data, kernel)
    out = accumulator + fused subsequent ops

On quantized mobilenetv2, this results in approximately a 30% speedup.

@masahi @mbrookhart

Appended fused operations in cov2d for int8 were computed in a separate loop from the main conv2d computation: ``` for i in ... parallel for j in ... accumulator = 0 for k in .. vectorized_multiply_add(accumulator, data, kernel) out = accumulator for k in .. out = out + fused subsequent ops ``` This patch moves the fused ops one more loop nesting inwards to get ``` for i in ... parallel for j in ... accumulator = 0 for k in .. vectorized_multiply_add(accumulator, data, kernel) out = accumulator + fused subsequent ops ``` On quantized mobilenetv2, this results in approximately a 30% speedup.

masahi

makes sense, and good speed up!

Curious how much it helps for other models. At first I thought this would be a bigger win for larger workloads (with corresponding larger write cache). It is easy to test quantized resnet50, inception v3, or mobilenet v3 via PyTorch.

mbrookhart

Yay!

tkonolige · 2022-01-18T23:26:52Z

Model	Before (ms)	After (ms)
resnet50	5.5282	4.7501
inception_v3	3.4320	3.2994
mobilenetv2	1.5239	1.1931

Appended fused operations in cov2d for int8 were computed in a separate loop from the main conv2d computation: ``` for i in ... parallel for j in ... accumulator = 0 for k in .. vectorized_multiply_add(accumulator, data, kernel) out = accumulator for k in .. out = out + fused subsequent ops ``` This patch moves the fused ops one more loop nesting inwards to get ``` for i in ... parallel for j in ... accumulator = 0 for k in .. vectorized_multiply_add(accumulator, data, kernel) out = accumulator + fused subsequent ops ``` On quantized mobilenetv2, this results in approximately a 30% speedup.

tkonolige requested review from Huyuwei, Laurawly, ZihengJiang, jcf94, jwfromm, kevinthesun, masahi, mbrookhart, vinx13 and yzhliu as code owners January 18, 2022 20:23

masahi approved these changes Jan 18, 2022

View reviewed changes

mbrookhart approved these changes Jan 18, 2022

View reviewed changes

michalpiszczek approved these changes Jan 18, 2022

View reviewed changes

masahi merged commit 19717aa into apache:main Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI,x86] Improve performance on int8 conv2d on x86 #9966

[TOPI,x86] Improve performance on int8 conv2d on x86 #9966

Uh oh!

tkonolige commented Jan 18, 2022 •

edited

Loading

Uh oh!

masahi left a comment

Uh oh!

mbrookhart left a comment

Uh oh!

tkonolige commented Jan 18, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[TOPI,x86] Improve performance on int8 conv2d on x86 #9966

[TOPI,x86] Improve performance on int8 conv2d on x86 #9966

Uh oh!

Conversation

tkonolige commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masahi left a comment

Choose a reason for hiding this comment

Uh oh!

mbrookhart left a comment

Choose a reason for hiding this comment

Uh oh!

tkonolige commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tkonolige commented Jan 18, 2022 •

edited

Loading

tkonolige commented Jan 18, 2022 •

edited

Loading