[ARM][Performance]Improve ARM CPU depthwise convolution performance #2028

FrozenGene · 2018-10-29T11:13:27Z

To leverage existing spatial pack schedule and add tunable compute_at knob to re-implement ARM CPU's depthwise convolution.

On my A53@2.0GHz ARM CPU (MTK6763), which can boost 1.6X performance compared with previous depthwise convolution in the Mobilenet V1 model (I have also checked the correctness of this schedule).

The following is the Tensorflow Mobilenet V1 model auto tvm training GFLOPS log:
Currently:
[Task 2/20] Current/Best: 0.98/ 2.32 GFLOPS | Progress: (1427/2000) | 2679.82 s Done.
[Task 4/20] Current/Best: 0.56/ 1.15 GFLOPS | Progress: (1072/2000) | 2461.27 s Done.
[Task 6/20] Current/Best: 1.08/ 2.78 GFLOPS | Progress: (1084/2000) | 1987.91 s Done.
[Task 8/20] Current/Best: 0.39/ 1.19 GFLOPS | Progress: (1815/2000) | 2744.70 s Done.
[Task 10/20] Current/Best: 1.09/ 2.33 GFLOPS | Progress: (1222/2000) | 1866.02 s Done.
[Task 12/20] Current/Best: 0.42/ 0.90 GFLOPS | Progress: (1716/2000) | 2528.94 s Done.
[Task 14/20] Current/Best: 1.89/ 2.63 GFLOPS | Progress: (1284/2000) | 2288.55 s Done.
[Task 16/20] Current/Best: 0.47/ 0.96 GFLOPS | Progress: (1467/2000) | 2282.65 s Done.
[Task 18/20] Current/Best: 1.43/ 2.61 GFLOPS | Progress: (1007/2000) | 1525.76 s Done.

The depthwise convolution total execution time on single A53@2.0GHz time can be from 45.3839ms to 28.1945ms.

One thing you must notice to use this schedule: You MUST make the XGBTunner constructor’s feature type argument be feature_type= 'knob'. i.e. XGBTuner(tsk, loss_type='rank', feature_type='knob'). Otherwise your program maybe hang forever.

@merrymercy @tqchen Pls review it.

GCC issues warnings with -Wextra if we don't explicitly initialize base class in copy constructors. This commit fixed the issue.

* [RELAY][PASS] FoldScaleAxis Forward * Introduce helper function type_as * Update per review comment * Fix according to comments

FrozenGene · 2018-10-29T12:09:44Z

The CI's test_topi_depthwise_conv2d.py test error is because I have modified the schedule, which doesn't have tile_c and so on in previous schedule. But TopHub(fall back will use) doesn't contain my new schedule cfg. Maybe I need @merrymercy 's help to handle this situation.

…t binded (apache#1999)

…ation (apache#2007)

…ache#1797)

* Add Eddie to committer * Fix order

* Add MXNet test example for relay * Fix a bug in BiasAddSimplifier

Dtype of output of pad should follows input, but if dtype of input is not float, output will still be float becase pad_value is float.

…orial (apache#2325)

@myfunc

typo "@func myfunc" => "func @myfunc"

@kevinthesun

…and ssd ops (apache#2322) * add ssd ops to mxnet.py * add ssd ops to mxnet.py * add result check for multibox and nms unit tests * add result check for multibox and nms unit tests * address @kevinthesun's comments * Disable cuda test for nms for now.

…ompiling (apache#2219)

dtype of count is the same as dtype of inputs[0] when created, but its type may change when multiplied by inputs[0]->shape[i]. Which causes dtype of output is not same as dtype of input.

* Add cast op * Rename dtype_cast to cast * Add additional safety check for String2TVMType * Add missing relay op docs

FrozenGene · 2018-12-27T13:34:06Z

I'm very sorry that I commit the merge code previously. I wish this doesn't interrupt you.

Currently, I open one new PR: #2345 to continue this work and add this PR as reference in case people are interested in the background.

Sorry again for my mistake.

tqchen and others added 20 commits October 25, 2018 14:25

[ATTR] Introduce Integer container (apache#1994)

a71b34d

[RELAY] Add structural hashing for Relay (apache#1977)

d74b7bb

fix typo in resnet definition (apache#1995)

824db6f

[RELAY] Fix compilation under clang-4.0 (apache#1998)

096fa48

[RELAY][OP] Split (apache#1876)

f7b9f3b

initialize base class in copy constructors (apache#2006)

27d30f2

GCC issues warnings with -Wextra if we don't explicitly initialize base class in copy constructors. This commit fixed the issue.

[RELAY] Add occurs check before unification (apache#2012)

cf39ff1

[RELAY]reshape_like (apache#1950)

2563f36

[TOPI][CUDA] batched int8 conv2d (apache#1961)

f4b0383

[RELAY][OP] Fix conv2d NHWC type inference. (apache#2019)

21dc6a4

[OPENCL][RUNTIME] Fix race condition of modules (apache#2018)

247ea6d

[DOCKER] temporary revert cuda version to cuda8 (apache#2021)

d061fd4

save (apache#2015)

af96077

[TF] ignore Truncate in cast (apache#2022)

d331f1f

[DOCKER][GOLANG] fix golang version. (apache#2023)

9b0ec34

[RELAY][PASS] FoldScaleAxis Forward (apache#2020)

8949716

* [RELAY][PASS] FoldScaleAxis Forward * Introduce helper function type_as * Update per review comment * Fix according to comments

Add attrs package (apache#2025)

8c352ab

[intrin]support fmod for cuda (apache#1964)

d79633a

Do not mutate GlobalVar's checked_type field. (apache#2026)

d915318

[Relay] DQN Port (apache#2009)

fd87cad

xqdan and others added 9 commits October 29, 2018 11:19

[PASS]unroll loops with extent=1 (apache#2027)

0308989

[Relay] DCGAN port (apache#2010)

1260671

[RELAY]prelu op support (apache#2016)

bfc8c68

Refine porting x86 NCHWc conv to AutoTVM (apache#1993)

5984fab

[PASS] add a pass for the specific hardware accelarator when it is no…

4fb2d7e

…t binded (apache#1999)

[Frontend][MXNet] Change mxnet graph traversal from recursion to iter…

4823d55

…ation (apache#2007)

[RELAY][PASS] FoldScaleAxis Backward (apache#2024)

feca27e

Conditional Loop Partitioning - Extending to remove if conditions (ap…

ea74668

…ache#1797)

[YOLO]yolo op added in frontend and removed from topi (apache#1974)

4f7da63

icemelon and others added 20 commits December 20, 2018 14:40

[COMMUNITY] @eqy -> Committer (apache#2311)

e54408d

* Add Eddie to committer * Fix order

[Relay][Frontend] Add MXNet test example for relay (apache#2316)

e78e432

* Add MXNet test example for relay * Fix a bug in BiasAddSimplifier

[BUGFIX] Seg fault in memory planing for symbolic shape (apache#2317)

21e3a5d

Small refactors and bug fixes. (apache#2281)

8aee172

[NNVM] Fix dtype of output of pad. (apache#2331)

bb9e184

Dtype of output of pad should follows input, but if dtype of input is not float, output will still be float becase pad_value is float.

[ROCM] Make sure all bit code files exist (apache#2323)

e12f310

[Relay][docs] Details on comp. graphs in Relay dev intro (apache#2324)

a2e77a8

[RELAY] Add missing arg in vgg (apache#2329)

2be6673

[Relay][Docs] Fix broken bullet points in Relay operator addition tut…

d37c088

…orial (apache#2325)

[RELAY][AUTOTVM] Extract tuning tasks from Relay programs (apache#2181)

7d4ea4d

[FRONTEND][TENSORFLOW] Bugfix (apache#2326)

d7ff19a

[DOCS] typo "@func myfunc" => "func @myfunc" (apache#2333)

3a187c9

typo "@func myfunc" => "func @myfunc"

Add a the ability to trigger debugging in the interpreter without rec…

9aabf96

…ompiling (apache#2219)

[TOPI][CUDA] Add reorder option in int8 conv2d (apache#2327)

9c48af8

[RELAY] Inline scalar compute (apache#2335)

fcb0981

[NNVM] Fix dtype of output of mean. (apache#2334)

97dd830

dtype of count is the same as dtype of inputs[0] when created, but its type may change when multiplied by inputs[0]->shape[i]. Which causes dtype of output is not same as dtype of input.

[Relay][OP] Add cast op (apache#2319)

7e9e45d

* Add cast op * Rename dtype_cast to cast * Add additional safety check for String2TVMType * Add missing relay op docs

[COMMUNITY] @srkreddy1238 -> Committer (apache#2339)

8d91569

Merge branch 'master' into arm_cpu_depthwise_convolution

bfc259b

FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from d95a24c to aa73419 Compare December 27, 2018 12:28

FrozenGene requested review from Huyuwei, Laurawly, nhynes and phisiart as code owners December 27, 2018 12:28

FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from aa73419 to bfc259b Compare December 27, 2018 12:41

FrozenGene closed this Dec 27, 2018

FrozenGene deleted the arm_cpu_depthwise_convolution branch December 27, 2018 12:51

FrozenGene mentioned this pull request Dec 27, 2018

[ARM][Performance] Improve ARM CPU depthwise convolution performance #2345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM][Performance]Improve ARM CPU depthwise convolution performance #2028

[ARM][Performance]Improve ARM CPU depthwise convolution performance #2028

Uh oh!

FrozenGene commented Oct 29, 2018 •

edited

Loading

Uh oh!

FrozenGene commented Oct 29, 2018 •

edited

Loading

Uh oh!

FrozenGene commented Dec 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[ARM][Performance]Improve ARM CPU depthwise convolution performance #2028

[ARM][Performance]Improve ARM CPU depthwise convolution performance #2028

Uh oh!

Conversation

FrozenGene commented Oct 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FrozenGene commented Oct 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FrozenGene commented Dec 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

FrozenGene commented Oct 29, 2018 •

edited

Loading

FrozenGene commented Oct 29, 2018 •

edited

Loading