-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[ARM][Performance]Improve ARM CPU depthwise convolution performance #2028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM][Performance]Improve ARM CPU depthwise convolution performance #2028
Conversation
GCC issues warnings with -Wextra if we don't explicitly initialize base class in copy constructors. This commit fixed the issue.
* [RELAY][PASS] FoldScaleAxis Forward * Introduce helper function type_as * Update per review comment * Fix according to comments
|
The CI's test_topi_depthwise_conv2d.py test error is because I have modified the schedule, which doesn't have |
* Add Eddie to committer * Fix order
* Add MXNet test example for relay * Fix a bug in BiasAddSimplifier
Dtype of output of pad should follows input, but if dtype of input is not float, output will still be float becase pad_value is float.
…and ssd ops (apache#2322) * add ssd ops to mxnet.py * add ssd ops to mxnet.py * add result check for multibox and nms unit tests * add result check for multibox and nms unit tests * address @kevinthesun's comments * Disable cuda test for nms for now.
dtype of count is the same as dtype of inputs[0] when created, but its type may change when multiplied by inputs[0]->shape[i]. Which causes dtype of output is not same as dtype of input.
* Add cast op * Rename dtype_cast to cast * Add additional safety check for String2TVMType * Add missing relay op docs
d95a24c to
aa73419
Compare
aa73419 to
bfc259b
Compare
|
I'm very sorry that I commit the merge code previously. I wish this doesn't interrupt you. Currently, I open one new PR: #2345 to continue this work and add this PR as reference in case people are interested in the background. Sorry again for my mistake. |
To leverage existing spatial pack schedule and add tunable compute_at knob to re-implement ARM CPU's depthwise convolution.
On my A53@2.0GHz ARM CPU (MTK6763), which can boost 1.6X performance compared with previous depthwise convolution in the Mobilenet V1 model (I have also checked the correctness of this schedule).
The following is the Tensorflow Mobilenet V1 model auto tvm training GFLOPS log:
Currently:
[Task 2/20] Current/Best: 0.98/ 2.32 GFLOPS | Progress: (1427/2000) | 2679.82 s Done.
[Task 4/20] Current/Best: 0.56/ 1.15 GFLOPS | Progress: (1072/2000) | 2461.27 s Done.
[Task 6/20] Current/Best: 1.08/ 2.78 GFLOPS | Progress: (1084/2000) | 1987.91 s Done.
[Task 8/20] Current/Best: 0.39/ 1.19 GFLOPS | Progress: (1815/2000) | 2744.70 s Done.
[Task 10/20] Current/Best: 1.09/ 2.33 GFLOPS | Progress: (1222/2000) | 1866.02 s Done.
[Task 12/20] Current/Best: 0.42/ 0.90 GFLOPS | Progress: (1716/2000) | 2528.94 s Done.
[Task 14/20] Current/Best: 1.89/ 2.63 GFLOPS | Progress: (1284/2000) | 2288.55 s Done.
[Task 16/20] Current/Best: 0.47/ 0.96 GFLOPS | Progress: (1467/2000) | 2282.65 s Done.
[Task 18/20] Current/Best: 1.43/ 2.61 GFLOPS | Progress: (1007/2000) | 1525.76 s Done.
After this PR optimization:
[Task 2/20] Current/Best: 0.00/ 4.83 GFLOPS | Progress: (1682/2000) | 1470.40 s Done.
[Task 4/20] Current/Best: 1.35/ 3.17 GFLOPS | Progress: (1257/2000) | 1032.80 s Done.
[Task 6/20] Current/Best: 2.04/ 5.49 GFLOPS | Progress: (1904/2000) | 1623.10 s Done.
[Task 8/20] Current/Best: 0.75/ 3.15 GFLOPS | Progress: (1885/2000) | 1546.22 s Done.
[Task 10/20] Current/Best: 2.09/ 6.07 GFLOPS | Progress: (2000/2000) | 1640.41 s Done.
[Task 12/20] Current/Best: 2.99/ 3.80 GFLOPS | Progress: (1853/2000) | 1547.13 s Done.
[Task 14/20] Current/Best: 4.59/ 6.06 GFLOPS | Progress: (1355/2000) | 1091.93 s Done.
[Task 16/20] Current/Best: 1.96/ 4.01 GFLOPS | Progress: (2000/2000) | 1586.18 s Done.
[Task 18/20] Current/Best: 2.33/ 4.63 GFLOPS | Progress: (2000/2000) | 1599.89 s Done.
The depthwise convolution total execution time on single A53@2.0GHz time can be from
45.3839msto28.1945ms.One thing you must notice to use this schedule: You MUST make the XGBTunner constructor’s feature type argument be feature_type= 'knob'. i.e. XGBTuner(tsk, loss_type='rank', feature_type='knob'). Otherwise your program maybe hang forever.
@merrymercy @tqchen Pls review it.