Skip to content

Conversation

@FrozenGene
Copy link
Member

@FrozenGene FrozenGene commented Dec 27, 2018

original discussion about this PR:
#2028

To leverage existing spatial pack schedule and add tunable compute_at knob to re-implement ARM CPU's depthwise convolution. Currently, this PR named this schedule as contrib_spatial_pack as discussed in the original PR #2028.

On my A53@2.0GHz ARM CPU (MTK6763), which can boost 1.6X performance compared with previous depthwise convolution in the Mobilenet V1 model (I have also checked the correctness of this schedule). However, @yzhliu has also proved that this PR can also boost the performance on x86 CPU.

The following is the Tensorflow Mobilenet V1 model auto tvm training GFLOPS log:
Currently:
[Task 2/20] Current/Best: 0.98/ 2.32 GFLOPS | Progress: (1427/2000) | 2679.82 s Done.
[Task 4/20] Current/Best: 0.56/ 1.15 GFLOPS | Progress: (1072/2000) | 2461.27 s Done.
[Task 6/20] Current/Best: 1.08/ 2.78 GFLOPS | Progress: (1084/2000) | 1987.91 s Done.
[Task 8/20] Current/Best: 0.39/ 1.19 GFLOPS | Progress: (1815/2000) | 2744.70 s Done.
[Task 10/20] Current/Best: 1.09/ 2.33 GFLOPS | Progress: (1222/2000) | 1866.02 s Done.
[Task 12/20] Current/Best: 0.42/ 0.90 GFLOPS | Progress: (1716/2000) | 2528.94 s Done.
[Task 14/20] Current/Best: 1.89/ 2.63 GFLOPS | Progress: (1284/2000) | 2288.55 s Done.
[Task 16/20] Current/Best: 0.47/ 0.96 GFLOPS | Progress: (1467/2000) | 2282.65 s Done.
[Task 18/20] Current/Best: 1.43/ 2.61 GFLOPS | Progress: (1007/2000) | 1525.76 s Done.

After this PR optimization:
[Task 2/20] Current/Best: 0.00/ 4.83 GFLOPS | Progress: (1682/2000) | 1470.40 s Done.
[Task 4/20] Current/Best: 1.35/ 3.17 GFLOPS | Progress: (1257/2000) | 1032.80 s Done.
[Task 6/20] Current/Best: 2.04/ 5.49 GFLOPS | Progress: (1904/2000) | 1623.10 s Done.
[Task 8/20] Current/Best: 0.75/ 3.15 GFLOPS | Progress: (1885/2000) | 1546.22 s Done.
[Task 10/20] Current/Best: 2.09/ 6.07 GFLOPS | Progress: (2000/2000) | 1640.41 s Done.
[Task 12/20] Current/Best: 2.99/ 3.80 GFLOPS | Progress: (1853/2000) | 1547.13 s Done.
[Task 14/20] Current/Best: 4.59/ 6.06 GFLOPS | Progress: (1355/2000) | 1091.93 s Done.
[Task 16/20] Current/Best: 1.96/ 4.01 GFLOPS | Progress: (2000/2000) | 1586.18 s Done.
[Task 18/20] Current/Best: 2.33/ 4.63 GFLOPS | Progress: (2000/2000) | 1599.89 s Done.

The depthwise convolution total execution time on single A53@2.0GHz time can be from 45.3839ms to 28.1945ms.

One thing you must notice to use this schedule: You MUST make the XGBTunner constructor’s feature type argument be feature_type= 'knob'. i.e. XGBTuner(tsk, loss_type='rank', feature_type='knob'). Otherwise your program maybe hang forever.

This schedule is not default schedule (i.e. direct) of arm cpu / x86 cpu depthwise convolution. I will update the auto tuning of ARM CPU tutorial to show how to use this contrib_spatial_pack schedule in the following PR.

@merrymercy @yzhliu @tqchen pls review it.

@FrozenGene
Copy link
Member Author

Hi @merrymercy, could you spend some time reviewing? Thanks.

<< "output channels must divide group size";

TShape wshape({param.channels / param.groups,
// Restore depthwise conv2d kernel layout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old code is strange and incorrect while your code is straightforward and correct. We can delete this comment.

# Currently, Mali schedule doesn't use it like conv2d.
if cfg.is_fallback:
ref_log = autotvm.tophub.load_reference_log('arm_cpu', 'rk3399', 'depthwise_conv2d_nchw',
'direct')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'direct')
'contrib_spatial_pack')

Copy link
Member

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. But recently we updated the alter_op_layout to support relay (#2356, a new argument 'F' in introduced). Please resolve the conflict.

@FrozenGene FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from f26cdfd to 4cfc239 Compare January 10, 2019 09:59
@FrozenGene FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from 4cfc239 to 1467f34 Compare January 10, 2019 10:19
@FrozenGene
Copy link
Member Author

@merrymercy Have modified code as you suggest. Please review it again.

@FrozenGene FrozenGene force-pushed the arm_cpu_depthwise_convolution branch 3 times, most recently from 6fccb6d to ff30450 Compare January 10, 2019 10:55
@FrozenGene FrozenGene force-pushed the arm_cpu_depthwise_convolution branch from ff30450 to 0e86fe0 Compare January 10, 2019 10:55
@merrymercy merrymercy merged commit 394cf9f into apache:master Jan 11, 2019
@ZihengJiang ZihengJiang mentioned this pull request Feb 1, 2019
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
…pache#2345)

* Add sptialpack schedule for arm cpu depthwise convolution

* Supply comments.
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
…pache#2345)

* Add sptialpack schedule for arm cpu depthwise convolution

* Supply comments.
@FrozenGene FrozenGene deleted the arm_cpu_depthwise_convolution branch September 10, 2019 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants