Optimization for subpixel layer on Tensor core

I found that Depth_to_Space layer spend too much time on changing data layout (NHWC <-> NCHW) while using tensor core. It takes up to 25% of the run time to do the transpose.

Is it possible to reduce this kind of unnecessary data manipulation, like combining reshape and/or transpose into one op. 

A sample network
```
scale = 4

conv(3, 64),
conv(64, scale**2),
subpixel(scale),
conv(64 // scale**2, 3)
```

Then it will do
```
scale = 4

nchwToNhwc(), 
conv(3, 64),
conv(64, scale**2),
nhwcToNchw(),
reshape and transpose
nchwToNhwc(), 
conv(64 // scale**2, 3)
nhwcToNchw(),
```
I think nchwToNhwc is done automatically by CUDA, maybe we could convert the whole to NHWC before using tensor core will be a better choice.

Or some features like this PR https://github.com/apache/incubator-tvm/pull/4335

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for subpixel layer on Tensor core #4523

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimization for subpixel layer on Tensor core #4523

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions