Skip to content

Optimization for subpixel layer on Tensor core #4523

@kice

Description

@kice

I found that Depth_to_Space layer spend too much time on changing data layout (NHWC <-> NCHW) while using tensor core. It takes up to 25% of the run time to do the transpose.

Is it possible to reduce this kind of unnecessary data manipulation, like combining reshape and/or transpose into one op.

A sample network

scale = 4

conv(3, 64),
conv(64, scale**2),
subpixel(scale),
conv(64 // scale**2, 3)

Then it will do

scale = 4

nchwToNhwc(), 
conv(3, 64),
conv(64, scale**2),
nhwcToNchw(),
reshape and transpose
nchwToNhwc(), 
conv(64 // scale**2, 3)
nhwcToNchw(),

I think nchwToNhwc is done automatically by CUDA, maybe we could convert the whole to NHWC before using tensor core will be a better choice.

Or some features like this PR #4335

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions