-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Closed
Description
I found that Depth_to_Space layer spend too much time on changing data layout (NHWC <-> NCHW) while using tensor core. It takes up to 25% of the run time to do the transpose.
Is it possible to reduce this kind of unnecessary data manipulation, like combining reshape and/or transpose into one op.
A sample network
scale = 4
conv(3, 64),
conv(64, scale**2),
subpixel(scale),
conv(64 // scale**2, 3)
Then it will do
scale = 4
nchwToNhwc(),
conv(3, 64),
conv(64, scale**2),
nhwcToNchw(),
reshape and transpose
nchwToNhwc(),
conv(64 // scale**2, 3)
nhwcToNchw(),
I think nchwToNhwc is done automatically by CUDA, maybe we could convert the whole to NHWC before using tensor core will be a better choice.
Or some features like this PR #4335
Metadata
Metadata
Assignees
Labels
No labels