Skip to content

Conversation

@icemelon
Copy link
Member

@icemelon icemelon commented Apr 2, 2020

Using cudnn can improve the softmax performance on Nvidia GPU.

@yzhliu @Laurawly @ZihengJiang


// Set mode and shape descriptor
if (axis == ndim - 1) {
int64_t N = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we need int64_t here but later cast to int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because DLTensor defines the shape in int64_t type. There'll be a cast anyway.

@tqchen
Copy link
Member

tqchen commented Apr 2, 2020

as part of the principle, would be great if we can lookinto making the native op as fast

@icemelon
Copy link
Member Author

icemelon commented Apr 3, 2020

@tqchen Yes, I understand that. But the latency difference could be 10x between tvm schedule and cudnn for the input shape like [100, 1024] on V100. I guess to achieve such performance it requires fusion across multiple stage of reduction, which it seems not easy to be implemented in tir.

@tqchen
Copy link
Member

tqchen commented Apr 3, 2020

k, I am not trying to blocking the PR, merely trying to say it would be great to have such investigation :)

@tqchen tqchen merged commit 799ff35 into apache:master Apr 6, 2020
@tqchen
Copy link
Member

tqchen commented Apr 6, 2020

@wpan11nv @yongfeng-nv can you suggest a bit about possible optimizations that can be done?

@yongfeng-nv
Copy link
Contributor

@wpan11nv @yongfeng-nv can you suggest a bit about possible optimizations that can be done?

We don't know the details, but will look into it.

icemelon added a commit to icemelon/tvm that referenced this pull request Apr 14, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Apr 16, 2020
zhiics pushed a commit to neo-ai/tvm that referenced this pull request Apr 17, 2020
@wpan11nv
Copy link
Contributor

wpan11nv commented Apr 20, 2020

@wpan11nv @yongfeng-nv can you suggest a bit about possible optimizations that can be done?

The cuda schedule emits 4 kernels, which cause lots of IO overhead. Ideally, we may emit a single kernel for small reduction sizes (e.g. reduction dim n <= 1024)

dpankratz pushed a commit to dpankratz/incubator-tvm that referenced this pull request Apr 24, 2020
@tqchen
Copy link
Member

tqchen commented Jun 5, 2020

#5600 for improving softmax with warp shuffle.

@icemelon icemelon deleted the softmax-cudnn branch July 21, 2020 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants