Skip to content

GPU version of SoftmaxWithLossLayer#1789

Merged
shelhamer merged 1 commit intoBVLC:devfrom
SaganBolliger:softmax_loss_gpu
Feb 7, 2015
Merged

GPU version of SoftmaxWithLossLayer#1789
shelhamer merged 1 commit intoBVLC:devfrom
SaganBolliger:softmax_loss_gpu

Conversation

@SaganBolliger
Copy link
Contributor

Just thought I'd contribute something to what seems like a great community. The new forward and backward functions pass the existing test cases for SoftmaxWithLossLayer.

I'm not entirely sure about these lines in which I call cpu_data() on a blob of count() 1 so that I can then call caffe_gpu_scal with the retrieved value as the scalar. I'm not proficient in profiling CUDA code and I'm not sure if this is a bottleneck and a solution with no device to host data transfer at all is worth writing, or whether transferring a single scalar would incur a performance hit so minor that this is not worth optimizing.

Cheers!

@shelhamer
Copy link
Member

Hey @SaganBolliger, thanks for addressing this longtime TODO with your first contribution. Check a few suggestions from http://caffe.berkeleyvision.org/development.html and in particular do

make lint

and fix any complaints it makes. Once that's done, please squash these changes into a single commit and push to your branch for us to review.

@SaganBolliger
Copy link
Contributor Author

You're most welcome. I've fixed the lint issues and squashed the commits. Let me know if you have any further feedback.

@shelhamer
Copy link
Member

@longjon this should be coordinated with #1654. Should this be merged with #1654 to follow once updated with GPU code? It's up to you.

@SaganBolliger
Copy link
Contributor Author

Once there are tests written for #1654, I can update the GPU code to pass the tests.

@SaganBolliger
Copy link
Contributor Author

Just an FYI, I've been using the code I submitted in this PR in my own modelling experiments and I'm running into some issues that are not being caught by the test cases. I'm trying to debug it and write some additional test cases to catch the issue. I'll update here when I've fixed the issue, so please don't merge until then. Thanks!

@longjon
Copy link
Contributor

longjon commented Jan 27, 2015

This looks pretty nice. Since @SaganBolliger has volunteered to update for #1654, that plan sounds good to me. Just two comments from a quick glance:

  • The kernel names don't seem quite right; SoftmaxForwardGPU actually does (part of) the multinomial logistic forward, while softmax layer is handling the softmax forward itself, and SoftmaxBackwardGPU is doing (part of) the softmax loss backward... it would be better to name the kernels according to what they actually compute, IMO.
  • Have you done a simple timing test, CPU vs GPU? I'm sure this is faster but it would nice to have a ballpark number.

@SaganBolliger
Copy link
Contributor Author

I've updated the code to reflect the changes in #1654. It passes the new test cases that were added there.

@shelhamer Looks like the issues I was running into aren't related to this code, so I'd say this is safe to merge.

@longjon I've renamed the kernels SoftmaxLoss{Backward,Forward}GPU to avoid confusion with the SoftMaxLayer itself. It's maybe a bit tricky to find a concise name for the exact computation in the kernels, though I'm open to suggestions.

I'm currently working on an older macbook with a slow GPU, so benchmarking the CPU vs GPU versions produces comparable results. Though this is par for the course with this GPU as most networks take slightly longer to train on the GPU than on the CPU. If anyone else with access to better hardware would be willing to benchmark it, that would be great. Otherwise, I'm planning on setting up an AWS instance in the next week for some more serious computation and could run a benchmark there.

@shelhamer
Copy link
Member

I benchmarked a 1000 x 1000 x 1 x 1 softmax loss by 100 iterations on CPU, dev with the host-device roundtrip, and this PR which does give a speedup. The improvement will scale as the output grows, as it does for sliding window detection or semantic segmentation, since it saves the roundtrip time of sending the output map back to the host.

CPU mode:

loss    forward: 42.2155 ms.
loss    backward: 0.00059 ms.

dev GPU mode (device-host communication):

loss    forward: 752.946 milliseconds.
loss    backward: 0.001216 milliseconds.

This PR:

loss    forward: 3.5701 ms.
loss    backward: 0.00121536 ms.

@shelhamer
Copy link
Member

Thanks @SaganBolliger (and don't worry about the scalar for the loss weight)!

shelhamer added a commit that referenced this pull request Feb 7, 2015
GPU version of SoftmaxWithLossLayer
@shelhamer shelhamer merged commit 4905366 into BVLC:dev Feb 7, 2015
@longjon
Copy link
Contributor

longjon commented Feb 7, 2015

Just a couple post-mortem notes on an excellent PR:

  • The new kernel names are fine, definitely less confusing.
  • Using the diff for storage in forward is something we've discussed before. In general we'd prefer to avoid it, and might outlaw it in the future, but right now that is a technique we do use sometimes (for example in hinge loss layer), so I'm okay with it here, well noted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments