GPU version of SoftmaxWithLossLayer#1789
Conversation
|
Hey @SaganBolliger, thanks for addressing this longtime TODO with your first contribution. Check a few suggestions from http://caffe.berkeleyvision.org/development.html and in particular do and fix any complaints it makes. Once that's done, please squash these changes into a single commit and push to your branch for us to review. |
9545437 to
d770cda
Compare
|
You're most welcome. I've fixed the lint issues and squashed the commits. Let me know if you have any further feedback. |
|
Once there are tests written for #1654, I can update the GPU code to pass the tests. |
|
Just an FYI, I've been using the code I submitted in this PR in my own modelling experiments and I'm running into some issues that are not being caught by the test cases. I'm trying to debug it and write some additional test cases to catch the issue. I'll update here when I've fixed the issue, so please don't merge until then. Thanks! |
|
This looks pretty nice. Since @SaganBolliger has volunteered to update for #1654, that plan sounds good to me. Just two comments from a quick glance:
|
d770cda to
4c894d0
Compare
|
I've updated the code to reflect the changes in #1654. It passes the new test cases that were added there. @shelhamer Looks like the issues I was running into aren't related to this code, so I'd say this is safe to merge. @longjon I've renamed the kernels SoftmaxLoss{Backward,Forward}GPU to avoid confusion with the SoftMaxLayer itself. It's maybe a bit tricky to find a concise name for the exact computation in the kernels, though I'm open to suggestions. I'm currently working on an older macbook with a slow GPU, so benchmarking the CPU vs GPU versions produces comparable results. Though this is par for the course with this GPU as most networks take slightly longer to train on the GPU than on the CPU. If anyone else with access to better hardware would be willing to benchmark it, that would be great. Otherwise, I'm planning on setting up an AWS instance in the next week for some more serious computation and could run a benchmark there. |
|
I benchmarked a 1000 x 1000 x 1 x 1 softmax loss by 100 iterations on CPU, dev with the host-device roundtrip, and this PR which does give a speedup. The improvement will scale as the output grows, as it does for sliding window detection or semantic segmentation, since it saves the roundtrip time of sending the output map back to the host. CPU mode: dev GPU mode (device-host communication): This PR: |
|
Thanks @SaganBolliger (and don't worry about the scalar for the loss weight)! |
GPU version of SoftmaxWithLossLayer
|
Just a couple post-mortem notes on an excellent PR:
|
Just thought I'd contribute something to what seems like a great community. The new forward and backward functions pass the existing test cases for SoftmaxWithLossLayer.
I'm not entirely sure about these lines in which I call cpu_data() on a blob of count() 1 so that I can then call caffe_gpu_scal with the retrieved value as the scalar. I'm not proficient in profiling CUDA code and I'm not sure if this is a bottleneck and a solution with no device to host data transfer at all is worth writing, or whether transferring a single scalar would incur a performance hit so minor that this is not worth optimizing.
Cheers!