Skip to content

Comments

Put the acc_data in a new syncedmemory block#6141

Closed
sonack wants to merge 2 commits intoBVLC:masterfrom
sonack:patch-1
Closed

Put the acc_data in a new syncedmemory block#6141
sonack wants to merge 2 commits intoBVLC:masterfrom
sonack:patch-1

Conversation

@sonack
Copy link

@sonack sonack commented Dec 30, 2017

When I was using Accuracy Layer in Training Phase, I found something weird, the loss was increasing and the gradient was much larger than normal.
A easy reproduction example is to uncomment the include_phase of accuracy layer in caffe/examples/cifar10/cifar10_quick_train_test.prototxt and use train_quick.sh to train it.
After strugglely reviewing the source code, I found this line in accuracy_layer.cu is the cause for it, it was saying Since this memory is not used for anything,we use it here to avoid having to allocate new GPU memory to accumulate intermediate results in the kernel.
But it is wrong, because when I use some blob in both loss layer and accuracy layer, caffe would insert split layer to duplicate this blob, and during the backward pass, the splited blob for accuracy layer 's diff would not be updated(covered) and should always be zero. However, here changed its value to the accuracy, which made the gradient after split layer is much larger.

I modified this line to allocate a new syncedmem object to store the acc_data, which made it independent with the bottom[0]'s diff field and solved above issue.

@Noiredd
Copy link
Member

Noiredd commented Jan 3, 2018

This problem has been first spotted in #5981 and has a temporary workaround in #5987. How is your PR different from it? Let's pick one to work on rather than waste energy simultaneously developing two solutions to one problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants