Skip to content

Comments

clear bottom diffs after used as temp memory in softmax/sigmoid loss …#6186

Closed
orzlibo wants to merge 1 commit intoBVLC:masterfrom
orzlibo:master
Closed

clear bottom diffs after used as temp memory in softmax/sigmoid loss …#6186
orzlibo wants to merge 1 commit intoBVLC:masterfrom
orzlibo:master

Conversation

@orzlibo
Copy link

@orzlibo orzlibo commented Jan 18, 2018

This is basically the same problem as that of the accuracy layer recently. The problem with the loss layers is when they do not propagate down (i.e. Backward_gpu will not execute) but the blob is shared with other layers, the temporary data leads to invalid gradients.

This is common when multiple losses are used and some of them have loss_weight set to 0. A simple example will show that: copy the SoftmaxWithLoss layer in the mnist example and set its loss_weight to 0, and the loss will go to inf (~87) after hundreds of iterations.

I have also read the recent discussions about accuracy layer. From my view it is reasonable to use bottom diff as a temp mem and clear it (at least recently), because the data diffs (i.e. top and bottom blobs) are meant to be ** set ** (while the parameter diffs are meant to be ** accumulated **). Some evidence might support that:

  1. Current implementation of some widely used layers set the bottom diff and accumulate the param diff. InnerProduct layer will be an example; current solvers clear param diff after iter_size * ForwardBackwards, but do not clear those data blobs.
  2. Caffe use (automatically inserted) split layers to deal with the bottom sharing problem. This can be seen from the training log. As for the implementation, in the split layers, multiple tops share data with a bottom, but their diffs are not shared. The diffs are accumulated by Backward of the split layer. The problem also comes with this implementation, because in the current framework, a layer does not conveniently know which ones of its tops need back-prop (only propagate_down is given in LayerParameter). It has no option but to regard all its top blobs as the same, thus accumulating the wrong gradients from the tops on which Backward() is not executed.
  3. This design (bottoms and tops to be set and params to be accumulated) is advantageous in some ways. For example, consider the implementation of iter_size. If we accumulate diff for shared bottoms, they still have to be cleared between iters. Compared to params, bottom and top blobs are much larger (e.g. feature maps vs params in CNNs). It is not so obvious that this will be faster than the current split layer design.

That is as far as I am concerned now. My usage of Caffe is limited to vision tasks so further discussions may be necessary for other tasks. Also I am newbee on github and I apologize for any potential inconvenience or offensive within the discussion above.

@orzlibo
Copy link
Author

orzlibo commented Jan 18, 2018

Allocating internal buffers (as recently mentioned) or clearing bottom diffs can be regarded as a trade-off between memory usage and computation. My opinion is that neither of them are absolutely the preferable choice.

@shelhamer
Copy link
Member

See #6202 for a combined fix with further comments. It includes this fix and makes the analogous fix for the Accuracy layer.

@shelhamer
Copy link
Member

@orzlibo Thanks for your thoroughly explained and straightforward pull request. Your interpretation and fix were right so I included them in a slightly more systematic PR #6202. That has just been merged so I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants