Decouple the computational batch size and minibatch size by accumulating gradients#1663
Closed
Decouple the computational batch size and minibatch size by accumulating gradients#1663
Conversation
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Dec 31, 2014
Decouple the computational batch size and minibatch size by accumulating gradients
a4d2e6d to
d76653a
Compare
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Dec 31, 2014
Decouple the computational batch size and minibatch size by accumulating gradients
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Jan 1, 2015
Decouple the computational batch size and minibatch size by accumulating gradients
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Jan 2, 2015
Decouple the computational batch size and minibatch size by accumulating gradients
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Jan 2, 2015
Decouple the computational batch size and minibatch size by accumulating gradients
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Jan 3, 2015
Decouple the computational batch size and minibatch size by accumulating gradients
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Jan 3, 2015
Decouple the computational batch size and minibatch size by accumulating gradients
(With layers whose backwards accumlate gradients), this effectively decouples the computational batch from the SGD minibatch. Each iteration accumulates gradients over iter_size batches, then parameters are updated.
Contributor
|
Have we thought about how to handle the case when we're sharing parameters but using different learning rates? I would be okay with simply disallowing that case since it would probably be a pretty weird thing to do. Otherwise the only other way I can think to handle it is pretty messy -- we could have a a special case where, e.g. if blobs_lr is 2 in one layer but 1 in all others, the Net could prescale (by a factor of 2) the top_diff for the layer with blobs_lr 2 by 2... Actually, even that wouldn't work if the layer has other shared param blobs that don't also have the same relative LR... |
philkr
added a commit
to philkr/caffe
that referenced
this pull request
Jan 25, 2015
Decouple the computational batch size and minibatch size by accumulating gradients
Member
|
Always accumulating is simple and good, but let's review the weight sharing and solvers issues before merging. |
This was referenced Feb 15, 2015
Closed
This was referenced Feb 21, 2015
Member
|
Replaced by #1977. |
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After #1615, so that this code already supports deconv layer. (The actual diff is just +37/-40 lines.)
This PRs the gradient accumulation branch living at https://github.com/shelhamer/caffe/tree/accum-grad. I took a lighter approach here than the one there: parameter gradients are always accumulated, there is no other option. The gradient checker is made correct by zero-initing parameter diffs.
Issues:
Backward. External code that usedBackwardis likely to break, if there is any.SGDSolver, but haven't thought carefully about that yet.