Conversation
|
Cool. Isn't |
src/caffe/layers/loss_layer.cpp
Outdated
There was a problem hiding this comment.
I'm a big fan of the 1/2 on the regularizer, but I don't usually see it on the squared hinge term (cf. liblinear, for example).
There was a problem hiding this comment.
Yeah, it will be great to have L1, L2 and L1/L2 regularizers in caffe.
There was a problem hiding this comment.
Er, not sure I communicated what I meant to here... what I mean is, I usually see the L2 hinge loss formulated as (lambda/2) ||w||^2 + sum_i xi_i^2, but you seem to have implemented (lambda/2) ||w||^2 + (1/2) sum_i xi_i^2, (where the regularization is implicit in the weight decay, and sometimes C is used in place of lambda, and sometimes the 1/2 is omitted on the first term, although I don't like that), so (unless you are following a convention I'm not aware of), I would drop the multiplication by Dtype(0.5), (and add a factor of 2 to the gradient).
There was a problem hiding this comment.
I was trying to separate the L2 hinge loss from the L2 weight regularization. One could use L2 hinge loss without any weight regularization or with L1 weight regularization. So I was trying to separate C from lambda, although it is not clear if it is the right approach.
The idea of multiplying the loss by 0.5 instead of multiplying the gradient by 2 was to simplify the gradient computation and reduce the rounding errors, (this is similar to the caffe implementation of dropout). But I would be happy to change it to a more common formulation.
I was following this implementation of L2svmloss
Y = bsxfun(@(y,ypos) 2*(y==ypos)-1, y, 1:K);
margin = max(0, 1 - Y .* (X*theta));
if reg
loss = (0.5 * sum(theta.^2)) + C*mean(margin.^2);
else
loss = C*mean(margin.^2);
end
loss = sum(loss);
if reg
g = theta - 2*C/M * (X' * (margin .* Y));
else
g = - 2*C/M * (X' * (margin .* Y));
end
|
I'm not sure if @longjon many thanks for your comments. I will incorporate them. I haven't try LeNet yet, but I should, as a sanity check. |
|
@sguada I think C and weight_decay won't work as the same role in general cases, but you can always set per-layer learning rate (as in my own implementation and experiments) to have exact the same effect of C. But anyway, a separate C parameter would be more favorable to avoid misunderstanding. |
|
@s9xie (cc @sguada):
There certainly could be improvements made to the way parameters are regularized (e.g., weight decay could be upgraded to a general function of parameters, thus supporting things like |
|
@longjon Totally agreed. My point was when seeing a loss function with specific parameters, it would be great if one can easily write it down in a prototxt without reading the code and struggling to figure out the mess of setting lr, lambda, weight decay, C etc... But I do agree this decoupling process could be easily achieved after we have an explicit regularizer layer. |
|
@longjon If let |
|
@longjon I have moved the 0.5 from the loss into the 2 for the gradient, but for now I'm going to leave the |
|
Any successful attempt on imagenet dataset? It seems the parameters for softmax and hinge loss differ greatly |
|
@winstywang try with |
|
@sguada It seems scaling down it by a factor of number of classes makes more sense. I'm still experimenting it... |
|
@winstywang is it start converging? The training loss should start going down after few thousand iterations. |
|
@sguada Still running... only 1000 iters for now. I manually checked the initial gradient of softmax and l2 hinge loss. I find that this setting could at least match the gradient magnitudes. This setting works on CIFAR-10. Let's wait and see... |
|
Let's get this in? |
|
Yeah, I don't like to merge my own PR. On Tuesday, May 20, 2014, Sergey Karayev notifications@github.com wrote:
Sergio |
|
Wait was @longjon's concern about C ever resolved? |
|
Well the last few comments are kind of unresolved. Did stuff converge? On Tuesday, May 20, 2014, Sergio Guadarrama notifications@github.com
|
|
@sergeyk I could get a reasonable result on cifar, but not imagenet. For imagenet, the training loss could decrease, but with a significantly slower rate. BTW, I always set C to the proportion of number of positive samples and negative samples. |
|
Yeah, it converges if one sets C=0.1. With C=1, the default value it does I think On Tuesday, May 20, 2014, Sergey Karayev notifications@github.com wrote:
Sergio |
|
Better to finish the conversation and rebase for a clean merge than just cross our fingers it's ok. |
|
I would vote for no C, and @sguada should add documentation of this layer On Tuesday, May 20, 2014, Evan Shelhamer notifications@github.com wrote:
|
|
To be clear, the gradients we compute are So, if we do include scaling on (any) loss, it is for the convenience of adjusting one parameter in the net prototxt instead of two in the solver prototxt, and comes at the cost of a bit of redundancy in the parameter space. (If we do decide we want that, I would be in favor of (@sguada's offline suggestion of) naming the scale factor something other than |
|
@longjon agreed, I would rename Actually one should change the |
|
@sguada, is what I said above not true? I think this confusion is precisely why I don't find omitting the loss scaling parameter as onerous as you do. Isn't it true that if you scale I agree that having to finagle layer multipliers to get the right tradeoff is rather obnoxious, but I don't see why that should be necessary here. I'll check this to be sure when I have a chance... |
|
@longjon true, but when the learning rate is updated, the factor by which weight_decay was initially updated to scale the loss becomes incorrect, no? |
|
@sergeyk, unless I've missed something, I see no problem with updating the learning rate. Learning rate schedules always have the form Viewed another way, |
|
@longjon Agreed, one would need to keep updating the |
|
@sguada: No, I am saying exactly the opposite. Setting I'll try this empirically by tomorrow and post the results here. You're quite right that we need a way to trade off multiple losses. But I think it would be better to do that generically in another PR, since all losses will need that option. In the end, the user will have the option to not use the loss scale with a single loss, gaining unique specification of the objective, or to scale the loss directly just as you've done here. |
|
@longjon okay, then I will remove the loss_scale from the |
Conflicts: src/caffe/layers/loss_layer.cpp src/caffe/proto/caffe.proto src/caffe/test/test_l2_hinge_loss_layer.cpp
Conflicts: include/caffe/vision_layers.hpp src/caffe/layers/loss_layer.cpp src/caffe/proto/caffe.proto src/caffe/test/test_l2_hinge_loss_layer.cpp
Conflicts: src/caffe/layers/loss_layer.cpp src/caffe/test/test_hinge_loss_layer.cpp
Conflicts: src/caffe/layers/loss_layer.cpp src/caffe/test/test_hinge_loss_layer.cpp
Conflicts: src/caffe/layers/loss_layer.cpp
|
@longjon could you take a look and merge? |
|
For the record, I can confirm empirically what I said above: one gets the same performance on LeNet either by setting (the old) I'll review the code as it is now and make sure it works for me by tomorrow. |
|
@longjon many thanks for verifying empirically that scaling the loss has the same effect as changing the |
|
Okay, one big thing:
and a few finicky things:
|
|
@longjon It seems that I run the tests in a different branch. Now should be okay. |
|
Merged. Thanks @sguada! There was a lint error (trailing whitespace), but I went ahead and fixed it myself to avoid another iteration. I also added "L1" to the names of the original tests to be explicit. |
|
Thanks @longjon for fixing the last bits. :) On Saturday, June 21, 2014, longjon notifications@github.com wrote:
Sergio |
|
Has anyone managed to obtain good performance on ImageNet using this L2 SVM as the final layer? Most applications of transfer learning in computer vision replace the top layer in AlexNet with a L2 SVM to achieve state of the art, so would be awesome to have this work on caffe! @sguada in your first post you say that |
|
In your citation (as well as many others) they trained a linear svm on dumped CNN layer features. This is of course not the same as replacing the softmax with hingeloss layer. Though SVM is actually a shallow network that can be trained with Caffe. People have discussions about this (Charlie Tang's paper and R-CNN). Based on my experience on Imagenet-2012, one-vs-all svm cannot work very well for datasets with larger number of classes (say, 100 and above) and it is extremely difficult to tune the hyper parameters. This might be due to the class imbalance problem. If there is any one who tried one-vs-all hingeloss instead of softmax on Imagenet and saw some performance boost, I'd be appreciated to know that. |
|
@s9xie thanks for the reply, that's taught me a few things! These points are off-topic, but:
|
|
I have been looking for a prototxt file myself explaining how to use caffe to train linear SVM. Does anyone have an example that I could use? |
This PR extends #303 adding C param to the loss and by adding a L2 norm to hinge_loss. These should help to implement L1 (L2) -SVMs with L1 (L2) regularization.
@longjon could you take a look, and let me know if your examples are still working.
By default it uses
C=1.0andhinge_norm=L1, so it behaves as hinge loss.