Conversation
|
Why L2 regularization in InnerProductLayer? Should be equivalent to weight decay, no? (Though your implementation does save an axpy if using lambda instead of weight_decay, with weight_decay set to 0, but seems potentially hazardous if we're not going to remove weight_decay altogether and do something similar in all layers with parameters imho.) |
|
Good point, wasn't thinking about that. One might conceivably want a different tradeoff parameter at the |
|
Inside each
|
src/caffe/layers/loss_layer.cpp
Outdated
There was a problem hiding this comment.
caffe_copy(count, bottom_data, bottom_diff)
|
@longjon can you share the prototxt for LeNet/hinge? I've got numerical overflow on gradient computations with your loss... |
|
My apologies, @s9xie, I accidentally clobbered the working commit with a broken one (an errant minus sign). I've put up a fixed version that gets (e.g.) 0.9921 accuracy after 10k iterations. The only change to the prototxt is to replace |
|
I'm new to hinge loss, how can it be applied to a multi-class problem? |
|
@zgxiangyang, this layer implements one-vs-all hinge loss, so the loss for each example is the hinge loss for the binary problem of separating the true class of that example from all other classes. There is also (not implemented here) a different multiclass hinge loss, the Crammer and Singer version, that some feel is more natural (and that extends naturally to structured prediction problems); one-vs-all is, however, more common in practice. |
|
@longjon thanks! |
This layer implements a "one-vs-all" hinge loss, (1/n) sum_ij max(0, 1 - y_ij x_ij), with bottom blob x_ij (i ranging over examples and j over classes), and y_ij = +1/-1 indicating the label. No regularization is included, since regularization is done via weight decay or using the parameters of another layer. The gradient is taken to be zero at the hinge point. This commit only provides the CPU implementation.
In theory, layer functions could be nonsmooth anywhere; in all cases in use so far, they are nonsmooth at either zero or +1 and -1. In the future, it might be necessary to generalize the kink mechanism beyond this stopgap measure.
Based on SoftmaxWithLossLayerTest.
|
Now using Can someone confirm that is okay to store intermediate computations in Other than that, this is ready for review. |
|
looks great, thanks Jon! |
|
@longjon I think in the long run we probably don't want to use |
HingeLossLayer
HingeLossLayeronly provides a CPU implementationAdding L2 regularization directly toInnerProductLayermight seem heavy-handed (although the implementation is simple)AFAICT Implement regularizers #258 does not address regularization of parameters, so it will not make 86ef499 go away, but future work might provide a more general solution