Conversation
|
Looks like some #ifdef CPU and cmake support need to be added. |
|
Seems to me that you are doing per-neuron batch normalization which is feasible when it is fully connected layer. But in conv layer (and also fully connected layer), you should do per-channel batch normalization according to Google's paper. Update: Here is my quick and messy implementation based on MVNLayer: https://github.com/ChenglongChen/batch_normalization |
|
@ChenglongChen |
|
@ducha-aiki, |
|
@ChenglongChen |
|
@ChenglongChen Perhaps having the layer normalize over num() and channels() is a better default. But by my reading, it's not actually possible to do proper convolution batch normalization without deep integration into the conv layer itself. I assumed Caffe would have to implement that separately. They apply normalization to each feature map independently, and feature maps are overlapping in the convolution operation, so it's not possible get the desired result via composition with the current conv layer. Essentially, the normalization is not just over num() and channels(), but also 3x3 or 5x5 patches of width() and height(). Correct me where I'm wrong. Edit: It looks like others are interpreting the feature maps as the entire width() and height() for a given channel, rather than kxk kernel regions, so you can get away with implementing the layer separately after all. It seems more natural to me to normalize across the local receptive fields, but perhaps the authors ran into the same difficulty of implementing that route. It would be interesting to know if they had tried. |
|
@Russell91, implementation of @ChenglongChen looks like handling this part properly. Actually, it is very similar to yours. |
|
@ChenglongChen rebased and fixed in #1891 |
|
@ducha-aiki, |
|
Could you clarify mu sth @ChenglongChen ? So, do I miss sth or there should be RELU or SIGMOID function after BN layer? I try @ChenglongChen version and @ducha-aiki and both of them does not converge on my data. I do not know why. |
@melgor you are right. My version only passes tests, but also has strange behaivour on some datasets. So any help is appreciated.
Just because original caffe-lenet also has no activation function, I think. |
|
@melgor, I did also observed some sudden blow-ups in some mini-batches in the training (with my older version). I suspect it is due to the variance division but not very sure. |
|
According to Algorithm 2, line 8 - 12. Before testing, we should compute the mean and variance from some training mini-batch. I think this piece is still missed, and it should be integrated in solver.cpp? |
|
Closing in favor of #1965 as it seems to be more complete -- comment if not. |


No description provided.