Added batchnorm layer. · Pull Request #1867 · BVLC/caffe

ghost · 2015-02-15T11:47:26Z

No description provided.

ghost · 2015-02-15T11:56:53Z

Looks like some #ifdef CPU and cmake support need to be added.

ChenglongChen · 2015-02-15T15:06:09Z

Seems to me that you are doing per-neuron batch normalization which is feasible when it is fully connected layer. But in conv layer (and also fully connected layer), you should do per-channel batch normalization according to Google's paper.

Update: Here is my quick and messy implementation based on MVNLayer: https://github.com/ChenglongChen/batch_normalization

ducha-aiki · 2015-02-15T19:29:21Z

@ChenglongChen
have you tested your layer? I have tried it and lenet haven`t converged at all.

ChenglongChen · 2015-02-15T20:31:50Z

@ducha-aiki,
I have used it a few times with mnist and my own dataset. Seems working fine in my case. For lenet, you can find the model file, solver and training log in the above repo.

ducha-aiki · 2015-02-15T20:44:25Z

@ChenglongChen
Thank you, the problem was is fillers. Now it converges.

ghost · 2015-02-15T21:23:11Z

@ChenglongChen Perhaps having the layer normalize over num() and channels() is a better default. But by my reading, it's not actually possible to do proper convolution batch normalization without deep integration into the conv layer itself. I assumed Caffe would have to implement that separately. They apply normalization to each feature map independently, and feature maps are overlapping in the convolution operation, so it's not possible get the desired result via composition with the current conv layer. Essentially, the normalization is not just over num() and channels(), but also 3x3 or 5x5 patches of width() and height(). Correct me where I'm wrong.

Edit: It looks like others are interpreting the feature maps as the entire width() and height() for a given channel, rather than kxk kernel regions, so you can get away with implementing the layer separately after all. It seems more natural to me to normalize across the local receptive fields, but perhaps the authors ran into the same difficulty of implementing that route. It would be interesting to know if they had tried.

ducha-aiki · 2015-02-16T17:54:44Z

@Russell91, implementation of @ChenglongChen looks like handling this part properly. Actually, it is very similar to yours.
@ChenglongChen, I have tested your layer on different datasets and network sometimes goes to zero accuracy and it behaivour depends greatly on hyperparameters. (especially with leaky ReLU, as well as with normal ReLUs) However, your implementation looks straightforward and correct. Have you experienced any troubles? If you agree, I will rebase your implementation on current dev branch.

ducha-aiki · 2015-02-17T16:19:53Z

@ChenglongChen rebased and fixed in #1891
You have used y instead of x_norm in propagation.

ChenglongChen · 2015-02-18T06:46:22Z

@ducha-aiki,
Yes, you are right. Thanks for the fix!

melgor · 2015-02-18T07:47:42Z

Could you clarify mu sth @ChenglongChen ?
In your example in lenet you do not have any activation function after BN layer. In paper is written that:

So, do I miss sth or there should be RELU or SIGMOID function after BN layer?

I try @ChenglongChen version and @ducha-aiki and both of them does not converge on my data. I do not know why.

ducha-aiki · 2015-02-18T08:56:09Z

I try @ChenglongChen version and @ducha-aiki and both of them does not converge on my data. I do not know why.

@melgor you are right. My version only passes tests, but also has strange behaivour on some datasets. So any help is appreciated.

In your example in lenet you do not have any activation function after BN layer.

Just because original caffe-lenet also has no activation function, I think.

ChenglongChen · 2015-02-18T10:36:58Z

@melgor,
As indicated by @ducha-aiki, caffe-lenet doesn't have activation function, but of course you can add RELU or other activation functions right after BNLayer.

I did also observed some sudden blow-ups in some mini-batches in the training (with my older version). I suspect it is due to the variance division but not very sure.

weiliu89 · 2015-02-18T16:52:50Z

According to Algorithm 2, line 8 - 12. Before testing, we should compute the mean and variance from some training mini-batch. I think this piece is still missed, and it should be integrated in solver.cpp?

ducha-aiki · 2015-02-25T15:20:59Z

@weiliu89, it works when we extracting probabilities or features in mini-batches. With single input does not yet. See #1965

shelhamer · 2015-03-10T00:53:41Z

Closing in favor of #1965 as it seems to be more complete -- comment if not.

Added batchnorm layer.

fc9ea9c

pannous added a commit to pannous/caffe that referenced this pull request Feb 16, 2015

https://github.com/BVLC/caffe/pull/1867

5215be8

shelhamer closed this Mar 10, 2015

Comments

Conversation

ghost commented Feb 15, 2015

Uh oh!

ghost commented Feb 15, 2015

Uh oh!

ChenglongChen commented Feb 15, 2015

Uh oh!

ducha-aiki commented Feb 15, 2015

Uh oh!

ChenglongChen commented Feb 15, 2015

Uh oh!

ducha-aiki commented Feb 15, 2015

Uh oh!

ghost commented Feb 15, 2015

Uh oh!

ducha-aiki commented Feb 16, 2015

Uh oh!

ducha-aiki commented Feb 17, 2015

Uh oh!

ChenglongChen commented Feb 18, 2015

Uh oh!

melgor commented Feb 18, 2015

Uh oh!

ducha-aiki commented Feb 18, 2015

Uh oh!

ChenglongChen commented Feb 18, 2015

Uh oh!

weiliu89 commented Feb 18, 2015

Uh oh!

ducha-aiki commented Feb 25, 2015

Uh oh!

shelhamer commented Mar 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants