Rework Xavier to be more flexibility#32
Conversation
|
Another point we should discuss is the calculation of fan_out. Currently we have: fan_in = prod(dims[2:end])
fan_out = dims[1]But following [1] and [4] `input blob has shape (num, a, b, c) where a * b * c = fan_in and num * b * c = fan_out. We maybe should have (and if somebody could double check my logic :) ) fan_in = prod(dims[2:end])
fan_out = prod(dims[1:end]) / dims[2] |
|
The caffe code you cited looks definite weird to me. The Also things are quite different when it comes to FullyConnected layer, the weights should be a matrix (instead of 4D tensor), and the fan-in fan-out calculation should handle this gracefully. |
|
Yeah I am unsure about 710dd01. In [1] |
|
So for me this would be ready. |
Current coverage is
|
Rework Xavier to be more flexibility
Following the discussion in apache/mxnet#610 I took another swing at Xavier.
The idea is that the concepts proposed in the papers [1, 2, 3] can be generalized in choosing the regularization factor
1/fan_in1/fan_out2/(fan_out + fan_in), the distribution to sample from and a magnitude scaling factor. [1] proposes3/fan_inand6/(fan_out+fan_in)and [2,3] propose2/fan_in.[1] X. Bengio and Y. Glorot (2010) http://jmlr.csail.mit.edu/proceedings/papers/v9/glorot10a.html
[2] K. He, X. Zhang, S. Ren, and J. Sun (2015) http://arxiv.org/abs/1502.01852
[3] A. M. Saxe, J. L. McClelland, and S. Ganguli (2013/2014) http://arxiv.org/abs/1312.6120v3