Conversation
layer, sparse gaussian filler)
|
Ready to merge, @shelhamer. The unit test problem with While I could propagate down to the 2nd input, the semantics of this layer (take sigmoid of first input and compute cross-entropy error on the sigmoidal outputs and the second input assumed to already be in a 0-1 range) seem like they would make it odd to use with weights below the second input, and it's wasteful to propagate down to the second input if there aren't any weights below. I think this should eventually be fixed somehow, e.g. by making the |
|
Ok, looks good. Thanks Jeff!
This seems like a good way to fix it to me. Agreed with not propagating to the 2nd input for now for the reason given. For anyone with a model where both paths have parameters, they could hack in a field in the loss layer like |
MNIST autoencoder example
|
I'd be very happy about some comments in mnist_autoencoder.prototxt or a accompanying readme about how exactly the autoencoder works. In particular, the roles of the two distinct loss layers (and why the one operates on |
|
I agree with @moi90 , a readme and tutorials would be very nice to one who wants to used them. |
This PR moves
examples/lenet/toexamples/mnist/and adds the necessary files to train an autoencoder on MNIST with the architecture of Hinton & Salakhutdinov [1] using SGD with no pre-training (e.g. via RBM). It uses a sparse Gaussian initialization (added tofiller.hpp) as suggested by [2] as a strategy for training autoencoders via SGD* without pretraining. It uses a fixed LR of 0.0001 which could probably be greatly improved upon, but I haven't played with it much.After 2 million iterations (which took a few hours on the GPU -- I didn't originally intend to train it this long but this is what it was at when I came back to it) the test L2 reconstruction error was around 1.5-1.6. (For an idea of what this means, a reconstruction with 2 out of 784 pixels flipped from perfectly white to perfectly black (or vice versa) would have an L2 reconstruction error of 2.0)
*actually [2] used Nesterov's accelerated gradient, but this is not currently implemented in Caffe and SGD seems to be fairly effective.
[1] http://www.cs.toronto.edu/~hinton/science.pdf
[2] http://jmlr.org/proceedings/papers/v28/sutskever13.pdf