Skip to content
This repository was archived by the owner on Jul 16, 2021. It is now read-only.

Updated neural net implementation#126

Merged
zackmdavis merged 38 commits intoAtheMathmo:masterfrom
NivenT:master
Oct 30, 2017
Merged

Updated neural net implementation#126
zackmdavis merged 38 commits intoAtheMathmo:masterfrom
NivenT:master

Conversation

@NivenT
Copy link
Copy Markdown
Contributor

@NivenT NivenT commented Sep 9, 2016

This PR is a response to #91, but is not yet ready to be merged.

This PR introduces the NetLayer trait for describing the requirements for being a layer in a neural network. It changes the method of storing layers in neural networks, form only storing their sizes to explicitly storing a Vec<Box<NetLayer>>, allowing different types of layers to be defined and combined.

The code currently passes all tests, and the and gate examples works as well, but I could use some feedback before this is merge ready. More specifically, there were several decisions I made writing this code that I am not sure were the best choices.

Linear doesn't have a bias

Linear layers (which are standard fully connected layers) don't include a bias term. Instead there is a separate Bias layer which appends a 1 to the end of the input, allowing a following linear layer to emulate having a bias. This feels a little roundabout, but not including a bias simplifies the linear code, and let's the user easily choose whether or not she wants to include a bias.

num_params/param_shape redundant

The NetLayer trait requires a function that returns the number of parameters, and a separate function that returns the shape of the parameter matrix. You could easily get the number of parameters from their shape, so is the first function necessary?

Name of linear layer

I am not sure Linear is the most obvious name for the layer, but I couldn't think of a better one.

Default Implementations on NetLayer

Some layers (Bias, ActivationFunc, pooling layers in the future) have no parameters. Would it be worthwhile to have default implementations for num_params, default_params, etc. on the NetLayer trait that assumed their were or parameters, or is it better not to have those to make sure no one forgets to implement them on a layer that does have parameters?

Box<NetLayer> vs &NetLayer

Neural networks currently store a vector of Box<NetLayer>. Should they instead store &NetLayer?

NetLayer/ActivationFunc extend other traits

NetLayer inherits from Debug, and ActivationFunc now inherits from Clone + Debug. I ran into issues with NetLayer needing Debug since BaseNeuralNetwork derive it. I ran into similar issues resulting in me having ActivationFunc inherit Clone + Debug.

Old weight functions

If you compile to code as is, the compiler will warn about a few unused functions, all having to do with the weights of the network. These were all created with the previous way of storing layers in mind. I was planning on just deleting them, but decided to leave them there for now in case someone sees value in leaving them, but adapted for the new storage method.

Aside from those, the only other major thing to know is that the implementation doesn't take regularization into account yet during backpropagation. Any feedback, complaints, or improvements?

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Sep 9, 2016

Does anyone know the list of added commits includes ones that have already been merged, and how I can get rid of that if possible?

@AtheMathmo
Copy link
Copy Markdown
Owner

AtheMathmo commented Sep 9, 2016

Does anyone know the list of added commits includes ones that have already been merged, and how I can get rid of that if possible?

You can rebase and do commit --amend but I wouldn't worry. I've been squashing and merging PRs to keep the master branch cleaner (I also find it makes it a little easier to write up the change log notes).

Edit: I just figured out why they appear - it is because I squashed these commits in earlier. It's not going to be an issue here I believe.

I haven't looked through the PR in detail yet but I figured I can give a few comments to things first. (I'll also have to brush up on my neural nets as it has been a little while since I've delved into the details).

Linear doesn't have a bias

I don't object to this not being the default but I do think requiring bias as a separate layer seems counter intuitive to me. If there were some way to have it be included in the layer but optional this would be preferable to me. Maybe I will understand the motivation more after digging into the code.

num_params/param_shape redundant

Will have to read a little more before answering this.

Name of linear layer

I think it sort of makes sense but I agree that something else might be better. Mostly to be familiar to users coming from other frameworks.

NetLayer/ActivationFunc extend other traits

I think this is fine but from my brief look-over it would be nice if we could cut down on clones. For example I saw input.clone() - this probably exists already but should be removed if at all possible in the new implementation. We should use something other than apply here.

Edit: oh and I think I'd prefer to make a new nnet/mod.rs and put the layers module in this folder. It seems better to keep them together then put it into toolkit - this makes it hard to find the code (and documentation).

@AtheMathmo
Copy link
Copy Markdown
Owner

After looking through a bit more I think that we'd like to get rid of the Linear layer and instead have an Activation layer. If we combine the two we can capture the Linear layer by using a Linear activation function. Did you intentionally avoid this construction?

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Sep 9, 2016

I did intentionally avoid that. I wasn't sure if there would be a use case where you apply an activation function after something other than a linear layer, so I wanted to keep the two separate in case there is. Part of my philosophy with the implementation was that having more layers allows for more modularity and simpler calculations (per layer), although admittedly, at the expense of more verbose/complicated net construction.

@AtheMathmo
Copy link
Copy Markdown
Owner

I'd be interested to see some benchmarks to see whether there are performance advantages to this approach. Actually, there is a benchmark we can use already to check...

My feeling is that the added complexity is definitely not worth it right now. Rusty-machine is far off being a competitor in the neural network space - I think it's simplicity is one of the few reasons I thought the neural nets might be valuable. I'd like to see this changed in the future but it is a lot of work.

I think if we have to provide lots of information on how to use the neural networks in rusty-machine they become a lot less valuable. Perhaps I am being short sighted - what do you think?

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Sep 9, 2016

I understand what you mean. There's definitely value in keeping the API simple.

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Sep 9, 2016

I started working on combining the layers, and ran into an issue. The new Activation layer computes f(xW+b), so when calculating gradients, you use the chain rule and need to know the value of xW+b. This means that when calling compute_grad, xW+b is calculated twice, once during forward propagation and once during backward propagation. I can't think of a nice way to store xW+b during forward propagation to avoid the second calculation since it's an intermediate step in the layer and not the layer's output.

I think what I'll do is keep the Linear layer separate from the ActivationFunc layer (this allows things to only be calculated once), but make it optionally include a bias like you had said before. Users will still have to know to add an activation layer after a linear one, but at least they won't have to add a bias layer beforehand.

@AtheMathmo
Copy link
Copy Markdown
Owner

Thanks for making that change - it is definitely cleaner I think.

As for the Activation layer I'm not sure I understand the problem (and wont be able to dig into it until a bit later). But as a quick-and-dirty solution, could we not put a field on the Activation layer which is something like Option<Matrix<f64>> to store the information on the forward pass?

I'll try to figure out a better way later. If it really is impossible then maybe we can think of another approach to keep the API simple?

Comment thread src/learning/toolkit/net_layer.rs Outdated
}

/// Construct a Linear layer with a bias term
pub fn default(input_size: usize, output_size: usize) -> Linear {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to call this function with_bias?

In fact perhaps Linear::new creates a layer with bias by default, and we add a new function Linear::without_bias for the no bias case? I'm suggesting this as I think the layer without bias will be very rare and should impact the API very little as a result. What do you think?

Just to be clear I don't think the current approach is particularly bad and will be happy with keeping it if you think it is best

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make that change. new will include a bias and there'll be a separate without_bias.

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Sep 9, 2016

The issue is that during the forward pass, the layer would compute f(xW+b). During the backward pass, it would need to know the value of xW+b to calculate the gradient, but would only be given x, W, and b separately, causing it to perform the same matrix multiplication twice. The output of the layer f(xW+b) can be stored during the forward pass easily, but since its an intermediate calculation, storing xW+b would require mutability, but compute_grad takes an immutable reference.

Comment thread src/learning/toolkit/net_layer.rs Outdated

impl Linear {
/// Construct a new Linear layer
pub fn with_bias(input_size: usize, output_size: usize) -> Linear {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was a little unclear. I meant that this should be called new - we're fairly consistent in naming our constructors new and so users may search for this explicitly. Either way, we should also describe in the docs exactly what these methods do (i.e. They append a column of ones to the left of the input matrices if bias is on). Some examples would be nice so that we can enforce consistency later with doc tests too.

This is all really minor stuff so you can hold off on changes until I've been able to actually read the PR properly and give more informed feedback!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad. I'll remember to fix that.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not your fault at all! I just can't help being super picky with user-facing things (like docs and public function names). Feel free to let me know if you disagree (I'll be more than happy to hear you out). Also you can hang onto these minor changes until later if you prefer.

Comment thread examples/nnet-and_gate.rs
let layers = &[2, 1];
let criterion = BCECriterion::new(Regularization::L2(0.));
let mut model = NeuralNet::new(layers, criterion, StochasticGD::default());
let mut model = NeuralNet::mlp(layers, criterion, StochasticGD::default(), Sigmoid);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't heavily related to your work but I think an inline comment here explaining what we're creating is a good idea.

i.e. // Create a new multilayer perceptron with Stochastic gradient descent and Sigmoid activation

Comment thread src/learning/nnet.rs
//!
//! // We will just use the default stochastic gradient descent.
//! let mut model = NeuralNet::new(layers, criterion, StochasticGD::default());
//! let mut model = NeuralNet::mlp(layers, criterion, StochasticGD::default(), Sigmoid);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to comment on examples - as this is the module docs I think we should be super clear about what this line is doing.

@AtheMathmo
Copy link
Copy Markdown
Owner

There is a merge conflict on this PR

I've had a chance to look through the PR now and there are a couple things that are worth resolving still (I think):

  • We shouldn't use apply in the Activation layer. Instead use iterators and map then collect.
  • We should be able to remove the activations.push(inputs.clone()); line by replacing the access of the first element of activations with inputs. This means more verbose code but more efficient. If you have a better suggestion I'd love to hear it!

Other than these two points you've done a great job of cutting down on the clones, thanks!


I've also done a little more digging into the Linear vs Activation layers that we were discussing before. It seems that other libraries also adopt the separation of these two and in fact I believe that it does make it easier to adapt more complex Activation functions.

Finally we might want to put some thought into composability and how we would extend this to more general layers. In particular:

  • Can we easily support other non-linearities like SoftMin/SoftMax.
  • Can we make it possible to combine layers for easy building? For example wrapping a Linear and an Activation layer together and adding these to the network?
  • What are the requirements for us to tackle higher-dimensional layers like convolution?

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Sep 21, 2016

I have been busy lately and haven't had time to look at your comments. Hopefully over the next few days I'll have some free time to take a close look at everything you said and resolve any issues. I've only glanced at your comments so far.

@AtheMathmo
Copy link
Copy Markdown
Owner

No problem - take your time!

@AtheMathmo
Copy link
Copy Markdown
Owner

I've just merged in a breaking change with #133 . Let me know if you need any help resolving - it's mostly some added trait imports.

We're also pretty close to doing a 0.5.0 release with some breaking changes. I am happy to wait on this to be finished (so we can include it as a breaking change). If you'd prefer to take some more time to work through this slowly we can postpone this to 0.6.0 (as we're pre 1.0.0 this can happen pretty soon after too!). Let me know if you have any thoughts.

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Sep 21, 2016

I'm fine with this being a part of either 0.5.0 or 0.6.0. If after a while I haven't been able to update everything to satisfactory standards, then go ahead and postpone. Otherwise, include it in 0.5.0.

While looking through the comments, I remembered that you wanted you wanted a nnet/mod.rs. Would you prefer to essentially just rename nnet.rs to nnet/mod.rs or to keep nnet/mod.rs simple (pretty much just uses, similar to a C++ header file) and have a separate nnet/nnet.rs where the neural network implementation is kept?

As for your the things you wanted to keep in mind

  • I think something like SoftMax/SoftMin should be doable as things are now
  • I am not 100% sure what you mean by wrapping layers together. Do you mean something like?
struct JointLayer<T, U> where T: NetLayer, U: NetLayer {
    etc.
}
  • I would need to think about that more to have a good answer.

@AtheMathmo
Copy link
Copy Markdown
Owner

AtheMathmo commented Sep 21, 2016

I think the nnet/mod.rs approach is fine. I wouldn't want users to need to import like this: use learning::nnet::nnet::NeuralNet. We can always spread it out more later if needed. A separate module for layers would be good though: nnet/layers.rs.

The JointLayer you gave is an example of what I meant. I was wondering if we can get more general. For example allow arbitrary combinations of layers to be chained, something like:

#[derive(Clone)]
struct Container {
   layers: Vec<Box<Layer>>,
}

impl Layer for Container { ... }

This way we could do something like:

let tanh_container = Container::new(vec![linear, tanh]);

for _ in 0..5 {
    nnet.add(tanh_container.clone());
}

I don't remember how consistent I am with the new api but hopefully that makes what I mean clear. Is there an even nicer way that we can do this?

To be clear - I'm not expecting this to be included as part of the current PR. Just some thoughts about how we can proceed in future would be nice.

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Nov 22, 2016

I have an idea for why this PR is slower than the existing code. I think the slowdown may be due to it using dynamic dispatching, since it uses trait objects, stopping the compiler from performing some of the optimizations it could before. This idea just came to me, and I'm not yet sure how this knowledge (if it's correct) could be used to speed things up.

On a brighter note, I think your earlier idea of having something like a Container struct for adding multiple layers at once is doable. Trait objects may not be clonable, but the objects implementing the traits can be. So, we can do something like this

trait NetLayer {
  fn box_clone(&self) -> Box<NetLayer>
}

impl Clone for Box<NetLayer> {
  fn clone(&self) -> Self {
    self.box_clone()
  }
}

#[derive(Clone)]
struct Layer;

impl NetLayer for Layer {
  fn box_clone(&self) -> Box<NetLayer> {
    Box::new(self.clone());
  }
}

and that would allow us to clone Box<NetLayer>. On the off chance there is eventually a NetLayer that cannot be cloned, we can always make box_clone return an Option<_> instead.

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Nov 22, 2016

Since we don't really need random access to the layers, one possible way of getting rid of the dynamic dispatch would be to have some dedicated type just for storing the layers that we could implement as a heterogeneous list. It might look something like this

trait NetList: MarkerTrait {}

struct Nil;
impl NetList for Nil {}

struct NetCons<L: NetLayer, T: NetList>(L, T);
impl<L: NetLayer, T: NetList> NetList for NetCons<L, T> {}

We could then implement Iterator for NetList (along with some other functions like add/push) and use it basically the same way we currently use Vec<Box<NetLayer>>. I got this idea when I came across this.

It might be worth mentioning that storing the layers in a linked list like this makes adding new layers O(n) instead of O(1). This is probably inconsequential in almost all cases since layers are usually only adding once, and there usually isn't a ridiculous amount of them.

@AtheMathmo
Copy link
Copy Markdown
Owner

I'm so glad you came back to this! I think that the dynamic dispatch looks like a fairly likely culprit (in the absence of anything else).

The box clone is a pretty neat way to get around the issue but I'd suggest we hold off on that for now. Maybe we can up with a tidier way once more of the pieces have settled.

I agree that the cost of adding layers increasing isn't really an issue. I must admit I'm not totally sure how this would work (and probably wouldn't be until I saw it in code). But I think it is an idea definitely worth exploring. If it can improve the performance it would be a great change!

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Nov 23, 2016

I'm not gonna lie, I'm not totally sure how this will work either. This is my first time seeing heterogeneous lists, so I do not know how they are usually implemented, but they certainly look interesting. Luckily, other people have implemented them before, and so there should be resources out there for learning more about them, so I think figuring this out is doable. However, the more I think about how this would work, the more it feels like we'll still need some dynamic dispatching, but hopefully with fewer indirect calls than we have now.

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Dec 20, 2016

Sorry it's been a while. I worked on this for a couple days, and then stopped when I had focus on school. I wish I had better news about the progressive I've made. I am less confident that this is a good way to get rid of indirection.

As far as I have been able to figure out, creating a heterogeneous list is not too bad, but using one is proving difficult. The code I have working allows the user to create a heterogeneous list, but to use one in the BaseNeuralNet class would require storing layers as a Box<LayerList> (The trait used by the list) which still using dynamic dispatch. I don't know a way to get rid of this since you would otherwise need to know at compile which layers every net was using. This indirection alone (probably) wouldn't be too bad, but I'm not sure of a way to iterate over the layers without indirection. You can't use an iterator because it would be returning different types (One call to next may want to yield a Sigmoid and the other a Linear but iterators can only return a single type). You can't make a map function in the LayerList trait that applies a function to every layer because the function would be applied to different types. I do not know what you can do to avoid using a Box<NetLayer>, and if we use that, then nothing has changed.

@AtheMathmo
Copy link
Copy Markdown
Owner

Don't worry about the delay - it's been a busy few weeks for me too. And will likely be even busier over the next month.

It's a shame that the heterogeneous list didn't provide the answer. I'm hoping I'll find some time this week to sit down and read over the code. Maybe there's something else to spot that will help.

It may also be a good idea to confirm that dynamic dispatch is the issue. Trying modifying the model to only accept one layer type and see if the static model is faster than the equivalent dynamic model? At least this way we know we have targeted the right thing to spend resources on.

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Dec 22, 2016

Good idea to make sure dynamic dispatch is the issue. You may want to run your own test to confirm this, but when I tested it, it turned out to not be making a difference. I removed the activation layers from the mlp function, so the network in the benchmark only had Linear layers. I compared doing only this to both doing this and changing the type of the layers parameter to layers: Vec<Linear>, and there was no noticeable difference in benchmarks. Looks like this might not be the issue, which leaves me very confused as to what's going on.

@AtheMathmo
Copy link
Copy Markdown
Owner

Thank you for checking - I guess in a way this is good news! It seems that fixing the dynamic dispatch issue would be quite some work...

I'm going to take another look at this tomorrow to try to figure it out. It will take me a little while to familiarize myself with the code again but hopefully I can figure something out. At this point I am quite tempted to accept the performance hit in exchange for more expressive code. Besides - if someone needs a performance critical neural net there are far better options. (Not to say performance in rusty-machine isn't important!).

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Jan 6, 2017

I was able to speed things up some. When I benched it, the latest commit was consistently faster than before, but still slightly slower than the current master branch.

My optimizations focused on the ActivationFunc layer.
The first one was to introduce a new function func_grad_from_output(y: f64) -> f64 as part of the ActivationFunc trait. I had Sigmoid specifically in mind when I thought of this. It's gradient can be calculated in terms of its output (f'(x)=f(x)(1-f(x))), so I thought it might help to be able to do this explicitly in terms of y=f(x) instead of recalculating f(x) given only x since it involves relatively costly operations like division and exponentiation. While I had Sigmoid in mind, the same logic applies to the other activation functions here. With this, I added the layer's output as an argument to the back_input and back_params functions, so this new function could be utilized.
The second thing I did was replace the clone's that were there with explicit Vec constructions, so that the input (or output) matrix was traversed only once instead of twice.

There's still a performance hit, but at least its smaller now.

@AtheMathmo
Copy link
Copy Markdown
Owner

Thank you so much for your commitment to this! I've been busy moving country and so haven't had as much time as I'd have hoped to look into this.

This weekend I will try to find some time to sit down and read through the PR again properly. I think even with the performance hit this may be worth merging - for the reasons given in the above comment.

@botev
Copy link
Copy Markdown

botev commented Jan 17, 2017

I came pass by this project by chance. I just want to point out guys that you should be thinking about autodiff, otherwise you are just brute forcing something that is automatable. All of this "Layers" and things like that should be placed for basic ops only. Additionally, if you ever want to compete with the big boys frameworks you will need to build the computation graphs. Anyway, good luck!

@AtheMathmo
Copy link
Copy Markdown
Owner

@botev thanks for the comments!

You are right that this library is a long way away from being considered as an alternative for deep learning. But this has never really been a goal for me and the effort it would take is colossal. Maybe we will see some improvements in the directions you mention in the future, but they are a long way away! (GPU support would be necessary too).

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Jan 31, 2017

I have been looking at this some more over the past week, but I think I am all out of ideas. If you are still willing to accept the performance hit, I'm fine with that. If you are not, I will continue to try to find things that can be improved.

@AtheMathmo
Copy link
Copy Markdown
Owner

Sorry for the delay - it's been a busy couple months for me and I'm just starting to get up and running again.

I took another look and only spotted one thing that wouldn't affect the benchmarks anyway. When we run this with `RUSTFLAGS="-C -target-cpu=native" there is only an ~5% difference between the current master and this. I think this is a very acceptable performance loss given the other gains.

I expect that neural nets will see more large API changes in the future but I want to get this merged and put in as part of the upcoming breaking changes release. If you have time could you take one last look through the code and see if things look good to you @NivenT ?

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented Feb 25, 2017

No need to apologize.

I just took a look at it, and I think things look good. I did two small things. I slightly rearranged the code in forward_prop so that it parallels how forward propagation is down in compute_grad. I think this way is cleaner. I had previously kept a pointer instead of an index because I thought it might be a little faster, but I as far as I can tell, there's no real difference.

The second change was in your remove_first_col function, which was a really nice addition by the way. If I understand it correctly, it moves the first column of the matrix to the end of its underlying data array and then just ignores it. Since we aren't using the first column for anything afterwards, I figured there was no point in making sure to keep it in memory, so I replaced the call to v.swap() with an assignment

v[i-del]=v[i]

@NivenT
Copy link
Copy Markdown
Contributor Author

NivenT commented May 4, 2017

Do you have any remaining concerns or anything else that should be modified? @AtheMathmo

@zackmdavis
Copy link
Copy Markdown
Collaborator

Thanks @NivenT and @tafia! 💖

@zackmdavis zackmdavis merged commit 34ea5d3 into AtheMathmo:master Oct 30, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants