Updated neural net implementation#126
Conversation
|
Does anyone know the list of added commits includes ones that have already been merged, and how I can get rid of that if possible? |
You can rebase and do Edit: I just figured out why they appear - it is because I squashed these commits in earlier. It's not going to be an issue here I believe. I haven't looked through the PR in detail yet but I figured I can give a few comments to things first. (I'll also have to brush up on my neural nets as it has been a little while since I've delved into the details).
I don't object to this not being the default but I do think requiring bias as a separate layer seems counter intuitive to me. If there were some way to have it be included in the layer but optional this would be preferable to me. Maybe I will understand the motivation more after digging into the code.
Will have to read a little more before answering this.
I think it sort of makes sense but I agree that something else might be better. Mostly to be familiar to users coming from other frameworks.
I think this is fine but from my brief look-over it would be nice if we could cut down on clones. For example I saw Edit: oh and I think I'd prefer to make a new |
|
After looking through a bit more I think that we'd like to get rid of the |
|
I did intentionally avoid that. I wasn't sure if there would be a use case where you apply an activation function after something other than a linear layer, so I wanted to keep the two separate in case there is. Part of my philosophy with the implementation was that having more layers allows for more modularity and simpler calculations (per layer), although admittedly, at the expense of more verbose/complicated net construction. |
|
I'd be interested to see some benchmarks to see whether there are performance advantages to this approach. Actually, there is a benchmark we can use already to check... My feeling is that the added complexity is definitely not worth it right now. Rusty-machine is far off being a competitor in the neural network space - I think it's simplicity is one of the few reasons I thought the neural nets might be valuable. I'd like to see this changed in the future but it is a lot of work. I think if we have to provide lots of information on how to use the neural networks in rusty-machine they become a lot less valuable. Perhaps I am being short sighted - what do you think? |
|
I understand what you mean. There's definitely value in keeping the API simple. |
|
I started working on combining the layers, and ran into an issue. The new I think what I'll do is keep the |
|
Thanks for making that change - it is definitely cleaner I think. As for the I'll try to figure out a better way later. If it really is impossible then maybe we can think of another approach to keep the API simple? |
| } | ||
|
|
||
| /// Construct a Linear layer with a bias term | ||
| pub fn default(input_size: usize, output_size: usize) -> Linear { |
There was a problem hiding this comment.
I think it would make more sense to call this function with_bias?
In fact perhaps Linear::new creates a layer with bias by default, and we add a new function Linear::without_bias for the no bias case? I'm suggesting this as I think the layer without bias will be very rare and should impact the API very little as a result. What do you think?
Just to be clear I don't think the current approach is particularly bad and will be happy with keeping it if you think it is best
There was a problem hiding this comment.
I can make that change. new will include a bias and there'll be a separate without_bias.
|
The issue is that during the forward pass, the layer would compute |
|
|
||
| impl Linear { | ||
| /// Construct a new Linear layer | ||
| pub fn with_bias(input_size: usize, output_size: usize) -> Linear { |
There was a problem hiding this comment.
Sorry I was a little unclear. I meant that this should be called new - we're fairly consistent in naming our constructors new and so users may search for this explicitly. Either way, we should also describe in the docs exactly what these methods do (i.e. They append a column of ones to the left of the input matrices if bias is on). Some examples would be nice so that we can enforce consistency later with doc tests too.
This is all really minor stuff so you can hold off on changes until I've been able to actually read the PR properly and give more informed feedback!
There was a problem hiding this comment.
Sorry, my bad. I'll remember to fix that.
There was a problem hiding this comment.
It's not your fault at all! I just can't help being super picky with user-facing things (like docs and public function names). Feel free to let me know if you disagree (I'll be more than happy to hear you out). Also you can hang onto these minor changes until later if you prefer.
| let layers = &[2, 1]; | ||
| let criterion = BCECriterion::new(Regularization::L2(0.)); | ||
| let mut model = NeuralNet::new(layers, criterion, StochasticGD::default()); | ||
| let mut model = NeuralNet::mlp(layers, criterion, StochasticGD::default(), Sigmoid); |
There was a problem hiding this comment.
This isn't heavily related to your work but I think an inline comment here explaining what we're creating is a good idea.
i.e. // Create a new multilayer perceptron with Stochastic gradient descent and Sigmoid activation
| //! | ||
| //! // We will just use the default stochastic gradient descent. | ||
| //! let mut model = NeuralNet::new(layers, criterion, StochasticGD::default()); | ||
| //! let mut model = NeuralNet::mlp(layers, criterion, StochasticGD::default(), Sigmoid); |
There was a problem hiding this comment.
Similar to comment on examples - as this is the module docs I think we should be super clear about what this line is doing.
|
There is a merge conflict on this PR I've had a chance to look through the PR now and there are a couple things that are worth resolving still (I think):
Other than these two points you've done a great job of cutting down on the clones, thanks! I've also done a little more digging into the Finally we might want to put some thought into composability and how we would extend this to more general layers. In particular:
|
|
I have been busy lately and haven't had time to look at your comments. Hopefully over the next few days I'll have some free time to take a close look at everything you said and resolve any issues. I've only glanced at your comments so far. |
|
No problem - take your time! |
|
I've just merged in a breaking change with #133 . Let me know if you need any help resolving - it's mostly some added trait imports. We're also pretty close to doing a 0.5.0 release with some breaking changes. I am happy to wait on this to be finished (so we can include it as a breaking change). If you'd prefer to take some more time to work through this slowly we can postpone this to 0.6.0 (as we're pre 1.0.0 this can happen pretty soon after too!). Let me know if you have any thoughts. |
|
I'm fine with this being a part of either 0.5.0 or 0.6.0. If after a while I haven't been able to update everything to satisfactory standards, then go ahead and postpone. Otherwise, include it in 0.5.0. While looking through the comments, I remembered that you wanted you wanted a As for your the things you wanted to keep in mind
struct JointLayer<T, U> where T: NetLayer, U: NetLayer {
etc.
}
|
|
I think the The #[derive(Clone)]
struct Container {
layers: Vec<Box<Layer>>,
}
impl Layer for Container { ... }This way we could do something like: let tanh_container = Container::new(vec![linear, tanh]);
for _ in 0..5 {
nnet.add(tanh_container.clone());
}I don't remember how consistent I am with the new api but hopefully that makes what I mean clear. Is there an even nicer way that we can do this? To be clear - I'm not expecting this to be included as part of the current PR. Just some thoughts about how we can proceed in future would be nice. |
|
I have an idea for why this PR is slower than the existing code. I think the slowdown may be due to it using dynamic dispatching, since it uses trait objects, stopping the compiler from performing some of the optimizations it could before. This idea just came to me, and I'm not yet sure how this knowledge (if it's correct) could be used to speed things up. On a brighter note, I think your earlier idea of having something like a trait NetLayer {
fn box_clone(&self) -> Box<NetLayer>
}
impl Clone for Box<NetLayer> {
fn clone(&self) -> Self {
self.box_clone()
}
}
#[derive(Clone)]
struct Layer;
impl NetLayer for Layer {
fn box_clone(&self) -> Box<NetLayer> {
Box::new(self.clone());
}
}and that would allow us to clone |
|
Since we don't really need random access to the layers, one possible way of getting rid of the dynamic dispatch would be to have some dedicated type just for storing the layers that we could implement as a heterogeneous list. It might look something like this trait NetList: MarkerTrait {}
struct Nil;
impl NetList for Nil {}
struct NetCons<L: NetLayer, T: NetList>(L, T);
impl<L: NetLayer, T: NetList> NetList for NetCons<L, T> {}We could then implement It might be worth mentioning that storing the layers in a linked list like this makes adding new layers O(n) instead of O(1). This is probably inconsequential in almost all cases since layers are usually only adding once, and there usually isn't a ridiculous amount of them. |
|
I'm so glad you came back to this! I think that the dynamic dispatch looks like a fairly likely culprit (in the absence of anything else). The box clone is a pretty neat way to get around the issue but I'd suggest we hold off on that for now. Maybe we can up with a tidier way once more of the pieces have settled. I agree that the cost of adding layers increasing isn't really an issue. I must admit I'm not totally sure how this would work (and probably wouldn't be until I saw it in code). But I think it is an idea definitely worth exploring. If it can improve the performance it would be a great change! |
|
I'm not gonna lie, I'm not totally sure how this will work either. This is my first time seeing heterogeneous lists, so I do not know how they are usually implemented, but they certainly look interesting. Luckily, other people have implemented them before, and so there should be resources out there for learning more about them, so I think figuring this out is doable. However, the more I think about how this would work, the more it feels like we'll still need some dynamic dispatching, but hopefully with fewer indirect calls than we have now. |
|
Sorry it's been a while. I worked on this for a couple days, and then stopped when I had focus on school. I wish I had better news about the progressive I've made. I am less confident that this is a good way to get rid of indirection. As far as I have been able to figure out, creating a heterogeneous list is not too bad, but using one is proving difficult. The code I have working allows the user to create a heterogeneous list, but to use one in the |
|
Don't worry about the delay - it's been a busy few weeks for me too. And will likely be even busier over the next month. It's a shame that the heterogeneous list didn't provide the answer. I'm hoping I'll find some time this week to sit down and read over the code. Maybe there's something else to spot that will help. It may also be a good idea to confirm that dynamic dispatch is the issue. Trying modifying the model to only accept one layer type and see if the static model is faster than the equivalent dynamic model? At least this way we know we have targeted the right thing to spend resources on. |
|
Good idea to make sure dynamic dispatch is the issue. You may want to run your own test to confirm this, but when I tested it, it turned out to not be making a difference. I removed the activation layers from the |
|
Thank you for checking - I guess in a way this is good news! It seems that fixing the dynamic dispatch issue would be quite some work... I'm going to take another look at this tomorrow to try to figure it out. It will take me a little while to familiarize myself with the code again but hopefully I can figure something out. At this point I am quite tempted to accept the performance hit in exchange for more expressive code. Besides - if someone needs a performance critical neural net there are far better options. (Not to say performance in rusty-machine isn't important!). |
|
I was able to speed things up some. When I benched it, the latest commit was consistently faster than before, but still slightly slower than the current master branch. My optimizations focused on the There's still a performance hit, but at least its smaller now. |
|
Thank you so much for your commitment to this! I've been busy moving country and so haven't had as much time as I'd have hoped to look into this. This weekend I will try to find some time to sit down and read through the PR again properly. I think even with the performance hit this may be worth merging - for the reasons given in the above comment. |
|
I came pass by this project by chance. I just want to point out guys that you should be thinking about autodiff, otherwise you are just brute forcing something that is automatable. All of this "Layers" and things like that should be placed for basic ops only. Additionally, if you ever want to compete with the big boys frameworks you will need to build the computation graphs. Anyway, good luck! |
|
@botev thanks for the comments! You are right that this library is a long way away from being considered as an alternative for deep learning. But this has never really been a goal for me and the effort it would take is colossal. Maybe we will see some improvements in the directions you mention in the future, but they are a long way away! (GPU support would be necessary too). |
|
I have been looking at this some more over the past week, but I think I am all out of ideas. If you are still willing to accept the performance hit, I'm fine with that. If you are not, I will continue to try to find things that can be improved. |
|
Sorry for the delay - it's been a busy couple months for me and I'm just starting to get up and running again. I took another look and only spotted one thing that wouldn't affect the benchmarks anyway. When we run this with `RUSTFLAGS="-C -target-cpu=native" there is only an ~5% difference between the current master and this. I think this is a very acceptable performance loss given the other gains. I expect that neural nets will see more large API changes in the future but I want to get this merged and put in as part of the upcoming breaking changes release. If you have time could you take one last look through the code and see if things look good to you @NivenT ? |
|
No need to apologize. I just took a look at it, and I think things look good. I did two small things. I slightly rearranged the code in The second change was in your v[i-del]=v[i] |
|
Do you have any remaining concerns or anything else that should be modified? @AtheMathmo |
This PR is a response to #91, but is not yet ready to be merged.
This PR introduces the
NetLayertrait for describing the requirements for being a layer in a neural network. It changes the method of storing layers in neural networks, form only storing their sizes to explicitly storing aVec<Box<NetLayer>>, allowing different types of layers to be defined and combined.The code currently passes all tests, and the and gate examples works as well, but I could use some feedback before this is merge ready. More specifically, there were several decisions I made writing this code that I am not sure were the best choices.
Linear doesn't have a bias
Linearlayers (which are standard fully connected layers) don't include a bias term. Instead there is a separateBiaslayer which appends a 1 to the end of the input, allowing a following linear layer to emulate having a bias. This feels a little roundabout, but not including a bias simplifies the linear code, and let's the user easily choose whether or not she wants to include a bias.num_params/param_shape redundant
The
NetLayertrait requires a function that returns the number of parameters, and a separate function that returns the shape of the parameter matrix. You could easily get the number of parameters from their shape, so is the first function necessary?Name of linear layer
I am not sure
Linearis the most obvious name for the layer, but I couldn't think of a better one.Default Implementations on NetLayer
Some layers (Bias, ActivationFunc, pooling layers in the future) have no parameters. Would it be worthwhile to have default implementations for num_params, default_params, etc. on the NetLayer trait that assumed their were or parameters, or is it better not to have those to make sure no one forgets to implement them on a layer that does have parameters?
Box<NetLayer>vs&NetLayerNeural networks currently store a vector of
Box<NetLayer>. Should they instead store&NetLayer?NetLayer/ActivationFunc extend other traits
NetLayerinherits fromDebug, andActivationFuncnow inherits fromClone + Debug. I ran into issues withNetLayerneedingDebugsinceBaseNeuralNetworkderive it. I ran into similar issues resulting in me havingActivationFuncinheritClone + Debug.Old weight functions
If you compile to code as is, the compiler will warn about a few unused functions, all having to do with the weights of the network. These were all created with the previous way of storing layers in mind. I was planning on just deleting them, but decided to leave them there for now in case someone sees value in leaving them, but adapted for the new storage method.
Aside from those, the only other major thing to know is that the implementation doesn't take regularization into account yet during backpropagation. Any feedback, complaints, or improvements?