Skip to content
This repository was archived by the owner on Jul 16, 2021. It is now read-only.

Add cross validation#125

Merged
AtheMathmo merged 34 commits intoAtheMathmo:masterfrom
theotherphil:crossvalidation
Sep 18, 2016
Merged

Add cross validation#125
AtheMathmo merged 34 commits intoAtheMathmo:masterfrom
theotherphil:crossvalidation

Conversation

@theotherphil
Copy link
Copy Markdown
Contributor

@theotherphil theotherphil commented Sep 4, 2016

Not ready to merge yet, but wanted to get a bit of feedback.

The main issue at the moment is that we allocate fresh matrices for each of the k times we train and validate. As discussed previously, we should be able to allocate once and reuse the allocation for each of the folds. However, this would require the inputs and targets of k_fold_validate to be MatrixSlices, and I don't think any of the existing models support this yet.

The other issue is that there's a lot of boilerplate here to avoid allocating or copying when creating the indices for training and testing on for each fold, which seems a bit overkill given all the work we do to copy all the sample data around (even if we manage to get rid of the allocations in copy_rows).

Finally, we might want to run this in parallel, but then we really would need a separate copy of the input data for each fold.

Any thoughts?

@theotherphil
Copy link
Copy Markdown
Contributor Author

I suspect that merging this might have to wait until the model trait APIs have been revamped. At least it gives us a concrete example of something we'd like to support but is currently impossible.

@AtheMathmo
Copy link
Copy Markdown
Owner

I have only had a brief look but other than the things you already mention this looks good to me.

At first I wasn't sure about having this in learning::toolkit. My plan for now was to put things used internally by the models in that module and create a new module which wraps the cost functions and provides cross validation etc in a new module at the root level. But I think this makes more sense for now.

for &row in rows {
for col in 0..mat.cols(){
data[idx] = mat[[row, col]];
idx += 1;
Copy link
Copy Markdown
Owner

@AtheMathmo AtheMathmo Sep 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this loop more efficient ant remove the bound checks on both mat and data. Something like this:

let mut data = Vec::with_capacity(num_rows * mat.cols());

for &row in rows {
    unsafe {
        data.extend_from_slice(mat.get_row_unchecked(row));
    }
}

Edit: I realise that when we get rulinalg 0.3.1 we will get the equivalent of what I've said above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was going to either use select_rows when 117 is merged, or wait until MatrixSlices become widespread and do this without the allocations.

@theotherphil
Copy link
Copy Markdown
Contributor Author

I've updated the pull request to handle n % k != 0.

@AtheMathmo
Copy link
Copy Markdown
Owner

Hey @theotherphil!

I'm sorry that I've let this sit for a while... I recently pushed some breaking changes which will be part of a 0.5.0 release. This means that training/predicting from a model will return a Result instead of panicking. Would you mind updating this PR with the new changes?

I think this will mean the k_fold_validate function returning a LearningResult<Vec<f64>> and wrapping the train/predict functions with a try! macro.

I think once this is done I'd like to get this merged - if you are happy with that as well, of course. Let me know!

@theotherphil
Copy link
Copy Markdown
Contributor Author

No worries, I wasn't in any rush. I'll update later today to reflect your changes. Are you also merging 117 as part of the 0.5 release?

@theotherphil
Copy link
Copy Markdown
Contributor Author

theotherphil commented Sep 17, 2016

A few things I'm not sure about:

  1. Should the return type really be LearningResult<Vec<f64>>? LearningResult, and the Error type it uses, appear to be used only for the results of training a single model once, rather than the outcome of something that just happened to involve training models. If the result should be LearningResult, perhaps the documentation/members of ErrorKind should updated to explain that is used more generally than to describe the outcome of a single call to train
  2. Ignoring the which-Result-type issue, is Vec<f64> the information we want to return? Should we wrap this in another type, and/or provide some statistics rather than just per-fold scores?
  3. Should we be using the existing CostFunc trait here, or should I add a new trait just for this? Multiple traits for basically the same thing seem unnecessary, but a) we don't want to require users to provide a derivative here as we don't use it, and b) maybe CostFunc has the wrong sign? Normally higher scores would be better, but here we're returning a vector of costs.
  4. Should model: &mut M be replaced with a function to create fresh models? This function currently calls model.train() multiple times, and expects that this means to retrain from scratch on the new data. Is that currently the case for all models in this library, and if so could we document the requirement? If not then I'll definitely have to change this method.
  5. Should I add a pub use cross_validation statement to lib.rs?

@AtheMathmo
Copy link
Copy Markdown
Owner

Are you also merging 117 as part of the 0.5 release?

That's the plan! There is a lot more breaking changes to follow that - with #124 and others - but for now I'll cut the bump before those.

With regards to your other comments:

  1. The LearningResult is used for more than just training a single time. It is also currently used for predicting and is meant to be used for other issues in the learning module. I think that in this case the LearningResult feels appropriate to me but we may have to make some changes. More on this in point 3.
  2. I think it would be great to have a more robust structure than Vec. Something which can handle other forms of cross validation would be ideal. I'll leave this one up to you - I'm happy to merge in Vec if you want to handle this later.
  3. I think that we should have a new trait for this. I also think that the cross validation module should exist elsewhere. We need some kind of module for model analysis really - things like AOC, ROC, model selection. I think this should be separate to the learning model and perhaps have it's own Result types. Do you have any suggestions on what this should look like? Maybe rulinalg::analysis::cross_validation ?
  4. Right now (at least) model.train will always mean training from scratch. At some point in the future I'd like to add a new trait so that we can update a model to incrementally train. I think that keeping the signature as-is makes the most sense to me right now. If you disagree I'd be happy to hear your thoughts though!
  5. I think if we move to a new module this wont be necessary. For now I want to keep the root fairly clean - this may change later if the API is more stable.

@theotherphil
Copy link
Copy Markdown
Contributor Author

Great, thanks.

@theotherphil
Copy link
Copy Markdown
Contributor Author

What're your thoughts on single-function traits? e.g. should I have a Scorer<T> trait with a single score function or just use F: Fn(&T, &T) -> f64? The former makes it easier to explain conventions (e.g. higher is always better) and leads to shorter trait bounds, but the latter is more flexible.

@AtheMathmo
Copy link
Copy Markdown
Owner

I think I'd prefer the latter, F: Fn(&T, &T) -> f64 is probably easier to maintain. But I think I'd need to see the context to really be sure.

In terms of explaining the convention we can hopefully do that with docs? I'm not against you using the Scorer<T> trait if you prefer this though!

@theotherphil
Copy link
Copy Markdown
Contributor Author

I'll go with your suggestion, thanks.

@theotherphil
Copy link
Copy Markdown
Contributor Author

theotherphil commented Sep 18, 2016

I've now done everything you suggested, apart from speeding up copy_rows - I'll leave that one and just replace it with select_rows when 117 is merged.

The current API isn't great, but making it better is at least partly blocked by 124. I think having a non-empty analysis module with something at least usable in it is better than nothing, and lowers the barrier to adding more useful functions in this area. However, I definitely plan on changing the API in the fairly near future, so if you'd rather avoid constant breaking changes we could hold off on merging.

If you're happy to merge I'll (find out how to) squash the commits.

Edit: I tried to squash the commits. Looks like I failed!

@theotherphil
Copy link
Copy Markdown
Contributor Author

A change that I plan on making soon-ish (post-merge) is making k_fold_validate a wrapper for a cross_validate function which just maps an Iterator of "partitions" to an Iterator of (output, target) pairs. k-fold vs pre-specified vs stratified k-fold vs whatever then just becomes a choice of partition iterator, and any scoring is just a function of the results iterator.

However, it's quite tricky to make this work nicely. If we make Partition just a pair of Vec<usize> (one for the indices to use for training and the other for testing) then we allocate two fresh vectors of indices for every fold (or at least copy all the indices a lot - we could probably pre-allocate the memory for the vectors). If we instead make partitions a pair of iterators of arbitrary type with Item=&'a usize then the boilerplate required to actually do anything goes through the roof. As we currently allocate and copy the feature data everywhere the extra impact from doing this for the indices too is probably pretty negligible. However, we probably want it all to be nice and efficient eventually.

That's possibly too vague to comment sensibly on. I'll have a go at both approaches (pairs of vectors, vs arbitrary iterators of usize references) somewhere so that we have something concrete to look at.

Hopefully getting a nice answer here will give some useful feedback for designing other parts of this library, rather than just being massive over-engineering of a very simple function.

Implement copy_rows

Implement Folds so that only a single array of indices is allocated. It's currently a bit of a mess...

Tidy cross-validation implementation.

Fix comment

Support n % k != 0 in Folds iterator

Format comment

Fix build now that train and predict return Results. Proper error handling still TODO

Add analysis/score.rs

Use Fn(&Mat, &Mat) -> f64 in cross validation instead of CostFunc

Move cross_validation into analysis module

Remove unwraps from k_fold_validate, change return type to LearningResult<Vec<f64>>

Move k_fold_validate example from examples into doc comment. Remove redundant module comments

Very WIP cross validation

Implement copy_rows

Implement Folds so that only a single array of indices is allocated. It's currently a bit of a mess...

Tidy cross-validation implementation.

Fix comment

Support n % k != 0 in Folds iterator

Format comment

Fix build now that train and predict return Results. Proper error handling still TODO

Use Fn(&Mat, &Mat) -> f64 in cross validation instead of CostFunc

Move cross_validation into analysis module

Remove unwraps from k_fold_validate, change return type to LearningResult<Vec<f64>>

Move k_fold_validate example from examples into doc comment. Remove redundant module comments
@AtheMathmo
Copy link
Copy Markdown
Owner

Thanks for the extra work on this! I think that it is really close. The only issues I see are with documentation.

Is there a reason you removed the Naive Bayes cross validation example (in the examples directory)? I think it's really valuable to have a full working example of the cross validation there. This is a good place for us to provide verbose explanations of what each component is.

In the docs for k_fold_validate we should explain what each of the arguments is. I think this especially important because of the Fn argument.

Oh and don't worry about squashing the commits I can do that just before merging.

@theotherphil
Copy link
Copy Markdown
Contributor Author

I've added documentation for the arguments. I moved the Naive Bayes example into the doc comments for k_fold_validate. I think the example is more useful in its new location but can move it back/duplicate it if you feel strongly about it.

@AtheMathmo
Copy link
Copy Markdown
Owner

I think it's fine to keep in the docs - though in the future I'd like to have a full example in the examples directory.

I think we should at least add some inline comments to the doc comment example - for example describing what row_accuracy does. I would also modify the let accuracy line a little:

let accuracy_per_fold: Vec<f64> = k_fold_validate(
    &mut model,
    &inputs,
    &targets,
    3,
    row_accuracy).unwrap();

@theotherphil
Copy link
Copy Markdown
Contributor Author

Done.

@AtheMathmo
Copy link
Copy Markdown
Owner

This looks good! Thanks for your patience.

Are you happy for me to merge this now? I think I'll try to get #117 in within the next few days - if you'd prefer to wait to make the other changes that's fine too!

@theotherphil
Copy link
Copy Markdown
Contributor Author

Great, thanks. I'm happy for this to be merged now.

@AtheMathmo AtheMathmo merged commit cb55fb1 into AtheMathmo:master Sep 18, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants