Conversation
|
I suspect that merging this might have to wait until the model trait APIs have been revamped. At least it gives us a concrete example of something we'd like to support but is currently impossible. |
|
I have only had a brief look but other than the things you already mention this looks good to me. At first I wasn't sure about having this in |
| for &row in rows { | ||
| for col in 0..mat.cols(){ | ||
| data[idx] = mat[[row, col]]; | ||
| idx += 1; |
There was a problem hiding this comment.
We can make this loop more efficient ant remove the bound checks on both mat and data. Something like this:
let mut data = Vec::with_capacity(num_rows * mat.cols());
for &row in rows {
unsafe {
data.extend_from_slice(mat.get_row_unchecked(row));
}
}Edit: I realise that when we get rulinalg 0.3.1 we will get the equivalent of what I've said above.
There was a problem hiding this comment.
Yes, I was going to either use select_rows when 117 is merged, or wait until MatrixSlices become widespread and do this without the allocations.
|
I've updated the pull request to handle n % k != 0. |
|
Hey @theotherphil! I'm sorry that I've let this sit for a while... I recently pushed some breaking changes which will be part of a 0.5.0 release. This means that training/predicting from a model will return a I think this will mean the I think once this is done I'd like to get this merged - if you are happy with that as well, of course. Let me know! |
|
No worries, I wasn't in any rush. I'll update later today to reflect your changes. Are you also merging 117 as part of the 0.5 release? |
|
A few things I'm not sure about:
|
That's the plan! There is a lot more breaking changes to follow that - with #124 and others - but for now I'll cut the bump before those. With regards to your other comments:
|
|
Great, thanks. |
|
What're your thoughts on single-function traits? e.g. should I have a |
|
I think I'd prefer the latter, In terms of explaining the convention we can hopefully do that with docs? I'm not against you using the |
|
I'll go with your suggestion, thanks. |
…edundant module comments
|
I've now done everything you suggested, apart from speeding up The current API isn't great, but making it better is at least partly blocked by 124. I think having a non-empty If you're happy to merge I'll (find out how to) squash the commits. Edit: I tried to squash the commits. Looks like I failed! |
|
A change that I plan on making soon-ish (post-merge) is making However, it's quite tricky to make this work nicely. If we make Partition just a pair of That's possibly too vague to comment sensibly on. I'll have a go at both approaches (pairs of vectors, vs arbitrary iterators of Hopefully getting a nice answer here will give some useful feedback for designing other parts of this library, rather than just being massive over-engineering of a very simple function. |
…It's currently a bit of a mess...
…edundant module comments
…ty-machine into crossvalidation
Implement copy_rows Implement Folds so that only a single array of indices is allocated. It's currently a bit of a mess... Tidy cross-validation implementation. Fix comment Support n % k != 0 in Folds iterator Format comment Fix build now that train and predict return Results. Proper error handling still TODO Add analysis/score.rs Use Fn(&Mat, &Mat) -> f64 in cross validation instead of CostFunc Move cross_validation into analysis module Remove unwraps from k_fold_validate, change return type to LearningResult<Vec<f64>> Move k_fold_validate example from examples into doc comment. Remove redundant module comments Very WIP cross validation Implement copy_rows Implement Folds so that only a single array of indices is allocated. It's currently a bit of a mess... Tidy cross-validation implementation. Fix comment Support n % k != 0 in Folds iterator Format comment Fix build now that train and predict return Results. Proper error handling still TODO Use Fn(&Mat, &Mat) -> f64 in cross validation instead of CostFunc Move cross_validation into analysis module Remove unwraps from k_fold_validate, change return type to LearningResult<Vec<f64>> Move k_fold_validate example from examples into doc comment. Remove redundant module comments
…ty-machine into crossvalidation
|
Thanks for the extra work on this! I think that it is really close. The only issues I see are with documentation. Is there a reason you removed the Naive Bayes cross validation example (in the examples directory)? I think it's really valuable to have a full working example of the cross validation there. This is a good place for us to provide verbose explanations of what each component is. In the docs for Oh and don't worry about squashing the commits I can do that just before merging. |
|
I've added documentation for the arguments. I moved the Naive Bayes example into the doc comments for |
|
I think it's fine to keep in the docs - though in the future I'd like to have a full example in the examples directory. I think we should at least add some inline comments to the doc comment example - for example describing what let accuracy_per_fold: Vec<f64> = k_fold_validate(
&mut model,
&inputs,
&targets,
3,
row_accuracy).unwrap(); |
|
Done. |
|
This looks good! Thanks for your patience. Are you happy for me to merge this now? I think I'll try to get #117 in within the next few days - if you'd prefer to wait to make the other changes that's fine too! |
|
Great, thanks. I'm happy for this to be merged now. |
Not ready to merge yet, but wanted to get a bit of feedback.
The main issue at the moment is that we allocate fresh matrices for each of the k times we train and validate. As discussed previously, we should be able to allocate once and reuse the allocation for each of the folds. However, this would require the inputs and targets of k_fold_validate to be MatrixSlices, and I don't think any of the existing models support this yet.
The other issue is that there's a lot of boilerplate here to avoid allocating or copying when creating the indices for training and testing on for each fold, which seems a bit overkill given all the work we do to copy all the sample data around (even if we manage to get rid of the allocations in copy_rows).
Finally, we might want to run this in parallel, but then we really would need a separate copy of the input data for each fold.
Any thoughts?