ENH: Add datasets by sinhrks · Pull Request #161 · AtheMathmo/rusty-machine

sinhrks · 2016-12-16T15:09:04Z

Closes #115. Added Dataset struct which has data() and target() impl (intended for supervised learning).

Adding more data once API looks OK.

AtheMathmo

Just need to add a feature flag - or at least discuss this.

Otherwise this looks ready to go.

AtheMathmo · 2016-12-17T09:07:33Z

+/// Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
+/// Irvine, CA: University of California, School of Information and Computer Science.
+pub fn load_iris() -> Dataset<Matrix<f64>, Vector<usize>> {
+    let data: Vec<f64> = vec![5.1, 3.5, 1.4, 0.2,


Minor: It might be easier to use the matrix! macro here. My thinking is that if we need to add a row there's a little less work.

AtheMathmo · 2016-12-17T09:08:27Z

+
+/// Dataset container
+#[derive(Clone, Debug)]
+pub struct Dataset<D, T> where D: Clone + Debug, T: Clone + Debug {


I think this makes sense for now. We might want to be more strict in future if we want to be generic over DataSets. However, this is something that I don't think we will ever want to do.

AtheMathmo · 2016-12-17T09:11:29Z

 }
+
+/// Module for datasets.
+pub mod datasets;


We should feature gate this. My thinking is that if we have a few datasets users will not want to download all of this data by default.

To do this:

Add a new feature to Cargo.toml

In lib.rs add a feature flag

sinhrks · 2016-12-18T03:41:50Z

Added feature gates. Is it ok to be included by default ATM (as it is likely to be used in most tests)?

AtheMathmo · 2016-12-18T07:47:17Z

Thanks for the update.

It looks good but I'm a little cautious about having the datasets flag included by default. I wanted it feature flagged specifically so that it had to be opted-in. I can see that we will probably want to use it in some tests but I'd try a few ways around this first.

Can we fit all tests in the datasets module?
Can we somehow put the tests in the tests folder behind the feature flag too?
As a last resort, do as you have done and keep the "datasets" flag for the "test" feature.

Finally note that we will need to modify the travis CI matrix to include the "datasets" flag.

sinhrks · 2016-12-20T04:01:38Z

OK, made datasets as optional, and fixed travis tests.

AtheMathmo · 2016-12-20T09:46:30Z

This looks good to me but before merging I'd like to check out the branch and play around with it a little.

Thanks!

AtheMathmo · 2016-12-26T08:19:22Z

I checked out the code and I have a few thoughts. I am happy to merge this in without any further changes but we should at least write up a tracking issue for improvements.

I think the description of load_iris should include more information on the dataset. Or a link directly to the iris dataset: http://archive.ics.uci.edu/ml/datasets/Iris (instead of just the /ml root). We should say what the attributes and classes are. From the link:

Attribute Information:

sepal length in cm

sepal width in cm

petal length in cm

petal width in cm

class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

Also I think it might be a good idea to organize the datasets module a little differently. If we add more datasets the module is going to get large quickly and difficult to manage. I think we should move the iris data into a new iris module. This iris module should have a pub fn load() -> DataSet. As a user I would be quite happy to do:

use rusty_machine::datasets:iris;

let (inputs, targets) = iris::load();

But we could also use the current format and have a load_iris function in the datasets module that calls iris::load. In this case we might want to keep the iris module private?

Let me know what you think. If you don't want to make any of these changes now I'll merge and move this information to a separate ticket.

sinhrks · 2016-12-27T01:51:57Z

Thx for the comment. I've did a change requested. Pls take a look when u have a time.

AtheMathmo

Looks good to me now!

I have a minor nitpick for the features section but so far as I can tell it makes no real difference.

AtheMathmo · 2016-12-27T08:42:30Z

+default = []
 stats = []
+datasets = []
+test = []


We don't need to include the test or default features. These already exist as defined here.

Correct, removed.

AtheMathmo · 2016-12-27T08:43:01Z

+
+use super::Dataset;
+
+/// Load iris dataset.


This description is great!

AtheMathmo · 2016-12-27T08:55:23Z

Thank you! Merging now.

AtheMathmo suggested changes Dec 17, 2016

View reviewed changes

sinhrks added 2 commits December 20, 2016 07:51

ENH: Add datasets

c262ac4

add feature gates

dcbbd89

sinhrks force-pushed the datasets branch 2 times, most recently from 43d6e2d to 90e1944 Compare December 19, 2016 22:53

Fix travis tests

367316f

sinhrks force-pushed the datasets branch from 90e1944 to 367316f Compare December 19, 2016 22:56

AtheMathmo approved these changes Dec 20, 2016

View reviewed changes

AtheMathmo mentioned this pull request Dec 24, 2016

ENH: implement Agglomerative (hierarchical) clustering #162

Open

sinhrks force-pushed the datasets branch 2 times, most recently from 4cdb18a to 06e52c4 Compare December 27, 2016 01:48

Fix lib structure / added comment

7570fe4

sinhrks force-pushed the datasets branch from 06e52c4 to 7570fe4 Compare December 27, 2016 01:51

AtheMathmo reviewed Dec 27, 2016

View reviewed changes

do not add defaults and tests features

ad0ecb0

AtheMathmo merged commit 34a5417 into AtheMathmo:master Dec 27, 2016

sinhrks deleted the datasets branch December 27, 2016 08:56

Conversation

sinhrks commented Dec 16, 2016

Uh oh!

AtheMathmo left a comment

Choose a reason for hiding this comment

Uh oh!

AtheMathmo Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

AtheMathmo Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

AtheMathmo Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

sinhrks commented Dec 18, 2016

Uh oh!

AtheMathmo commented Dec 18, 2016

Uh oh!

sinhrks commented Dec 20, 2016

Uh oh!

AtheMathmo commented Dec 20, 2016

Uh oh!

AtheMathmo commented Dec 26, 2016

Uh oh!

sinhrks commented Dec 27, 2016

Uh oh!

AtheMathmo left a comment

Choose a reason for hiding this comment

Uh oh!

AtheMathmo Dec 27, 2016

Choose a reason for hiding this comment

Uh oh!

sinhrks Dec 27, 2016

Choose a reason for hiding this comment

Uh oh!

AtheMathmo Dec 27, 2016

Choose a reason for hiding this comment

Uh oh!

AtheMathmo commented Dec 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants