ENH: Add pca by sinhrks · Pull Request #158 · AtheMathmo/rusty-machine

sinhrks · 2016-12-15T22:35:00Z

Closes #104.

Is there a better way to compare float matrices are almost equal?

AtheMathmo · 2016-12-16T08:12:26Z

Thanks for the PR! I'm really looking forward to reviewing it :D

As for your question, rulinalg has an assert_matrix_eq! macro which has good support for floating point comparison. This was released in 0.3.7 and so you may need to bump the version up in rusty-machine to access it.

sinhrks · 2016-12-16T21:03:07Z

Updated to use assert... macro. Pls review when u have a time.

AtheMathmo · 2016-12-16T21:25:29Z

As the weekend is finally here I'll have some time tomorrow to go over all of these PRs. Thank you again for the amount of effort you've put into them - there's a lot of things here that I've been trying to find time to do for a while!

AtheMathmo

So first of all I must confess that I don't know too much about PCA - having studied it little and used it less.

Overall the code looks good to me and I'm encouraged by what looks like some solid tests.

There are a few comments which need addressing in the code. I also have a more general question.

I was under the impression that with PCA you should be able to choose how many components you wish to choose (by selecting the largest singular values). I think also that this PCA implementation does not take into account when input rows > input columns?

If I am correct then we need to do a little more work here. Check out the sklearn documentation for some ideas.

AtheMathmo · 2016-12-17T09:16:30Z

+}
+
+#[test]
+fn test_centering() {


This test should live in a tests module to follow convention:

#[test] mod tests { // ... }

AtheMathmo · 2016-12-17T09:21:45Z

+//!
+//! # Examples
+//!
+//! ```


I think this example will be a little confusing for people who are not very familiar with PCA. In particular due to some api confusion that I'll bring up in another comment.

I think it could be improved by explaining a little more what the prediction part does. And moreover what the assertion means, what is this value that we are comparing to -0.6686?

Fixed the example to compare with mapped matrix.

AtheMathmo · 2016-12-17T09:25:28Z

+                    match self.centers {
+                        None => return Err(Error::new_untrained()),
+                        Some(ref centers) => {
+                            let data = centering(inputs, &centers);


This could be dangerous.

You are assuming that inputs and centers have the same number of columns and using get_unchecked in the centering function. However, our user could give an input to the predict function which does not match up.

To correct this we should check that the column counts of the internal centers and new input are the same.

Thx. Added dimension check.

AtheMathmo · 2016-12-17T09:28:00Z

+        } else {
+            inputs.clone()
+        };
+        let (_, _, v) = data.svd().unwrap();


Just so that you are aware, there are some known issues with SVD: AtheMathmo/rulinalg#48

These affect performance and in some cases accuracy. It might be worth adding a comment to the module description to PCA to state that this will not work on large data sets and is experimental.

AtheMathmo · 2016-12-18T07:48:55Z

Thanks for updating this so quickly!

I wont be able to take a look for a couple of days sadly. I had a quick glance and it looked good to me. There was just one thing (that I might have missed). Have you handled the case where a matrix has more columns than rows?

AtheMathmo · 2016-12-20T15:56:01Z

+}
+
+/// Subtract center Vector from each rows
+fn centering(inputs: &Matrix<f64>, centers: &Vector<f64>) -> Matrix<f64> {


This is minor but I think we should make this function unsafe. This way the emphasis is on whoever is using it to make sure it works properly. In your case it looks fine but if someone else makes changes later it would be a welcome reminder.

AtheMathmo · 2016-12-20T16:07:36Z

I have been feeling a little conflicted about having PCA implement the UnSupModel trait. I worry that the predict function doesn't make much sense. It seems the flow would be something like:

// The data that I want to reduce to principal components
let data = Matrix::new(...);

// Train a new pca model
let mut pca = PCA::default();
pca.train(&data);

// Reduce the data to get components
let p_components = pca.predict(&data);

I was thinking that maybe the Transformer trait would be a better fit?

What do you think? I think it's safe to say that you know more than about this than I.

AtheMathmo · 2016-12-24T15:53:46Z

Thanks for making these changes! The code looks good to me now - but I'd still like to discuss whether the predict function is really a good fit here. I'm using the sklearn documentation as a reference.

sinhrks · 2016-12-25T01:07:49Z

Yeah, Transformer should be clearer. How about below definition?

    pub trait Transformer<T, U> {
        /// Transform inputs.
        fn transform(&self, inputs: &T) -> LearningResult<U>;

        /// Fit the transformer using inputs.
        fn fit(&mut self, inputs: &T) -> LearningResult<()>;
    }

I feel .transform should return transformed data using components, rather than components itself. It is the same as sklearn (also, R's predict.prcomp behaves as the same).

AtheMathmo · 2016-12-26T08:56:23Z

I feel .transform should return transformed data using components

I agree. The approach I would take would be to have the transform function transform the data, store the components in the PCA model, and then return the transformed inputs. I mean like this:

impl<T: Float> Transformer<Matrix<T>> for PCA {
    pub fn transform(&mut self, inputs: Matrix<T>) -> LearningResult<Matrix<T>> {
        let (components, transformed_inputs) = do_pca(inputs);
        self.components = Some(components);
        Ok(transformed_inputs)
    }
}

This is definitely not perfect. Sadly the API needs a little tweaking but really it is impossible to describe all possible machine learning actions with a small set of traits. Because of this we need to try and stretch the traits a little.

With all that said, I think updating the Transformer definition as you have isn't a bad idea. I just wanted to explain the idea behind the current trait a little.

AtheMathmo · 2017-01-06T12:16:44Z

I think that with the new Transformer trait changes we can get this PR in a position to merge.

AtheMathmo · 2017-02-17T23:05:26Z

@sinhrks just pinging you on this. Let me know if you plan to complete the work now that the Transformer changes are in place.

sinhrks · 2017-02-20T02:45:59Z

Sure, will do.

AtheMathmo · 2017-02-20T03:55:06Z

Note that #176 will land soon too (hopefully I'll check it over and merge tomorrow)

posix4e · 2017-10-24T00:38:12Z

This is great. May it be successfully rebased and merged. I am going to start using it for some machine learning work.

zackmdavis · 2017-12-27T03:36:07Z

I'm inclined to merge this now and let making it use Transformer rather than UnsupModel a separate issue. (To be blunt about it, "integrating ancient PRs so that the project can be alive rather than dead" is currently a higher priority than "avoid API instability on master.")

zackmdavis · 2017-12-27T03:44:29Z

By the power vested in me as Acting Deputy Lieutenant Comaintainer (junior grade) of rusty-machine, this has been rebased and merged (#188).

@sinhrks thank you!! 💖

@posix4e The rusty-machine maintainers regret the delay. Good luck with your work!—please file an issue if you have any problems.

sinhrks mentioned this pull request Dec 16, 2016

BUMP: Update rulinalg 0.3.7 #160

Merged

sinhrks added 2 commits December 17, 2016 05:51

ENH: Add pca

6e149bf

Update tests to use macros

31d3468

sinhrks force-pushed the pca branch from b925fa5 to 31d3468 Compare December 16, 2016 21:00

AtheMathmo suggested changes Dec 17, 2016

View reviewed changes

sinhrks added 2 commits December 18, 2016 10:38

filtering principal component

cd59645

add train and predict data validation

847cf0a

AtheMathmo reviewed Dec 20, 2016

View reviewed changes

sinhrks added 2 commits December 24, 2016 15:15

Add train data assertion

45215d2

support row < col input

14d5d46

sinhrks mentioned this pull request Dec 25, 2016

Add Basic Transformers #163

Closed

AtheMathmo approved these changes Dec 26, 2016

View reviewed changes

sinhrks mentioned this pull request Dec 27, 2016

Add Transformer.fit #166

Merged

AtheMathmo mentioned this pull request Mar 22, 2017

Implement Principal Component Analysis #104

Open

zackmdavis closed this Dec 27, 2017

zackmdavis mentioned this pull request Dec 27, 2017

make PCA use the Transformer trait #189

Open

Conversation

sinhrks commented Dec 15, 2016

Uh oh!

AtheMathmo commented Dec 16, 2016

Uh oh!

sinhrks commented Dec 16, 2016

Uh oh!

AtheMathmo commented Dec 16, 2016

Uh oh!

AtheMathmo left a comment

Choose a reason for hiding this comment

Uh oh!

AtheMathmo Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

AtheMathmo Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

sinhrks Dec 18, 2016

Choose a reason for hiding this comment

Uh oh!

AtheMathmo Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

sinhrks Dec 18, 2016

Choose a reason for hiding this comment

Uh oh!

AtheMathmo Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

AtheMathmo commented Dec 18, 2016

Uh oh!

AtheMathmo Dec 20, 2016

Choose a reason for hiding this comment

Uh oh!

AtheMathmo commented Dec 20, 2016

Uh oh!

AtheMathmo commented Dec 24, 2016

Uh oh!

sinhrks commented Dec 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AtheMathmo commented Dec 26, 2016

Uh oh!

AtheMathmo commented Jan 6, 2017

Uh oh!

AtheMathmo commented Feb 17, 2017

Uh oh!

sinhrks commented Feb 20, 2017

Uh oh!

AtheMathmo commented Feb 20, 2017

Uh oh!

posix4e commented Oct 24, 2017

Uh oh!

zackmdavis commented Dec 27, 2017

Uh oh!

zackmdavis commented Dec 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sinhrks commented Dec 25, 2016 •

edited

Loading