Skip to content
This repository was archived by the owner on Jul 16, 2021. It is now read-only.

ENH: Add pca#158

Closed
sinhrks wants to merge 6 commits intoAtheMathmo:masterfrom
sinhrks:pca
Closed

ENH: Add pca#158
sinhrks wants to merge 6 commits intoAtheMathmo:masterfrom
sinhrks:pca

Conversation

@sinhrks
Copy link
Copy Markdown
Contributor

@sinhrks sinhrks commented Dec 15, 2016

Closes #104.

Is there a better way to compare float matrices are almost equal?

@AtheMathmo
Copy link
Copy Markdown
Owner

Thanks for the PR! I'm really looking forward to reviewing it :D

As for your question, rulinalg has an assert_matrix_eq! macro which has good support for floating point comparison. This was released in 0.3.7 and so you may need to bump the version up in rusty-machine to access it.

@sinhrks
Copy link
Copy Markdown
Contributor Author

sinhrks commented Dec 16, 2016

Updated to use assert... macro. Pls review when u have a time.

@AtheMathmo
Copy link
Copy Markdown
Owner

As the weekend is finally here I'll have some time tomorrow to go over all of these PRs. Thank you again for the amount of effort you've put into them - there's a lot of things here that I've been trying to find time to do for a while!

Copy link
Copy Markdown
Owner

@AtheMathmo AtheMathmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So first of all I must confess that I don't know too much about PCA - having studied it little and used it less.

Overall the code looks good to me and I'm encouraged by what looks like some solid tests.

There are a few comments which need addressing in the code. I also have a more general question.

I was under the impression that with PCA you should be able to choose how many components you wish to choose (by selecting the largest singular values). I think also that this PCA implementation does not take into account when input rows > input columns?

If I am correct then we need to do a little more work here. Check out the sklearn documentation for some ideas.

Comment thread src/learning/pca.rs Outdated
}

#[test]
fn test_centering() {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should live in a tests module to follow convention:

#[test]
mod tests {
   // ...
}

Comment thread src/learning/pca.rs
//!
//! # Examples
//!
//! ```
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example will be a little confusing for people who are not very familiar with PCA. In particular due to some api confusion that I'll bring up in another comment.

I think it could be improved by explaining a little more what the prediction part does. And moreover what the assertion means, what is this value that we are comparing to -0.6686?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the example to compare with mapped matrix.

Comment thread src/learning/pca.rs Outdated
match self.centers {
None => return Err(Error::new_untrained()),
Some(ref centers) => {
let data = centering(inputs, &centers);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be dangerous.

You are assuming that inputs and centers have the same number of columns and using get_unchecked in the centering function. However, our user could give an input to the predict function which does not match up.

To correct this we should check that the column counts of the internal centers and new input are the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. Added dimension check.

Comment thread src/learning/pca.rs Outdated
} else {
inputs.clone()
};
let (_, _, v) = data.svd().unwrap();
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so that you are aware, there are some known issues with SVD: AtheMathmo/rulinalg#48

These affect performance and in some cases accuracy. It might be worth adding a comment to the module description to PCA to state that this will not work on large data sets and is experimental.

@AtheMathmo
Copy link
Copy Markdown
Owner

Thanks for updating this so quickly!

I wont be able to take a look for a couple of days sadly. I had a quick glance and it looked good to me. There was just one thing (that I might have missed). Have you handled the case where a matrix has more columns than rows?

Comment thread src/learning/pca.rs Outdated
}

/// Subtract center Vector from each rows
fn centering(inputs: &Matrix<f64>, centers: &Vector<f64>) -> Matrix<f64> {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor but I think we should make this function unsafe. This way the emphasis is on whoever is using it to make sure it works properly. In your case it looks fine but if someone else makes changes later it would be a welcome reminder.

@AtheMathmo
Copy link
Copy Markdown
Owner

I have been feeling a little conflicted about having PCA implement the UnSupModel trait. I worry that the predict function doesn't make much sense. It seems the flow would be something like:

// The data that I want to reduce to principal components
let data = Matrix::new(...);

// Train a new pca model
let mut pca = PCA::default();
pca.train(&data);

// Reduce the data to get components
let p_components = pca.predict(&data);

I was thinking that maybe the Transformer trait would be a better fit?

What do you think? I think it's safe to say that you know more than about this than I.

@AtheMathmo
Copy link
Copy Markdown
Owner

Thanks for making these changes! The code looks good to me now - but I'd still like to discuss whether the predict function is really a good fit here. I'm using the sklearn documentation as a reference.

@sinhrks
Copy link
Copy Markdown
Contributor Author

sinhrks commented Dec 25, 2016

Yeah, Transformer should be clearer. How about below definition?

    pub trait Transformer<T, U> {
        /// Transform inputs.
        fn transform(&self, inputs: &T) -> LearningResult<U>;

        /// Fit the transformer using inputs.
        fn fit(&mut self, inputs: &T) -> LearningResult<()>;
    }

I feel .transform should return transformed data using components, rather than components itself. It is the same as sklearn (also, R's predict.prcomp behaves as the same).

@sinhrks sinhrks mentioned this pull request Dec 25, 2016
@AtheMathmo
Copy link
Copy Markdown
Owner

I feel .transform should return transformed data using components

I agree. The approach I would take would be to have the transform function transform the data, store the components in the PCA model, and then return the transformed inputs. I mean like this:

impl<T: Float> Transformer<Matrix<T>> for PCA {
    pub fn transform(&mut self, inputs: Matrix<T>) -> LearningResult<Matrix<T>> {
        let (components, transformed_inputs) = do_pca(inputs);
        self.components = Some(components);
        Ok(transformed_inputs)
    }
}

This is definitely not perfect. Sadly the API needs a little tweaking but really it is impossible to describe all possible machine learning actions with a small set of traits. Because of this we need to try and stretch the traits a little.

With all that said, I think updating the Transformer definition as you have isn't a bad idea. I just wanted to explain the idea behind the current trait a little.

@sinhrks sinhrks mentioned this pull request Dec 27, 2016
@AtheMathmo
Copy link
Copy Markdown
Owner

I think that with the new Transformer trait changes we can get this PR in a position to merge.

@AtheMathmo
Copy link
Copy Markdown
Owner

@sinhrks just pinging you on this. Let me know if you plan to complete the work now that the Transformer changes are in place.

@sinhrks
Copy link
Copy Markdown
Contributor Author

sinhrks commented Feb 20, 2017

Sure, will do.

@AtheMathmo
Copy link
Copy Markdown
Owner

Note that #176 will land soon too (hopefully I'll check it over and merge tomorrow)

@posix4e
Copy link
Copy Markdown

posix4e commented Oct 24, 2017

This is great. May it be successfully rebased and merged. I am going to start using it for some machine learning work.

@zackmdavis
Copy link
Copy Markdown
Collaborator

I'm inclined to merge this now and let making it use Transformer rather than UnsupModel a separate issue. (To be blunt about it, "integrating ancient PRs so that the project can be alive rather than dead" is currently a higher priority than "avoid API instability on master.")

@zackmdavis
Copy link
Copy Markdown
Collaborator

By the power vested in me as Acting Deputy Lieutenant Comaintainer (junior grade) of rusty-machine, this has been rebased and merged (#188).

@sinhrks thank you!! 💖

@posix4e The rusty-machine maintainers regret the delay. Good luck with your work!—please file an issue if you have any problems.

@zackmdavis zackmdavis closed this Dec 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants