Skip to content

[WIP] Bauhaus PCA#896

Merged
kain88-de merged 39 commits intoMDAnalysis:developfrom
jdetle:PCA
Aug 14, 2016
Merged

[WIP] Bauhaus PCA#896
kain88-de merged 39 commits intoMDAnalysis:developfrom
jdetle:PCA

Conversation

@jdetle
Copy link
Contributor

@jdetle jdetle commented Jul 8, 2016

Changes made in this Pull Request:

  • Added PCA Module conforming to Bauhaus API

TODO

  • AnalysisBase _single_frame call is a little contrived, up to advice of others to get rid of or not.
  • Should inverse transform be a function, should it recreate a trajectory? [Pseudoinverse now...]
  • Should covariance be created manually?
  • Found what I think is an error in @kain88-de's gist, PCA transform should be transpose to assure shapes are equal when a subset of the PCs are chosen, or am I mistaken somewhere?
  • Frame iteration for transform
  • Subselection of atoms for transform
  • Test for cosine content in data
  • correct inverse transform for low-dimensional PCA points.
  • Test for different transform parameters
  • Test for different input parameters
  • Double checks docs
  • Clean up tests

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?



def _single_frame(self):
self._xyz[self._frame_index] = self._atoms.positions.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope this should slowly calculate the covariance matrix. Otherwise you are possibly loading hundreds of gigabytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are you saying I should start calculating the covariance matrix here with a loop inside _single_frame?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is that the sum goes over the frames. So you have to slowly build up the sum for each entry of the matrix as you iterate over the frame. Something like this should do the trick.

def _prelude(self):
    n_dim = self.atoms.n_atom * 3
    self.cov = np.zeros(n_dim, n_dim)

def _single_frame(self):
     x = self.atoms.positions.flat  # maybe you need to use ravel as this only iterates!
     self.cov += np.dot(x[:, np.newaxis], x[:, np.newaxis].T)

def _conclude(self):
    self.cov /= self.n_frames - 1

Please check the dot product. To get more information about the correct normalization have a look at the np.cov docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use np.cov directly in _single_frame because it works different for 1D and 2D arrays.

The dot product could be faster in cython but I doubt it will be noticeable compared with the slow loading of
a trajectory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's a good point. You're only as slow as your bottleneck and loading from a disk is pretty slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I took a few hours to mull it over just to be sure but this method that you wrote out about above will not be zero-mean. This will give a different answer than the method you wrote out in your gist. Did you overlook this or is it intentional?

Copy link
Contributor Author

@jdetle jdetle Jul 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for i, ts in enumerate(u.trajectory[0:-1:1]):
    x = ca.positions.flatten() 
    mean += x

mean /= u.trajectory.n_frames

for i, ts in enumerate(u.trajectory[0:-1:1]):
    x = ca.positions.flatten()
    x -= mean
    prac_cov += np.dot(x[:, np.newaxis], x[:, np.newaxis].T)

prac_cov /= u.trajectory.n_frames -1

This is a snippet for code that replicates the np.cov method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about the confusion. But yeah my example wasn't complete. It was a quick example how I thought it should work.

@coveralls
Copy link

coveralls commented Jul 8, 2016

Coverage Status

Coverage increased (+0.03%) to 80.414% when pulling 41f1ef6 on jdetle:PCA into 5d78f03 on MDAnalysis:develop.

@kain88-de
Copy link
Member

about the transform. For a PCA it makes sense to accept a atomgroup and return a array of the transformed coordinates. This allows to apply the same PCA on different simulations. Returning the array is consistent with the DIffusionMaps implementation.

@jdetle
Copy link
Contributor Author

jdetle commented Jul 20, 2016

Okay, and this atom group should be the same atom group used to generate
the covariance matrix, right?

On Wed, Jul 20, 2016 at 2:24 PM Max Linke notifications@github.com wrote:

about the transform. For a PCA it makes sense to accept a atomgroup and
return a array of the transformed coordinates. This allows to apply the
same PCA on different simulations. Returning the array is consistent with
the DIffusionMaps implementation.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#896 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKcARt1H9oyPEu4BfeixQxN7wTILv2WUks5qXpIcgaJpZM4JIXwX
.

Have a wonderful day,
John J. Detlefs

@kain88-de
Copy link
Member

Well kind of yeah. Lets say we have a test like this.

if self.atoms.n_atoms != atoms.n_atoms:
    raise ValueError()
elif self.atoms != atoms:  # I hope that checks for atom types and not reference id
    warnings.warn('transform for different atom types')

This could be something people might want to do in exotic cases. But we warn the normal user that he should be careful.

@coveralls
Copy link

coveralls commented Jul 21, 2016

Coverage Status

Coverage increased (+0.006%) to 83.146% when pulling 3e245b6 on jdetle:PCA into 9a632a9 on MDAnalysis:develop.

@coveralls
Copy link

coveralls commented Jul 21, 2016

Coverage Status

Coverage increased (+0.03%) to 83.168% when pulling 3e245b6 on jdetle:PCA into 9a632a9 on MDAnalysis:develop.

self._n_atoms = self._atoms.n_atoms
self._calculated = False

def fit(self, n_components=None, start=None, stop=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why fit? This goes away from the consistency we are trying to achieve for the analysis methods. They all should have a run method that then does everything required. So I would assume I can use the PCA like this.

>>> pca = analysis.pca.PCA(atomgroup).run()
>>> pca.variance
>>> pca_coord = pca.transform(atomgroup, n_components=4)
>>> print (pca_coord.shape)
(n_frames, 4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry thats a good point. I called it fit in an unnecessary attempt to mirror the sklearn api.

@coveralls
Copy link

coveralls commented Jul 24, 2016

Coverage Status

Coverage increased (+0.02%) to 83.159% when pulling 8ee2426 on jdetle:PCA into 9a632a9 on MDAnalysis:develop.


def __init__(self, atomgroup, select='all', n_components=None,
**kwargs):
def __init__(self, atomgroup, select='all', mean_free=True, reference=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call the option demean this is used in scikit-learn and statsmodel to mean that the mean should be subtracted before any calculations are done. I would like to use the same terminology to other python science packages to minimize surprise and friction for people using mdanalysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay. Haven't really got to addressing that yet but thanks in advance.

self.cov = np.zeros((n_dim, n_dim))
self.mean = np.zeros(n_dim,)

for i, ts in enumerate(self._u.trajectory[self.start:self.stop:self.step]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't need to be done anymore. You have aligned the structures.

@jdetle
Copy link
Contributor Author

jdetle commented Jul 26, 2016

Right now I'm kind of struggling to get through what should be insignificant roadblocks so I'm just going to right some crappy code and iterate from there.

self._atoms, self._ref_cog,
self._atoms.center_of_geometry())
# now all structures are aligned to reference
x = mobile_atoms.positions.ravel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this out of the if-else clause then you can skip the else clause completely. (and remove code duplication)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mobile_atoms is different than self._atoms right? Or am I misunderstanding things?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are right. I overlooked that.

@jdetle
Copy link
Contributor Author

jdetle commented Aug 12, 2016

I think I've been creating a headache for you by squashing commits and force pushing. I'll stop that now.


if mean is None:
warning.warn('In order to demean to generate the covariance matrix'
' the frames have to be iterated over twice. To avoid'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally let the white space at the end of the previous line. This is also what we do in most line breaking strings in mdanalysis (afaik that is some kind of unspoken standard between languages)


This module constructs a covariance matrix wherein each element of the matrix
is denoted by (i,j) row-column coordinates. The (i,j) coordinate is the
influence of the of the ith frame's coordinates on the jth frame's coordinates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text is still wrong. I doesn't refer to a frame but simple a coordinates.

After initializing and calling method with a universe or an atom group,
principal components ordering the atom coordinate data by decreasing
variance will be available for analysis. Please refer to the
:ref:`PCA-tutorial` for more detailed instructions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this doesn't work in an interactive session. You have to do a simple code example.
Something like this is enough to get a very short overview. It is still good to link to the tutorial but that is only seen in the online docs as a link.

pca = PCA(atomgroup, select='backbone').run()
pca_space =  pca.transform(atomgroup.select_atoms('backbone'), 3)

@kain88-de
Copy link
Member

Don't worry about the force push. It's just that I don't see always the changes in github since you change (rightly) different lines then I commend on.

@kain88-de
Copy link
Member

I'm more sorry I only comment on the doc issues now and still see some minor things that I missed earlier. Seems like a lot of small issues that I could have mentioned all at once to save you some work.

@jdetle
Copy link
Contributor Author

jdetle commented Aug 12, 2016

No problem on my end, makes me feel like I have work to do :)

variance will be available for analysis. Please refer to the
:ref:`PCA-tutorial` for more detailed instructions.
variance will be available for analysis. As an example:
>>> pca = PCA(atomgroup, select='backbone').run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to wrap this in a code block so it shows up in the Sphinx docs right?

Copy link
Contributor Author

@jdetle jdetle Aug 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I just fixed a typo there though. Wait, I don't know if I understood your question. I gave it the code blocks so the code is formatted in sphinx, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use in my code blocks

.. code::python
  >>> # now comes the code

Have a look around for this in the other doc strings and you see what I mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked this is already correctly rendered so you can ignore my comment.

@coveralls
Copy link

coveralls commented Aug 12, 2016

Coverage Status

Changes Unknown when pulling 6099a79 on jdetle:PCA into * on MDAnalysis:develop*.

>>> from MDAnalysis.tests.datafiles import PSF, DCD

Given a universe containing trajectory data we can perform PCA using
:class:`PCA`:: and retrieve the principal components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last :: are not necessary. And please mention in the test that PCA is also a class. In the rendered docs this reads to perform PCA we can use PCA. something like to perform a PCA we can use the PCA class is better.

@kain88-de
Copy link
Member

One last comment about the doc text/

variance is explained by the components. This cumulated variance by the
components is conveniently stored in the one-dimensional array attribute
`cumulated_variance`. The value at the ith index of `cumulated_variance` is the
sum of the variances from 0 to i.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from 0 to 1 and not to i

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

soory seems like I can't read in the morning.

@kain88-de kain88-de merged commit b375b4b into MDAnalysis:develop Aug 14, 2016
@kain88-de
Copy link
Member

Thanks @jdetle

@jdetle jdetle deleted the PCA branch August 14, 2016 17:47
@orbeckst
Copy link
Member

@jdetle , this PR is as good as any of the other ones that you contributed to say many thanks for all your excellent contributions to MDAnalysis – both the code and the community – during your GSoC. It would be great if you were to hang around a bit more even after GSoC.

abiedermann pushed a commit to abiedermann/mdanalysis that referenced this pull request Jan 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants