Conversation
package/MDAnalysis/analysis/pca.py
Outdated
|
|
||
|
|
||
| def _single_frame(self): | ||
| self._xyz[self._frame_index] = self._atoms.positions.copy() |
There was a problem hiding this comment.
Nope this should slowly calculate the covariance matrix. Otherwise you are possibly loading hundreds of gigabytes
There was a problem hiding this comment.
So are you saying I should start calculating the covariance matrix here with a loop inside _single_frame?
There was a problem hiding this comment.
What I'm saying is that the sum goes over the frames. So you have to slowly build up the sum for each entry of the matrix as you iterate over the frame. Something like this should do the trick.
def _prelude(self):
n_dim = self.atoms.n_atom * 3
self.cov = np.zeros(n_dim, n_dim)
def _single_frame(self):
x = self.atoms.positions.flat # maybe you need to use ravel as this only iterates!
self.cov += np.dot(x[:, np.newaxis], x[:, np.newaxis].T)
def _conclude(self):
self.cov /= self.n_frames - 1Please check the dot product. To get more information about the correct normalization have a look at the np.cov docs
There was a problem hiding this comment.
We can't use np.cov directly in _single_frame because it works different for 1D and 2D arrays.
The dot product could be faster in cython but I doubt it will be noticeable compared with the slow loading of
a trajectory.
There was a problem hiding this comment.
I guess that's a good point. You're only as slow as your bottleneck and loading from a disk is pretty slow.
There was a problem hiding this comment.
Okay, I took a few hours to mull it over just to be sure but this method that you wrote out about above will not be zero-mean. This will give a different answer than the method you wrote out in your gist. Did you overlook this or is it intentional?
There was a problem hiding this comment.
for i, ts in enumerate(u.trajectory[0:-1:1]):
x = ca.positions.flatten()
mean += x
mean /= u.trajectory.n_frames
for i, ts in enumerate(u.trajectory[0:-1:1]):
x = ca.positions.flatten()
x -= mean
prac_cov += np.dot(x[:, np.newaxis], x[:, np.newaxis].T)
prac_cov /= u.trajectory.n_frames -1This is a snippet for code that replicates the np.cov method
There was a problem hiding this comment.
sorry about the confusion. But yeah my example wasn't complete. It was a quick example how I thought it should work.
|
about the transform. For a PCA it makes sense to accept a atomgroup and return a array of the transformed coordinates. This allows to apply the same PCA on different simulations. Returning the array is consistent with the DIffusionMaps implementation. |
|
Okay, and this atom group should be the same atom group used to generate On Wed, Jul 20, 2016 at 2:24 PM Max Linke notifications@github.com wrote:
|
|
Well kind of yeah. Lets say we have a test like this. if self.atoms.n_atoms != atoms.n_atoms:
raise ValueError()
elif self.atoms != atoms: # I hope that checks for atom types and not reference id
warnings.warn('transform for different atom types')This could be something people might want to do in exotic cases. But we warn the normal user that he should be careful. |
package/MDAnalysis/analysis/pca.py
Outdated
| self._n_atoms = self._atoms.n_atoms | ||
| self._calculated = False | ||
|
|
||
| def fit(self, n_components=None, start=None, stop=None, |
There was a problem hiding this comment.
why fit? This goes away from the consistency we are trying to achieve for the analysis methods. They all should have a run method that then does everything required. So I would assume I can use the PCA like this.
>>> pca = analysis.pca.PCA(atomgroup).run()
>>> pca.variance
>>> pca_coord = pca.transform(atomgroup, n_components=4)
>>> print (pca_coord.shape)
(n_frames, 4)There was a problem hiding this comment.
Sorry thats a good point. I called it fit in an unnecessary attempt to mirror the sklearn api.
package/MDAnalysis/analysis/pca.py
Outdated
|
|
||
| def __init__(self, atomgroup, select='all', n_components=None, | ||
| **kwargs): | ||
| def __init__(self, atomgroup, select='all', mean_free=True, reference=None, |
There was a problem hiding this comment.
Please call the option demean this is used in scikit-learn and statsmodel to mean that the mean should be subtracted before any calculations are done. I would like to use the same terminology to other python science packages to minimize surprise and friction for people using mdanalysis.
There was a problem hiding this comment.
Oh okay. Haven't really got to addressing that yet but thanks in advance.
package/MDAnalysis/analysis/pca.py
Outdated
| self.cov = np.zeros((n_dim, n_dim)) | ||
| self.mean = np.zeros(n_dim,) | ||
|
|
||
| for i, ts in enumerate(self._u.trajectory[self.start:self.stop:self.step]): |
There was a problem hiding this comment.
this doesn't need to be done anymore. You have aligned the structures.
|
Right now I'm kind of struggling to get through what should be insignificant roadblocks so I'm just going to right some crappy code and iterate from there. |
| self._atoms, self._ref_cog, | ||
| self._atoms.center_of_geometry()) | ||
| # now all structures are aligned to reference | ||
| x = mobile_atoms.positions.ravel() |
There was a problem hiding this comment.
move this out of the if-else clause then you can skip the else clause completely. (and remove code duplication)
There was a problem hiding this comment.
mobile_atoms is different than self._atoms right? Or am I misunderstanding things?
There was a problem hiding this comment.
yeah you are right. I overlooked that.
|
I think I've been creating a headache for you by squashing commits and force pushing. I'll stop that now. |
package/MDAnalysis/analysis/pca.py
Outdated
|
|
||
| if mean is None: | ||
| warning.warn('In order to demean to generate the covariance matrix' | ||
| ' the frames have to be iterated over twice. To avoid' |
There was a problem hiding this comment.
I normally let the white space at the end of the previous line. This is also what we do in most line breaking strings in mdanalysis (afaik that is some kind of unspoken standard between languages)
package/MDAnalysis/analysis/pca.py
Outdated
|
|
||
| This module constructs a covariance matrix wherein each element of the matrix | ||
| is denoted by (i,j) row-column coordinates. The (i,j) coordinate is the | ||
| influence of the of the ith frame's coordinates on the jth frame's coordinates |
There was a problem hiding this comment.
This text is still wrong. I doesn't refer to a frame but simple a coordinates.
package/MDAnalysis/analysis/pca.py
Outdated
| After initializing and calling method with a universe or an atom group, | ||
| principal components ordering the atom coordinate data by decreasing | ||
| variance will be available for analysis. Please refer to the | ||
| :ref:`PCA-tutorial` for more detailed instructions. |
There was a problem hiding this comment.
Sorry this doesn't work in an interactive session. You have to do a simple code example.
Something like this is enough to get a very short overview. It is still good to link to the tutorial but that is only seen in the online docs as a link.
pca = PCA(atomgroup, select='backbone').run()
pca_space = pca.transform(atomgroup.select_atoms('backbone'), 3)|
Don't worry about the force push. It's just that I don't see always the changes in github since you change (rightly) different lines then I commend on. |
|
I'm more sorry I only comment on the doc issues now and still see some minor things that I missed earlier. Seems like a lot of small issues that I could have mentioned all at once to save you some work. |
|
No problem on my end, makes me feel like I have work to do :) |
| variance will be available for analysis. Please refer to the | ||
| :ref:`PCA-tutorial` for more detailed instructions. | ||
| variance will be available for analysis. As an example: | ||
| >>> pca = PCA(atomgroup, select='backbone').run() |
There was a problem hiding this comment.
You have to wrap this in a code block so it shows up in the Sphinx docs right?
There was a problem hiding this comment.
Nope, I just fixed a typo there though. Wait, I don't know if I understood your question. I gave it the code blocks so the code is formatted in sphinx, yes.
There was a problem hiding this comment.
I use in my code blocks
.. code::python
>>> # now comes the code
Have a look around for this in the other doc strings and you see what I mean.
There was a problem hiding this comment.
I just checked this is already correctly rendered so you can ignore my comment.
|
Changes Unknown when pulling 6099a79 on jdetle:PCA into * on MDAnalysis:develop*. |
package/MDAnalysis/analysis/pca.py
Outdated
| >>> from MDAnalysis.tests.datafiles import PSF, DCD | ||
|
|
||
| Given a universe containing trajectory data we can perform PCA using | ||
| :class:`PCA`:: and retrieve the principal components. |
There was a problem hiding this comment.
the last :: are not necessary. And please mention in the test that PCA is also a class. In the rendered docs this reads to perform PCA we can use PCA. something like to perform a PCA we can use the PCA class is better.
|
One last comment about the doc text/ |
| variance is explained by the components. This cumulated variance by the | ||
| components is conveniently stored in the one-dimensional array attribute | ||
| `cumulated_variance`. The value at the ith index of `cumulated_variance` is the | ||
| sum of the variances from 0 to i. |
There was a problem hiding this comment.
soory seems like I can't read in the morning.
|
Thanks @jdetle |
|
@jdetle , this PR is as good as any of the other ones that you contributed to say many thanks for all your excellent contributions to MDAnalysis – both the code and the community – during your GSoC. It would be great if you were to hang around a bit more even after GSoC. |
[WIP] Bauhaus PCA
Changes made in this Pull Request:
TODO
_single_framecall is a little contrived, up to advice of others to get rid of or not.PR Checklist