MemoryReader. by wouterboomsma · Pull Request #976 · MDAnalysis/mdanalysis

wouterboomsma · 2016-09-09T19:46:05Z

MemoryReader: Support for loading MDAnalysis trajectories into a numpy array, interfaced through the standard MDAnalysis reader interface. Makes it possible to manipulate coordinates in memory - for instance aligning frames in-memory.

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced? — n/a (see PR Inclusion of ENCORE into MDAnalysis #797 for context)

coveralls · 2016-09-09T20:33:25Z

Coverage increased (+0.1%) to 84.453% when pulling d5090ac on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

jbarnoud · 2016-09-10T08:55:16Z

package/MDAnalysis/coordinates/base.py

 from six.moves import range
 import six

+import logging as log


Is this used anywhere?

Nope. Removed in 5710477.

jbarnoud · 2016-09-10T09:06:18Z

I will be quite busy until the end of next week. I'll review the PR arount the 19th if nobody did it until then.

jbarnoud · 2016-09-10T09:08:13Z

package/MDAnalysis/coordinates/DCD.py

        return ts

-    def timeseries(self, asel, start=None, stop=None, step=None, skip=None,
+    def timeseries(self, asel=None, start=None, stop=None, step=None, skip=None,


asel=None should be documented

Good point. Fixed in 5710477.

coveralls · 2016-09-10T12:14:32Z

Coverage decreased (-3.0%) to 81.341% when pulling 5710477 on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

coveralls · 2016-09-10T12:14:32Z

Coverage increased (+0.1%) to 84.452% when pulling 5710477 on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

orbeckst · 2016-09-10T22:23:33Z

package/MDAnalysis/coordinates/memory.py

+:Maintainer: Wouter Boomsma <wb@bio.ku.dk>, wouterboomsma on github
+
+
+.. versionadded:: 0.15.0


Going to be 0.16.0

Strange that I missed that one. Fixed in 2637a20.

orbeckst · 2016-09-10T22:37:27Z

I ticked tests and docs (although see comments).

richardjgowers · 2016-09-11T12:29:20Z

package/MDAnalysis/coordinates/memory.py

+
+    """
+
+    format = 'memory'


format here must be upper case

I followed the style used by other readers. The DCDReader class for instance has a format class variable. However, I'd be happy to change it if there is a consensus about this - just let me know.

@richardjgowers – when I look through the readers/writers I only see lower case format.

grep -e 'Format =' MDAnalysis/coordinates/*py

Or do you mean, the format value should be ~~"Memory"~~ "MEMORY"?

Right sorry, I meant format = 'MEMORY'. All the guessers etc currently cast all format strings to uppercase.

Ah. That makes sense. Fixed in 2b30138.

wouterboomsma · 2016-09-11T20:37:55Z

@orbeckst You expressed some interest in the merge_universes function in the utils.py file of encore (PR #797). Would it make sense to make this function part of this PR, perhaps adding it somewhere under MDAnalysis.lib? If so, where?
Here it is, for reference:

def merge_universes(universes):
    """
    Merge list of universes into one

    Parameters
    ----------
    `universes` : list of Universe objects


    Returns
    ----------
    Universe object
    """

    for universe in universes:
        universe.transfer_to_memory()

    return mda.Universe(
        universes[0].filename,
        np.concatenate(tuple([e.trajectory.timeseries() for e in universes]),
        axis=1),
        format=MemoryReader)

coveralls · 2016-09-11T20:49:39Z

Coverage increased (+0.1%) to 84.461% when pulling 2637a20 on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

orbeckst · 2016-09-11T21:49:19Z

Re: merge_universes(): This is an interesting function that should be more generally exposed but lib is not the right place because anything under lib should be independent of the core MDAnalysis machinery such as AtomGroup, Universe, etc... At the moment it would fit into AtomGroup.py (similar to the Merge() function) but because the #363 rewrite will completely re-organize AtomGroup.py and friends I suggest we just hold of with moving merge_universes() elsewhere and wait until the dust around PR #815 settles. Maybe we find a good spot for "topology manipulation".

On the plus side for this PR: less work for you :-).

richardjgowers · 2016-09-12T09:20:53Z

@wouterboomsma looking good once last comments are done

kain88-de · 2016-09-12T09:31:34Z

@wouterboomsma did you ever test how big the speed improvement is for iterative algorithms using the memory reader? Would be nice to know for annoucement news of the new version.

richardjgowers · 2016-09-12T11:19:06Z

@wouterboomsma yeah that's neater, though thinking about it though, if isinstance(self.trajectory, MemoryReader) will be most future-proof

wouterboomsma · 2016-09-12T11:24:13Z

@richardjgowers I agree completely. Fixed in f4e3b65.

richardjgowers · 2016-09-12T11:30:41Z

@wouterboomsma sorry about the double change, late night code reviews!

WRT speed changes, I think we're thinking of advertising it in a blog post, so something basic like

from MDAnalysis.tests.datafiles import PSF, DCD  # use the reference system

u = mda.Universe(PSF, DCD)

ag = u.select_atoms('resid 1')
for ts in u.trajectory:
    print ag.center_of_mass()

To see if this gets much faster if the trajectory is transferred to memory (and the setup cost of transferring into memory).

There's a few calculations I can think of (eg mean squared displacement) which require a lot of thrashing around the trajectory, so it'd be cool if the MemoryReader made this faster. A quick iteration test should tell if this is the case.

kain88-de · 2016-09-12T11:47:01Z

mean squared displacement

That is a good example since it requires reading the trajectory a lot. Or just running a Contacts/Align analysis that we provide.

coveralls · 2016-09-12T11:54:10Z

Coverage increased (+0.7%) to 85.047% when pulling f4e3b65 on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

wouterboomsma · 2016-09-12T12:39:14Z

@richardjgowers, @kain88-de Generally, I wouldn't expect large improvements as long as you formulate it as a iterative problem. The benefits would arise by letting numpy do the entire calculation at once - but this means you cannot use the calculations in AtomGroup directly, at least as far as I can see. Here is an extrapolation of the example you sent, where the first two versions use the standard iterative approach with and without in_memory, and the last uses a pure numpy approach:

import time
import numpy as np
import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD

t0 = time.time()
u1 = mda.Universe(PSF, DCD)
t1 = time.time()
print "Read universe from DCD: {0}".format(t1-t0)

u2 = mda.Universe(PSF, DCD)

t0 = time.time()
u2.transfer_to_memory()
t1 = time.time()
print "Transfer_to_memory: {0}".format(t1-t0)

ag = u1.select_atoms('resid 1')
coms1 = []
t0 = time.time()
for ts in u1.trajectory:
    coms1.append(ag.center_of_mass())
t1 = time.time()
print "Iterative procedure: {0}".format(t1-t0)

ag2 = u2.select_atoms('resid 1')
coms2 = []
t0 = time.time()
for ts in u2.trajectory:
    coms2.append(ag2.center_of_mass())
t1 = time.time()
print "Iterative procedure in memory: {0}".format(t1-t0)

t0 = time.time()
coms3 = np.average(u2.trajectory.timeseries(ag2), weights=(ag2.masses/np.sum(ag2.masses)), axis=0)
t1 = time.time()
print "Non-iterative procedure: {0}".format(t1-t0)

# Verify that solutions are the same
assert((abs(np.array(coms1) - np.array(coms2)) < 0.0001).all())
assert((abs(np.array(coms1) - coms3) < 0.0001).all())

The timings I get are:

Read universe from DCD: 0.120114088058
Transfer_to_memory: 0.0131919384003
Iterative procedure: 0.00638604164124
Iterative procedure in memory: 0.00462794303894
Non-iterative procedure: 0.00023889541626

Several things are clear. The first read dominates any calculations that we do afterwards, so this is probably not the best test. Other than that, we can see that the numpy calculation is about 20 times as fast - but only if you execute it as a non-iterative procedure. I don't think the difference between "iterative procedure" and "iterative procedure in memory" are significant (they fluctuate around similar values).

wouterboomsma · 2016-09-12T12:42:41Z

Btw, I got this strange TimeOutError problem again, leading to a failed Travis build. This is probably caused by the fact that test_atomgroup.py has been expanded a bit. I therefore just upped the travis process timeout to 400s.

kain88-de · 2016-09-12T13:28:59Z

@wouterboomsma DCD is a bit unfair since it is almost just a memory map. XTC are PDB should give a fairer comparison since they take longer to read. That would also be good information for people who use Gromacs.

Still I'm a bit disappointed that the in memory approach doesn't help that much in the iterative approach. I hoped for better values even with DCD. This is not the fault of your memory reader though, it's a general problem of the iterations.

…ory option of Universe constructor

wouterboomsma · 2016-09-12T19:22:00Z

@kain88-de

Here is an updated version, now using XTCReader, and repeating each of the main tests a thousand time to get some more reliable numbers:

import time
import numpy as np
import MDAnalysis.analysis.align as align
import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD, GRO, XTC

# Setup universes
t0 = time.time()
u1 = mda.Universe(GRO, XTC)
t1 = time.time()
print "Read universe: {0}".format(t1-t0)
u2 = mda.Universe(GRO, XTC)

# Measure time for transfer to memory
t0 = time.time()
u2.transfer_to_memory()
t1 = time.time()
print "Transfer_to_memory: {0}".format(t1-t0)

# Standard case: iterative procedure
ag = u1.select_atoms('resid 1')
t0 = time.time()
for i in range(1000):
    coms1 = []
    for ts in u1.trajectory:
        coms1.append(ag.center_of_mass())
t1 = time.time()
print "Iterative procedure: {0}".format(t1-t0)

# Iterative procedure in-memory
ag2 = u2.select_atoms('resid 1')
t0 = time.time()
for i in range(1000):
    coms2 = []
    for ts in u2.trajectory:
        coms2.append(ag2.center_of_mass())
t1 = time.time()
print "Iterative procedure in memory: {0}".format(t1-t0)

# Direct numpy calculation in-memory
t0 = time.time()
for i in range(1000):
    coms3 = np.average(u2.trajectory.timeseries(ag2), weights=(ag2.masses/np.sum(ag2.masses)), axis=0)
t1 = time.time()
print "Non-iterative procedure: {0}".format(t1-t0)

# Direct numpy calculation without overhead of timeseries() call
array = u2.trajectory.timeseries(ag2)
t0 = time.time()
for i in range(1000):
    coms3 = np.average(array, weights=(ag2.masses/np.sum(ag2.masses)), axis=0)
t1 = time.time()
print "Non-iterative procedure excluding timeseries() call: {0}".format(t1-t0)

# Verify solutions are the same
assert((abs(np.array(coms1) - np.array(coms2)) < 0.0001).all())
assert((abs(np.array(coms1) - coms3) < 0.0001).all())

Times on my laptop are now:

Read universe: 0.62521314621
Transfer_to_memory: 0.0403649806976
Iterative procedure: 36.6376540661
Iterative procedure in memory: 0.486598968506
Non-iterative procedure: 6.78375792503
Non-iterative procedure excluding timeseries() call: 0.0540280342102

Indeed, we here see some interesting differences. The in-memory version is about 75 times faster than the standard XTCReader. Surprisingly, in this case, it is also faster than the non-iterative numpy procedure. It turns out that this is due to the overhead in the timeseries() call. The last value shows that there is another potential factor of 10 to be gained by extracting the numpy array once and for all (or if we removed some of the overhead in the timeseries() call)

kain88-de · 2016-09-12T19:49:45Z

Cool that's more like it.

richardjgowers

Looking good.

As a tangent, would it be possible to save/load these trajectories using np.save/load (or something more robust like hdf5?). You could then even associate the .npy file extension with MemoryReader

richardjgowers · 2016-09-14T20:25:48Z

package/MDAnalysis/analysis/align.py

         - ``False``: show all status messages and do not change the the logging
           level (default)
-
+      *in_memory*


Maybe mention that this is faster?

I haven't tested the speed very systematically for the alignment, so I've just added ", which can improve performance substantially in some cases".

richardjgowers · 2016-09-14T20:27:26Z

package/MDAnalysis/analysis/align.py

+    writer = None
+    if in_memory:
+        traj.transfer_to_memory()
+        frames = traj.trajectory


You can just move the above frames = to below this if block to simplify

Do you mean moving the frames = traj.trajectory line from the first line of the method (line 684) to after the if-block (line 709). If so, this would not work, since the else branch in the if-statement also uses the frames variable. Perhaps I'm misunderstanding what you meant.

Ah my mistake, didn't see that

orbeckst · 2016-09-14T23:45:43Z

CHANGELOG entry and done?

wouterboomsma · 2016-09-15T14:28:58Z

@richardjgowers Regarding your tangent, do you mean implementing a MemoryWriter? For completeness, it would make sense - but do you think it is a common use case? Regarding the hdf5 format, that would also be nice to have - but wouldn't it be most natural to have this functionality in a dedicated HDF5Reader and HDF5Writer?

wouterboomsma · 2016-09-15T14:41:10Z

Regarding the CHANGELOG, there seem to be no entries for 0.16.0 yet, and I'm not entirely sure about the format, but here is a suggestion:

Enhancements

Added 'MemoryReader' class to allow manipulation of trajectory data in-memory, which can provide substantial speed-ups to certain calculations. The constructor of 'Universe' now takes an 'in_memory' option, which will transfer the corresponding trajectory to memory. Likewise a new 'Universe.transfer_to_memory()' method makes it possible to make this transfer after construction of the universe.
Added 'in_memory' option to 'rms_fit_trj', which makes it possible to do in-place (in-memory) alignment of trajectories.

richardjgowers · 2016-09-15T15:10:45Z

@wouterboomsma thanks, took care of the CHANGELOG for you

orbeckst · 2016-09-15T16:16:56Z

Put it under the 0.15.1 heading in your CHANGELOG. Once you merge/rebase against develop you will see that the 0.15.1 was changed to 0.16.0

Keep your entries shorter, put them under Enhancements, and reference the PR (or issue) number. Don't forget to add authors' handles to the entry and if they don't appear yet in AUTHORS also add everyone to that file.

Oliver Beckstein
email: orbeckst@gmail.com

Am Sep 15, 2016 um 7:41 schrieb Wouter Boomsma notifications@github.com:

Regarding the CHANGELOG, there seem to be no entries for 0.16.0 yet, and I'm not entirely sure about the format, but here is a suggestion:

Enhancements

Added 'MemoryReader' class to allow manipulation of trajectory data in-memory, which can provide substantial speed-ups to certain calculations. The constructor of 'Universe' now takes an 'in_memory' option, which will transfer the corresponding trajectory to memory. Likewise a new 'Universe.transfer_to_memory()' method makes it possible to make this transfer after construction of the universe.
Added 'in_memory' option to 'rms_fit_trj', which makes it possible to do in-place (in-memory) alignment of trajectories.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

orbeckst · 2016-09-15T16:17:51Z

Hooray, congratulations and thanks a lot!

Oliver Beckstein
email: orbeckst@gmail.com

Am Sep 15, 2016 um 8:05 schrieb Richard Gowers notifications@github.com:

Merged #976.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

- changed "Maintained by Oliver Beckstein and Elizabeth Denning" to "Maintained by the MDAnalysis Core Devs Team" - added Wouter Boomsma (for PR #976) - removed explicit mentions for KDTree (Hamelryck) because since 0.11.0 we do no longer include the actual code (see PR #395 and #383) - removed explicit mentions for helanal (Hall) because Ben Hall is a contributing author

wouterboomsma · 2016-09-16T10:50:56Z

Great! Thank you all for your time in reviewing this PR.

* Added MemoryReader and tests. * Added documentation, and removed unused import * Minor bugfixes and documentation updates related to MemoryReader. * Changed reader format string: memory->MEMORY * Removed check for existence of group.universe * Use isinstance instead of checking against format class variable. * Inreased travis process timeout to 400s * Bugfix in MemoryReader: now forces numpy array copy when using in_memory option of Universe constructor * Extra documentation and minor bugfix

JmhX7IYP+2pnu4iv7mzp+xy8W5x1pzFOZhzfS4LslB2Ac+4SENM1eAZebiZQYSYl qyYkj9bLS2RqC40dQVfAA4dMrAuWpNfgZkTeCCU7pm29TodN0hmljNG/rLoGG/iH ieA+pSultKwyryY0UZXHAMmNRIViEfy1VsjE2Hswc/AIiY9W1yfLONXVjtELjFsU jv4u1ljR+lhuP6ZllEwucKeD5ojQDyXBU9WXPi2bUwFEFpNdn5zXVRTV/kntb86d sbOE1TQpy+WeSNTyqmEtdjQRuLH21Jm9kTnWFjkKjw2lL+odUOAoOEC4tLSin30i 6KK2lVvSfXJi9e5XOL+nXWqvwr+jU5jqNV+PC60aEe9N1xhVapYroY84bIJwcuYQ 2Axu0vJWNLOpnLGAH7kYYaeEFJdsZt283kDy1paagWmXqyy3p3hh1VD8NEWWcXQ4 Da07VzZh61RvLHFWIzyXkizPjaCMCfCPIpwRACht5OJdRH0LwEhGRhJhSHEUdh7p WGNEQr5XVnd2Xb+dHfpBeq6a0vPmOv8oDXzRCnGpDeTjGE3kEQy8PGeDqm4dsD+B wOmeCYN/NmUnurpZrW2x151bKgCbcjwiTksCsIhZmN1kRHqAR+NkoHSUOiji6INw LvpyU91X7ji1yyYlhPTu =N0P/ -----END PGP SIGNATURE----- updated AUTHORS - changed "Maintained by Oliver Beckstein and Elizabeth Denning" to "Maintained by the MDAnalysis Core Devs Team" - added Wouter Boomsma (for PR MDAnalysis#976) - removed explicit mentions for KDTree (Hamelryck) because since 0.11.0 we do no longer include the actual code (see PR MDAnalysis#395 and MDAnalysis#383) - removed explicit mentions for helanal (Hall) because Ben Hall is a contributing author

Added MemoryReader and tests.

d5090ac

jbarnoud reviewed Sep 10, 2016
View reviewed changes

jbarnoud added enhancement needs review Component-Readers labels Sep 10, 2016

jbarnoud reviewed Sep 10, 2016
View reviewed changes

Added documentation, and removed unused import

5710477

orbeckst reviewed Sep 10, 2016
View reviewed changes

orbeckst added this to the 0.16.0 milestone Sep 10, 2016

richardjgowers self-assigned this Sep 11, 2016

richardjgowers reviewed Sep 11, 2016
View reviewed changes

Minor bugfixes and documentation updates related to MemoryReader.

2637a20

Changed reader format string: memory->MEMORY

2b30138

Use isinstance instead of checking against format class variable.

f4e3b65

mnmelo mentioned this pull request Sep 12, 2016

Reduce coveralls flood? #955

Closed

Inreased travis process timeout to 400s

7a999e4

Bugfix in MemoryReader: now forces numpy array copy when using in_mem…

4201b42

…ory option of Universe constructor

richardjgowers approved these changes Sep 14, 2016

View reviewed changes

Extra documentation and minor bugfix

1c7fe34

richardjgowers merged commit 2ee7bbf into MDAnalysis:develop Sep 15, 2016

richardjgowers added a commit that referenced this pull request Sep 15, 2016

Updated CHANGELOG for Issue #976

c4e26d9

wouterboomsma mentioned this pull request Sep 16, 2016

Inclusion of ENCORE into MDAnalysis #797

Merged

4 tasks

abiedermann pushed a commit to abiedermann/mdanalysis that referenced this pull request Jan 5, 2017

Updated CHANGELOG for Issue MDAnalysis#976

5ee8616

		:Maintainer: Wouter Boomsma <wb@bio.ku.dk>, wouterboomsma on github


		.. versionadded:: 0.15.0

Conversation

wouterboomsma commented Sep 9, 2016 • edited by richardjgowers Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Uh oh!

coveralls commented Sep 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbarnoud commented Sep 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Sep 10, 2016

Uh oh!

coveralls commented Sep 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orbeckst commented Sep 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orbeckst Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wouterboomsma commented Sep 11, 2016

Uh oh!

coveralls commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orbeckst commented Sep 11, 2016

Uh oh!

richardjgowers commented Sep 12, 2016

Uh oh!

kain88-de commented Sep 12, 2016

Uh oh!

richardjgowers commented Sep 12, 2016

Uh oh!

wouterboomsma commented Sep 12, 2016

Uh oh!

richardjgowers commented Sep 12, 2016

Uh oh!

kain88-de commented Sep 12, 2016

Uh oh!

coveralls commented Sep 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wouterboomsma commented Sep 12, 2016

Uh oh!

wouterboomsma commented Sep 12, 2016

Uh oh!

kain88-de commented Sep 12, 2016

Uh oh!

wouterboomsma commented Sep 12, 2016

Uh oh!

kain88-de commented Sep 12, 2016

Uh oh!

richardjgowers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wouterboomsma commented Sep 9, 2016 •

edited by richardjgowers

Loading

coveralls commented Sep 9, 2016 •

edited

Loading

orbeckst Sep 11, 2016 •

edited

Loading

coveralls commented Sep 11, 2016 •

edited

Loading

coveralls commented Sep 12, 2016 •

edited

Loading