Skip to content

MemoryReader.#976

Merged
richardjgowers merged 9 commits intoMDAnalysis:developfrom
wouterboomsma:develop
Sep 15, 2016
Merged

MemoryReader.#976
richardjgowers merged 9 commits intoMDAnalysis:developfrom
wouterboomsma:develop

Conversation

@wouterboomsma
Copy link
Contributor

@wouterboomsma wouterboomsma commented Sep 9, 2016

MemoryReader: Support for loading MDAnalysis trajectories into a numpy array, interfaced through the standard MDAnalysis reader interface. Makes it possible to manipulate coordinates in memory - for instance aligning frames in-memory.

PR Checklist

@coveralls
Copy link

coveralls commented Sep 9, 2016

Coverage Status

Coverage increased (+0.1%) to 84.453% when pulling d5090ac on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

from six.moves import range
import six

import logging as log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Removed in 5710477.

@jbarnoud
Copy link
Contributor

I will be quite busy until the end of next week. I'll review the PR arount the 19th if nobody did it until then.

return ts

def timeseries(self, asel, start=None, stop=None, step=None, skip=None,
def timeseries(self, asel=None, start=None, stop=None, step=None, skip=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asel=None should be documented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed in 5710477.

@coveralls
Copy link

Coverage Status

Coverage decreased (-3.0%) to 81.341% when pulling 5710477 on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 84.452% when pulling 5710477 on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

:Maintainer: Wouter Boomsma <wb@bio.ku.dk>, wouterboomsma on github


.. versionadded:: 0.15.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to be 0.16.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange that I missed that one. Fixed in 2637a20.

@orbeckst
Copy link
Member

I ticked tests and docs (although see comments).

@orbeckst orbeckst added this to the 0.16.0 milestone Sep 10, 2016
@richardjgowers richardjgowers self-assigned this Sep 11, 2016

"""

format = 'memory'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format here must be upper case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the style used by other readers. The DCDReader class for instance has a format class variable. However, I'd be happy to change it if there is a consensus about this - just let me know.

Copy link
Member

@orbeckst orbeckst Sep 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardjgowers – when I look through the readers/writers I only see lower case format.

grep -e 'Format =' MDAnalysis/coordinates/*py

Or do you mean, the format value should be "Memory" "MEMORY"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right sorry, I meant format = 'MEMORY'. All the guessers etc currently cast all format strings to uppercase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. That makes sense. Fixed in 2b30138.

@wouterboomsma
Copy link
Contributor Author

@orbeckst You expressed some interest in the merge_universes function in the utils.py file of encore (PR #797). Would it make sense to make this function part of this PR, perhaps adding it somewhere under MDAnalysis.lib? If so, where?
Here it is, for reference:

def merge_universes(universes):
    """
    Merge list of universes into one

    Parameters
    ----------
    `universes` : list of Universe objects


    Returns
    ----------
    Universe object
    """

    for universe in universes:
        universe.transfer_to_memory()

    return mda.Universe(
        universes[0].filename,
        np.concatenate(tuple([e.trajectory.timeseries() for e in universes]),
        axis=1),
        format=MemoryReader)

@coveralls
Copy link

coveralls commented Sep 11, 2016

Coverage Status

Coverage increased (+0.1%) to 84.461% when pulling 2637a20 on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

@orbeckst
Copy link
Member

Re: merge_universes(): This is an interesting function that should be more generally exposed but lib is not the right place because anything under lib should be independent of the core MDAnalysis machinery such as AtomGroup, Universe, etc... At the moment it would fit into AtomGroup.py (similar to the Merge() function) but because the #363 rewrite will completely re-organize AtomGroup.py and friends I suggest we just hold of with moving merge_universes() elsewhere and wait until the dust around PR #815 settles. Maybe we find a good spot for "topology manipulation".

On the plus side for this PR: less work for you :-).

@richardjgowers
Copy link
Member

@wouterboomsma looking good once last comments are done

@kain88-de
Copy link
Member

@wouterboomsma did you ever test how big the speed improvement is for iterative algorithms using the memory reader? Would be nice to know for annoucement news of the new version.

@richardjgowers
Copy link
Member

@wouterboomsma yeah that's neater, though thinking about it though, if isinstance(self.trajectory, MemoryReader) will be most future-proof

@wouterboomsma
Copy link
Contributor Author

@richardjgowers I agree completely. Fixed in f4e3b65.

@richardjgowers
Copy link
Member

@wouterboomsma sorry about the double change, late night code reviews!

WRT speed changes, I think we're thinking of advertising it in a blog post, so something basic like

from MDAnalysis.tests.datafiles import PSF, DCD  # use the reference system

u = mda.Universe(PSF, DCD)

ag = u.select_atoms('resid 1')
for ts in u.trajectory:
    print ag.center_of_mass()

To see if this gets much faster if the trajectory is transferred to memory (and the setup cost of transferring into memory).

There's a few calculations I can think of (eg mean squared displacement) which require a lot of thrashing around the trajectory, so it'd be cool if the MemoryReader made this faster. A quick iteration test should tell if this is the case.

@kain88-de
Copy link
Member

mean squared displacement

That is a good example since it requires reading the trajectory a lot. Or just running a Contacts/Align analysis that we provide.

@coveralls
Copy link

coveralls commented Sep 12, 2016

Coverage Status

Coverage increased (+0.7%) to 85.047% when pulling f4e3b65 on wouterboomsma:develop into e8ad9b2 on MDAnalysis:develop.

@mnmelo mnmelo mentioned this pull request Sep 12, 2016
@wouterboomsma
Copy link
Contributor Author

@richardjgowers, @kain88-de Generally, I wouldn't expect large improvements as long as you formulate it as a iterative problem. The benefits would arise by letting numpy do the entire calculation at once - but this means you cannot use the calculations in AtomGroup directly, at least as far as I can see. Here is an extrapolation of the example you sent, where the first two versions use the standard iterative approach with and without in_memory, and the last uses a pure numpy approach:

import time
import numpy as np
import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD

t0 = time.time()
u1 = mda.Universe(PSF, DCD)
t1 = time.time()
print "Read universe from DCD: {0}".format(t1-t0)

u2 = mda.Universe(PSF, DCD)

t0 = time.time()
u2.transfer_to_memory()
t1 = time.time()
print "Transfer_to_memory: {0}".format(t1-t0)

ag = u1.select_atoms('resid 1')
coms1 = []
t0 = time.time()
for ts in u1.trajectory:
    coms1.append(ag.center_of_mass())
t1 = time.time()
print "Iterative procedure: {0}".format(t1-t0)

ag2 = u2.select_atoms('resid 1')
coms2 = []
t0 = time.time()
for ts in u2.trajectory:
    coms2.append(ag2.center_of_mass())
t1 = time.time()
print "Iterative procedure in memory: {0}".format(t1-t0)

t0 = time.time()
coms3 = np.average(u2.trajectory.timeseries(ag2), weights=(ag2.masses/np.sum(ag2.masses)), axis=0)
t1 = time.time()
print "Non-iterative procedure: {0}".format(t1-t0)

# Verify that solutions are the same
assert((abs(np.array(coms1) - np.array(coms2)) < 0.0001).all())
assert((abs(np.array(coms1) - coms3) < 0.0001).all())

The timings I get are:

Read universe from DCD: 0.120114088058
Transfer_to_memory: 0.0131919384003
Iterative procedure: 0.00638604164124
Iterative procedure in memory: 0.00462794303894
Non-iterative procedure: 0.00023889541626

Several things are clear. The first read dominates any calculations that we do afterwards, so this is probably not the best test. Other than that, we can see that the numpy calculation is about 20 times as fast - but only if you execute it as a non-iterative procedure. I don't think the difference between "iterative procedure" and "iterative procedure in memory" are significant (they fluctuate around similar values).

@wouterboomsma
Copy link
Contributor Author

Btw, I got this strange TimeOutError problem again, leading to a failed Travis build. This is probably caused by the fact that test_atomgroup.py has been expanded a bit. I therefore just upped the travis process timeout to 400s.

@kain88-de
Copy link
Member

@wouterboomsma DCD is a bit unfair since it is almost just a memory map. XTC are PDB should give a fairer comparison since they take longer to read. That would also be good information for people who use Gromacs.

Still I'm a bit disappointed that the in memory approach doesn't help that much in the iterative approach. I hoped for better values even with DCD. This is not the fault of your memory reader though, it's a general problem of the iterations.

@wouterboomsma
Copy link
Contributor Author

@kain88-de

Here is an updated version, now using XTCReader, and repeating each of the main tests a thousand time to get some more reliable numbers:

import time
import numpy as np
import MDAnalysis.analysis.align as align
import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD, GRO, XTC

# Setup universes
t0 = time.time()
u1 = mda.Universe(GRO, XTC)
t1 = time.time()
print "Read universe: {0}".format(t1-t0)
u2 = mda.Universe(GRO, XTC)

# Measure time for transfer to memory
t0 = time.time()
u2.transfer_to_memory()
t1 = time.time()
print "Transfer_to_memory: {0}".format(t1-t0)

# Standard case: iterative procedure
ag = u1.select_atoms('resid 1')
t0 = time.time()
for i in range(1000):
    coms1 = []
    for ts in u1.trajectory:
        coms1.append(ag.center_of_mass())
t1 = time.time()
print "Iterative procedure: {0}".format(t1-t0)

# Iterative procedure in-memory
ag2 = u2.select_atoms('resid 1')
t0 = time.time()
for i in range(1000):
    coms2 = []
    for ts in u2.trajectory:
        coms2.append(ag2.center_of_mass())
t1 = time.time()
print "Iterative procedure in memory: {0}".format(t1-t0)

# Direct numpy calculation in-memory
t0 = time.time()
for i in range(1000):
    coms3 = np.average(u2.trajectory.timeseries(ag2), weights=(ag2.masses/np.sum(ag2.masses)), axis=0)
t1 = time.time()
print "Non-iterative procedure: {0}".format(t1-t0)

# Direct numpy calculation without overhead of timeseries() call
array = u2.trajectory.timeseries(ag2)
t0 = time.time()
for i in range(1000):
    coms3 = np.average(array, weights=(ag2.masses/np.sum(ag2.masses)), axis=0)
t1 = time.time()
print "Non-iterative procedure excluding timeseries() call: {0}".format(t1-t0)

# Verify solutions are the same
assert((abs(np.array(coms1) - np.array(coms2)) < 0.0001).all())
assert((abs(np.array(coms1) - coms3) < 0.0001).all())

Times on my laptop are now:

Read universe: 0.62521314621
Transfer_to_memory: 0.0403649806976
Iterative procedure: 36.6376540661
Iterative procedure in memory: 0.486598968506
Non-iterative procedure: 6.78375792503
Non-iterative procedure excluding timeseries() call: 0.0540280342102

Indeed, we here see some interesting differences. The in-memory version is about 75 times faster than the standard XTCReader. Surprisingly, in this case, it is also faster than the non-iterative numpy procedure. It turns out that this is due to the overhead in the timeseries() call. The last value shows that there is another potential factor of 10 to be gained by extracting the numpy array once and for all (or if we removed some of the overhead in the timeseries() call)

@kain88-de
Copy link
Member

Cool that's more like it.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

As a tangent, would it be possible to save/load these trajectories using np.save/load (or something more robust like hdf5?). You could then even associate the .npy file extension with MemoryReader

- ``False``: show all status messages and do not change the the logging
level (default)

*in_memory*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention that this is faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested the speed very systematically for the alignment, so I've just added ", which can improve performance substantially in some cases".

writer = None
if in_memory:
traj.transfer_to_memory()
frames = traj.trajectory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just move the above frames = to below this if block to simplify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean moving the frames = traj.trajectory line from the first line of the method (line 684) to after the if-block (line 709). If so, this would not work, since the else branch in the if-statement also uses the frames variable. Perhaps I'm misunderstanding what you meant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my mistake, didn't see that

@orbeckst
Copy link
Member

CHANGELOG entry and done?

@wouterboomsma
Copy link
Contributor Author

@richardjgowers Regarding your tangent, do you mean implementing a MemoryWriter? For completeness, it would make sense - but do you think it is a common use case? Regarding the hdf5 format, that would also be nice to have - but wouldn't it be most natural to have this functionality in a dedicated HDF5Reader and HDF5Writer?

@wouterboomsma
Copy link
Contributor Author

Regarding the CHANGELOG, there seem to be no entries for 0.16.0 yet, and I'm not entirely sure about the format, but here is a suggestion:

Enhancements

  • Added 'MemoryReader' class to allow manipulation of trajectory data in-memory, which can provide substantial speed-ups to certain calculations. The constructor of 'Universe' now takes an 'in_memory' option, which will transfer the corresponding trajectory to memory. Likewise a new 'Universe.transfer_to_memory()' method makes it possible to make this transfer after construction of the universe.
  • Added 'in_memory' option to 'rms_fit_trj', which makes it possible to do in-place (in-memory) alignment of trajectories.

@richardjgowers richardjgowers merged commit 2ee7bbf into MDAnalysis:develop Sep 15, 2016
richardjgowers added a commit that referenced this pull request Sep 15, 2016
@richardjgowers
Copy link
Member

@wouterboomsma thanks, took care of the CHANGELOG for you

@orbeckst
Copy link
Member

Put it under the 0.15.1 heading in your CHANGELOG. Once you merge/rebase against develop you will see that the 0.15.1 was changed to 0.16.0

Keep your entries shorter, put them under Enhancements, and reference the PR (or issue) number. Don't forget to add authors' handles to the entry and if they don't appear yet in AUTHORS also add everyone to that file.

Oliver Beckstein
email: orbeckst@gmail.com

Am Sep 15, 2016 um 7:41 schrieb Wouter Boomsma notifications@github.com:

Regarding the CHANGELOG, there seem to be no entries for 0.16.0 yet, and I'm not entirely sure about the format, but here is a suggestion:

Enhancements

Added 'MemoryReader' class to allow manipulation of trajectory data in-memory, which can provide substantial speed-ups to certain calculations. The constructor of 'Universe' now takes an 'in_memory' option, which will transfer the corresponding trajectory to memory. Likewise a new 'Universe.transfer_to_memory()' method makes it possible to make this transfer after construction of the universe.
Added 'in_memory' option to 'rms_fit_trj', which makes it possible to do in-place (in-memory) alignment of trajectories.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@orbeckst
Copy link
Member

Hooray, congratulations and thanks a lot!

Oliver Beckstein
email: orbeckst@gmail.com

Am Sep 15, 2016 um 8:05 schrieb Richard Gowers notifications@github.com:

Merged #976.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

orbeckst added a commit that referenced this pull request Sep 15, 2016
- changed "Maintained by Oliver Beckstein and Elizabeth Denning" to
  "Maintained by the MDAnalysis Core Devs Team"
- added Wouter Boomsma (for PR #976)
- removed explicit mentions for KDTree (Hamelryck) because since 0.11.0
  we do no longer include the actual code (see PR #395 and #383)
- removed explicit mentions for helanal (Hall) because Ben Hall is
  a contributing author
@wouterboomsma
Copy link
Contributor Author

Great! Thank you all for your time in reviewing this PR.

abiedermann pushed a commit to abiedermann/mdanalysis that referenced this pull request Jan 5, 2017
* Added MemoryReader and tests.

* Added documentation, and removed unused import

* Minor bugfixes and documentation updates related to MemoryReader.

* Changed reader format string: memory->MEMORY

* Removed check for existence of group.universe

* Use isinstance instead of checking against format class variable.

* Inreased travis process timeout to 400s

* Bugfix in MemoryReader: now forces numpy array copy when using in_memory option of Universe constructor

* Extra documentation and minor bugfix
abiedermann pushed a commit to abiedermann/mdanalysis that referenced this pull request Jan 5, 2017
abiedermann pushed a commit to abiedermann/mdanalysis that referenced this pull request Jan 5, 2017
 JmhX7IYP+2pnu4iv7mzp+xy8W5x1pzFOZhzfS4LslB2Ac+4SENM1eAZebiZQYSYl
 qyYkj9bLS2RqC40dQVfAA4dMrAuWpNfgZkTeCCU7pm29TodN0hmljNG/rLoGG/iH
 ieA+pSultKwyryY0UZXHAMmNRIViEfy1VsjE2Hswc/AIiY9W1yfLONXVjtELjFsU
 jv4u1ljR+lhuP6ZllEwucKeD5ojQDyXBU9WXPi2bUwFEFpNdn5zXVRTV/kntb86d
 sbOE1TQpy+WeSNTyqmEtdjQRuLH21Jm9kTnWFjkKjw2lL+odUOAoOEC4tLSin30i
 6KK2lVvSfXJi9e5XOL+nXWqvwr+jU5jqNV+PC60aEe9N1xhVapYroY84bIJwcuYQ
 2Axu0vJWNLOpnLGAH7kYYaeEFJdsZt283kDy1paagWmXqyy3p3hh1VD8NEWWcXQ4
 Da07VzZh61RvLHFWIzyXkizPjaCMCfCPIpwRACht5OJdRH0LwEhGRhJhSHEUdh7p
 WGNEQr5XVnd2Xb+dHfpBeq6a0vPmOv8oDXzRCnGpDeTjGE3kEQy8PGeDqm4dsD+B
 wOmeCYN/NmUnurpZrW2x151bKgCbcjwiTksCsIhZmN1kRHqAR+NkoHSUOiji6INw
 LvpyU91X7ji1yyYlhPTu
 =N0P/
 -----END PGP SIGNATURE-----

updated AUTHORS

- changed "Maintained by Oliver Beckstein and Elizabeth Denning" to
  "Maintained by the MDAnalysis Core Devs Team"
- added Wouter Boomsma (for PR MDAnalysis#976)
- removed explicit mentions for KDTree (Hamelryck) because since 0.11.0
  we do no longer include the actual code (see PR MDAnalysis#395 and MDAnalysis#383)
- removed explicit mentions for helanal (Hall) because Ben Hall is
  a contributing author
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants