Change positions order from 'F' to 'C' == faster wrapping#2211
Change positions order from 'F' to 'C' == faster wrapping#2211zemanj wants to merge 27 commits intoMDAnalysis:developfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #2211 +/- ##
===========================================
- Coverage 90.71% 89.57% -1.15%
===========================================
Files 15 158 +143
Lines 1982 19692 +17710
Branches 0 2760 +2760
===========================================
+ Hits 1798 17639 +15841
- Misses 184 1458 +1274
- Partials 0 595 +595
Continue to review full report at Codecov.
|
|
In some tests, Travis/conda/whatever seems to screw up: EDIT: |
…- to C-contiguous
…or contiguous AtomGroups)
…argwhere_int_1d()
…ib.util.unique_int_1d
…nd fragments faster
…upBase._compound_indices() to AtomGroup.compound_indices()
…rameter to unique_int_1d and unique_masks_1d
…ns by AtomGroup._positions_view_or_copy and AtomGroup._positions_c_view_or_copy; use new util functions in *Group.wrap() and *Group.center(); new parameter check_weights in *Group.center()
Performance ReportTest Code# -*- coding: utf-8 -*-
import numpy as np
import MDAnalysis as mda
from timeit import timeit
########## HELPER FUNCTIONS ##########
def run(func, nframes):
for ts in u.trajectory[:nframes]:
func()
def trajectory_read(nframes):
for ts in u.trajectory[:nframes]:
pass
def time_group_wrap(group, compound, center, inplace, nruns, nframes_per_run,
t_read):
func = lambda: group.wrap(compound=compound, center=center, inplace=inplace)
t_wrap = timeit(lambda: run(func, nframes_per_run), number=nruns)
t_wrap *= 1.0e6 / (nruns * nframes_per_run)
t = t_wrap - t_read
print("| {:9s} | {:6s} | {:7s} | {:10.1f} |"
"".format(compound, center, str(inplace), t))
def run_measurements(groups, compounds, centers, nruns, nframes_per_run):
t_read = timeit(lambda: trajectory_read(nframes_per_run), number=nruns)
t_read *= 1.0e6 / (nruns * nframes_per_run)
for cont, group in groups:
print("\n{} {}\n".format(cont, repr(group).strip("<>")))
print("| compound | center | inplace | µs / frame |")
print("| :--- | :---: | :--- | ---: |")
for compound in compounds:
if compound == 'atoms':
_centers = ['n/a']
else:
_centers = centers
for center in _centers:
for inplace in [False, True]:
time_group_wrap(group, compound, center, inplace,
nruns, nframes_per_run, t_read)
########## SETUP ##########
# Load Universe:
u = mda.Universe("topol.tpr", "traj_comp.xtc")
# populate u._fraginfo:
_ = u.atoms.fragments
# Set up groups, compounds, and centers:
np.random.seed(1337)
rnd_aix = u.atoms.ix.copy()
np.random.shuffle(rnd_aix)
groups = [("contiguous", u.atoms),
("non-contiguous", u.atoms[::-1]),
("scrambled", u.atoms[rnd_aix])]
compounds = ['atoms', 'group', 'segments', 'residues', 'molecules', 'fragments']
centers = ['com', 'cog']
# measurement parameters:
nruns = 1000
nframes_per_run = 100
print("{} runs, {} frames / run".format(nruns, nframes_per_run))
########## MEASUREMENTS ##########
run_measurements(groups, compounds, centers, nruns, nframes_per_run)Runtime ComparisonTest CPU: AMD Ryzen Threadripper 1920X 12-Core Processor @ 3.5 GHz, process pinned to core 0 Note that the runtimes presented below are execution times of contiguous AtomGroup with 15000 atoms (
|
| compound | center | inplace | µs / frame (new) | µs / frame (old) | speedup factor (old/new) |
|---|---|---|---|---|---|
| atoms | n/a | False | 199.6 | 211.3 | 1.06 |
| atoms | n/a | True | 49.8 | 394.7 | 7.93 |
| group | com | False | 238.2 | 722.4 | 3.03 |
| group | com | True | 85.3 | 905.6 | 10.62 |
| group | cog | False | 206.4 | 599.8 | 2.91 |
| group | cog | True | 56.0 | 782.7 | 13.98 |
| segments | com | False | 489.3 | 2011.1 | 4.11 |
| segments | com | True | 339.9 | 2194.2 | 6.46 |
| segments | cog | False | 457.6 | 1858.4 | 4.06 |
| segments | cog | True | 299.7 | 2041.2 | 6.81 |
| residues | com | False | 1198.2 | 33676.5 | 28.11 |
| residues | com | True | 1035.7 | 33874.6 | 32.71 |
| residues | cog | False | 1147.2 | 33407.9 | 29.12 |
| residues | cog | True | 994.9 | 33745.0 | 33.92 |
| molecules | com | False | 843.2 | 18311.0 | 21.72 |
| molecules | com | True | 690.1 | 18520.9 | 26.84 |
| molecules | cog | False | 803.5 | 18148.0 | 22.59 |
| molecules | cog | True | 650.1 | 18377.9 | 28.27 |
| fragments | com | False | 829.2 | 25149.5 | 30.33 |
| fragments | com | True | 674.4 | 25387.2 | 37.64 |
| fragments | cog | False | 788.2 | 24823.8 | 31.49 |
| fragments | cog | True | 635.7 | 25004.9 | 39.33 |
| average | com/cog | False | 654.6 | 14447.2 | 22.07 |
| average | com/cog | True | 501.1 | 14657.2 | 29.25 |
| average | com/cog | False/True | 577.8 | 14552.2 | 25.18 |
non-contiguous AtomGroup with 15000 atoms (u.atoms[::-1])
| compound | center | inplace | µs / frame (new) | µs / frame (old) | speedup factor (old/new) |
|---|---|---|---|---|---|
| atoms | n/a | False | 197.9 | 213.0 | 1.08 |
| atoms | n/a | True | 176.8 | 415.9 | 2.35 |
| group | com | False | 233.3 | 724.2 | 3.10 |
| group | com | True | 140.6 | 906.3 | 6.45 |
| group | cog | False | 205.5 | 606.5 | 2.95 |
| group | cog | True | 112.7 | 793.2 | 7.04 |
| segments | com | False | 479.4 | 2026.0 | 4.23 |
| segments | com | True | 386.6 | 2206.9 | 5.71 |
| segments | cog | False | 440.4 | 1872.4 | 4.25 |
| segments | cog | True | 348.1 | 2062.9 | 5.93 |
| residues | com | False | 1183.0 | 33750.9 | 28.53 |
| residues | com | True | 1079.9 | 33996.4 | 31.48 |
| residues | cog | False | 1136.6 | 33460.7 | 29.44 |
| residues | cog | True | 1040.0 | 33677.7 | 32.38 |
| molecules | com | False | 854.0 | 18210.1 | 21.32 |
| molecules | com | True | 766.6 | 18426.2 | 24.04 |
| molecules | cog | False | 816.3 | 18034.4 | 22.09 |
| molecules | cog | True | 726.9 | 18206.0 | 25.05 |
| fragments | com | False | 840.0 | 25127.3 | 29.91 |
| fragments | com | True | 753.1 | 25364.3 | 33.68 |
| fragments | cog | False | 798.9 | 24892.8 | 31.16 |
| fragments | cog | True | 709.5 | 25024.5 | 35.27 |
| average | com/cog | False | 653.2 | 14447.1 | 22.12 |
| average | com/cog | True | 567.3 | 14643.7 | 25.81 |
| average | com/cog | False/True | 610.3 | 14545.4 | 23.83 |
scrambled AtomGroup with 15000 atoms (u.atoms[shuffled_indices])
| compound | center | inplace | µs / frame (new) | µs / frame (old) | speedup factor (old/new) |
|---|---|---|---|---|---|
| atoms | n/a | False | 201.7 | 213.2 | 1.06 |
| atoms | n/a | True | 378.4 | 396.1 | 1.05 |
| group | com | False | 238.7 | 723.0 | 3.03 |
| group | com | True | 411.8 | 904.5 | 2.20 |
| group | cog | False | 217.0 | 599.8 | 2.76 |
| group | cog | True | 383.6 | 784.2 | 2.04 |
| segments | com | False | 538.4 | 2015.5 | 3.74 |
| segments | com | True | 705.2 | 2201.4 | 3.12 |
| segments | cog | False | 489.3 | 1863.1 | 3.81 |
| segments | cog | True | 670.7 | 2047.4 | 3.05 |
| residues | com | False | 1771.1 | 33739.9 | 19.05 |
| residues | com | True | 1940.2 | 33851.5 | 17.45 |
| residues | cog | False | 1719.7 | 33423.1 | 19.44 |
| residues | cog | True | 1892.0 | 34512.8 | 18.24 |
| molecules | com | False | 1377.8 | 18672.9 | 13.55 |
| molecules | com | True | 1546.2 | 18348.2 | 11.87 |
| molecules | cog | False | 1323.1 | 17975.6 | 13.59 |
| molecules | cog | True | 1494.7 | 18158.8 | 12.15 |
| fragments | com | False | 1359.3 | 25020.9 | 18.41 |
| fragments | com | True | 1528.4 | 25237.3 | 16.51 |
| fragments | cog | False | 1319.2 | 24755.0 | 18.77 |
| fragments | cog | True | 1485.6 | 24898.2 | 16.76 |
| average | com/cog | False | 959.6 | 14454.7 | 15.06 |
| average | com/cog | True | 1130.6 | 14667.3 | 12.97 |
| average | com/cog | False/True | 1045.1 | 14561.0 | 13.93 |
|
@MDAnalysis/coredevs Should be good to go. I'm sorry that this got a little out of hand, maybe I should have split that into several PRs... |
| coordinates[i, 2] = coordinates[i, 2] + vector[2] | ||
|
|
||
|
|
||
| def _coords_add_vectors(np.ndarray[np.float32_t, ndim=2] coordinates not None, |
There was a problem hiding this comment.
Is there a compelling reason to use the old buffer interface instead of a typed memoryview here?
There was a problem hiding this comment.
@tylerjereddy Yes, there is. Depending on input dtype and/or shape, I use numpy operations such as ndarray.astype() or radds like coordinates += vectors. The problem with memoryviews is that they don't offer such functionality. Having memoryviews as input would require unnecessary instantiations of ndarray objects with np.asarray() prior to these operations... You see my point?
So we really do want to have numpy.ndarray objects as input here. As you might have noticed, I also initialize a memoryview of the same data in line 203:
cdef float[:, :] coords = coordinatesWhenever I use custom cython operations, this memoryview is used instead of the ndarray objects.
There was a problem hiding this comment.
I still try to guard against introduction of the old buffer interface in new code. Usually there is a way to work around using the old approach with careful design.
I agree it is a pain to deal with asarray() stuff a lot.
There was a problem hiding this comment.
Usually there is a way to work around using the old approach with careful design.
Do you have anything specific in mind? Like a Python wrapper taking care of array initialization and returning? (If that's not too costly)
There was a problem hiding this comment.
Not sure, but probably not a blocking issue here if you don't have something in mind already.
I'm also not sure if Cython will ever really deprecate the old approach. I've had a hard time replacing the old buffers with memoryviews in SciPy for the reasons you outline--reviewers don't like performance drops when the buffer is just a medium for transmitting the ndarray.
| .. versionadded:: 0.20.0 | ||
| """ | ||
| cdef np.intp_t n = coordinates.shape[0] | ||
| cdef np.ndarray[np.float64_t, ndim=2] center |
There was a problem hiding this comment.
This is there for a reason:
The center object is what the function returns in the end, and that should be a numpy.ndarray.
If I used a memoryview instead, I'd have to return np.asarray(center). Since center is initialized with np.zeros(...), which returns an ndarray anyway, this would mean throwing away this first ndarray object and creating a new one in the return statement, effectively killing performance.
| cdef np.intp_t n_unique | ||
| cdef np.ndarray[np.intp_t, ndim=1] counts | ||
| cdef np.ndarray[object, ndim=1] masks | ||
| cdef np.ndarray[np.intp_t, ndim=1] unique = np.empty(n_values, |
There was a problem hiding this comment.
old buffer interfaces on lines above
There was a problem hiding this comment.
That's here for the same reasons as above: counts, masks, and unique are the arrays being returned in the end and we want to avoid unnecessary calls to np.asarray(). Note that the low-level implementations of unique_int_1d() operate on memoryviews of these buffers.
|
|
||
| .. versionadded:: 0.20.0 | ||
| """ | ||
| cdef np.ndarray[np.intp_t, ndim=1] result = np.empty(arr.shape[0], dtype=np.intp) |
| return (x == INFINITY) | (x == -INFINITY) | ||
|
|
||
|
|
||
| def coords_add_vector(float[:, :] coordinates not None, |
There was a problem hiding this comment.
I don't really get this, but I guess I need to find the time to play with it myself. I don't like the idea that we're implementing what should already be done optimally in numpy. If we start having functions like this, then the library could quickly balloon with similar....
There was a problem hiding this comment.
I absolutely agree that this should be done with numpy. However, the current implementation in numpy is dead slow (about a factor 10 slower than it could be) due of the way numpy implements broadcasting (resulting in a gazillion of __memmove_ssse3()s) before calling its vectorized addition routine (sse2_binary_add_FLOAT()). I should probably think about a more efficient way of implementing this in numpy instead of implementing a workaraound in MDAnalysis...
There was a problem hiding this comment.
+1 for moving upstream instead of here if there's some fundamental flaw in NumPy operations that can be changed without drawbacks
There was a problem hiding this comment.
@tylerjereddy I'm currently investigating where and how shortcuts in/around numpy's npyiter_copy_to_buffers() might be implemented. Unfortunately, understanding this section of the numpy code base is not an entirely trivial task...
There was a problem hiding this comment.
@zemanj I went ahead and tried to summarize your idea in the weekly NumPy meeting community topics section. I hope that's ok & you are welcome to join / listen in on the call using the link on that doc.
Feel free to clean up my summary there too--anyone can edit that.
The call is in a few hours--12 pm Pacific, which may be rather late for you.
There was a problem hiding this comment.
@tylerjereddy Thanks! I added two links, one to the gdb log and one to the call stack visualization I posted in the discussion of #2210. I won't be able to join the meeting today, so feel free to postpone the topic.
There was a problem hiding this comment.
This part of numpy is really complex and probably can be improved. And yes it is complex...
The buffer code should skip copying to the buffer when the correct data is already in it (at least in most cases). But I am not sure it is very smart about trying to ensure that this actually happens. I have to check the exact case you are observing, possibly there are also some cases which could dispatch to a non-buffered path, but do not?
richardjgowers
left a comment
There was a problem hiding this comment.
I think this needs the reimplementation of basic numpy operations (coords += pos) removed before we can merge
|
I agree, and consequently, I should also revert the change in memory layout. This has to wait until numpy offers more efficient in-place operations for C-contiguous arrays.
…On May 8, 2019 12:06:46 PM GMT+02:00, Richard Gowers ***@***.***> wrote:
richardjgowers requested changes on this pull request.
I think this needs the reimplementation of basic numpy operations
(coords += pos) removed before we can merge
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#2211 (review)
|
|
@zemanj @richardjgowers there's a ton of improvement here that aren't what was implemented in #3132 - (i.e. going beyond 'F' -> 'C') probably worth opening a new PR for this that targets that? |
Fixes #2210
Changes made in this Pull Request:
coordinates.base.Timestep.orderto 'C'return_counts,return_masks, andassume_unsortedtolib.util.unique_int_1d()and improved its performancelib.util.iscontiguous_int_1d()lib.util.indices_to_slice_1d()*Group.iscontiguousproperty*Group._ix_or_slicepropertyAtomGroup._positions_view_or_copyproperty returning a view for sliceable AtomGroups (otherwise a copy)AtomGroup._positions_c_view_or_copyproperty returning a C-contiguous view for contiguous AtomGroups (otherwise a copy)lib.util.unique_masks_int_1d()lib.util.argwhere_int_1d()(will be used in a future PR for faster unwrapping)lib.util.check_compound()AtomGroup.compound_indices()lib.util.coords_add_vector()lib.util._coords_add_vectors()Universe._fragdictbyUniverse._fraginfoproperty (acceleratesAtomGroup.fragmentsandAtomGroup.fragindices)*Group.center(), especially for per-compound computations and contiguous AtomGroups*Group.wrap(), especially for per-compound computations and contiguous AtomGroups (up to a factor of 39.3)PR Checklist