Issue 2046 lib distances rework continued#2114
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #2114 +/- ##
===========================================
+ Coverage 89.41% 89.41% +<.01%
===========================================
Files 157 157
Lines 18765 18766 +1
Branches 2711 2711
===========================================
+ Hits 16778 16779 +1
Misses 1382 1382
Partials 605 605
Continue to review full report at Codecov.
|
| assert line == line.lstrip() | ||
|
|
||
|
|
||
| class TestCheckBox(object): |
There was a problem hiding this comment.
I just copied this over from lib/test_distances.py. The code itself remained unchanged.
| assert line == line.lstrip() | ||
|
|
||
|
|
||
| class TestCheckBox(object): |
There was a problem hiding this comment.
I just copied this over from lib/test_distances.py. The code itself remained unchanged.
| return lines[0].lstrip() + "\n" + textwrap.dedent("\n".join(lines[1:])) | ||
|
|
||
|
|
||
| def check_box(box): |
There was a problem hiding this comment.
I just cut & pasted this from lib/distances.py. The code itself is unchanged.
zemanj
left a comment
There was a problem hiding this comment.
Some explanatory comments
| from .c_distances_openmp import OPENMP_ENABLED as USED_OPENMP | ||
|
|
||
|
|
||
| def _check_box(box): |
There was a problem hiding this comment.
This function is not specific to distance computation and might be useful elsewhere --> moved to lib.util.check_box()
| if N > 1: | ||
| distvec = self_distance_array(reference, box=box) | ||
| dist = np.full((N, N), max_cutoff, dtype=np.float64) | ||
| dist = np.full((N, N), np.finfo(np.float64).max, dtype=np.float64) |
There was a problem hiding this comment.
Old fill value max_cutoff doesn't work with cutoff criterion distance <= max_cutoff.
np.finfo(np.float64).max is the largest number representable by np.float64, so this fill value is guaranteed to work for all possible values of max_cutoff
| cdef real rvec_norm2(const rvec a) nogil: | ||
| return a[XX]*a[XX] + a[YY]*a[YY] + a[ZZ]*a[ZZ] | ||
|
|
||
| cdef void rvec_clear(rvec a) nogil: |
| # mdamath.triclinic_vectors explicitly sets the off-diagonal | ||
| # elements to zero if the box is orthogonal, so we can | ||
| # safely check floating point values for equality here | ||
| if box[i, j] != 0.0: |
There was a problem hiding this comment.
This is to correspond to lib.util.check_box() and therefore all functions in lib.distances.
| return lines[0].lstrip() + "\n" + textwrap.dedent("\n".join(lines[1:])) | ||
|
|
||
|
|
||
| def check_box(box): |
There was a problem hiding this comment.
This is just cut & pasted from lib/distances.py, no changes in the code.
| from MDAnalysis.lib import mdamath | ||
| from MDAnalysis.tests.datafiles import PSF, DCD, TRIC | ||
|
|
||
| class TestCheckBox(object): |
There was a problem hiding this comment.
Moved to lib/test_util.py
| assert line == line.lstrip() | ||
|
|
||
|
|
||
| class TestCheckBox(object): |
There was a problem hiding this comment.
This is just cut & pasted from lib/test_distances.py, no changes in the code.
zemanj
left a comment
There was a problem hiding this comment.
Some explanatory comments
|
Sorry for the comment mess. GitHub is extremely unresponsive for me today. It took half an hour until I could finally see that all my seemingly unsuccessful attempts to add comments did in fact work. 😠 |
|
|
||
| ctypedef np.int_t ns_int | ||
| ctypedef np.float32_t real | ||
| ctypedef np.float64_t dreal |
There was a problem hiding this comment.
Is there a performance different for moving to double precision? (probably yes?) If so, maybe we should open an issue about precision in our distance calculations. Especially as for this Cython it's easy to used a fused type (aka template)
There was a problem hiding this comment.
Good call, I'll run some benchmarks.
|
@zemanj I'll find some time to go over this in detail, but it looks good at a glance |
|
@richardjgowers I tested the performance impact of switching from single to double precision distance computation in >>> import MDAnalysis as mda
>>> from MDAnalysisTests.datafiles import TRR, TPR
>>> from MDAnalysis.lib.distances import capped_distance, self_capped_distance
>>> u = mda.Universe(TPR, TRR)
>>> water_oxygen = u.select_atoms('resname SOL and name OW')
>>> print(water_oxygen.n_atoms)
11084
>>> print(len(u.trajectory))
10
>>> %%timeit
>>> for ts in u.trajectory:
>>> pairs, dists = capped_distance(water_oxygen.positions,
... water_oxygen.positions,
... 15, box=ts.dimensions,
... method='nsgrid')
>>> %%timeit
>>> for ts in u.trajectory:
>>> pairs, dists = self_capped_distance(water_oxygen.positions,
... 15, box=ts.dimensions,
... method='nsgrid')Output for capped_distance: Output for self_capped_distance: So I'd say that the performance impact is rather negligible. |
richardjgowers
left a comment
There was a problem hiding this comment.
Cool thanks @zemanj
Interesting that the performance loss is so small, but I guess a good thing for now
|
Looks like it needs a rebase onto develop |
…d_distances() methods to <= max_cutoff
…cutoff; added underscore prefix to non-user classes
81e3f5a to
42b856a
Compare
|
@richardjgowers I think the small performance impact indicates one or both of the following:
I guess it's a bit of both since I already identified some rather crude thingies in |
|
I recently noticed an increasing amount of CI hiccups where jobs fail due to errors unrelated to the actual PR. Usually, those are I/O-related errors. I started collecting these cases and will open an issue once I gathered enough information. |
|
@richardjgowers Thank you for reviewing! |
|
@zemanj yeah probably because it's memory bound. WRT the I/O hiccups, it might be possible to fix some of these using longer scoped fixtures.... There's currently the encore HES thing and badzipfile I can think of :) |
Fixes (partially) #2046. This is the fifth of a series of related PRs following PRs #2048, #2062, #2070, and #2083.
Changes made in this Pull Request:
lib.distances._check_box()tolib.util.check_box()capped_distance()have the same cut-off criterion (distances <= max_cutoff). According to thescipy.spatial.cKDTree.query_ball_treedocs and according to some tests I ran, this should also be the case for the pKDTree method. In practice, this is rather pathological (comparing floating point values for equality, you know...) but should still be consistent.lib.distancesandlib.nsgridmake arbitrary assumptions about when a box is orthogonal. Since the number 90 has an exact float representation, we should refrain from defining arbitrary margins around that value. This is definitely a case where comparing floating point values for equality makes perfect sense (as it has always been done bylib.mdamath.triclinic_vectorsand the box checks inlib.distances(now inlib.utils)).lib.nsgridnow uses double precision for distance computation.lib.nsgridwith an underscore.lib.nsgrid.lib.utilsI stumbled accross.PR Checklist