Skip to content

Issue 2046 lib distances rework continued#2114

Merged
zemanj merged 5 commits intoMDAnalysis:developfrom
zemanj:issue-2046-lib-distances-rework
Oct 25, 2018
Merged

Issue 2046 lib distances rework continued#2114
zemanj merged 5 commits intoMDAnalysis:developfrom
zemanj:issue-2046-lib-distances-rework

Conversation

@zemanj
Copy link
Copy Markdown
Member

@zemanj zemanj commented Oct 19, 2018

Fixes (partially) #2046. This is the fifth of a series of related PRs following PRs #2048, #2062, #2070, and #2083.

Changes made in this Pull Request:

  • Moved lib.distances._check_box() to lib.util.check_box()
  • Ensured that all methods used in capped_distance() have the same cut-off criterion (distances <= max_cutoff). According to the scipy.spatial.cKDTree.query_ball_tree docs and according to some tests I ran, this should also be the case for the pKDTree method. In practice, this is rather pathological (comparing floating point values for equality, you know...) but should still be consistent.
  • Ensured that none of the functions in lib.distances and lib.nsgrid make arbitrary assumptions about when a box is orthogonal. Since the number 90 has an exact float representation, we should refrain from defining arbitrary margins around that value. This is definitely a case where comparing floating point values for equality makes perfect sense (as it has always been done by lib.mdamath.triclinic_vectors and the box checks in lib.distances (now in lib.utils)).
  • Just like all other distance-related functions, lib.nsgridnow uses double precision for distance computation.
  • Prefixed all non-user-level classes in lib.nsgrid with an underscore.
  • Adjusted all affected tests accordingly.
  • Fixed up half-broken docs of lib.nsgrid.
  • Minor doc fix-ups in lib.utils I stumbled accross.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@richardjgowers richardjgowers self-assigned this Oct 19, 2018
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 19, 2018

Codecov Report

Merging #2114 into develop will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2114      +/-   ##
===========================================
+ Coverage    89.41%   89.41%   +<.01%     
===========================================
  Files          157      157              
  Lines        18765    18766       +1     
  Branches      2711     2711              
===========================================
+ Hits         16778    16779       +1     
  Misses        1382     1382              
  Partials       605      605
Impacted Files Coverage Δ
package/MDAnalysis/lib/util.py 88.31% <100%> (+0.13%) ⬆️
package/MDAnalysis/lib/distances.py 97.4% <100%> (-0.06%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e08e817...42b856a. Read the comment docs.

assert line == line.lstrip()


class TestCheckBox(object):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied this over from lib/test_distances.py. The code itself remained unchanged.

assert line == line.lstrip()


class TestCheckBox(object):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied this over from lib/test_distances.py. The code itself remained unchanged.

return lines[0].lstrip() + "\n" + textwrap.dedent("\n".join(lines[1:]))


def check_box(box):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just cut & pasted this from lib/distances.py. The code itself is unchanged.

Copy link
Copy Markdown
Member Author

@zemanj zemanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some explanatory comments

from .c_distances_openmp import OPENMP_ENABLED as USED_OPENMP


def _check_box(box):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not specific to distance computation and might be useful elsewhere --> moved to lib.util.check_box()

if N > 1:
distvec = self_distance_array(reference, box=box)
dist = np.full((N, N), max_cutoff, dtype=np.float64)
dist = np.full((N, N), np.finfo(np.float64).max, dtype=np.float64)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old fill value max_cutoff doesn't work with cutoff criterion distance <= max_cutoff.
np.finfo(np.float64).max is the largest number representable by np.float64, so this fill value is guaranteed to work for all possible values of max_cutoff

cdef real rvec_norm2(const rvec a) nogil:
return a[XX]*a[XX] + a[YY]*a[YY] + a[ZZ]*a[ZZ]

cdef void rvec_clear(rvec a) nogil:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was unused

# mdamath.triclinic_vectors explicitly sets the off-diagonal
# elements to zero if the box is orthogonal, so we can
# safely check floating point values for equality here
if box[i, j] != 0.0:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to correspond to lib.util.check_box() and therefore all functions in lib.distances.

return lines[0].lstrip() + "\n" + textwrap.dedent("\n".join(lines[1:]))


def check_box(box):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just cut & pasted from lib/distances.py, no changes in the code.

from MDAnalysis.lib import mdamath
from MDAnalysis.tests.datafiles import PSF, DCD, TRIC

class TestCheckBox(object):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to lib/test_util.py

assert line == line.lstrip()


class TestCheckBox(object):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just cut & pasted from lib/test_distances.py, no changes in the code.

Copy link
Copy Markdown
Member Author

@zemanj zemanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some explanatory comments

@zemanj
Copy link
Copy Markdown
Member Author

zemanj commented Oct 22, 2018

Sorry for the comment mess. GitHub is extremely unresponsive for me today. It took half an hour until I could finally see that all my seemingly unsuccessful attempts to add comments did in fact work. 😠
Hope you can see the comments, they only appear for me on every fifth page reload or so.


ctypedef np.int_t ns_int
ctypedef np.float32_t real
ctypedef np.float64_t dreal
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a performance different for moving to double precision? (probably yes?) If so, maybe we should open an issue about precision in our distance calculations. Especially as for this Cython it's easy to used a fused type (aka template)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll run some benchmarks.

@richardjgowers
Copy link
Copy Markdown
Member

@zemanj I'll find some time to go over this in detail, but it looks good at a glance

@zemanj
Copy link
Copy Markdown
Member Author

zemanj commented Oct 23, 2018

@richardjgowers I tested the performance impact of switching from single to double precision distance computation in lib.nsgrid with the following code:

>>> import MDAnalysis as mda
>>> from MDAnalysisTests.datafiles import TRR, TPR
>>> from MDAnalysis.lib.distances import capped_distance, self_capped_distance
>>> u = mda.Universe(TPR, TRR)
>>> water_oxygen = u.select_atoms('resname SOL and name OW')
>>> print(water_oxygen.n_atoms)
11084
>>> print(len(u.trajectory))
10

>>> %%timeit
>>> for ts in u.trajectory:
>>>     pairs, dists = capped_distance(water_oxygen.positions,
...                                   water_oxygen.positions,
...                                   15, box=ts.dimensions,
...                                   method='nsgrid')
>>> %%timeit
>>> for ts in u.trajectory:
>>>     pairs, dists = self_capped_distance(water_oxygen.positions,
...                                         15, box=ts.dimensions,
...                                         method='nsgrid')

Output for capped_distance:
1 loop, best of 3: 24.3 s per loop (single precision)
1 loop, best of 3: 25.6 s per loop (double precision)
➡️ Performance loss: ~5%

Output for self_capped_distance:
1 loop, best of 3: 21.1 s per loop (single precision)
1 loop, best of 3: 21.9 s per loop (double precision)
➡️ Performance loss: ~4%

So I'd say that the performance impact is rather negligible.

Copy link
Copy Markdown
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks @zemanj

Interesting that the performance loss is so small, but I guess a good thing for now

@richardjgowers richardjgowers added this to the 0.19.x milestone Oct 24, 2018
@richardjgowers
Copy link
Copy Markdown
Member

Looks like it needs a rebase onto develop

@zemanj zemanj force-pushed the issue-2046-lib-distances-rework branch from 81e3f5a to 42b856a Compare October 25, 2018 05:52
@zemanj
Copy link
Copy Markdown
Member Author

zemanj commented Oct 25, 2018

@richardjgowers I think the small performance impact indicates one or both of the following:

  • the algorithm is strongly memory-bound
  • the implementation might have some inefficiencies

I guess it's a bit of both since I already identified some rather crude thingies in nsgrid.pyx (such as the way neighboring cells are determined). However, I'll need to do some profiling before I can tell more.

@zemanj
Copy link
Copy Markdown
Member Author

zemanj commented Oct 25, 2018

I recently noticed an increasing amount of CI hiccups where jobs fail due to errors unrelated to the actual PR. Usually, those are I/O-related errors. I started collecting these cases and will open an issue once I gathered enough information.

@zemanj zemanj merged commit c9a37e3 into MDAnalysis:develop Oct 25, 2018
@zemanj
Copy link
Copy Markdown
Member Author

zemanj commented Oct 25, 2018

@richardjgowers Thank you for reviewing!

@zemanj zemanj deleted the issue-2046-lib-distances-rework branch October 25, 2018 11:15
@richardjgowers
Copy link
Copy Markdown
Member

@zemanj yeah probably because it's memory bound.

WRT the I/O hiccups, it might be possible to fix some of these using longer scoped fixtures.... There's currently the encore HES thing and badzipfile I can think of :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants