Skip to content

Added capped function for grid search#2008

Merged
richardjgowers merged 48 commits intoMDAnalysis:developfrom
ayushsuhane:cappedgrid
Aug 7, 2018
Merged

Added capped function for grid search#2008
richardjgowers merged 48 commits intoMDAnalysis:developfrom
ayushsuhane:cappedgrid

Conversation

@ayushsuhane
Copy link
Contributor

@ayushsuhane ayushsuhane commented Jul 27, 2018

Following #1996 , added capped function on top of the grid search to find all the pairs from reference and configuration state and their respective distances. Also, I have modified the __init__ of class FastNS. For now, it takes box, and atom positions as input parameters.

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Note

The function cannot handle non-periodic cases in its current state.

@codecov
Copy link

codecov bot commented Jul 28, 2018

Codecov Report

Merging #2008 into develop will increase coverage by <.01%.
The diff coverage is 89.65%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2008      +/-   ##
===========================================
+ Coverage    88.59%   88.59%   +<.01%     
===========================================
  Files          143      143              
  Lines        17305    17363      +58     
  Branches      2649     2658       +9     
===========================================
+ Hits         15331    15383      +52     
- Misses        1376     1379       +3     
- Partials       598      601       +3
Impacted Files Coverage Δ
package/MDAnalysis/lib/__init__.py 100% <100%> (ø) ⬆️
package/MDAnalysis/lib/distances.py 87.88% <89.47%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e90dfd2...b7751d1. Read the comment docs.

if min_cutoff is not None:
idx = pair_distance > min_cutoff
pairs, pair_distance = pairs[idx], pair_distance[idx]
if pairs.size > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushsuhane I don't really understand what this step is doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with concatenate and search id is that it wouldn't mask the pairs within search ids. The pairs which are returned from capped function are such that in a pair (i, j) i is from the reference positions and j is from the configuration positions. But here, it will return all the indices i.e (i, j) might be such that i and j both are indices of the search group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so it sounds like this is much better for self_capped_distance than capped_distance right now. Ideally we'd create 2 nsgrids and tell them to search each other?

search_ids = np.arange(len(configuration), len(all_coords))
results = gridsearch.search(search_ids=search_ids)
pairs = results.get_pairs()
pairs[:, 1] = undo_augment(pairs[:, 1], mapping, len(configuration))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing an undo_augment if we never augmented? Is this because we had to concatenate the coordinates together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand it could become confusing. But I just used it since we already had that function which served the purpose. But yes, since we are concatenating and to get the same results from other capped_functions, all the indices should be translated back to their corresponding group indices.


self.grid = NSGrid(self.coords_bbox.shape[0], cutoff, self.box, max_gridsize, debug=debug)
self.prepared = False
if prepare:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why prepare would be optional. Can you just make it always prepare on init and remove the prepare method?

cdef real[:] coord_i, coord_j
from collections import defaultdict

indices_buffer = defaultdict(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushsuhane can you use C++ vecs here, and does it benchmark faster if you do?

self.fast_update(box)


cdef void fast_pbc_dx(self, rvec ref, rvec other, rvec dx) nogil:

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah I was wrong here

for i in range (DIM-1, -1, -1):
while dx[i] > self.c_pbcbox.hbox_diag[i]:
for j in range (i, -1, -1):
dx[j] -= self.c_pbcbox.box[i][j]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushsuhane Please have a close look at this loop. It does subtract the correct vector to land at point C in your example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand this loop. But I was wondering that even though the particles are in a brick shaped box, can we apply minimum_image_ortho here? While the above loop will land at C, I think if we apply minimum_image from calc_distances.h, it will return distance corresponding to AD rather than AC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the particles are put inside a brick-shaped box, the box itself is not brick-shaped so the loop is needed to use the correct box vectors and we can't just rely on minimum_image_ortho.


boxes_1 = (np.array([1, 2, 3, 90, 90, 90], dtype=np.float32), # ortho
np.array([1, 2, 3, 30, 45, 60], dtype=np.float32), # tri_box
np.array([1, 2, 3, 45, 60, 90], dtype=np.float32), # tri_box
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the tests fail with the 30 angle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, It was some misunderstanding. I will change it again.

@ayushsuhane
Copy link
Contributor Author

I tried generating maps of cell_neighbours as well, but it looks like it is a bad idea. Probably it is due to generating dictionary which takes relatively longer time than vectors. Here is a snippet of the code

ctypedef cset[ns_int] intset
ctypedef cmap[ns_int, intset] intmap
cdef intmap generate_neighbors(self):
        cdef ns_int x1, yi, zi
        cdef ns_int cell
        cdef ivec cellxyz, neighbor_cellxyz
        cdef ns_int neighbor_cellid
        
        self.cell_neighbors.empty()
        
        for cell in range(self.size):
            self.cellid2cellxyz(cell, cellxyz)
            for xi in range(DIM):
                for yi in range(DIM):
                    for zi in range(DIM):
                        neighbor_cellxyz[XX] = cellxyz[XX] + (xi - 1)
                        neighbor_cellxyz[YY] = cellxyz[YY] + (yi - 1)
                        neighbor_cellxyz[ZZ] = cellxyz[ZZ] + (zi - 1)
                        # Bring the cells back to the brick shaped box
                        for n in range(DIM-1, -1, -1):
                            while neighbor_cellxyz[n] > self.ncells[n]:
                                neighbor_cellxyz[n] -= self.ncells[n]
                            while neighbor_cellxyz[n] < 0:
                                neighbor_cellxyz[n] += self.ncells[n]
                        neighbor_cellid = neighbor_cellxyz[XX] + neighbor_cellxyz[YY]*self.cell_offsets[YY] + \
                                           neighbor_cellxyz[ZZ]*self.cell_offsets[ZZ]
                        self.cell_neighbors[cell].insert(neighbor_cellid)
        return self.cell_neighbors

I think using some alternative hashmaps for indices can serve the purpose if we want to go that way. In any case, iterating over indices is a better solution in my opinion (as it will be faster for less number of atoms as well, while iterating over cells will lead to too many if evaluations).

@richardjgowers Is this how you were envisioning to create two grids and check the pairs? As far as benchmarks of vectors are concerned, using vectors of vectors in place of dictionary reduces the time by half, whereas replacing vectors with list does not lead to any significant change in performance. However, it increases the readability of the code and we don't need to address the memory allocations.

@seb-buch @kain88-de @jbarnoud @richardjgowers @orbeckst , Please review the code. I have tried to remove most of the part which wasn't desired (specifically the get_coordinates). In my opinion, coordinates can be selected directly via python after we evaluate the indices and distances. For a single query, coordinates_buffer was consuming a significant amount of time. I plan to add noPBC option first and then add the documentation. Ofcourse if @seb-buch is OK with it. But in any case, merging before the GSOC would be helpful.

For self_search, it returns all the pairs and indices. To obtain the unique pairs, np.unique should be used. If this is how the method is desired, I will add this in the documentation.

While the current version is faster than the implemented methods in MDAnalysis for most of the cases, there are still opportunities for improvements. For instance, using probe[XX] = searchcoords[current_beadid, XX] + (xi - 1) * self.cutoff[XX] in search to evaluate the cell_id, and not evaluating the distances if cell_id is already checked. Currently, it evaluates all the adjacent cells. One way could be to use a stencil[27], and using every index as a flag for every probe. However, I am not sure how to deal with the periodic case in an elegant way (by not increasing the number of evaluations for every particle). This will be particularly useful when the cutoff is very small (but greater than the cellsize).

continue
#find distance between search coords[i] and coords[bid]
d2 = self.box.fast_distance2(&self.coords_bbox[current_beadid, XX], &self.coords_bbox[bid, XX])
if d2 < cutoff2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like npairs has the wrong indentation here, you don't want to increment if d2 < EPSILON. I'd do if d2 cutoff and d2 > EPSILON:


self.indices_buffer[beadid_i].push_back(beadid_j)

dist = sqrt(dist2)

This comment was marked as resolved.

self.distances_buffer[beadid_i].push_back(dist)

for i in range(self.searchcoords.shape[0]):
sorted_indices = np.argsort(self.indices_buffer[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we have to sort tbh, I'd just give back indices in the order we found them. You can sort inside the tests to check results

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some simple docs on each method

self.pair_distances2_buffer.push_back(distance2)
self.npairs += 1

def get_pairs(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs docs docs. It took me a while to understand what all the different get_X mean in this class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I get your point. I was thinking that first we should fix the API and then I would add the docs. However, even for the reviewers to understand, docs are critical. I will add them today.

Copy link
Contributor

@seb-buch seb-buch Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to write some docs on my branch... However I am officially on vacation so I do not spend much time of them.
Anyway, I will take some time tomorrow to merge the changes from @ayushsuhane and the very little I did on the docs.

package/setup.py Outdated
['MDAnalysis/lib/nsgrid' + source_suffix],
include_dirs=include_dirs,
language='c++',
libraries=parallel_libraries,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove parallel libs here

@richardjgowers
Copy link
Member

@ayushsuhane Looks like just a precision error on the test, use assert_almost_equal

@ayushsuhane
Copy link
Contributor Author

I am not sure about both the errors.

(1) I didnt' change anything in the pkdtree, but still it raises an error while it didn't raise an error earlier.
(2) I checked the contiguous flags separately, which passes in tests, but raises an error with numpy. Is it because of a different numpy version? How can I get past it?

@ayushsuhane
Copy link
Contributor Author

To handle the no pbc case, I have used python methods blatantly inside if box is not None. A/c to you, will it have a significant effect? If yes, how should I change it. Also, are you happy with the way it handles pbc?

@richardjgowers
Copy link
Member

@ayushsuhane can you link me a line? I can't find it. In general, it's ok to have some python to check things at the start, you just can't have any inside loops

if (box is None) or (np.allclose(box[:3], 0.) and box.shape[0] == 6):
bmax = np.max(coords, axis=0)
bmin = np.min(coords, axis=0)
for i in range(DIM):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looping over all the coordinates to shift the origin since box origin is always (0,0,0) which is used to get the cellids later on. Similarly, np.min and np.max are also dark yellow if you look into annotate.

pseudobox[DIM + i] = 90.
box = pseudobox
# shift the origin
coords -= bmin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you're modifying coords in place, which changes the copy that the user holds (which might lead to tears). Better would be to move the self.coords = coords.copy() call below, and modify self.coords which is the class' copy.

@ayushsuhane
Copy link
Contributor Author

@richardjgowers @jbarnoud Any suggestions on how can I remove the error. It says the numpy array is not C-contiguous. I have already tried using np.ascontiguous (now I have changed it to list type), it also doesn't fail the test but fails in travis.

@ayushsuhane
Copy link
Contributor Author

ayushsuhane commented Aug 5, 2018

@richardjgowers Can you review it please. If this API is OK (and once self_capped is merged), I can quickly add self capped function along with modified _determine_method function using benchmarks.

#find distance between search coords[i] and coords[bid]
d2 = self.box.fast_distance2(&searchcoords_bbox[current_beadid, XX], &self.coords_bbox[bid, XX])
if d2 < cutoff2 and d2 > EPSILON:
results.add_neighbors(current_beadid, bid, d2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the capped funcctions have a return signature such that same atom in different group (query and reference coordinates) will be obtained as a neighbour. In that case, should I remove EPSILON here. It will be included only in self_search function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah remove that epsilon hack so this works as other distance methods

@richardjgowers richardjgowers self-assigned this Aug 5, 2018
@richardjgowers
Copy link
Member

@ayushsuhane CHANGELOG then we're done I think :)

@ayushsuhane
Copy link
Contributor Author

ayushsuhane commented Aug 7, 2018

I was doing benchmarks for the _determine_method, but I figured we can add them in a different PR, as this is already a big one. If you differ in opinion, I can add them here as well alogn with the modified tests. Here, is the link for benchmarks.

https://github.com/ayushsuhane/Benchmarks_Distance/blob/master/Notebooks/BM_CappedNS.ipynb

@orbeckst
Copy link
Member

orbeckst commented Aug 7, 2018

@ayushsuhane thanks for the benchmarks. Looks very encouraging!

Have you also thought about adding some benchmarks to https://github.com/MDAnalysis/mdanalysis/tree/develop/benchmarks ? It would be interesting to see how code that relies on distance calculations improves with your additions. (I am not saying that this should be in this PR, but overall it would be nice if one could point to https://www.mdanalysis.org/benchmarks/ for immediate evidence how your work improved the performance of MDAnalysis.)

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update CHANGELOG. Thanks!


Enhancements

* Added a wrapper of lib.nsgrid in lib.distances.self_capped_distance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seb-buch also needs to be added as an author as he's authored some of the commits

@richardjgowers
Copy link
Member

@ayushsuhane don't worry about the encore failures on appveyor

@richardjgowers richardjgowers merged commit e2c2b56 into MDAnalysis:develop Aug 7, 2018
@ayushsuhane ayushsuhane deleted the cappedgrid branch August 20, 2018 06:00
if cutoff < 0:
raise ValueError("Cutoff must be positive!")
if cutoff * cutoff > self.box.c_pbcbox.max_cutoff2:
raise ValueError("Cutoff greater than maximum cutoff ({:.3f}) given the PBC")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the maximum cutoff is not displayed when raising the error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Charly. Thank you for reporting the problem.

A comment on a merged pull request is likely to be forgotten, though. Would you mind opening an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants