From 49e67f91be6826082a952b38a2ed5aefe164a503 Mon Sep 17 00:00:00 2001 From: ayush Date: Fri, 31 Aug 2018 16:50:48 -0700 Subject: [PATCH 01/10] initial blog post --- _posts/2018-08-30-gsoc18-distancesearch.md | 67 ++++++++++++++++++++++ 1 file changed, 67 insertions(+) create mode 100644 _posts/2018-08-30-gsoc18-distancesearch.md diff --git a/_posts/2018-08-30-gsoc18-distancesearch.md b/_posts/2018-08-30-gsoc18-distancesearch.md new file mode 100644 index 00000000..a959118a --- /dev/null +++ b/_posts/2018-08-30-gsoc18-distancesearch.md @@ -0,0 +1,67 @@ +--- +layout: post +title: Efficient distance search methods +--- + +We are pleased to announce another successful year of Google Summer of code with [NumFOCUS][] organization. Thanks to [Richard Gowers][] and [Jonathan Barnoud][] for mentoring the GSoC students. This year, one of the project, was to improve the performance of pairwise distance computations which are used quite frequently in MDAnalysis in one form or the other. MDAnalysis v0.19.0 and higher will include an internal functionality to speed up the distance computations using automatic method selection. As a result, a user need not bother with the technical details of implementations of different search algorithms. A flexible functionality with easily extendible interface ``capped_distance`` is introduced in ``MDAnalysis.lib.distances`` for developers. Aside from the major highlight of ``capped_distance`` and its improvements, additional functionality of ``augment_coordinates`` are also implemented for any curious user to implement other neighbor search algorithm with periodic boundary conditions. + +One of the major bottleneck in various analysis routines in MDAnalysis (and typically in Molecular Dynamics studies) is the evaluation of pairwise distances among the particles. The primary problem revolves around fixed radius neighbor search algorithms. MDAnalysis offers a suite of algorithms including brute force method, tree based binary search algorithm for such problems. While these methods are suitable for a variety of problems dealt in MDAnalysis, one of the question was whether one can improve the performance of distance calculations using other established neighbor search methods. + +This question led to the inception of Google Summer of Code [project][] with [NumFOCUS][]. [Ayush Suhane][] completed the project and was able to demonstrate performance improvements for specific cases of distance selections, identification of bonds and Radial distribution function in the analysis module of MDAnalysis. More details on the commit history, PR's and blog posts can be found in the final [report][] submitted to GSoC. Real-time benchmarks for specific modules in MDAnalysis can be found [here](https://www.mdanalysis.org/benchmarks/). + +The major highlight of the project is the introduction of ``capped_distance`` which allows automatic selection of method to evaluate pairwise distances based on predefined set of rules. It allows user-friendly interface for the developers to quickly implement any new algorithm throughout MDAnalysis modules. To test any new algorithm, one must comply with the following protocol. Define the new algorithm as: + +```python +def newmethod_capped(reference, configuration, max_cutoff, min_cutoff=None, box=None, return_distance=True): + """ + Algorithm to evaluate pairs between reference and configuration atoms + and corresponding distances + """ + return pairs, distances +``` + +Once the method is defined, register the function name in ``_determine_method`` in ``MDAnalysis.lib.distances`` as: + +```python +methods = {'bruteforce': _bruteforce_capped, + 'pkdtree': _pkdtree_capped, + 'nsgrid': _nsgrid_capped, + 'newmethod': newmethod_capped} +``` +That's it. The new method is ready to be tested across functions which use ``capped_distance``. For any specific application, it can be called as ``capped_distance(ref, conf, max_dist, method=newmethod)`` from the function. + +As mentioned above, MDAnalysis offers support of three different algorithms namely [bruteforce][] which is a naive pairwise distance calculation algorithm and implemented in MDAnalysis for parallel execution, [pkdtree][] which is a wrapper method around binary tree search algorithm, [nsgrid][] is an implementation of cell-list algorithm. During the tenure of GSoC'18, an additional method ``nsgrid`` is implemented in MDAnalysis with the help of [Sebastien Buchoux][]. For more information, the reader is encouraged to read the [blog], which include detailed information about different algorithms and their implementation. + +While implementing any new algorithm for Molecular dynamics trajectories, one additional requirement is to handle the periodic boundary conditions. A combination of versatile function ``augment_coordinates`` and ``undo_augment`` can be used with any algorithm to handle PBC. The main idea is to extend the box by generating duplicate particles in the vicinity of the box by ``augment_coordinates``. These duplicates as well as the original particles can now be used with any algorithm to evaluate the nearest neighbors. After the operation, the duplicate particles can be reverted back to their original particle using ``undo_augment``. These functions are available in ``MDAnalysis.lib._augment``. We encourage the interested readers to try different algorithms using these functions. Hopefully, you can help us improve the performance further with your feedbacks. As a starting point, the skeleton to enable PBC would look like: + +```python +def newmethod_search(coords, centers, radius, box=None): + aug, mapping = augment_coordinates(coords, box, radius) + all_coords = no.concatenate([coords, aug]) + """ + Perform operations for distance evaluations + with **all_coords** using the new algorithm + and obtain the result in indices + """ + indices = undo_augment(indices, mapping, len(coords)) + return indices +``` + +Finally, this function can be tested with ``capped_distance`` to check the performance against already implemented algorithms in MDAnalysis. + +This was a flavor of what work was done during GSoC'18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead focus on their analysis segment, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks. + +As a final note, we managed to get an improvement of ~ 2-3 times in rdf computation, ~ 10 times in identification of bonds, and ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis. The performance is also found to improve with larger datasets, but is not reported in benchmarks. Any motivated reader is welcome to submit their feedback about performance of the above mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world. + + +[project]: https://summerofcode.withgoogle.com/projects/#5050592943144960 +[NumFOCUS]: https://numfocus.org/ +[Ayush Suhane]: https://github.com/ayushsuhane +[report]: https://gist.github.com/ayushsuhane/fd114cda20e93b0f61a8acb6d25d3276 +[bruteforce]: http://www.csl.mtu.edu/cs4321/www/Lectures/Lecture%206%20-%20Brute%20Force%20Closest%20Pair%20and%20Convex%20and%20Exhausive%20Search.htm +[pkdtree]: https://en.wikipedia.org/wiki/K-d_tree +[nsgrid]: https://en.wikipedia.org/wiki/Cell_lists +[blog]: https://ayushsuhane.github.io/ +[Sebastien Buchoux]: https://github.com/seb-buch +[Richard Gowers]: https://github.com/richardjgowers +[Jonathan Barnoud]: https://github.com/jbarnoud \ No newline at end of file From 4ba85b9c7e6ced8bb5a975a10c06fd842394a6e3 Mon Sep 17 00:00:00 2001 From: ayush Date: Sat, 1 Sep 2018 21:14:19 -0700 Subject: [PATCH 02/10] slight changes to the text --- _posts/2018-08-30-gsoc18-distancesearch.md | 38 +++++++++++----------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/_posts/2018-08-30-gsoc18-distancesearch.md b/_posts/2018-08-30-gsoc18-distancesearch.md index a959118a..88923ff9 100644 --- a/_posts/2018-08-30-gsoc18-distancesearch.md +++ b/_posts/2018-08-30-gsoc18-distancesearch.md @@ -1,20 +1,20 @@ --- layout: post -title: Efficient distance search methods +title: Improvements in distance search methods --- -We are pleased to announce another successful year of Google Summer of code with [NumFOCUS][] organization. Thanks to [Richard Gowers][] and [Jonathan Barnoud][] for mentoring the GSoC students. This year, one of the project, was to improve the performance of pairwise distance computations which are used quite frequently in MDAnalysis in one form or the other. MDAnalysis v0.19.0 and higher will include an internal functionality to speed up the distance computations using automatic method selection. As a result, a user need not bother with the technical details of implementations of different search algorithms. A flexible functionality with easily extendible interface ``capped_distance`` is introduced in ``MDAnalysis.lib.distances`` for developers. Aside from the major highlight of ``capped_distance`` and its improvements, additional functionality of ``augment_coordinates`` are also implemented for any curious user to implement other neighbor search algorithm with periodic boundary conditions. +We are pleased to announce another successful year of Google Summer of Code with [NumFOCUS][] organization. Thanks to [Richard Gowers][] and [Jonathan Barnoud][] for mentoring the GSoC students. This year, one of the projects, was to improve the performance of pairwise distance computations, which is used quite frequently in MDAnalysis in different forms. MDAnalysis v 0.19.0 and higher will include an internal function to speed up the neighbor search computations using automatic method selection. As a result, a user need not bother with the technical details of implementations of different search algorithms. A flexible functionality with easily extendible interface ``capped_distance`` is introduced in ``MDAnalysis.lib.distances`` for this purpose. Aside from the major highlight of ``capped_distance`` and its improvements, additional functionality of ``augment_coordinates`` is also implemented for any curious user to implement different neighbor search algorithm with periodic boundary conditions. -One of the major bottleneck in various analysis routines in MDAnalysis (and typically in Molecular Dynamics studies) is the evaluation of pairwise distances among the particles. The primary problem revolves around fixed radius neighbor search algorithms. MDAnalysis offers a suite of algorithms including brute force method, tree based binary search algorithm for such problems. While these methods are suitable for a variety of problems dealt in MDAnalysis, one of the question was whether one can improve the performance of distance calculations using other established neighbor search methods. +One of the major bottleneck in various analysis routines in MDAnalysis (and typically in Molecular Dynamics studies) is the evaluation of pairwise distances among the particles. The primary problem revolves around fixed radius neighbor search algorithms. MDAnalysis offers a suite of algorithms including brute force method, tree-based binary search algorithms to solve such problems. While these methods are suitable for a variety of analysis functions using pairwise distances in MDAnalysis, one of the question was whether one can improve the performance of distance calculations using other established neighbor search methods. This question led to the inception of Google Summer of Code [project][] with [NumFOCUS][]. [Ayush Suhane][] completed the project and was able to demonstrate performance improvements for specific cases of distance selections, identification of bonds and Radial distribution function in the analysis module of MDAnalysis. More details on the commit history, PR's and blog posts can be found in the final [report][] submitted to GSoC. Real-time benchmarks for specific modules in MDAnalysis can be found [here](https://www.mdanalysis.org/benchmarks/). -The major highlight of the project is the introduction of ``capped_distance`` which allows automatic selection of method to evaluate pairwise distances based on predefined set of rules. It allows user-friendly interface for the developers to quickly implement any new algorithm throughout MDAnalysis modules. To test any new algorithm, one must comply with the following protocol. Define the new algorithm as: +The major highlight of the project is the introduction of ``capped_distance`` which allows automatic selection of methods based on predefined set of rules to evaluate pairs of atoms in the neighborhood of any particle. It allows a user-friendly interface for the developers to quickly implement any new algorithm throughout MDAnalysis modules. To test any new algorithm, one must comply with the following protocol: ```python def newmethod_capped(reference, configuration, max_cutoff, min_cutoff=None, box=None, return_distance=True): - """ - Algorithm to evaluate pairs between reference and configuration atoms + """ + An Algorithm to evaluate pairs between reference and configuration atoms and corresponding distances """ return pairs, distances @@ -30,28 +30,28 @@ methods = {'bruteforce': _bruteforce_capped, ``` That's it. The new method is ready to be tested across functions which use ``capped_distance``. For any specific application, it can be called as ``capped_distance(ref, conf, max_dist, method=newmethod)`` from the function. -As mentioned above, MDAnalysis offers support of three different algorithms namely [bruteforce][] which is a naive pairwise distance calculation algorithm and implemented in MDAnalysis for parallel execution, [pkdtree][] which is a wrapper method around binary tree search algorithm, [nsgrid][] is an implementation of cell-list algorithm. During the tenure of GSoC'18, an additional method ``nsgrid`` is implemented in MDAnalysis with the help of [Sebastien Buchoux][]. For more information, the reader is encouraged to read the [blog], which include detailed information about different algorithms and their implementation. +As mentioned above, MDAnalysis offers support of three different algorithms namely [bruteforce][] which is a naive pairwise distance calculation algorithm and implemented in MDAnalysis even for parallel execution, [pkdtree][] is a wrapper method around binary tree search algorithm, [nsgrid][] is an implementation of cell-list algorithm. During the tenure of GSoC'18, an additional method ``nsgrid`` is implemented in MDAnalysis with the help of [Sebastien Buchoux][]. For more information, the reader is encouraged to read the [blog], which include detailed information about different algorithms and their implementation. -While implementing any new algorithm for Molecular dynamics trajectories, one additional requirement is to handle the periodic boundary conditions. A combination of versatile function ``augment_coordinates`` and ``undo_augment`` can be used with any algorithm to handle PBC. The main idea is to extend the box by generating duplicate particles in the vicinity of the box by ``augment_coordinates``. These duplicates as well as the original particles can now be used with any algorithm to evaluate the nearest neighbors. After the operation, the duplicate particles can be reverted back to their original particle using ``undo_augment``. These functions are available in ``MDAnalysis.lib._augment``. We encourage the interested readers to try different algorithms using these functions. Hopefully, you can help us improve the performance further with your feedbacks. As a starting point, the skeleton to enable PBC would look like: +While implementing any new algorithm for Molecular dynamics trajectories, one additional requirement is to handle the periodic boundary conditions. A combination of versatile function ``augment_coordinates`` and ``undo_augment`` can be used with any algorithm to handle PBC. The main idea is to extend the box by generating duplicate particles in the vicinity of the box by ``augment_coordinates``. These duplicates, as well as the original particles, can now be used with any algorithm to evaluate the nearest neighbors. After the operation, the duplicate particles can be reverted back to their original particle indices using ``undo_augment``. These functions are available in ``MDAnalysis.lib._augment``. We encourage the interested readers to try different algorithms using these functions. Hopefully, you can help us improve the performance further with your feedback. As a starting point, the skeleton to enable PBC would take the following form: ```python def newmethod_search(coords, centers, radius, box=None): - aug, mapping = augment_coordinates(coords, box, radius) - all_coords = no.concatenate([coords, aug]) - """ - Perform operations for distance evaluations - with **all_coords** using the new algorithm - and obtain the result in indices - """ - indices = undo_augment(indices, mapping, len(coords)) - return indices + aug, mapping = augment_coordinates(coords, box, radius) + all_coords = no.concatenate([coords, aug]) + """ + Perform operations for distance evaluations + with **all_coords** using the new algorithm + and obtain the result in indices + """ + indices = undo_augment(indices, mapping, len(coords)) + return indices ``` Finally, this function can be tested with ``capped_distance`` to check the performance against already implemented algorithms in MDAnalysis. -This was a flavor of what work was done during GSoC'18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead focus on their analysis segment, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks. +This was a flavor of what work was done during GSoC'18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead allow a user to focus on their analysis, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks. -As a final note, we managed to get an improvement of ~ 2-3 times in rdf computation, ~ 10 times in identification of bonds, and ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis. The performance is also found to improve with larger datasets, but is not reported in benchmarks. Any motivated reader is welcome to submit their feedback about performance of the above mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world. +As a final note, we managed to get an improvement of ~ 2-3 times in Radial Distribution Function computation, ~ 10 times in identification of bonds, and ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis. The performance is also found to improve with larger datasets but is not reported in benchmarks. Any motivated reader is welcome to submit their feedbacks about the performance of the above-mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world. [project]: https://summerofcode.withgoogle.com/projects/#5050592943144960 From e61a7ef19b943a2a93eaba4d997e504a7119e9db Mon Sep 17 00:00:00 2001 From: Richard Gowers Date: Tue, 4 Sep 2018 17:00:01 -0500 Subject: [PATCH 03/10] clarified the travel grant --- _posts/2018-08-24-numfocus-workshop.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2018-08-24-numfocus-workshop.md b/_posts/2018-08-24-numfocus-workshop.md index 68be8349..9381616e 100644 --- a/_posts/2018-08-24-numfocus-workshop.md +++ b/_posts/2018-08-24-numfocus-workshop.md @@ -10,7 +10,7 @@ style="float: right; width: 10em;" /> We are happy to announce that thanks to a [NumFOCUS development grant][numfocus] we will be hosting a 2 day workshop and hackathon aimed at introducing researchers to MDAnalysis. The event will be free to attend and will be held at Northwestern University, Evanston IL on the 12th and 13th of November 2018. -In addition, there are travel grants available for people attending from outside Chicago +In addition, there are small travel grants available for people attending from other US institutions aimed at promoting diversity in STEM. Women and minorities are especially encouraged to apply! The first day of the workshop will focus on covering the basics of the Python programming language From 70640c55d28f20828c014284675a8e86ed419ba3 Mon Sep 17 00:00:00 2001 From: Max Linke Date: Tue, 18 Sep 2018 11:50:12 +0200 Subject: [PATCH 04/10] update scipy citation page numbers The page numbers are now taken from the official bibtex entry. --- pages/citations.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/citations.md b/pages/citations.md index 5064e1ff..a838cf26 100644 --- a/pages/citations.md +++ b/pages/citations.md @@ -24,7 +24,7 @@ When using MDAnalysis in published work, please cite the following two papers: I. M. Kenney, and O. Beckstein. [MDAnalysis: A Python package for the rapid analysis of molecular dynamics simulations](http://conference.scipy.org/proceedings/scipy2016/oliver_beckstein.html). In S. Benthall and S. Rostrup, editors, *Proceedings of the 15th Python in - Science Conference*, pages 102-109, Austin, TX, 2016. SciPy. + Science Conference*, pages 98-105, Austin, TX, 2016. SciPy. * N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics From 10eb3c312ea87cb398066d222b0c6c1c15b04b3c Mon Sep 17 00:00:00 2001 From: Oliver Beckstein Date: Sat, 6 Oct 2018 15:34:01 -0700 Subject: [PATCH 05/10] explained why both MDA papers should be cited fix #92 --- pages/citations.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/pages/citations.md b/pages/citations.md index 5064e1ff..871a1ce9 100644 --- a/pages/citations.md +++ b/pages/citations.md @@ -17,7 +17,15 @@ algorithms and libraries that were used in the program. ## MDAnalysis library ## -When using MDAnalysis in published work, please cite the following two papers: +When using MDAnalysis in published work, please cite the following two +papers. + +(We are asking you to cite *both* papers if at all possible because +the 2016 paper describes many updates to the original 2011 paper and +neither paper on its own provides a comprehensive description of the +library. We will publish a complete self-contained paper with the +upcoming 1.0 release of MDAnalysis, which will then supersede these +two citations.) * R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler, D. L. Dotson, J. Domanski, S. Buchoux, From e05fc69598f6238ec8c8f824b385d52e49b13b2b Mon Sep 17 00:00:00 2001 From: Oliver Beckstein Date: Thu, 1 Nov 2018 23:09:30 -0700 Subject: [PATCH 06/10] added link to workshop materials/program --- _posts/2018-08-24-numfocus-workshop.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/_posts/2018-08-24-numfocus-workshop.md b/_posts/2018-08-24-numfocus-workshop.md index 9381616e..73dfaad7 100644 --- a/_posts/2018-08-24-numfocus-workshop.md +++ b/_posts/2018-08-24-numfocus-workshop.md @@ -23,7 +23,15 @@ contributing to open source software. style="center" alt="Workshop program details" width="100%"/> +## Registration + To attend the Workshop [please register here.](https://goo.gl/forms/eoEthZEts8junr3V2) -Space is limited and registration will close on October 15th. +Space is limited and registration closes on October 15th. + +## Program and Materials + +The [workshop materials] are online and are available for attendants +and anyone else interested. [numfocus]: https://numfocus.org/blog/summer-2018-open-source-development-grants +[workshop materials]: {{site.baseurl}}/WorkshopHackathon2018/ From c36a7e851ff77ab487a2bbdf937d487e809e56f0 Mon Sep 17 00:00:00 2001 From: Richard Gowers Date: Fri, 16 Nov 2018 10:53:08 -0600 Subject: [PATCH 07/10] started rewrite --- _posts/2018-08-30-gsoc18-distancesearch.md | 67 ---------------- _posts/2018-11-16-gsoc18-distancesearch.md | 92 ++++++++++++++++++++++ 2 files changed, 92 insertions(+), 67 deletions(-) delete mode 100644 _posts/2018-08-30-gsoc18-distancesearch.md create mode 100644 _posts/2018-11-16-gsoc18-distancesearch.md diff --git a/_posts/2018-08-30-gsoc18-distancesearch.md b/_posts/2018-08-30-gsoc18-distancesearch.md deleted file mode 100644 index 88923ff9..00000000 --- a/_posts/2018-08-30-gsoc18-distancesearch.md +++ /dev/null @@ -1,67 +0,0 @@ ---- -layout: post -title: Improvements in distance search methods ---- - -We are pleased to announce another successful year of Google Summer of Code with [NumFOCUS][] organization. Thanks to [Richard Gowers][] and [Jonathan Barnoud][] for mentoring the GSoC students. This year, one of the projects, was to improve the performance of pairwise distance computations, which is used quite frequently in MDAnalysis in different forms. MDAnalysis v 0.19.0 and higher will include an internal function to speed up the neighbor search computations using automatic method selection. As a result, a user need not bother with the technical details of implementations of different search algorithms. A flexible functionality with easily extendible interface ``capped_distance`` is introduced in ``MDAnalysis.lib.distances`` for this purpose. Aside from the major highlight of ``capped_distance`` and its improvements, additional functionality of ``augment_coordinates`` is also implemented for any curious user to implement different neighbor search algorithm with periodic boundary conditions. - -One of the major bottleneck in various analysis routines in MDAnalysis (and typically in Molecular Dynamics studies) is the evaluation of pairwise distances among the particles. The primary problem revolves around fixed radius neighbor search algorithms. MDAnalysis offers a suite of algorithms including brute force method, tree-based binary search algorithms to solve such problems. While these methods are suitable for a variety of analysis functions using pairwise distances in MDAnalysis, one of the question was whether one can improve the performance of distance calculations using other established neighbor search methods. - -This question led to the inception of Google Summer of Code [project][] with [NumFOCUS][]. [Ayush Suhane][] completed the project and was able to demonstrate performance improvements for specific cases of distance selections, identification of bonds and Radial distribution function in the analysis module of MDAnalysis. More details on the commit history, PR's and blog posts can be found in the final [report][] submitted to GSoC. Real-time benchmarks for specific modules in MDAnalysis can be found [here](https://www.mdanalysis.org/benchmarks/). - -The major highlight of the project is the introduction of ``capped_distance`` which allows automatic selection of methods based on predefined set of rules to evaluate pairs of atoms in the neighborhood of any particle. It allows a user-friendly interface for the developers to quickly implement any new algorithm throughout MDAnalysis modules. To test any new algorithm, one must comply with the following protocol: - -```python -def newmethod_capped(reference, configuration, max_cutoff, min_cutoff=None, box=None, return_distance=True): - """ - An Algorithm to evaluate pairs between reference and configuration atoms - and corresponding distances - """ - return pairs, distances -``` - -Once the method is defined, register the function name in ``_determine_method`` in ``MDAnalysis.lib.distances`` as: - -```python -methods = {'bruteforce': _bruteforce_capped, - 'pkdtree': _pkdtree_capped, - 'nsgrid': _nsgrid_capped, - 'newmethod': newmethod_capped} -``` -That's it. The new method is ready to be tested across functions which use ``capped_distance``. For any specific application, it can be called as ``capped_distance(ref, conf, max_dist, method=newmethod)`` from the function. - -As mentioned above, MDAnalysis offers support of three different algorithms namely [bruteforce][] which is a naive pairwise distance calculation algorithm and implemented in MDAnalysis even for parallel execution, [pkdtree][] is a wrapper method around binary tree search algorithm, [nsgrid][] is an implementation of cell-list algorithm. During the tenure of GSoC'18, an additional method ``nsgrid`` is implemented in MDAnalysis with the help of [Sebastien Buchoux][]. For more information, the reader is encouraged to read the [blog], which include detailed information about different algorithms and their implementation. - -While implementing any new algorithm for Molecular dynamics trajectories, one additional requirement is to handle the periodic boundary conditions. A combination of versatile function ``augment_coordinates`` and ``undo_augment`` can be used with any algorithm to handle PBC. The main idea is to extend the box by generating duplicate particles in the vicinity of the box by ``augment_coordinates``. These duplicates, as well as the original particles, can now be used with any algorithm to evaluate the nearest neighbors. After the operation, the duplicate particles can be reverted back to their original particle indices using ``undo_augment``. These functions are available in ``MDAnalysis.lib._augment``. We encourage the interested readers to try different algorithms using these functions. Hopefully, you can help us improve the performance further with your feedback. As a starting point, the skeleton to enable PBC would take the following form: - -```python -def newmethod_search(coords, centers, radius, box=None): - aug, mapping = augment_coordinates(coords, box, radius) - all_coords = no.concatenate([coords, aug]) - """ - Perform operations for distance evaluations - with **all_coords** using the new algorithm - and obtain the result in indices - """ - indices = undo_augment(indices, mapping, len(coords)) - return indices -``` - -Finally, this function can be tested with ``capped_distance`` to check the performance against already implemented algorithms in MDAnalysis. - -This was a flavor of what work was done during GSoC'18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead allow a user to focus on their analysis, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks. - -As a final note, we managed to get an improvement of ~ 2-3 times in Radial Distribution Function computation, ~ 10 times in identification of bonds, and ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis. The performance is also found to improve with larger datasets but is not reported in benchmarks. Any motivated reader is welcome to submit their feedbacks about the performance of the above-mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world. - - -[project]: https://summerofcode.withgoogle.com/projects/#5050592943144960 -[NumFOCUS]: https://numfocus.org/ -[Ayush Suhane]: https://github.com/ayushsuhane -[report]: https://gist.github.com/ayushsuhane/fd114cda20e93b0f61a8acb6d25d3276 -[bruteforce]: http://www.csl.mtu.edu/cs4321/www/Lectures/Lecture%206%20-%20Brute%20Force%20Closest%20Pair%20and%20Convex%20and%20Exhausive%20Search.htm -[pkdtree]: https://en.wikipedia.org/wiki/K-d_tree -[nsgrid]: https://en.wikipedia.org/wiki/Cell_lists -[blog]: https://ayushsuhane.github.io/ -[Sebastien Buchoux]: https://github.com/seb-buch -[Richard Gowers]: https://github.com/richardjgowers -[Jonathan Barnoud]: https://github.com/jbarnoud \ No newline at end of file diff --git a/_posts/2018-11-16-gsoc18-distancesearch.md b/_posts/2018-11-16-gsoc18-distancesearch.md new file mode 100644 index 00000000..2c3555d1 --- /dev/null +++ b/_posts/2018-11-16-gsoc18-distancesearch.md @@ -0,0 +1,92 @@ +--- +layout: post +title: GSOC 2018: Improvements in distance search methods +--- + +We are pleased to announce another successful year of [Google Summer of Code][] with the [NumFOCUS][] organization, +thanks to [Richard Gowers][] and [Jonathan Barnoud][] for mentoring the GSoC students. +This year one of the projects was to improve the performance of pairwise distance computations, which is used quite frequently in MDAnalysis in different forms. +MDAnalysis v0.19.0 and higher will include the new functions `MDAnalysis.lib.distances.capped_distance` and `MDAnalysis.lib.distances.self_capped_distance` +which offer a much faster way to calculate all pairwise distances up to a certain maximum distance. +By only considering distances up to a certain maximum, we can use various algorithms to optimise the number of pairwise comparisons that are performed. +Behind the scenes, these functions are using one of three different algorithms: +[bruteforce][] which is a naive pairwise distance calculation algorithm, +[pkdtree][] which is a wrapper method around Scipy's KD tree search algorithm +and [nsgrid][] which is an implementation of cell-list algorithm. +This last algorithm uses the new ``MDAnalysis.lib.nsgrid`` module which was implemented with the help of [Sebastien Buchoux][]. +For more information on these algorithms the reader is encouraged to read this [blog], which includes a comparison of these approaches and their performance in different conditions. + + +One of the major bottleneck in various analysis routines in MDAnalysis (and typically in Molecular Dynamics studies) is the evaluation of pairwise distances among the particles. + + +The primary problem revolves around fixed radius neighbor search algorithms. +MDAnalysis offers a suite of algorithms including brute force method, tree-based binary search algorithms to solve such problems. +While these methods are suitable for a variety of analysis functions using pairwise distances in MDAnalysis, one of the question was whether one can improve the performance of distance calculations using other established neighbor search methods. + +This question led to the inception of Google Summer of Code [project][] with [NumFOCUS][]. +[Ayush Suhane][] completed the project and was able to demonstrate performance improvements for specific cases of distance selections, identification of bonds and Radial distribution function in the analysis module of MDAnalysis. +More details on the commit history, PR's and blog posts can be found in the final [report][] submitted to GSoC. Real-time benchmarks for specific modules in MDAnalysis can be found [here](https://www.mdanalysis.org/benchmarks/). + + + +The major highlight of the project is the introduction of ``MDAnalysis.lib.distances.capped_distance`` which allows automatic selection of methods based on predefined set of rules to evaluate pairs of atoms in the neighborhood of any particle. It allows a user-friendly interface for the developers to quickly implement any new algorithm throughout MDAnalysis modules. To test any new algorithm, one must comply with the following API: + +```python +def newmethod_capped(reference, configuration, max_cutoff, min_cutoff=None, box=None, return_distance=True): + """ + An Algorithm to evaluate pairs between reference and configuration atoms + and corresponding distances + """ + return pairs, distances +``` + +Once the method is defined, register the function name in ``_determine_method`` in ``MDAnalysis.lib.distances`` as: + +```python +from MDAnalysis.lib import distances +distances_methods['newmethod'] = newmethod_capped +``` +That's it. The new method is ready to be tested across functions which use ``capped_distance``. For any specific application, it can be called as ``capped_distance(ref, conf, max_dist, method='newmethod')`` from the function. + + + +While implementing any new algorithm for Molecular dynamics trajectories, one additional requirement is to handle the periodic boundary conditions. +A combination of versatile function ``augment_coordinates`` and ``undo_augment`` can be used with any algorithm to handle PBC. +The main idea is to extend the box by generating duplicate particles in the vicinity of the box by ``augment_coordinates``. +These duplicates, as well as the original particles, can now be used with any algorithm to evaluate the nearest neighbors. +After the operation, the duplicate particles can be reverted back to their original particle indices using ``undo_augment``. +These functions are available in ``MDAnalysis.lib._augment``. We encourage the interested readers to try different algorithms using these functions. +Hopefully, you can help us improve the performance further with your feedback. As a starting point, the skeleton to enable PBC would take the following form: + +```python +def newmethod_search(coords, centers, radius, box=None): + aug, mapping = augment_coordinates(coords, box, radius) + all_coords = no.concatenate([coords, aug]) + """ + Perform operations for distance evaluations + with **all_coords** using the new algorithm + and obtain the result in indices + """ + indices = undo_augment(indices, mapping, len(coords)) + return indices +``` + +Finally, this function can be tested with ``capped_distance`` to check the performance against already implemented algorithms in MDAnalysis. + +This was a flavor of what work was done during GSoC'18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead allow a user to focus on their analysis, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks. + +As a final note, we managed to get an improvement of ~ 2-3 times in Radial Distribution Function computation, ~ 10 times in identification of bonds, and ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis. The performance is also found to improve with larger datasets but is not reported in benchmarks. Any motivated reader is welcome to submit their feedbacks about the performance of the above-mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world. + + +[Google Summer of Code]: https://summerofcode.withgoogle.com/projects/#5050592943144960 +[NumFOCUS]: https://numfocus.org/ +[Ayush Suhane]: https://github.com/ayushsuhane +[report]: https://gist.github.com/ayushsuhane/fd114cda20e93b0f61a8acb6d25d3276 +[bruteforce]: http://www.csl.mtu.edu/cs4321/www/Lectures/Lecture%206%20-%20Brute%20Force%20Closest%20Pair%20and%20Convex%20and%20Exhausive%20Search.htm +[pkdtree]: https://en.wikipedia.org/wiki/K-d_tree +[nsgrid]: https://en.wikipedia.org/wiki/Cell_lists +[blog]: https://ayushsuhane.github.io/ +[Sebastien Buchoux]: https://github.com/seb-buch +[Richard Gowers]: https://github.com/richardjgowers +[Jonathan Barnoud]: https://github.com/jbarnoud From 6edf8459d1506eacbae4723efe142c0c65ac7ed2 Mon Sep 17 00:00:00 2001 From: Oliver Beckstein Date: Wed, 21 Nov 2018 00:08:49 -0700 Subject: [PATCH 08/10] updated capped_distance blog post - added sub-headings - added links - minor text changes --- _posts/2018-11-16-gsoc18-distancesearch.md | 23 ++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/_posts/2018-11-16-gsoc18-distancesearch.md b/_posts/2018-11-16-gsoc18-distancesearch.md index 2c3555d1..98c283f3 100644 --- a/_posts/2018-11-16-gsoc18-distancesearch.md +++ b/_posts/2018-11-16-gsoc18-distancesearch.md @@ -6,7 +6,7 @@ title: GSOC 2018: Improvements in distance search methods We are pleased to announce another successful year of [Google Summer of Code][] with the [NumFOCUS][] organization, thanks to [Richard Gowers][] and [Jonathan Barnoud][] for mentoring the GSoC students. This year one of the projects was to improve the performance of pairwise distance computations, which is used quite frequently in MDAnalysis in different forms. -MDAnalysis v0.19.0 and higher will include the new functions `MDAnalysis.lib.distances.capped_distance` and `MDAnalysis.lib.distances.self_capped_distance` +MDAnalysis v0.19.0 and higher include the new functions [`MDAnalysis.lib.distances.capped_distance`][] and [`MDAnalysis.lib.distances.self_capped_distance`][] which offer a much faster way to calculate all pairwise distances up to a certain maximum distance. By only considering distances up to a certain maximum, we can use various algorithms to optimise the number of pairwise comparisons that are performed. Behind the scenes, these functions are using one of three different algorithms: @@ -14,23 +14,26 @@ Behind the scenes, these functions are using one of three different algorithms: [pkdtree][] which is a wrapper method around Scipy's KD tree search algorithm and [nsgrid][] which is an implementation of cell-list algorithm. This last algorithm uses the new ``MDAnalysis.lib.nsgrid`` module which was implemented with the help of [Sebastien Buchoux][]. -For more information on these algorithms the reader is encouraged to read this [blog], which includes a comparison of these approaches and their performance in different conditions. +For more information on these algorithms the reader is encouraged to read @ayushsuhane's [blog], which includes a comparison of these approaches and their performance in different conditions. -One of the major bottleneck in various analysis routines in MDAnalysis (and typically in Molecular Dynamics studies) is the evaluation of pairwise distances among the particles. +## The GSoC project + +One of the major bottleneck in various analysis routines in MDAnalysis (and typically in molecular dynamics studies) is the evaluation of pairwise distances among the particles. The primary problem revolves around fixed radius neighbor search algorithms. MDAnalysis offers a suite of algorithms including brute force method, tree-based binary search algorithms to solve such problems. While these methods are suitable for a variety of analysis functions using pairwise distances in MDAnalysis, one of the question was whether one can improve the performance of distance calculations using other established neighbor search methods. -This question led to the inception of Google Summer of Code [project][] with [NumFOCUS][]. -[Ayush Suhane][] completed the project and was able to demonstrate performance improvements for specific cases of distance selections, identification of bonds and Radial distribution function in the analysis module of MDAnalysis. +This question led to the inception of a Google Summer of Code [project][] with [NumFOCUS][]. +[Ayush Suhane][] completed the project and was able to demonstrate performance improvements for specific cases of distance selections, identification of bonds, and radial distribution function in the analysis module of MDAnalysis. More details on the commit history, PR's and blog posts can be found in the final [report][] submitted to GSoC. Real-time benchmarks for specific modules in MDAnalysis can be found [here](https://www.mdanalysis.org/benchmarks/). +## The new `capped_distance()` function -The major highlight of the project is the introduction of ``MDAnalysis.lib.distances.capped_distance`` which allows automatic selection of methods based on predefined set of rules to evaluate pairs of atoms in the neighborhood of any particle. It allows a user-friendly interface for the developers to quickly implement any new algorithm throughout MDAnalysis modules. To test any new algorithm, one must comply with the following API: +The major highlight of the project is the introduction of [`MDAnalysis.lib.distances.capped_distance`][] which allows automatic selection of methods based on predefined set of rules to evaluate pairs of atoms in the neighborhood of any particle. It allows a user-friendly interface for the developers to quickly implement any new algorithm throughout MDAnalysis modules. To test any new algorithm, one must comply with the following API: ```python def newmethod_capped(reference, configuration, max_cutoff, min_cutoff=None, box=None, return_distance=True): @@ -50,6 +53,7 @@ distances_methods['newmethod'] = newmethod_capped That's it. The new method is ready to be tested across functions which use ``capped_distance``. For any specific application, it can be called as ``capped_distance(ref, conf, max_dist, method='newmethod')`` from the function. +## Treatment of periodic boundary conditions While implementing any new algorithm for Molecular dynamics trajectories, one additional requirement is to handle the periodic boundary conditions. A combination of versatile function ``augment_coordinates`` and ``undo_augment`` can be used with any algorithm to handle PBC. @@ -74,14 +78,17 @@ def newmethod_search(coords, centers, radius, box=None): Finally, this function can be tested with ``capped_distance`` to check the performance against already implemented algorithms in MDAnalysis. -This was a flavor of what work was done during GSoC'18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead allow a user to focus on their analysis, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks. +## Performance improvements -As a final note, we managed to get an improvement of ~ 2-3 times in Radial Distribution Function computation, ~ 10 times in identification of bonds, and ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis. The performance is also found to improve with larger datasets but is not reported in benchmarks. Any motivated reader is welcome to submit their feedbacks about the performance of the above-mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world. +This was a flavor of what work was done during GSoC'18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead allow a user to focus on their analysis, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks. +As a final note, we managed to get a speed improvement of ~ 2-3 times in Radial Distribution Function computation, ~ 10 times in identification of bonds, and ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis. The performance is also found to improve with larger datasets but is not reported in benchmarks. Any motivated reader is welcome to submit their feedbacks about the performance of the above-mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world. [Google Summer of Code]: https://summerofcode.withgoogle.com/projects/#5050592943144960 [NumFOCUS]: https://numfocus.org/ [Ayush Suhane]: https://github.com/ayushsuhane +[`MDAnalysis.lib.distances.capped_distance`]: https://www.mdanalysis.org/docs/documentation_pages/lib/distances.html#MDAnalysis.lib.distances.capped_distance +[`MDAnalysis.lib.distances.self_capped_distance`]: https://www.mdanalysis.org/docs/documentation_pages/lib/distances.html#MDAnalysis.lib.distances.self_capped_distance [report]: https://gist.github.com/ayushsuhane/fd114cda20e93b0f61a8acb6d25d3276 [bruteforce]: http://www.csl.mtu.edu/cs4321/www/Lectures/Lecture%206%20-%20Brute%20Force%20Closest%20Pair%20and%20Convex%20and%20Exhausive%20Search.htm [pkdtree]: https://en.wikipedia.org/wiki/K-d_tree From 2f35a47f9090f72d3d718ac86c9404efac16f1e9 Mon Sep 17 00:00:00 2001 From: Oliver Beckstein Date: Wed, 28 Nov 2018 17:03:32 -0700 Subject: [PATCH 09/10] today date for distances blog post --- ...oc18-distancesearch.md => 2018-11-28-gsoc18-distancesearch.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename _posts/{2018-11-16-gsoc18-distancesearch.md => 2018-11-28-gsoc18-distancesearch.md} (100%) diff --git a/_posts/2018-11-16-gsoc18-distancesearch.md b/_posts/2018-11-28-gsoc18-distancesearch.md similarity index 100% rename from _posts/2018-11-16-gsoc18-distancesearch.md rename to _posts/2018-11-28-gsoc18-distancesearch.md From 5142594ddc3527bb5a085ff10e574cd4529988fe Mon Sep 17 00:00:00 2001 From: Oliver Beckstein Date: Wed, 28 Nov 2018 17:03:49 -0700 Subject: [PATCH 10/10] finalized distances GSOC2018 blog post --- _posts/2018-11-28-gsoc18-distancesearch.md | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/_posts/2018-11-28-gsoc18-distancesearch.md b/_posts/2018-11-28-gsoc18-distancesearch.md index 98c283f3..55639bcb 100644 --- a/_posts/2018-11-28-gsoc18-distancesearch.md +++ b/_posts/2018-11-28-gsoc18-distancesearch.md @@ -1,19 +1,21 @@ --- layout: post -title: GSOC 2018: Improvements in distance search methods +title: "GSOC 2018: Improvements in distance search methods" --- We are pleased to announce another successful year of [Google Summer of Code][] with the [NumFOCUS][] organization, thanks to [Richard Gowers][] and [Jonathan Barnoud][] for mentoring the GSoC students. -This year one of the projects was to improve the performance of pairwise distance computations, which is used quite frequently in MDAnalysis in different forms. -MDAnalysis v0.19.0 and higher include the new functions [`MDAnalysis.lib.distances.capped_distance`][] and [`MDAnalysis.lib.distances.self_capped_distance`][] +This year one of the projects was [to improve the performance of pairwise distance computations][], which is used quite frequently in MDAnalysis in different forms. + +MDAnalysis v0.19.0 and higher now include the _new functions [`MDAnalysis.lib.distances.capped_distance`][] and [`MDAnalysis.lib.distances.self_capped_distance`][]_ which offer a much faster way to calculate all pairwise distances up to a certain maximum distance. By only considering distances up to a certain maximum, we can use various algorithms to optimise the number of pairwise comparisons that are performed. Behind the scenes, these functions are using one of three different algorithms: [bruteforce][] which is a naive pairwise distance calculation algorithm, [pkdtree][] which is a wrapper method around Scipy's KD tree search algorithm and [nsgrid][] which is an implementation of cell-list algorithm. -This last algorithm uses the new ``MDAnalysis.lib.nsgrid`` module which was implemented with the help of [Sebastien Buchoux][]. +This last algorithm uses the new [`MDAnalysis.lib.nsgrid`][] module which was implemented with the help of [Sebastien Buchoux][]. + For more information on these algorithms the reader is encouraged to read @ayushsuhane's [blog], which includes a comparison of these approaches and their performance in different conditions. @@ -80,13 +82,22 @@ Finally, this function can be tested with ``capped_distance`` to check the perfo ## Performance improvements +As a final note, we managed to get a speed improvement of +- ~ 2-3 times in Radial Distribution Function computation, +- ~ 10 times in identification of bonds, and +- ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis. + +The performance is also found to improve with larger datasets but is not reported in benchmarks. Any motivated reader is welcome to submit their feedbacks about the performance of the above-mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world. + This was a flavor of what work was done during GSoC'18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead allow a user to focus on their analysis, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks. -As a final note, we managed to get a speed improvement of ~ 2-3 times in Radial Distribution Function computation, ~ 10 times in identification of bonds, and ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis. The performance is also found to improve with larger datasets but is not reported in benchmarks. Any motivated reader is welcome to submit their feedbacks about the performance of the above-mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world. + +— [Ayush Suhane][], [Richard Gowers][] [Google Summer of Code]: https://summerofcode.withgoogle.com/projects/#5050592943144960 [NumFOCUS]: https://numfocus.org/ [Ayush Suhane]: https://github.com/ayushsuhane +[to improve the performance of pairwise distance computations]: {% post_url 2018-04-26-gsoc-students %}#ayush-suhane-improve-distance-search-methods-in-mdanalysis [`MDAnalysis.lib.distances.capped_distance`]: https://www.mdanalysis.org/docs/documentation_pages/lib/distances.html#MDAnalysis.lib.distances.capped_distance [`MDAnalysis.lib.distances.self_capped_distance`]: https://www.mdanalysis.org/docs/documentation_pages/lib/distances.html#MDAnalysis.lib.distances.self_capped_distance [report]: https://gist.github.com/ayushsuhane/fd114cda20e93b0f61a8acb6d25d3276 @@ -94,6 +105,7 @@ As a final note, we managed to get a speed improvement of ~ 2-3 times in Radial [pkdtree]: https://en.wikipedia.org/wiki/K-d_tree [nsgrid]: https://en.wikipedia.org/wiki/Cell_lists [blog]: https://ayushsuhane.github.io/ +[`MDAnalysis.lib.nsgrid`]: https://www.mdanalysis.org/docs/documentation_pages/lib/nsgrid.html [Sebastien Buchoux]: https://github.com/seb-buch [Richard Gowers]: https://github.com/richardjgowers [Jonathan Barnoud]: https://github.com/jbarnoud