Improving the RDKitConverter caching system by cbouy · Pull Request #2942 · MDAnalysis/mdanalysis

cbouy · 2020-09-13T23:12:30Z

The current "homemade" caching system in the RDKit converter only allows to store the most recent conversion.
This new version uses the functools.lru_cache which allows users to select how many molecules should be cached, and improves readability/maintainability IMO

Also, the new caching system retrieves the converted items from the hash of all the arguments passed to the decorated atomgroup_to_mol function, instead of the id of the atomgroup and the arguments, which makes more sense. I didn't know what a hash was until recently so please forgive me for the rookie mistake :D
Now if you successively run u.atoms.convert_to("RDKIT") it will benefit from the caching system.

I needed to convert two different atomgroups (protein and ligand) while iterating over a trajectory and the previous system would just rebuild the whole topology (which takes quite some time for a protein) for each molecule at every frame hence why I think this is necessary. Now it works like a breeze.

Changes made in this Pull Request:

Replaces the RDKitConverter cache system with functools.lru_cache
Adds the set_converter_cache_size(maxsize) function to modify how many items are retained in the cache
Moves atomgroup_to_mol outside of the RDKitConverter class (it's not really needed there anyway), otherwise I need to define hash and eq dunders for the caching to work
Changes the default behavior of the RDKitConverter when an AtomGroup with no hydrogen is being converted: the converter now raises an AttributeError. The error is not raised when NoImplicit=False
Adds the force parameter to the RDKitConverter to ignore the above AttributeError and continue the conversion, which is mostly useful for inorganic molecules, CO2 and so on.
Adds a few more tests to increase the coverage of the RDKitConverter

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

pep8speaks · 2020-09-13T23:12:35Z

Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-11 10:08:59 UTC

codecov · 2020-09-14T00:42:15Z

Codecov Report

Merging #2942 (2aceaf8) into develop (d0fc581) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           develop    #2942   +/-   ##
========================================
  Coverage    93.55%   93.56%           
========================================
  Files          176      176           
  Lines        22837    22837           
  Branches      3194     3195    +1     
========================================
+ Hits         21366    21368    +2     
+ Misses        1421     1418    -3     
- Partials        50       51    +1

Impacted Files	Coverage Δ
package/MDAnalysis/converters/RDKit.py	`98.08% <100.00%> (+0.76%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0fc581...2aceaf8. Read the comment docs.

cbouy · 2020-09-22T09:12:48Z

ping @IAlibay @richardjgowers

orbeckst

I like the use of the lru cache from the stdlib. Peripheral comments inline.

package/MDAnalysis/coordinates/RDKit.py

orbeckst · 2020-09-23T20:47:17Z

package/MDAnalysis/coordinates/RDKit.py

+        conversions in memory. Using ``maxsize=None`` will remove all limits
+        to the cache size, i.e. everything is cached.
+    """
+    global atomgroup_to_mol


justified use of global ;-)

This is probably not thread-safe – not a big deal, though, and I don't have a better idea.

(Although, we don't really encourage use of threads for parallelization; multiprocessing should do just fine.)

IAlibay

Overall lgtm! Just a few comments, mainly to do with tests & docs.

package/MDAnalysis/coordinates/RDKit.py

testsuite/MDAnalysisTests/coordinates/test_rdkit.py

IAlibay

Thanks @cbouy, couple of near-final comments with very minimal changes (one of which probably can be just ignored). The main discussion point remains this implicit hydrogens thing.

testsuite/MDAnalysisTests/coordinates/test_rdkit.py

package/MDAnalysis/coordinates/RDKit.py

cbouy · 2020-11-03T16:08:16Z

I updated my first post with the new changes.
The Azure fails seem to be related to H5MD, I'll let you investigate.
Also coverage is not 100% even with the new tests 😭

cbouy · 2020-11-18T11:39:15Z

All tests are passing 💃 anything else ?

IAlibay · 2020-11-19T20:07:35Z

All tests are passing 💃 anything else ?

Apologies for taking so long here, I'll re-review over the weekend but I think we should be good.

IAlibay · 2021-04-10T10:42:39Z

@cbouy if you want to update this against the current develop, it'll finally be on my list for the next thing I review.

Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

cbouy · 2021-04-10T13:42:49Z

Okay I think I finally managed to run a proper git rebase upstream/develop this time 💁‍♂️

IAlibay

Sorry about the massive delay here @cbouy, overall lgtm!

Just a couple of small questions.

@orbeckst do you want to re-review or are we good to merge?

package/MDAnalysis/coordinates/RDKit.py

testsuite/MDAnalysisTests/coordinates/test_rdkit.py

IAlibay · 2021-04-22T19:50:29Z

testsuite/MDAnalysisTests/coordinates/test_rdkit.py

    def test_single_atom_mol(self, smi):
        u = mda.Universe.from_smiles(smi, addHs=False,
                                     generate_coordinates=False)
-        mol = u.atoms.convert_to("RDKIT")


Sorry, I think I'm just being silly and forgetting a very obvious thing. Could you remind me why these are all being switched away from convert_to?

convert_to doesn't pass arguments to the underlying converter, it was in a PR at some point though (#2882 )

Doesn't this behaviour contradict the docstring? I.e. ":func:set_converter_cache_size. However, ag.convert_to("RDKIT")
followed by ag.convert_to("RDKIT", NoImplicit=False) will not use the"

Or was the argument that we would merge #2882 before this PR?

The point is the converter modules weren't really documented to be instantiated like c = mda.coordinates.RDKit.RDKitConverter(); c.convert(...) but usually go through the convert_to AtomGroup method.
So yeah I assumed 2882 would be merged before v2.0 comes out

alright, let's see if we can revive #2882 then

Switched back to using convert_to now that it's merged!

testsuite/MDAnalysisTests/coordinates/test_rdkit.py

orbeckst · 2021-04-23T19:42:10Z

@IAlibay sorry, don't have time today to review — I'll leave it to you.

IAlibay · 2021-04-23T21:38:16Z

I think this PR is complete, but I want to hold off on merging before we have a clearer idea of what's going on with #2882.

IAlibay

I don't know why I completely forgot 🙀 We need a changelog entry (and can you also add in an entry for #2926?).

cbouy · 2021-04-24T16:30:42Z

I'm adding the missing changelog now. For the changelog of this PR though, do I mention the fixes/changes I made, or just the enhancements (set_converter_cache_size(maxsize) and the force parameter) ? The RDKit converter isn't released yet so it's a bit weird fixing something that isn't officially out...

orbeckst · 2021-04-24T17:16:07Z

If there was an issue for the fixes then I’d still add a CHANGELOG entry even though it’s a bit weird. However,for people living in the edge (using develop) it’s still helpful.

…

Am 4/24/21 um 09:30 schrieb Cédric Bouysset ***@***.***>: I'm adding the missing changelog now. For the changelog of this PR though, do I mention the fixes/changes I made, or just the enhancements (set_converter_cache_size(maxsize) and the force parameter) ? The RDKit converter isn't released yet so it's a bit weird fixing something that isn't officially out... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

IAlibay · 2021-05-10T15:20:29Z

I'm adding the missing changelog now. For the changelog of this PR though, do I mention the fixes/changes I made, or just the enhancements (set_converter_cache_size(maxsize) and the force parameter) ? The RDKit converter isn't released yet so it's a bit weird fixing something that isn't officially out...

Sorry for the delayed response here @cbouy.

I'd add the following entries:

enhancements:

Aromaticity and charge guessers using the RDKit converter (PR Aromaticity and Gasteiger charges guessers #2926)

changes:

force parameter (PR Improving the RDKitConverter caching system #2942)

Fixes:

New cache system to fix Issue RDKIT tests sometimes fails #2958

edit: once #2882 is done if you can add these and then update against develop I'll merge this.

IAlibay · 2021-05-10T20:28:17Z

RDKIT crashes are starting to happen too frequently for py3.6 + numpy 1.16 (see: #3287), I'm not sure if this is somehow linked to the new converter API, so I've updated this PR against the current develop to see if it fixes things.

@cbouy please do double check that I've not accidentally broken things!

edit: best way to check that this is fixed is just by re-running CI I guess -- number of successful CI runs: 3 (that should be enough)

cbouy · 2021-05-10T21:00:08Z

IAlibay

Thanks @cbouy lgtm!

orbeckst reviewed Sep 23, 2020

View reviewed changes

IAlibay requested changes Sep 24, 2020

View reviewed changes

package/MDAnalysis/coordinates/RDKit.py Show resolved Hide resolved

package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved

package/MDAnalysis/coordinates/RDKit.py Show resolved Hide resolved

testsuite/MDAnalysisTests/coordinates/test_rdkit.py Show resolved Hide resolved

IAlibay requested changes Oct 31, 2020

View reviewed changes

testsuite/MDAnalysisTests/coordinates/test_rdkit.py Outdated Show resolved Hide resolved

testsuite/MDAnalysisTests/coordinates/test_rdkit.py Show resolved Hide resolved

package/MDAnalysis/coordinates/RDKit.py Outdated Show resolved Hide resolved

IAlibay added this to the 2.0 milestone Mar 14, 2021

cbouy force-pushed the rdkitcache branch from 5b13cd0 to bc3bb86 Compare April 10, 2021 13:03

Cédric Bouysset and others added 7 commits April 10, 2021 15:29

use lru_cache

a046ca0

Update package/MDAnalysis/coordinates/RDKit.py

550f7e1

Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

add tests and docs

a59ee25

add force parameter + review tests

7614e07

pep8

9be1906

increase coverage

2283a18

bump

06b1181

cbouy force-pushed the rdkitcache branch from bc3bb86 to 06b1181 Compare April 10, 2021 13:36

This was referenced Apr 19, 2021

Erratic .convert_to("RDKIT") behaviour #3235

Closed

RDKIT tests sometimes fails #2958

Closed

IAlibay approved these changes Apr 22, 2021

View reviewed changes

IAlibay mentioned this pull request Apr 23, 2021

Aromaticity and Gasteiger charges guessers #2926

Merged

4 tasks

IAlibay requested changes Apr 23, 2021

View reviewed changes

Merge branch 'develop' into rdkitcache

c090c45

IAlibay mentioned this pull request May 10, 2021

Use Results class for WaterBridge analysis #3287

Merged

7 tasks

cbouy added 4 commits May 11, 2021 10:59

changelog update

55ae3d3

Merge branch 'rdkitcache' of github.com:cbouy/mdanalysis into rdkitcache

6c38a7c

Merge branch 'develop' into rdkitcache

d0378b3

pass kwargs to the convert_to method

2aceaf8

IAlibay approved these changes May 11, 2021

View reviewed changes

IAlibay merged commit 6d5ef34 into MDAnalysis:develop May 11, 2021

fiona-naughton added enhancement Component-Converters labels Sep 26, 2023

Conversation

cbouy commented Sep 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Uh oh!

pep8speaks commented Sep 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-05-11 10:08:59 UTC

Uh oh!

codecov bot commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cbouy commented Sep 22, 2020

Uh oh!

orbeckst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IAlibay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IAlibay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cbouy commented Nov 3, 2020

Uh oh!

cbouy commented Nov 18, 2020

Uh oh!

IAlibay commented Nov 19, 2020

Uh oh!

IAlibay commented Apr 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbouy commented Apr 10, 2021

Uh oh!

IAlibay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

orbeckst commented Apr 23, 2021

Uh oh!

IAlibay commented Apr 23, 2021

Uh oh!

IAlibay left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbouy commented Apr 24, 2021

Uh oh!

orbeckst commented Apr 24, 2021 via email

Uh oh!

IAlibay commented May 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cbouy commented Sep 13, 2020 •

edited

Loading

pep8speaks commented Sep 13, 2020 •

edited

Loading

codecov bot commented Sep 14, 2020 •

edited

Loading

IAlibay commented Apr 10, 2021 •

edited

Loading

IAlibay left a comment •

edited

Loading

IAlibay commented May 10, 2021 •

edited

Loading

IAlibay commented May 10, 2021 •

edited

Loading