Skip to content

Add SMARTS selection#2883

Merged
IAlibay merged 11 commits intoMDAnalysis:developfrom
cbouy:smarts
Aug 29, 2020
Merged

Add SMARTS selection#2883
IAlibay merged 11 commits intoMDAnalysis:developfrom
cbouy:smarts

Conversation

@cbouy
Copy link
Member

@cbouy cbouy commented Jul 31, 2020

Part of #2468
Depends on #2775

Changes made in this Pull Request:

  • Adds atom selection based on SMARTS queries: u.select_atoms("[C;R;!a]", smarts=True)

An example with Proline:
image

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@richardjgowers @IAlibay @fiona-naughton

@pep8speaks
Copy link

pep8speaks commented Jul 31, 2020

Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-26 16:53:09 UTC

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to have a smarts keyword instead of forcing one type or another of selection? Eg "smarts [#7] and not resname ASP"

@cbouy
Copy link
Member Author

cbouy commented Jul 31, 2020

Since some SMARTS have parenthesis in them I'd need to find a way to bypass the "split original selection string on spaces and parenthesis" thing, but should be doable

@cbouy
Copy link
Member Author

cbouy commented Jul 31, 2020

Done!

@richardjgowers
Copy link
Member

This is pretty awesome btw :)

if not pattern:
raise ValueError("{!r} is not a valid SMARTS query".format(
self.pattern))
mol = group.convert_to("RDKIT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to try and cache this in Topology.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mol (topology, not coordinates) is already cached in the converter so that we don't need to rebuild it for every frame of a trajectory:
https://github.com/cbouy/mdanalysis/blob/rdkit-converter/package/MDAnalysis/coordinates/RDKit.py#L253-L264

mol = group.convert_to("RDKIT")
matches = mol.GetSubstructMatches(pattern, useChirality=True)
indices = [idx for match in matches for idx in match]
indices = [mol.GetAtomWithIdx(i).GetIntProp("_MDAnalysis_index")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not predict what idx RDKit will assign? Ie do the atoms sometimes get reordered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They get reordered when using reactions to "standardize" some functional groups like nitro, sulfone...etc into the correct form. I could also force reordering atoms at the end of the conversion but that requires building a new molecule again I think.

token = 'smarts'

def __init__(self, parser, tokens):
pattern = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it simpler to enforce no white space allowed in the pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

White spaces are added automatically by the parser around parentheses:
https://github.com/cbouy/mdanalysis/blob/rdkit-converter/package/MDAnalysis/core/selection.py#L1182

@jbarnoud
Copy link
Contributor

jbarnoud commented Aug 5, 2020

That is a super cool feature!

I did not read the code so maybe it is obvious and super well documented, but I have 2 questions:

  • what happens if I do not have RDKit installed?
  • what happens if the SMART string matches in several places? What is these places overlap?

@cbouy
Copy link
Member Author

cbouy commented Aug 5, 2020

what happens if I do not have RDKit installed?

It raises an ImportError with a message that you need RDKit to use the SMARTS selection

what happens if the SMART string matches in several places? What if these places overlap?

You get a flat and "uniqued" AtomGroup with all atoms matching the query.

For example, with the molecule C(C=O)(C=C)C:
image

RDKit returns a tuple of matches for the query, i.e. if you query C-C it will return ((0, 1), (0, 3), (0, 5))
u.select_atoms("smarts C-C").indices will return array([0, 1, 3, 5])

There's a limit of 1000 matches (the default in RDKit).

Maybe it could be useful to add a kwarg to select_atoms so that it yields each match as a separate AtomGroup (i.e. [0, 1], then [0, 3]...etc.) ?

@jbarnoud
Copy link
Contributor

jbarnoud commented Aug 5, 2020

Maybe it could be useful to add a kwarg to select_atoms so that it yields each match as a separate AtomGroup (i.e. [0, 1], then [0, 3]...etc.) ?

For API consistency's sake, it would be better to have an other method to do this. I think it would be very useful, but it is clearly for later.

@cbouy
Copy link
Member Author

cbouy commented Aug 5, 2020

I agree that it could get confusing, a separate method is probably better

for ag in u.smarts_matches("C-C"):
    print(ag.indices)

@orbeckst orbeckst added the GSoC GSoC project label Aug 7, 2020
Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is fine, quick doc question

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things, but otherwise lgtm :)

Copy link
Contributor

@fiona-naughton fiona-naughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment from me, otherwise looking good!

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming CI returns green lgtm!

pinging @fiona-naughton and @richardjgowers for re-review, we can probably get this merged today/tomorrow.

@codecov
Copy link

codecov bot commented Aug 26, 2020

Codecov Report

Merging #2883 into develop will increase coverage by 0.10%.
The diff coverage is 92.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2883      +/-   ##
===========================================
+ Coverage    92.91%   93.01%   +0.10%     
===========================================
  Files          187      187              
  Lines        24591    24963     +372     
  Branches      3185     3263      +78     
===========================================
+ Hits         22848    23219     +371     
+ Misses        1697     1696       -1     
- Partials        46       48       +2     
Impacted Files Coverage Δ
package/MDAnalysis/core/groups.py 98.58% <ø> (-0.01%) ⬇️
package/MDAnalysis/core/selection.py 99.24% <92.00%> (-0.26%) ⬇️
package/MDAnalysis/coordinates/RDKit.py 97.31% <0.00%> (-2.69%) ⬇️
__init__.py 91.89% <0.00%> (-0.61%) ⬇️
package/MDAnalysis/__init__.py 91.89% <0.00%> (-0.61%) ⬇️
package/MDAnalysis/core/universe.py 97.45% <0.00%> (+0.25%) ⬆️
package/MDAnalysis/core/topologyattrs.py 96.49% <0.00%> (+0.31%) ⬆️
package/MDAnalysis/coordinates/base.py 95.12% <0.00%> (+0.48%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23074c6...1d2bb91. Read the comment docs.

@IAlibay IAlibay merged commit b2b7bcb into MDAnalysis:develop Aug 29, 2020
PicoCentauri pushed a commit to PicoCentauri/mdanalysis that referenced this pull request Mar 30, 2021
Towards MDAnalysis#2468 

## Work done in this PR
Adds a new SMARTS based selection which uses the RDKit converter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants