Skip to content

Providing Elements attribute to parsers (#2553)#2627

Merged
IAlibay merged 15 commits intoMDAnalysis:developfrom
AmeyaHarmalkar:test-develop
Mar 30, 2020
Merged

Providing Elements attribute to parsers (#2553)#2627
IAlibay merged 15 commits intoMDAnalysis:developfrom
AmeyaHarmalkar:test-develop

Conversation

@AmeyaHarmalkar
Copy link
Contributor

@AmeyaHarmalkar AmeyaHarmalkar commented Mar 14, 2020

Fixes #2553

Changes made in this Pull Request:

  • Added the elements attributes to the PDBParser.py
  • Added test cases with
    1. PDB that is correct and provides the elements attributes correctly (pdb added to testsuites)
    2. A PDB file with missing element types that prints approriate warnings. (pdb added to testsuites)
  • Added the test check for guessed and expected attributes.
  • Added necessary warning statements for the user.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@AmeyaHarmalkar
Copy link
Contributor Author

@lilyminium @richardjgowers (sorry for so many pings, just bringing your attention to the alternative Pull Request for issue #2553)

@IAlibay
Copy link
Member

IAlibay commented Mar 15, 2020

@AmeyaHarmalkar this PR seems to include changes for both the PDBParser and TPRParser.
Since you are aiming to fix the PDBParser here, your branch should ideally only have changes associated with that.

@AmeyaHarmalkar
Copy link
Contributor Author

@IAlibay I committed to this branch by mistake(I switched the branch and used a different PR for that). However, I used git reset --hard on this branch to reset it to the parent. Is there any other way I should have done it?

@IAlibay
Copy link
Member

IAlibay commented Mar 15, 2020

@AmeyaHarmalkar that seems to have done it 👍

@AmeyaHarmalkar
Copy link
Contributor Author

I actually forced it on the head just now. Made a rookie mistake. Sorry about that!

Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AmeyaHarmalkar, thanks for separating your changes.

# Guessed attributes
# Need to pull elements from Atom names
# Similar to the check for atomtypes function
if any(elements):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any returns True if anything in elements is truthy. This neither guarantees that all the elements are found, or that any are missing -- it only returns False when all of them are missing. Do you mean if not all?

>>> any(['N', '', 'O'])
True
>>> any(['N', 'H', 'O'])
True
>>> any(['', '', ''])
False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic was if any of the element identifier exists, then get element information, else guess based on atom names, and report individual missing elements. However, if the entire row is missing, then the element identifier does not exist, and hence, we would be guessing completely based on the atom names. The subset of missing element identifiers can vary, hence I wanted to prompt a warning only in the cases when it was missing, hence I did a if any instead of a if not all. Because only specific elements are being guessed instead of the entire atom set in the PDB. In your opinion should I change this logic?

Comment on lines +316 to +319
for i,e in enumerate(elements):
if e == '':
elements[i] = guess_atom_element(names[i])
warnings.warn("Element record found to be non-physical. Guessing element from atom name: {}".format(names[i]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looping over the entire list instead of only the unique elements is slow, and means that the user will get a lot of warnings. You might find it easier to convert to a numpy array, and use np.unique(el_array, return_inverse=True).

elif len(name) <= 2:
name = ''
return name
name = "00" # probably element is on left not right
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this line is here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept '00' to indicate a missing element. From the discussion, the third case was that if we detect a missing element. Should I return a None?

name = "00" # probably element is on left not right

# if it's numbers
return no_symbols
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why this is here, but from #2553 (comment) I think this function should either return a valid element symbol, or an empty string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will never execute now, so you can remove it.


u = mda.Universe(PDB_elements, format='PDB')
element_list = ['N', 'C', 'C', 'O', 'C', 'C', 'O', 'N', 'H', 'H', 'H',
'H','H','H', 'H', 'H', 'CU', 'FE', 'C', 'MG']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the elements all be upper case like that? Genuine question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there was an initial suggestion in the code to compare everything in Upper case. I was unsure whether it was for some alternative functionality. So, I preferred to avoid the case change.

def test_PDB_elements():
from MDAnalysis.topology import tables

u = mda.Universe(PDB_elements, format='PDB')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really not have a PDB file on hand with elements? (I haven't looked.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have one with elements (metals.pdb for eg but it contains all hetatoms), but I took @richardjgowers opinion and created an alternative one, which contains an amino acid and some arbitrary metals. I can redirect it to some other pdb if you think that's a better alternative?

@orbeckst
Copy link
Member

@AmeyaHarmalkar – is there an existing issue that this PR and PR #2628 are addressing? If so, please reference it.

If there isn't an issue yet then it would be great if you raised one in the issue tracker and then reference your new issue. In this way we have a much better record of problems and the problem definition is neatly separated from the solution.

@AmeyaHarmalkar AmeyaHarmalkar changed the title Providing Elements attribute to parsers (#2553) Providing Elements attribute to parsers fixes #2553 Mar 20, 2020
@AmeyaHarmalkar AmeyaHarmalkar changed the title Providing Elements attribute to parsers fixes #2553 Providing Elements attribute to parsers (#2553) Mar 20, 2020
@jbarnoud
Copy link
Contributor

I don't have the time to work on this but I just wanted to point to #2442 where I started solving the issue. The discussion may be of interest.

@AmeyaHarmalkar
Copy link
Contributor Author

@lilyminium I made the changes you suggested. It would be really helpful if you could have a look and let me know if I need to make any further changes.
Thanks!

Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes @AmeyaHarmalkar -- right now I'm not sure there's a consensus on what to do for intermittently missing or invalid elements.

# Need to pull elements from Atom names
# Similar to the check for atomtypes function
if not any(elements):
warnings.warn("Element information missing. Guessing elements from atom names : {}".format(names))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDB files can have up to 10k atoms and are human-readable. I don't think including the atom names in the warning message will be a net benefit, because users can just open the file up themselves to look at the names.

elements = [ guess_atom_element(names[k]) if k in indexlist else elements[k] for k in range(len(elements)) ]
attrs.append(Elements(elements, guessed=True))
else:
attrs.append(Elements(elements, guessed=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in no scenario is elements not guessed -- but I would think that the file column counts as un-guessed information, so long as the column elements are valid. If there is even one element provided, I would just check to see if it is valid (if not, turn it into the empty string '') and then not try to guess the others.

I think @jbarnoud and @richardjgowers had different opinions on that in #2442 if they want to chime in. And @zemanj has opinions on guessing in general -- what do you think about trying to validate read information?

return name

name = name[:-1]
# probably element is on left not right
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this comment back to the line it explains? (name = name[:-1])

name = "00" # probably element is on left not right

# if it's numbers
return no_symbols
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will never execute now, so you can remove it.

"""
u = mda.Universe(StringIO(PDB_elements), format='PDB')
element_list = np.array(['N', 'C', 'C', 'O', 'C', 'C', 'O', 'N', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'CU', 'FE', 'MG', 'S', 'O', 'C', 'C', 'S', 'O', 'C', 'C'], dtype=object)
assert np.all(u.atoms.elements == element_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use assert_equal for this. The user guide has more tips on writing tests.

u = mda.Universe(StringIO(PDB_missing_ele), format='PDB')

assert len(record) == 1
assert record[0].message.args[0] == "Element information missing. Guessing elements from atom names : ['N', 'CA', 'C', 'O', 'CU', 'FE', 'Mg']"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, please break the message over multiple lines to follow PEP8. You can also test to check that the guessed elements are what you expect here.

u = mda.Universe(StringIO(PDB_wrong_ele), format='PDB')

assert len(record) == 2
assert record[1].message.args[0] == "Element record found to be non-physical. Guessing element from atom name: ['X', 'FE']"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, please follow PEP8. What's the first warning that you're expecting here?

"Element record found to be non-physical.

I don't see any non-physical elements in PDB_wrong_ele, just missing ones. You may need to reconsider how you guess the elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATOM 5 was named X, that was a non-physical element which is returned as a string.

# Checking if missing atoms are filled with an empty string.

element_list = np.array(['N', 'C', 'C', 'O', '', 'CU', 'FE', 'MG'], dtype=object)
assert np.all(u.atoms.elements == element_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, please use assert_equal.

Comment on lines +322 to +326
if any(np.isin(uni_elements, '')):
indexlist = [ j for j in range(len(indices)) if indices[j] == 0 ]
# This would be 0 always as guess_types would return a set
# of alphabets and empty string, and empty string will be
# sorted to be a index 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit convoluted. You're right that the empty string will always be sorted to be first, so you could just check if uni_elements[0] == ''. However what I had in mind was more checking the unique atom names. So you can get a masking array of missing = elements=='', use that to choose the names you want to look at with names_to_guess_from = names[missing], and apply np.unique to that array. (note: you may need to convert elements and names into numpy arrays first.)

@IAlibay
Copy link
Member

IAlibay commented Mar 21, 2020

Thanks for these changes @AmeyaHarmalkar -- right now I'm not sure there's a consensus on what to do for intermittently missing or invalid elements.

With the idea of getting this done on time for the GSOC deadline, maybe this PR should just aim to add the elements if they exist in the file, and if they don't warn the user and not populate elements. If needed, further changes could then be tackled in a separate PR (most likely based on #2630). Thoughts on this @lilyminium?

@lilyminium
Copy link
Member

@IAlibay sounds good. Just to be clear, you're suggesting that Elements only be added if all of the symbols are present and valid, and no guessing happen based on names?

Alternatively, @AmeyaHarmalkar, you could focus on #2628 since the TPR format is more straightforward.

@AmeyaHarmalkar
Copy link
Contributor Author

Sounds great! @lilyminium Should I finish this PR with the suggestions and changes you requested, merge it and then work on TPR next?

@zemanj
Copy link
Member

zemanj commented Mar 22, 2020

@IAlibay sounds good. Just to be clear, you're suggesting that Elements only be added if all of the symbols are present and valid, and no guessing happen based on names?

Very conservative, won't break anything - seems legit!

I think @jbarnoud and @richardjgowers had different opinions on that in #2442 if they want to chime in. And @zemanj has opinions on guessing in general -- what do you think about trying to validate read information?

I never though about that before. I think that (optional?) data validation (comparing with guessed data?) would be a handy feature! For this PR, though, I'd simply go with what @IAlibay said:

  • If all atoms in the PDB file have elements defined AND all these elements are known elements -> use them
  • Otherwise: discard all element info, throw warning, continue

@IAlibay
Copy link
Member

IAlibay commented Mar 22, 2020

@IAlibay sounds good. Just to be clear, you're suggesting that Elements only be added if all of the symbols are present and valid, and no guessing happen based on names?

Yeah exactly. It's possible that I misread your comment #2627 (review), but I thought this might be the fastest solution here.

I'm not super fond of my own suggestion, since it technically amounts to asking @AmeyaHarmalkar to throw away code they wrote. However, considering there's 9 days left on the GSOC submission deadline, and at least to me adding elements on pdbv3 complient files is already a good PR, I thought it might be reasonable to give @AmeyaHarmalkar an option to get this done without having to wait for consensus.

Alternatively, @AmeyaHarmalkar, you could focus on #2628 since the TPR format is more straightforward.

So I think we might face the same issue about intermittently missing or invalid elements in this one too. I was inspecting a few tprs yesterday and realised that in my protein-ligand complexes, my ligands all had -1 atomic numbers whilst my proteins were fine (some parmed conversion thing it seems) 😅

@lilyminium
Copy link
Member

@AmeyaHarmalkar yes I think it would be a good idea to finish off this PR following @IAlibay's and @zemanj's suggestions for element reading for now, just to get it in before the GSOC deadline :-)

…essing. Implementing comments and editing test cases.
Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AmeyaHarmalkar, getting there

# Similar to the check for atomtypes function
if not any(elements):
warnings.warn("Element information absent.")
attrs.append(Elements(elements, guessed=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guessed=True

But you are not guessing these. In addition, I interpreted @IAlibay's suggestion to just not add the Elements if they are not all present and valid.

# if needed.
indexlist = [ j for j in range(len(indices)) if missing[j] == True]
warnings.warn("Element record found to be either non-physical or missing for some elements.")
attrs.append(Elements(elements, guessed=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I would simply check that all the read elements are valid, and only add them if so.

element_list = np.array([
'N', 'C', 'C', 'O', 'C', 'C', 'O', 'N', 'H',
'H', 'H', 'H', 'H', 'H', 'H', 'H', 'CU', 'FE',
'MG', 'S', 'O', 'C', 'C', 'S', 'O', 'C', 'C'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think the elements should follow the capitalisation of the periodic table: 'Cu', 'Fe', etc.

# Checking if missing atoms are filled with an empty string.

element_list = np.array(['N', 'C', 'C', 'O', '', 'CU', '', 'MG'
], dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I think we are simply not adding Elements if even a single one is missing.

'resnames', 'altLocs', 'icodes', 'occupancies',
'bonds', 'tempfactors', 'chainIDs']
guessed_attrs = ['types', 'masses']
guessed_attrs = ['types', 'masses','elements']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have to remove elements from guessed_attrs if they are not all present in the files for the tests that subclass PDBBase.

@lilyminium
Copy link
Member

@IAlibay would you be able help out with this PR, as the GSOC deadline is coming up quite soon? @AmeyaHarmalkar is really close to being done and I think you know this issue very well. I won't have as much time to spend on MDAnalysis in the next few days due to dealing with departmental shutdown. Of course I'm happy to keep up with this if you have too much on your plate already (global pandemic).

@AmeyaHarmalkar
Copy link
Contributor Author

@IAlibay pushed the changes. I made the changes in the AUTHORS and CHANGELOG, but there is some issue which I am not sure of.

@IAlibay
Copy link
Member

IAlibay commented Mar 28, 2020

@IAlibay pushed the changes. I made the changes in the AUTHORS and CHANGELOG, but there is some issue which I am not sure of.

By issues, do you mean the merge conflicts?

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more changes then I think that this is done as long as travis comes back green.

@lilyminium @jbarnoud any (hopefully final) changes on your end?

# Getting element information from element column.
periodic_table = set(SYMB2Z)
if all(elements):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure from the thumbs up if you wanted to fix this one but didn't get around to in your latest commit?

@lilyminium
Copy link
Member

Looks good, thanks for the fix @AmeyaHarmalkar and for taking over @IAlibay! Travis won't run while there are merge conflicts, I think. It's just with AUTHORS and CHANGELOG so you could just resolve on GitHub if you had to

@AmeyaHarmalkar
Copy link
Contributor Author

I don't have right access, so I am unable to resolve the conflicts in AUTHORS and CHANGELOG. Let me know if the changes look good so far, and it would be really helpful if someone could help me with resolving the conflicts. Thanks @lilyminium and @IAlibay!

@lilyminium
Copy link
Member

Ok @AmeyaHarmalkar I fixed the merge conflicts on this end, now if you want to make changes you'll need to git pull origin test-develop to pull the changes into your local repo.

@AmeyaHarmalkar
Copy link
Contributor Author

Thanks @lilyminium!
The codecov/project check failed. Everything else seems alright. Any further suggestions to work on the code @IAlibay?


# Getting element information from element column.
if all(elements):
elements = [i.capitalize() for i in elements]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per @lilyminium's comment: #2627 (comment)

Doing a set comparison for elements here would be faster. So:

if all(elements):
    elem_set = set(i.capitalize() for i in i in set(elements))
    if all(element in SYMB2Z for element in elem_set):
        attrs.append(Elements(np.array([i.capitalize() for i in elements], dtype=object))
    else:
        .....

Note: I noticed that every other attrs being appended are actually being converted to numpy arrays, so we probably should be consistent here (unless there's a reason not to?).

# Getting element information from element column.
periodic_table = set(SYMB2Z)
if all(elements):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL set creation is way more expensive than I realised 🙃

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more changes and I think you're done.

* Added missing selection module to leaflet.py (Issue #2612)

Enhancements
* Added elements attribute to PDBParser (Issue #2553)
Copy link
Member

@IAlibay IAlibay Mar 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also add issue #2647 here too.

@IAlibay
Copy link
Member

IAlibay commented Mar 29, 2020

@AmeyaHarmalkar I just got a notification with something along the lines of "I git pulled on my local repo and I can't import the mda universe due a module error. It has issues with importing the tqdm package (it was one of the newly merged changes I believe)."

For some reason the comment isn't showing up here, did you fix the issue?

@AmeyaHarmalkar
Copy link
Contributor Author

Yeah @IAlibay it was a rookie mistake. I figured it out. No worries!

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry feels like I'm nitpicking, but you accidentally removed a couple of spaces in the changelog, so I tagged on a couple of extra cosmetic changes, they shouldn't take much to fix.
After this, assuming that Travis is green, I'm pretty sure this is good to go.

if all(elements):
element_set = set(i.capitalize() for i in set(elements))
if all( element in SYMB2Z for element in element_set):
attrs.append(Elements(np.array([i.capitalize() for i in elements],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 you're exceeding the character-per-line limit here.
There's no great solution here because either way flake8 tends to complain, but I'd just align things with the first bracket something like:

attrs.append(Elements(np.array(
             [i.capitalize() for i in elements],
             dtype=object)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually solve such situations with intermediate variables, e.g.

            if all(element in SYMB2Z for element in element_set):
                element_list = [i.capitalize() for i in elements]
                attrs.append(Elements(np.array(element_list, dtype=object)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make that change. I thought probably we are saving one line of code, but can definitely make the change if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmeyaHarmalkar probably in this case, adding one line for a bit more readability is worth it. If you feel up to making the change please do.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming Travis goes green, LGTM!

"elements attributes will not be populated.")
else:
warnings.warn("Element information is absent or missing for a few "
"residues. Elements attributes will not be populated.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmeyaHarmalkar if you are fixing @zemanj's comment: #2627 (comment) then I probably would also change a few residues here to some atoms (since elements are atom-wise).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Updating and pushing through

Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @AmeyaHarmalkar! @IAlibay do you want to merge?

@IAlibay IAlibay merged commit faebd3e into MDAnalysis:develop Mar 30, 2020
@IAlibay
Copy link
Member

IAlibay commented Mar 30, 2020

Thanks so much for all your work @AmeyaHarmalkar and congrats on your first merged MDA PR 🎉

@AmeyaHarmalkar
Copy link
Contributor Author

Thank you @IAlibay and @lilyminium for all your help! This was a great learning experience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parsers should try and provide an Elements attribute

8 participants