Providing Elements attribute to parsers (#2553)#2627
Providing Elements attribute to parsers (#2553)#2627IAlibay merged 15 commits intoMDAnalysis:developfrom
Conversation
|
@lilyminium @richardjgowers (sorry for so many pings, just bringing your attention to the alternative Pull Request for issue #2553) |
|
@AmeyaHarmalkar this PR seems to include changes for both the PDBParser and TPRParser. |
|
@IAlibay I committed to this branch by mistake(I switched the branch and used a different PR for that). However, I used git reset --hard on this branch to reset it to the parent. Is there any other way I should have done it? |
53d5da5 to
c099c99
Compare
|
@AmeyaHarmalkar that seems to have done it 👍 |
|
I actually forced it on the head just now. Made a rookie mistake. Sorry about that! |
lilyminium
left a comment
There was a problem hiding this comment.
Hi @AmeyaHarmalkar, thanks for separating your changes.
| # Guessed attributes | ||
| # Need to pull elements from Atom names | ||
| # Similar to the check for atomtypes function | ||
| if any(elements): |
There was a problem hiding this comment.
any returns True if anything in elements is truthy. This neither guarantees that all the elements are found, or that any are missing -- it only returns False when all of them are missing. Do you mean if not all?
>>> any(['N', '', 'O'])
True
>>> any(['N', 'H', 'O'])
True
>>> any(['', '', ''])
FalseThere was a problem hiding this comment.
The logic was if any of the element identifier exists, then get element information, else guess based on atom names, and report individual missing elements. However, if the entire row is missing, then the element identifier does not exist, and hence, we would be guessing completely based on the atom names. The subset of missing element identifiers can vary, hence I wanted to prompt a warning only in the cases when it was missing, hence I did a if any instead of a if not all. Because only specific elements are being guessed instead of the entire atom set in the PDB. In your opinion should I change this logic?
| for i,e in enumerate(elements): | ||
| if e == '': | ||
| elements[i] = guess_atom_element(names[i]) | ||
| warnings.warn("Element record found to be non-physical. Guessing element from atom name: {}".format(names[i])) |
There was a problem hiding this comment.
Looping over the entire list instead of only the unique elements is slow, and means that the user will get a lot of warnings. You might find it easier to convert to a numpy array, and use np.unique(el_array, return_inverse=True).
| elif len(name) <= 2: | ||
| name = '' | ||
| return name | ||
| name = "00" # probably element is on left not right |
There was a problem hiding this comment.
I don't understand why this line is here?
There was a problem hiding this comment.
I kept '00' to indicate a missing element. From the discussion, the third case was that if we detect a missing element. Should I return a None?
| name = "00" # probably element is on left not right | ||
|
|
||
| # if it's numbers | ||
| return no_symbols |
There was a problem hiding this comment.
I don't remember why this is here, but from #2553 (comment) I think this function should either return a valid element symbol, or an empty string.
There was a problem hiding this comment.
This line will never execute now, so you can remove it.
|
|
||
| u = mda.Universe(PDB_elements, format='PDB') | ||
| element_list = ['N', 'C', 'C', 'O', 'C', 'C', 'O', 'N', 'H', 'H', 'H', | ||
| 'H','H','H', 'H', 'H', 'CU', 'FE', 'C', 'MG'] |
There was a problem hiding this comment.
Should the elements all be upper case like that? Genuine question.
There was a problem hiding this comment.
Yeah, there was an initial suggestion in the code to compare everything in Upper case. I was unsure whether it was for some alternative functionality. So, I preferred to avoid the case change.
| def test_PDB_elements(): | ||
| from MDAnalysis.topology import tables | ||
|
|
||
| u = mda.Universe(PDB_elements, format='PDB') |
There was a problem hiding this comment.
Do we really not have a PDB file on hand with elements? (I haven't looked.)
There was a problem hiding this comment.
We do have one with elements (metals.pdb for eg but it contains all hetatoms), but I took @richardjgowers opinion and created an alternative one, which contains an amino acid and some arbitrary metals. I can redirect it to some other pdb if you think that's a better alternative?
|
@AmeyaHarmalkar – is there an existing issue that this PR and PR #2628 are addressing? If so, please reference it. If there isn't an issue yet then it would be great if you raised one in the issue tracker and then reference your new issue. In this way we have a much better record of problems and the problem definition is neatly separated from the solution. |
|
I don't have the time to work on this but I just wanted to point to #2442 where I started solving the issue. The discussion may be of interest. |
…s from testsuite/MDAnalysisTests/data
|
@lilyminium I made the changes you suggested. It would be really helpful if you could have a look and let me know if I need to make any further changes. |
lilyminium
left a comment
There was a problem hiding this comment.
Thanks for these changes @AmeyaHarmalkar -- right now I'm not sure there's a consensus on what to do for intermittently missing or invalid elements.
| # Need to pull elements from Atom names | ||
| # Similar to the check for atomtypes function | ||
| if not any(elements): | ||
| warnings.warn("Element information missing. Guessing elements from atom names : {}".format(names)) |
There was a problem hiding this comment.
PDB files can have up to 10k atoms and are human-readable. I don't think including the atom names in the warning message will be a net benefit, because users can just open the file up themselves to look at the names.
| elements = [ guess_atom_element(names[k]) if k in indexlist else elements[k] for k in range(len(elements)) ] | ||
| attrs.append(Elements(elements, guessed=True)) | ||
| else: | ||
| attrs.append(Elements(elements, guessed=True)) |
There was a problem hiding this comment.
So in no scenario is elements not guessed -- but I would think that the file column counts as un-guessed information, so long as the column elements are valid. If there is even one element provided, I would just check to see if it is valid (if not, turn it into the empty string '') and then not try to guess the others.
I think @jbarnoud and @richardjgowers had different opinions on that in #2442 if they want to chime in. And @zemanj has opinions on guessing in general -- what do you think about trying to validate read information?
| return name | ||
|
|
||
| name = name[:-1] | ||
| # probably element is on left not right |
There was a problem hiding this comment.
Can you move this comment back to the line it explains? (name = name[:-1])
| name = "00" # probably element is on left not right | ||
|
|
||
| # if it's numbers | ||
| return no_symbols |
There was a problem hiding this comment.
This line will never execute now, so you can remove it.
| """ | ||
| u = mda.Universe(StringIO(PDB_elements), format='PDB') | ||
| element_list = np.array(['N', 'C', 'C', 'O', 'C', 'C', 'O', 'N', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'CU', 'FE', 'MG', 'S', 'O', 'C', 'C', 'S', 'O', 'C', 'C'], dtype=object) | ||
| assert np.all(u.atoms.elements == element_list) |
There was a problem hiding this comment.
Please use assert_equal for this. The user guide has more tips on writing tests.
| u = mda.Universe(StringIO(PDB_missing_ele), format='PDB') | ||
|
|
||
| assert len(record) == 1 | ||
| assert record[0].message.args[0] == "Element information missing. Guessing elements from atom names : ['N', 'CA', 'C', 'O', 'CU', 'FE', 'Mg']" |
There was a problem hiding this comment.
Again, please break the message over multiple lines to follow PEP8. You can also test to check that the guessed elements are what you expect here.
| u = mda.Universe(StringIO(PDB_wrong_ele), format='PDB') | ||
|
|
||
| assert len(record) == 2 | ||
| assert record[1].message.args[0] == "Element record found to be non-physical. Guessing element from atom name: ['X', 'FE']" |
There was a problem hiding this comment.
As above, please follow PEP8. What's the first warning that you're expecting here?
"Element record found to be non-physical.
I don't see any non-physical elements in PDB_wrong_ele, just missing ones. You may need to reconsider how you guess the elements.
There was a problem hiding this comment.
ATOM 5 was named X, that was a non-physical element which is returned as a string.
| # Checking if missing atoms are filled with an empty string. | ||
|
|
||
| element_list = np.array(['N', 'C', 'C', 'O', '', 'CU', 'FE', 'MG'], dtype=object) | ||
| assert np.all(u.atoms.elements == element_list) |
There was a problem hiding this comment.
As above, please use assert_equal.
| if any(np.isin(uni_elements, '')): | ||
| indexlist = [ j for j in range(len(indices)) if indices[j] == 0 ] | ||
| # This would be 0 always as guess_types would return a set | ||
| # of alphabets and empty string, and empty string will be | ||
| # sorted to be a index 0. |
There was a problem hiding this comment.
This is a bit convoluted. You're right that the empty string will always be sorted to be first, so you could just check if uni_elements[0] == ''. However what I had in mind was more checking the unique atom names. So you can get a masking array of missing = elements=='', use that to choose the names you want to look at with names_to_guess_from = names[missing], and apply np.unique to that array. (note: you may need to convert elements and names into numpy arrays first.)
With the idea of getting this done on time for the GSOC deadline, maybe this PR should just aim to add the elements if they exist in the file, and if they don't warn the user and not populate elements. If needed, further changes could then be tackled in a separate PR (most likely based on #2630). Thoughts on this @lilyminium? |
|
@IAlibay sounds good. Just to be clear, you're suggesting that Elements only be added if all of the symbols are present and valid, and no guessing happen based on names? Alternatively, @AmeyaHarmalkar, you could focus on #2628 since the TPR format is more straightforward. |
|
Sounds great! @lilyminium Should I finish this PR with the suggestions and changes you requested, merge it and then work on TPR next? |
Very conservative, won't break anything - seems legit!
I never though about that before. I think that (optional?) data validation (comparing with guessed data?) would be a handy feature! For this PR, though, I'd simply go with what @IAlibay said:
|
Yeah exactly. It's possible that I misread your comment #2627 (review), but I thought this might be the fastest solution here. I'm not super fond of my own suggestion, since it technically amounts to asking @AmeyaHarmalkar to throw away code they wrote. However, considering there's 9 days left on the GSOC submission deadline, and at least to me adding elements on pdbv3 complient files is already a good PR, I thought it might be reasonable to give @AmeyaHarmalkar an option to get this done without having to wait for consensus.
So I think we might face the same issue about intermittently missing or invalid elements in this one too. I was inspecting a few tprs yesterday and realised that in my protein-ligand complexes, my ligands all had -1 atomic numbers whilst my proteins were fine (some parmed conversion thing it seems) 😅 |
|
@AmeyaHarmalkar yes I think it would be a good idea to finish off this PR following @IAlibay's and @zemanj's suggestions for element reading for now, just to get it in before the GSOC deadline :-) |
…essing. Implementing comments and editing test cases.
lilyminium
left a comment
There was a problem hiding this comment.
Thanks @AmeyaHarmalkar, getting there
| # Similar to the check for atomtypes function | ||
| if not any(elements): | ||
| warnings.warn("Element information absent.") | ||
| attrs.append(Elements(elements, guessed=True)) |
There was a problem hiding this comment.
guessed=True
But you are not guessing these. In addition, I interpreted @IAlibay's suggestion to just not add the Elements if they are not all present and valid.
| # if needed. | ||
| indexlist = [ j for j in range(len(indices)) if missing[j] == True] | ||
| warnings.warn("Element record found to be either non-physical or missing for some elements.") | ||
| attrs.append(Elements(elements, guessed=True)) |
There was a problem hiding this comment.
Same as above, I would simply check that all the read elements are valid, and only add them if so.
| element_list = np.array([ | ||
| 'N', 'C', 'C', 'O', 'C', 'C', 'O', 'N', 'H', | ||
| 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'CU', 'FE', | ||
| 'MG', 'S', 'O', 'C', 'C', 'S', 'O', 'C', 'C' |
There was a problem hiding this comment.
Again, I think the elements should follow the capitalisation of the periodic table: 'Cu', 'Fe', etc.
| # Checking if missing atoms are filled with an empty string. | ||
|
|
||
| element_list = np.array(['N', 'C', 'C', 'O', '', 'CU', '', 'MG' | ||
| ], dtype=object) |
There was a problem hiding this comment.
As above, I think we are simply not adding Elements if even a single one is missing.
| 'resnames', 'altLocs', 'icodes', 'occupancies', | ||
| 'bonds', 'tempfactors', 'chainIDs'] | ||
| guessed_attrs = ['types', 'masses'] | ||
| guessed_attrs = ['types', 'masses','elements'] |
There was a problem hiding this comment.
You may have to remove elements from guessed_attrs if they are not all present in the files for the tests that subclass PDBBase.
|
@IAlibay would you be able help out with this PR, as the GSOC deadline is coming up quite soon? @AmeyaHarmalkar is really close to being done and I think you know this issue very well. I won't have as much time to spend on MDAnalysis in the next few days due to dealing with departmental shutdown. Of course I'm happy to keep up with this if you have too much on your plate already (global pandemic). |
|
@IAlibay pushed the changes. I made the changes in the AUTHORS and CHANGELOG, but there is some issue which I am not sure of. |
By issues, do you mean the merge conflicts? |
IAlibay
left a comment
There was a problem hiding this comment.
Two more changes then I think that this is done as long as travis comes back green.
@lilyminium @jbarnoud any (hopefully final) changes on your end?
| # Getting element information from element column. | ||
| periodic_table = set(SYMB2Z) | ||
| if all(elements): | ||
|
|
There was a problem hiding this comment.
Not sure from the thumbs up if you wanted to fix this one but didn't get around to in your latest commit?
|
Looks good, thanks for the fix @AmeyaHarmalkar and for taking over @IAlibay! Travis won't run while there are merge conflicts, I think. It's just with AUTHORS and CHANGELOG so you could just resolve on GitHub if you had to |
|
I don't have right access, so I am unable to resolve the conflicts in AUTHORS and CHANGELOG. Let me know if the changes look good so far, and it would be really helpful if someone could help me with resolving the conflicts. Thanks @lilyminium and @IAlibay! |
|
Ok @AmeyaHarmalkar I fixed the merge conflicts on this end, now if you want to make changes you'll need to |
|
Thanks @lilyminium! |
|
|
||
| # Getting element information from element column. | ||
| if all(elements): | ||
| elements = [i.capitalize() for i in elements] |
There was a problem hiding this comment.
As per @lilyminium's comment: #2627 (comment)
Doing a set comparison for elements here would be faster. So:
if all(elements):
elem_set = set(i.capitalize() for i in i in set(elements))
if all(element in SYMB2Z for element in elem_set):
attrs.append(Elements(np.array([i.capitalize() for i in elements], dtype=object))
else:
.....Note: I noticed that every other attrs being appended are actually being converted to numpy arrays, so we probably should be consistent here (unless there's a reason not to?).
| # Getting element information from element column. | ||
| periodic_table = set(SYMB2Z) | ||
| if all(elements): | ||
|
|
There was a problem hiding this comment.
TIL set creation is way more expensive than I realised 🙃
IAlibay
left a comment
There was a problem hiding this comment.
Two more changes and I think you're done.
package/CHANGELOG
Outdated
| * Added missing selection module to leaflet.py (Issue #2612) | ||
|
|
||
| Enhancements | ||
| * Added elements attribute to PDBParser (Issue #2553) |
|
@AmeyaHarmalkar I just got a notification with something along the lines of "I git pulled on my local repo and I can't import the mda universe due a module error. It has issues with importing the tqdm package (it was one of the newly merged changes I believe)." For some reason the comment isn't showing up here, did you fix the issue? |
|
Yeah @IAlibay it was a rookie mistake. I figured it out. No worries! |
IAlibay
left a comment
There was a problem hiding this comment.
Sorry feels like I'm nitpicking, but you accidentally removed a couple of spaces in the changelog, so I tagged on a couple of extra cosmetic changes, they shouldn't take much to fix.
After this, assuming that Travis is green, I'm pretty sure this is good to go.
| if all(elements): | ||
| element_set = set(i.capitalize() for i in set(elements)) | ||
| if all( element in SYMB2Z for element in element_set): | ||
| attrs.append(Elements(np.array([i.capitalize() for i in elements], |
There was a problem hiding this comment.
PEP8 you're exceeding the character-per-line limit here.
There's no great solution here because either way flake8 tends to complain, but I'd just align things with the first bracket something like:
attrs.append(Elements(np.array(
[i.capitalize() for i in elements],
dtype=object)There was a problem hiding this comment.
I usually solve such situations with intermediate variables, e.g.
if all(element in SYMB2Z for element in element_set):
element_list = [i.capitalize() for i in elements]
attrs.append(Elements(np.array(element_list, dtype=object)))There was a problem hiding this comment.
I could make that change. I thought probably we are saving one line of code, but can definitely make the change if needed.
There was a problem hiding this comment.
@AmeyaHarmalkar probably in this case, adding one line for a bit more readability is worth it. If you feel up to making the change please do.
IAlibay
left a comment
There was a problem hiding this comment.
Assuming Travis goes green, LGTM!
| "elements attributes will not be populated.") | ||
| else: | ||
| warnings.warn("Element information is absent or missing for a few " | ||
| "residues. Elements attributes will not be populated.") |
There was a problem hiding this comment.
@AmeyaHarmalkar if you are fixing @zemanj's comment: #2627 (comment) then I probably would also change a few residues here to some atoms (since elements are atom-wise).
There was a problem hiding this comment.
Sure thing. Updating and pushing through
lilyminium
left a comment
There was a problem hiding this comment.
LGTM thanks @AmeyaHarmalkar! @IAlibay do you want to merge?
|
Thanks so much for all your work @AmeyaHarmalkar and congrats on your first merged MDA PR 🎉 |
|
Thank you @IAlibay and @lilyminium for all your help! This was a great learning experience! |
Fixes #2553
Changes made in this Pull Request:
PR Checklist