Write out chainID to PDB instead of segID#3157
Write out chainID to PDB instead of segID#3157lilyminium merged 25 commits intoMDAnalysis:developfrom
Conversation
|
Hello @HenryKobin! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-04-08 03:18:48 UTC |
lilyminium
left a comment
There was a problem hiding this comment.
Thanks for fixing this @HenryKobin! Just a few small comments :-)
Please add yourself to the AUTHORS file, as this is your first contribution to MDAnalysis.
Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>
Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>
Codecov Report
@@ Coverage Diff @@
## develop #3157 +/- ##
========================================
Coverage 92.77% 92.78%
========================================
Files 169 169
Lines 22737 22760 +23
Branches 3227 3234 +7
========================================
+ Hits 21095 21118 +23
Misses 1594 1594
Partials 48 48
Continue to review full report at Codecov.
|
|
@lilyminium I'm having trouble passing this test. It appears this instance does have a chainID property, however they are all blank. I was under the impression I should only set the chainIDs to the SegmentID default if they don't have the chainID property. Am I misunderstanding? EDIT: To be more concise, the test is expecting A from GUA to be the ChainID, however the chainID attribute of this universe does exist, they are just all blank. The other tests in this file pass. |
|
Just copying over from the discord. The PDB standard is surprisingly quite clear about what a chain ID can be in a PDB file: In my opinion we shouldn't be allowing the writing of blank chain IDs. In an ideal world we would just throw an error and avoid writing out of standard, but I assume this might break a lot of things downstream. Probably the easiest answer here is to set the chainID to the SegmentID default if it exists but doesn't conform to the PDB standard? I'm sure others have views on this though, I am a bit of a PDB standard purist. |
|
Got it. I'll wait and see what the consensus is, and move forward from there |
|
Well, "non-blank" is pretty unambiguous 😅. I think, following the spirit of #2627 (where elements are read from the file), we could go for an all-or-nothing approach: if even one chainID is blank, we ignore them all and write the last letter of the segid. What do you think @HenryKobin @IAlibay? Edit: similar to the PDB elements reading, this would likely need a warning too because users probably haven't memorised the PDB spec. |
My interpretation is that it means
Honestly this is a bit of a difficult one, residues are modular so I can see folks setting some residues and leaving others blank. Personally I wouldn't default to segid[-1] at all (I think it's just easier to default to 'U' or something like that, i.e. like what things like openbabel do), although I assume there is some historical reason for doing so (@jbarnoud might have some ideas on this?). I should note that we now handle PDB element reading in a non-all-or-nothing approach (see #3001). I'm ok with going all-or-nothing here to keep the PR simple, but we'll need a long term solution that a) prevents us ever defaulting to writing out empty strings, b) defaults to a predictable value, c) allows modularity. |
|
@IAlibay If I switch to all or nothing a couple tests will break, so I might have to alter those to pass (they are ensuring that segID is being used to create a chainID). I like the idea of checking if there are any non alphanumeric ids in the list and defaulting to something expected if this occurs (although I am not an end user of the tool, so I don't know what the end user would want). something like the above EDIT: I just pushed a version that conforms to "use chainids if present and has all alphanumeric ID's else use end of segID". I can alter to whatever decision is ultimately made by the team, this is just a starting point / y'all can see what my current thinking is on how to solve the issue. |
|
I do not know what the best behaviour should be. As I see it, the issue is that chains and segments are pretty arbitrary concepts and they are not consistent between parsers/formats. If a chain is there, it should be used, regardless of what it is. The only constraint is that it should not be more than one character. "alpha-numerical" is ambiguous, even if you stay in the realm of ASCII characters, Defaulting to the last character of the segment name makes some assumptions. It assumes that:
One solution I implemented for other projects was to name chains |
|
@jbarnoud so would a solution perhaps be to check if a chainID attribute exists, and if it is anything other than 1 character long or non existent to find all unique segment names and set the chainID to A,B,C... changing each time the segment name changes? |
|
It is indeed what I would do rather than using the segment name. It is worth seeing what the others think about it, though. There are very good arguments to keep it simple and just name the chain So to summarize:
It is nothing but my own opinion, though; @lilyminium or @IAlibay may think differently. |
|
For alpha-numerical, I would take any non-empty single character as valid and emit a warning if it is not in the range |
|
@HenryKobin, thank you for your patience. I think consensus amongst @jbarnoud, @IAlibay and I is that:
What do you think? It's a bit of a conflict that segids are a segment-level attribute, and chainIDs are atom-level -- that means we can't treat them as conceptually the same, because in a segment different atoms could have different chainIDs. So I think we will opt for the simple case for now. |
Use "X" for the atoms that do not have a chainID. |
|
@lilyminium I think it makes sense. I just wanna make sure i'm following, though.
So is the plan is to replace individual chainID's with "X" if they are missing, not to change every chainID to "X" if one is missing, correct?
Thanks for clarifying. I'm not very familiar with many of these atom concepts. |
That is a very specific check. It is verbose but, as far as I can tell, not much slower than just checking One thing I'll note though is that |
|
No worries @lilyminium , I've been moving apartments so it's been hectic over here as well. I agree that using |
|
Sorry I've not been keeping up here. @HenryKobin just wanted to check what we can do to push this towards merging. Is there anything on our end that we still need to decide on? |
|
Hey @IAlibay , @lilyminium sorry for the delay. I've noticed that changing the segids to use |
lilyminium
left a comment
There was a problem hiding this comment.
@HenryKobin looking at the errors, you're right that messing with reading segids has far-reaching consequences. Unlike the chainID, which is just a topology attribute (essentially a label for each atom), segments are structurally important in defining different parts of the molecule. It's probably simpler to focus this issue on accurately writing out segids, and we can look at changing default segid later as this will require some thought.
|
|
||
| # If segids not present, try to use chainids | ||
| if not any(segids): | ||
| segids = chainids |
There was a problem hiding this comment.
If you put this back in, that second and third failing tests should pass. I think that this is expected behaviour from most users, that differences in chain are conceptually similar to differences in segment -- better not to change this part.
| else: | ||
| n_segments = 1 | ||
| attrs.append(Segids(np.array(['SYSTEM'], dtype=object))) | ||
| attrs.append(Segids(np.array([' '], dtype=object))) |
There was a problem hiding this comment.
Instead of using spaces, use an empty string for blank. That being said, the reason the first failing test is failing is that some functions use "segid xx and resid yy" to select atoms -- so if xx='' or xx=' ', the string "segid and resid yy" breaks the parser. You could change the code that the test shows is failing, but I would be wary of more things failing that tests aren't showing. If you change the default value back to "SYSTEM", that's not too much of an issue -- the original issue was just about writing out segids.
|
@lilyminium you're right! Whoops. I mixed up reading with writing... 😞 I reverted the changes. I did have to alter one test, though. Since we are defaulting to 'X' for a chainID if it is missing, there are always chainids - so segids will default to |
| missing_ids = False | ||
|
|
||
| for (i, chainid) in enumerate(chainids): | ||
| if chainid is None: |
There was a problem hiding this comment.
Is it ever None? Isn't the default value above " "?
There was a problem hiding this comment.
You're correct, it shouldn't ever be None.
lilyminium
left a comment
There was a problem hiding this comment.
Thanks @HenryKobin, looking through I just have a question about None and if you can parametrize the test :-)
| """check whether chainID comes from last character of segid (issue #2224)""" | ||
| ref_id = 'E' | ||
| u = universe2 | ||
| def test_chainid_validated(self, universe3, outfile): |
There was a problem hiding this comment.
Could you please use pytest.mark.parametrize (https://docs.pytest.org/en/stable/parametrize.html) to turn this into three tests with different arguments? There are a few examples around if you ctrl+F in the tests.
…nd corrected check for empty string in pdb.py
|
@lilyminium , I made the requested changes. |
lilyminium
left a comment
There was a problem hiding this comment.
Thanks @HenryKobin, if you just switch that space to the empty string I think it's good to go!
Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>
|
@lilyminium committed your suggested changes! Thanks! |
|
Thank you @HenryKobin for fixing this bug, my PDB files will be much happier for it! :-) |
|
@lilyminium thank you for your patience and assistance in walking me through everything! 😁 |
Fixes #3144
Changes made in this Pull Request:
PR Checklist