Skip to content

Write out chainID to PDB instead of segID#3157

Merged
lilyminium merged 25 commits intoMDAnalysis:developfrom
HenryKobin:pdb-chain-fix
Apr 8, 2021
Merged

Write out chainID to PDB instead of segID#3157
lilyminium merged 25 commits intoMDAnalysis:developfrom
HenryKobin:pdb-chain-fix

Conversation

@HenryKobin
Copy link
Contributor

Fixes #3144

Changes made in this Pull Request:

  • When writing to a file, a check will be made for if chainIDs exist, and if they do they will be used instead of being defaulted to the last letter of the segid. If there is no chainID present, then segid's last letter will be used (like how this was before).

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented Mar 14, 2021

Hello @HenryKobin! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-08 03:18:48 UTC

Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @HenryKobin! Just a few small comments :-)

Please add yourself to the AUTHORS file, as this is your first contribution to MDAnalysis.

HenryKobin and others added 2 commits March 13, 2021 17:24
Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>
Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Mar 14, 2021

Codecov Report

Merging #3157 (9f0998c) into develop (3aebcb5) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #3157   +/-   ##
========================================
  Coverage    92.77%   92.78%           
========================================
  Files          169      169           
  Lines        22737    22760   +23     
  Branches      3227     3234    +7     
========================================
+ Hits         21095    21118   +23     
  Misses        1594     1594           
  Partials        48       48           
Impacted Files Coverage Δ
package/MDAnalysis/coordinates/PDB.py 94.77% <100.00%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3aebcb5...9f0998c. Read the comment docs.

@HenryKobin
Copy link
Contributor Author

HenryKobin commented Mar 14, 2021

@lilyminium I'm having trouble passing this test. It appears this instance does have a chainID property, however they are all blank. I was under the impression I should only set the chainIDs to the SegmentID default if they don't have the chainID property. Am I misunderstanding?

EDIT: To be more concise, the test is expecting A from GUA to be the ChainID, however the chainID attribute of this universe does exist, they are just all blank. The other tests in this file pass.

@IAlibay
Copy link
Member

IAlibay commented Mar 14, 2021

Just copying over from the discord.

The PDB standard is surprisingly quite clear about what a chain ID can be in a PDB file:
"Non-blank alphanumerical character is used for chain identifier."
(see: http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM)

In my opinion we shouldn't be allowing the writing of blank chain IDs. In an ideal world we would just throw an error and avoid writing out of standard, but I assume this might break a lot of things downstream. Probably the easiest answer here is to set the chainID to the SegmentID default if it exists but doesn't conform to the PDB standard?

I'm sure others have views on this though, I am a bit of a PDB standard purist.

@HenryKobin
Copy link
Contributor Author

Got it. I'll wait and see what the consensus is, and move forward from there

@lilyminium
Copy link
Member

lilyminium commented Mar 14, 2021

Well, "non-blank" is pretty unambiguous 😅. I think, following the spirit of #2627 (where elements are read from the file), we could go for an all-or-nothing approach: if even one chainID is blank, we ignore them all and write the last letter of the segid. What do you think @HenryKobin @IAlibay?

Edit: similar to the PDB elements reading, this would likely need a warning too because users probably haven't memorised the PDB spec.

@IAlibay
Copy link
Member

IAlibay commented Mar 14, 2021

Well, "non-blank" is pretty unambiguous 😅

My interpretation is that it means str.isalnum() == True, i.e. non-whitespace alphanumeric. Maybe I'm misinterpreting something though? The PDB folks are quite responsive on twitter, so we can ping them if needed.

I think, following the spirit of #2627 (where elements are read from the file), we could go for an all-or-nothing approach: if even one chainID is blank, we ignore them all and write the last letter of the segid. What do you think @HenryKobin @IAlibay?

Edit: similar to the PDB elements reading, this would likely need a warning too because users probably haven't memorised the PDB spec.

Honestly this is a bit of a difficult one, residues are modular so I can see folks setting some residues and leaving others blank. Personally I wouldn't default to segid[-1] at all (I think it's just easier to default to 'U' or something like that, i.e. like what things like openbabel do), although I assume there is some historical reason for doing so (@jbarnoud might have some ideas on this?).

I should note that we now handle PDB element reading in a non-all-or-nothing approach (see #3001). I'm ok with going all-or-nothing here to keep the PR simple, but we'll need a long term solution that a) prevents us ever defaulting to writing out empty strings, b) defaults to a predictable value, c) allows modularity.

@HenryKobin
Copy link
Contributor Author

HenryKobin commented Mar 14, 2021

@IAlibay If I switch to all or nothing a couple tests will break, so I might have to alter those to pass (they are ensuring that segID is being used to create a chainID).

I like the idea of checking if there are any non alphanumeric ids in the list and defaulting to something expected if this occurs (although I am not an end user of the tool, so I don't know what the end user would want).

        if not any([id.isalnum() for id in chainids]):
            chainids = np.array(['U'] * len(atoms))

something like the above

EDIT: I just pushed a version that conforms to "use chainids if present and has all alphanumeric ID's else use end of segID". I can alter to whatever decision is ultimately made by the team, this is just a starting point / y'all can see what my current thinking is on how to solve the issue.

@jbarnoud
Copy link
Contributor

I do not know what the best behaviour should be. As I see it, the issue is that chains and segments are pretty arbitrary concepts and they are not consistent between parsers/formats.

If a chain is there, it should be used, regardless of what it is. The only constraint is that it should not be more than one character. "alpha-numerical" is ambiguous, even if you stay in the realm of ASCII characters, ¼ and ° are alpha-numerical.

Defaulting to the last character of the segment name makes some assumptions. It assumes that:

  • if the segment name changes from one "chain" to the next, the last letter reflects that
  • the last character will not be the same for 2 different chains.
    Basically, it assumes that the segment name encodes the chain name: segA, segB, etc...

One solution I implemented for other projects was to name chains A, B, C... and changing the chain name for new segments. If we go for that, I'd try to respect the chain ids that are already in the Universe if any: this means starting the sequence after the largest chain id already in use.

@HenryKobin
Copy link
Contributor Author

HenryKobin commented Mar 16, 2021

@jbarnoud so would a solution perhaps be to check if a chainID attribute exists, and if it is anything other than 1 character long or non existent to find all unique segment names and set the chainID to A,B,C... changing each time the segment name changes?
edit: Also working out a way to respect existing valid chainIDs. I'm just trying to make sure i'm following your comment.

@jbarnoud
Copy link
Contributor

It is indeed what I would do rather than using the segment name. It is worth seeing what the others think about it, though. There are very good arguments to keep it simple and just name the chain A, X, or U: somehow what I suggest is some sort of a guesser for chain ids and it is historically not always the best of ideas. Because the chainID is an attribute of the atom and NOT an attribute of the segment, every clever ideas based on segments can encounter nasty corner cases.

So to summarize:

  • we probably want to keep it simple
  • if we want to make it clever, I'll go for what I suggested above

It is nothing but my own opinion, though; @lilyminium or @IAlibay may think differently.

@jbarnoud
Copy link
Contributor

For alpha-numerical, I would take any non-empty single character as valid and emit a warning if it is not in the range [a-zA-Z0-9] (which is not what isalnum look for).

@IAlibay IAlibay mentioned this pull request Mar 16, 2021
9 tasks
@lilyminium
Copy link
Member

lilyminium commented Mar 16, 2021

@HenryKobin, thank you for your patience. I think consensus amongst @jbarnoud, @IAlibay and I is that:

  • if the chainID for all atoms exists and a valid, non-empty, [a-zA-Z0-9] character, use that
  • if the chainID for some atoms is missing, raise a warning and use "X" -- @jbarnoud was this right?
  • if the chainID is > 1 character for any atoms, raise a warning
  • if the chainID attribute is missing or is empty for all atoms, use "X" as a placeholder value

What do you think?

It's a bit of a conflict that segids are a segment-level attribute, and chainIDs are atom-level -- that means we can't treat them as conceptually the same, because in a segment different atoms could have different chainIDs. So I think we will opt for the simple case for now.

@jbarnoud
Copy link
Contributor

if the chainID for some atoms is missing, raise a warning [and use "X"] -- @jbarnoud was this right?

Use "X" for the atoms that do not have a chainID.

@HenryKobin
Copy link
Contributor Author

@lilyminium I think it makes sense. I just wanna make sure i'm following, though.

if the chainID for some atoms is missing, raise a warning and use "X" -- @jbarnoud was this right?
Use "X" for the atoms that do not have a chainID.

So is the plan is to replace individual chainID's with "X" if they are missing, not to change every chainID to "X" if one is missing, correct?

if the chainID is > 1 character for any atoms, raise a warning
In this case I should raise a warning and additionally use "X", right?

Thanks for clarifying. I'm not very familiar with many of these atom concepts.

@lilyminium
Copy link
Member

Just brainstorming, would something like this be overkill?

Of course I would like to avoid adding a new function, but perhaps it could be useful in the future depending on how granular our validation becomes. I could also try list comprehension, but that doesn't feel as pythonic since we are checking against multiple factors, so it could become hard to read.


        def validate_chainids(chainids, default):

            """Validate each atom's chainID



            chainids - np array of chainIDs

            default - default value in case chainID is considered invalid

            """

            invalid_length_ids = False

            invalid_char_ids = False

            missing_ids = False



            for (i, id) in enumerate(chainids):

                if id is None:

                    missing_ids = True

                    chainids[i] = default

                elif len(id) > 1:

                    invalid_length_ids = True

                    chainids[i] = default

                elif not id.isalnum():

                    invalid_char_ids = True

                    chainids[i] = default



            if invalid_length_ids:

                warnings.warn("Found chainIDs with invalid length"

                              " Corresponding atoms will use value of '{}'"

                              "".format(default))

            if invalid_char_ids:

                warnings.warn("Found chainIDs using unnaccepted character"

                              " Corresponding atoms will use value of '{}'"

                              "".format(default))

            if missing_ids:

                warnings.warn("Found missing chainIDs"

                              " Corresponding atoms will use value of '{}'"

                              "".format(default))

            return chainids



        chainids = validate_chainids(chainids, "X")

That is a very specific check. It is verbose but, as far as I can tell, not much slower than just checking if id is None or not id.isalnum() or len(id) !=1. I'm happy with it, but could be persuaded otherwise if @IAlibay and @jbarnoud disagree.

One thing I'll note though is that id is a built in function in Python and it's preferable to not overwrite it.

@HenryKobin
Copy link
Contributor Author

No worries @lilyminium , I've been moving apartments so it's been hectic over here as well. I agree that using id is probably not best practice for that function. I'll change it to chainid. As for the segid situation, I can set it so that it uses ' ' instead of defaulting to chainids (if no one objects)

@IAlibay
Copy link
Member

IAlibay commented Apr 3, 2021

Sorry I've not been keeping up here. @HenryKobin just wanted to check what we can do to push this towards merging. Is there anything on our end that we still need to decide on?

@IAlibay IAlibay added this to the 2.0 milestone Apr 6, 2021
@HenryKobin
Copy link
Contributor Author

HenryKobin commented Apr 7, 2021

Hey @IAlibay , @lilyminium sorry for the delay. I've noticed that changing the segids to use ' ' instead of defaulting to chainIDs may have some farther reaching consequences, as this causes a few tests to fail for reasons a bit unknown to me. I would appreciate any advice in this regard. Thank you.

===================================================== FAILURES ======================================================
________________________________________ TestRamachandran.test_None_removal _________________________________________
[gw7] linux -- Python 3.8.5 /home/henry/Documents/github/mdanalysis/mdanalysis/env/bin/python3

self = <MDAnalysisTests.analysis.test_dihedrals.TestRamachandran object at 0x7f59eb890be0>

    def test_None_removal(self):
        with pytest.warns(UserWarning):
            u = mda.Universe(PDB_rama)
>           rama = Ramachandran(u.select_atoms("protein").residues[1:-1])

/home/henry/Documents/github/mdanalysis/mdanalysis/testsuite/MDAnalysisTests/analysis/test_dihedrals.py:124: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/henry/Documents/github/mdanalysis/mdanalysis/package/MDAnalysis/analysis/dihedrals.py:374: in __init__
    prev = residues._get_prev_residues_by_resid()
/home/henry/Documents/github/mdanalysis/mdanalysis/package/MDAnalysis/core/topologyattrs.py:935: in _get_prev_residues_by_resid
    pvres[i] = u.select_atoms(sel.format(s, r)).residues[0]
/home/henry/Documents/github/mdanalysis/mdanalysis/package/MDAnalysis/core/universe.py:639: in select_atoms
    return self.atoms.select_atoms(*args, **kwargs)
/home/henry/Documents/github/mdanalysis/mdanalysis/package/MDAnalysis/core/groups.py:2949: in select_atoms
    selections = tuple((selection.Parser.parse(s, selgroups,
/home/henry/Documents/github/mdanalysis/mdanalysis/package/MDAnalysis/core/groups.py:2949: in <genexpr>
    selections = tuple((selection.Parser.parse(s, selgroups,
/home/henry/Documents/github/mdanalysis/mdanalysis/package/MDAnalysis/core/selection.py:1429: in parse
    parsetree = self.parse_expression(0)
/home/henry/Documents/github/mdanalysis/mdanalysis/package/MDAnalysis/core/selection.py:1437: in parse_expression
    exp1 = self._parse_subexp()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <MDAnalysis.core.selection.SelectionParser object at 0x7f5a0fb41cd0>

    def _parse_subexp(self):
        op = self.tokens.popleft()
    
        if op == '(':
            exp = self.parse_expression(0)
            self.expect(')')
            return exp
    
        try:
            return _SELECTIONDICT[op](self, self.tokens)
        except KeyError:
            errmsg = f"Unknown selection token: '{op}'"
            raise SelectionError(errmsg) from None
        except ValueError as e:
            errmsg = f"Selection failed: '{e}'"
>           raise SelectionError(errmsg) from None
E           MDAnalysis.exceptions.SelectionError: Selection failed: 'Unexpected token 'and''

/home/henry/Documents/github/mdanalysis/mdanalysis/package/MDAnalysis/core/selection.py:1461: SelectionError
_____________________________________________ test_new_chainid_new_res ______________________________________________
[gw11] linux -- Python 3.8.5 /home/henry/Documents/github/mdanalysis/mdanalysis/env/bin/python3

    def test_new_chainid_new_res():
        # parser must start new residue when chainid starts
    
        u = mda.Universe(PDB_chainidnewres)
    
>       assert len(u.residues) == 4
E       assert 3 == 4
E        +  where 3 = len(<ResidueGroup with 3 residues>)
E        +    where <ResidueGroup with 3 residues> = <Universe with 22 atoms>.residues

/home/henry/Documents/github/mdanalysis/mdanalysis/testsuite/MDAnalysisTests/topology/test_pdb.py:167: AssertionError
____________________________________________________ test_bonds _____________________________________________________
[gw8] linux -- Python 3.8.5 /home/henry/Documents/github/mdanalysis/mdanalysis/env/bin/python3

u = <Universe with 1877 atoms>

    def test_bonds(u):
        # need to force topology to load before querying individual atom bonds
>       bonds0 = u.select_atoms("segid B and (altloc A)")[0].bonds

/home/henry/Documents/github/mdanalysis/mdanalysis/testsuite/MDAnalysisTests/utils/test_altloc.py:47: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <AtomGroup with 0 atoms>, item = 0

    def __getitem__(self, item):
        # supports
        # - integer access
        # - boolean slicing
        # - fancy indexing
        # because our _ix attribute is a numpy array
        # it can be sliced by all of these already,
        # so just return ourselves sliced by the item
        if isinstance(item, numbers.Integral):
>           return self.level.singular(self.ix[item], self.universe)
E           IndexError: index 0 is out of bounds for axis 0 with size 0

/home/henry/Documents/github/mdanalysis/mdanalysis/package/MDAnalysis/core/groups.py:517: IndexError
============================================== short test summary info ==============================================
FAILED testsuite/MDAnalysisTests/analysis/test_dihedrals.py::TestRamachandran::test_None_removal - MDAnalysis.exce...
FAILED testsuite/MDAnalysisTests/topology/test_pdb.py::test_new_chainid_new_res - assert 3 == 4
FAILED testsuite/MDAnalysisTests/utils/test_altloc.py::test_bonds - IndexError: index 0 is out of bounds for axis ...
=========== 3 failed, 16902 passed, 408 skipped, 1 xfailed, 2 xpassed, 8463 warnings in 143.47s (0:02:23) ===========

Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HenryKobin looking at the errors, you're right that messing with reading segids has far-reaching consequences. Unlike the chainID, which is just a topology attribute (essentially a label for each atom), segments are structurally important in defining different parts of the molecule. It's probably simpler to focus this issue on accurately writing out segids, and we can look at changing default segid later as this will require some thought.


# If segids not present, try to use chainids
if not any(segids):
segids = chainids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put this back in, that second and third failing tests should pass. I think that this is expected behaviour from most users, that differences in chain are conceptually similar to differences in segment -- better not to change this part.

else:
n_segments = 1
attrs.append(Segids(np.array(['SYSTEM'], dtype=object)))
attrs.append(Segids(np.array([' '], dtype=object)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using spaces, use an empty string for blank. That being said, the reason the first failing test is failing is that some functions use "segid xx and resid yy" to select atoms -- so if xx='' or xx=' ', the string "segid and resid yy" breaks the parser. You could change the code that the test shows is failing, but I would be wary of more things failing that tests aren't showing. If you change the default value back to "SYSTEM", that's not too much of an issue -- the original issue was just about writing out segids.

@HenryKobin
Copy link
Contributor Author

@lilyminium you're right! Whoops. I mixed up reading with writing... 😞

I reverted the changes. I did have to alter one test, though. Since we are defaulting to 'X' for a chainID if it is missing, there are always chainids - so segids will default to 'X' when it parses the file. The test was formerly expecting 'SYSTEM'.

missing_ids = False

for (i, chainid) in enumerate(chainids):
if chainid is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ever None? Isn't the default value above " "?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, it shouldn't ever be None.

Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HenryKobin, looking through I just have a question about None and if you can parametrize the test :-)

"""check whether chainID comes from last character of segid (issue #2224)"""
ref_id = 'E'
u = universe2
def test_chainid_validated(self, universe3, outfile):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use pytest.mark.parametrize (https://docs.pytest.org/en/stable/parametrize.html) to turn this into three tests with different arguments? There are a few examples around if you ctrl+F in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

@HenryKobin
Copy link
Contributor Author

@lilyminium , I made the requested changes.

Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HenryKobin, if you just switch that space to the empty string I think it's good to go!

Co-authored-by: Lily Wang <31115101+lilyminium@users.noreply.github.com>
@HenryKobin
Copy link
Contributor Author

@lilyminium committed your suggested changes! Thanks!

@lilyminium lilyminium changed the title ChainID is overwritten by segid Write out chainID to PDB instead of segID Apr 8, 2021
@lilyminium lilyminium merged commit fe26097 into MDAnalysis:develop Apr 8, 2021
@lilyminium
Copy link
Member

Thank you @HenryKobin for fixing this bug, my PDB files will be much happier for it! :-)

@HenryKobin
Copy link
Contributor Author

@lilyminium thank you for your patience and assistance in walking me through everything! 😁

@HenryKobin HenryKobin deleted the pdb-chain-fix branch April 8, 2021 15:44
@IAlibay IAlibay added the defect label Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ChainID is overwritten by segid

5 participants