Skip to content

[WIP] PDB parser and writer use the "elements" topology attribute#2442

Closed
jbarnoud wants to merge 5 commits intodevelopfrom
pdb-elements
Closed

[WIP] PDB parser and writer use the "elements" topology attribute#2442
jbarnoud wants to merge 5 commits intodevelopfrom
pdb-elements

Conversation

@jbarnoud
Copy link
Contributor

@jbarnoud jbarnoud commented Jan 9, 2020

Fixes #2422 #2423

Changes made in this Pull Request:

  • The PDB parser populates the elements topology attribute if the appropriate column is filled in the input. The content of elements is the same as atomtypes.
  • The PDB reader uses the elements topology attribute if available.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@jbarnoud
Copy link
Contributor Author

jbarnoud commented Jan 9, 2020

The changes cause a difference in the behaviour of the writer which I do not know how to handle. The default for a missing element symbol is an empty string. When the element symbols are available to the writer, then the writer omits the missing one, leaving the field empty. This is different from the current behaviour that guesses all the element symbols.

This difference is made visible by a test where the initial PDB input fills some, but not all, the element symbols, and the output PDB is expected to have the missing ones.

My feeling is that guessing the missing symbols should be an active decision of the user, and I would be tempted to adapt the failing test. Though, I would like to hear what others would think about this.

if a PDB file specifies some, but not all, element symbols, then the missing symbols default to an empty string. If you then write a new PDB file

@richardjgowers
Copy link
Member

I think it makes most sense to either have the column or not. Can we guess elements ahead of time and only write if they are all defined?

@RMeli
Copy link
Member

RMeli commented Jun 10, 2020

My feeling is that guessing the missing symbols should be an active decision of the user, and I would be tempted to adapt the failing test.

@jbarnoud I think this sounds very reasonable, especially because the guessers can have problems with some atom names found in the wild (see CAL->Ca and NAN->Na in #2732).

If the elements attribute is present chances are that it comes from a file (especially after #2627), so I think that if such record is incomplete the writer should leave the field empty (as it is in your current implementation) instead of guessing. IMO, it is nice to have a file unchanged by a read immediately followed by a write. Guessing atom elements from the atom name should be decided/done by the user.

@haroldgrosjean
Copy link

Hello,

I am writing this message because I am experiencing similar issues. I am using MDAnalysis to separate protein-ligand pairs to, later, parametrize for MD where correct atom typing and valencies are essential. It essentially transforms some Bromides (Br) into boron (B)

my code goes as follow:

import MDAnalysis
print(MDAnalysis.__version__)

0.20.1

An excerpt from this input file is shown below.

##This is ./pdb.pdb
#more protein atoms above
ANISOU 1016  NH2 ARG A1434     2607   3648   2290   -197    323   -307       N  
TER    1017      ARG A1434                                                      
HETATM 1018  C4  HH8 A1501     -20.776  12.574  25.611  0.59 11.56           C  
ANISOU 1018  C4  HH8 A1501     1524   1603   1266    -34    163      3       C  
HETATM 1019  C5  HH8 A1501     -18.698  13.529  26.059  0.59 13.32           C  
ANISOU 1019  C5  HH8 A1501     1742   1861   1459    -49    143    -23       C  
HETATM 1020  C6  HH8 A1501     -19.171  14.104  27.289  0.59 13.28           C  
ANISOU 1020  C6  HH8 A1501     1752   1860   1436    -66    163    -39       C  
HETATM 1021  C7  HH8 A1501     -20.476  13.883  27.651  0.59 12.78           C  
ANISOU 1021  C7  HH8 A1501     1698   1776   1381    -62    184    -31       C  
HETATM 1022  C8  HH8 A1501     -21.313  13.106  26.800  0.59 11.94           C  
ANISOU 1022  C8  HH8 A1501     1585   1651   1299    -46    184     -9       C  
HETATM 1023  N1  HH8 A1501     -21.537  11.824  24.697  0.59 10.48           N  
ANISOU 1023  N1  HH8 A1501     1379   1453   1151    -26    163     18       N  
HETATM 1024  N2  HH8 A1501     -19.448  12.773  25.252  0.59 12.38           N  
ANISOU 1024  N2  HH8 A1501     1620   1722   1363    -34    142     -3       N  
HETATM 1025  C3  HH8 A1501     -22.795  11.625  24.992  0.59 11.02           C  
ANISOU 1025  C3  HH8 A1501     1449   1514   1225    -30    183     21       C  
HETATM 1026 BR1  HH8 A1501     -23.498  13.338  28.667  0.59 17.51           BR  
ANISOU 1026 BR1  HH8 A1501     2312   2347   1994    -60    237    -17       BR  
HETATM 1027  C1  HH8 A1501     -22.672  12.805  27.033  0.59 13.02           C  
ANISOU 1027  C1  HH8 A1501     1725   1776   1447    -45    205     -2       C  
HETATM 1028  C2  HH8 A1501     -23.434  12.102  26.169  0.59 11.96           C  
ANISOU 1028  C2  HH8 A1501     1579   1633   1333    -38    205     13       C  
HETATM 1040  O   HOH A1601      -6.356  19.116  22.312  1.00 38.64           O 
#more water atoms bellow
u = mda.Universe('./pdb.pdb')    
protein = u.select_atoms('protein or resname HOH')
LIG = u.select_atoms('all') - protein
LIG.write("./LIG.pdb")

The code returns for ./LIG.pdb

ATOM      1  C4  HH8 A1501     -20.776  12.574  25.611  0.59 11.56      A    C
ATOM      2  C5  HH8 A1501     -18.698  13.529  26.059  0.59 13.32      A    C
ATOM      3  C6  HH8 A1501     -19.171  14.104  27.289  0.59 13.28      A    C
ATOM      4  C7  HH8 A1501     -20.476  13.883  27.651  0.59 12.78      A    C
ATOM      5  C8  HH8 A1501     -21.313  13.106  26.800  0.59 11.94      A    C
ATOM      6  N1  HH8 A1501     -21.537  11.824  24.697  0.59 10.48      A    N
ATOM      7  N2  HH8 A1501     -19.448  12.773  25.252  0.59 12.38      A    N
ATOM      8  C3  HH8 A1501     -22.795  11.625  24.992  0.59 11.02      A    C
ATOM      9 BR1  HH8 A1501     -23.498  13.338  28.667  0.59 17.51      A    B
ATOM     10  C1  HH8 A1501     -22.672  12.805  27.033  0.59 13.02      A    C
ATOM     11  C2  HH8 A1501     -23.434  12.102  26.169  0.59 11.96      A    C
CONECT    1    5    6    7
CONECT    2    3    7
CONECT    3    2    4
CONECT    4    3    5
CONECT    5    1    4   10
CONECT    6    1    8
CONECT    7    1    2
CONECT    8    6   11
CONECT    9   10
CONECT   10    5    9   11
CONECT   11    8   10
END
``

@IAlibay
Copy link
Member

IAlibay commented Oct 19, 2020

@jbarnoud sorry to be digging up an old PR. It'd be really great to get this fixed (it's an ongoing issue for a few things I deal with downstream). Is this something you're still willing to work on?

Otherwise, given the large changes to the PDB parser since this PR started, I'd be willing to create a superseding PR if that was ok with you?

@jbarnoud
Copy link
Contributor Author

@IAlibay This is still on my radar and bumped up in my TODO recently, but I cannot guarantee a timeframe. Please go ahead if you think you can send some time on it soon. Ping me when you need your code reviewed.

@IAlibay
Copy link
Member

IAlibay commented Nov 19, 2020

Closing this now that superseding PR #3001 has been merged.

@IAlibay IAlibay closed this Nov 19, 2020
@RMeli RMeli deleted the pdb-elements branch December 23, 2023 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDB parser ignores the element column

6 participants