Update tables.py by guilleaf · Pull Request #1808 · MDAnalysis/mdanalysis

guilleaf · 2018-03-07T18:10:23Z

Fixes #

Changes made in this Pull Request:

Small typo for the Mass of Zn

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

orbeckst · 2018-03-07T18:34:59Z

Hi @guilleaf welcome to MDAnalysis and thank you for contributing!

Is it a problem for you that Zn is spelled ZN?

There is a reason that some of the elements are capitalized. This is how they appear in common forcefields. (We could probably do something about this and spell elements with correct element names and all-caps everything and make the atom-guessers do case-insensitive comparisons, but then this would involve a bit more work.)

orbeckst · 2018-03-07T18:53:56Z

Admittedly, I just ran into the situation where I am loading a PDB-type file with Zn and get

~/Projects/Methods/MDAnalysis/mdanalysis/package/MDAnalysis/topology/guessers.py:72: UserWarning: Failed to guess the mass for the following atom types: Zn
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))

So I concede the point that something ought to be done about it...

orbeckst · 2018-03-07T18:55:01Z

@guilleaf Could you please raise an issue that states the problem that you're having and then we can use the issue report to discuss what ought to be done about it?

guilleaf · 2018-03-07T19:13:35Z

Hello Oliver, Sorry if the approach that I took was not appropriated in your code. MD calculations is not exactly my area of expertise. I am helping some else in the University with MD simulations with Metal Organic Frameworks. I summarize the story. 1. The researcher is using a package https://github.com/a-anik/zif-8_md Not an actual code, but rather a set of python scripts created a few months ago. That code has MDAnalysis in the list of dependencies and reading a bit about your project I decide to install it centrally in our cluster. I am attaching the relevant files for reference: The researcher approach me asking for help with some error messages using that script. $ ./pdb2top_ZIF-8.py conf.pdb > zif8_1x1x1_periodic.itp /shared/software/languages/python/2.7.13/lib/python2.7/site-packages/MDAnalysis/topology/guessers.py:56: UserWarning: Failed to guess the mass for the following atom types: Zn "".format(', '.join(misses))) Traceback (most recent call last): File "./pdb2top_ZIF-8.py", line 170, in <module> u = build_zif8_top(args.fname) File "./pdb2top_ZIF-8.py", line 50, in build_zif8_top tgb = topobj.TopologyGroup.from_indices(all_bonds, u.atoms, bondclass=topobj.Bond, guessed=False) AttributeError: type object 'TopologyGroup' has no attribute 'from_indices' So I was digging on both the pdb2top_ZIF-8.py code and your library trying to understand the origin of the "Warning” and the actual error. The error is clearly a change in the API from your TopologyGroup object, so the issue at that point is more with the script that was probably intended to use a previous version of your code. The warning however comes from the guesser that associate masses based on the atoms found on conf.pdb Following the warning message it comes from this code def guess_masses(atom_types): """Guess the mass of many atoms based upon their type Parameters ---------- atom_types Type of each atom Returns ------- atom_masses : np.ndarray dtype float64 """ masses = np.array([get_atom_mass(atom_t) for atom_t in atom_types], dtype=np.float64) if np.any(masses == 0.0): # figure out where the misses were and report misses = np.unique(np.asarray(atom_types)[np.where(masses == 0.0)]) warnings.warn("Failed to guess the mass for the following atom types: {}" "".format(', '.join(misses))) return masses From: MDAnalysis/topology/guessers.py Probably the best approach is not using a non-standard name for the atoms and eventually standardized their names before the use the list comprehension on that function and keep the list of masses using canonical names for atoms. My approach to your problem usually is: “The world outside can be dirty, but I keep my code inside clean” In practice, the best approach is probably convert everything from outside like “ZN”, “zn”, “zN” or “Zn” to Zn and being clean inside the code using the correct name of the atom “Zn" I understand your stand on the issue. I let you to choose the best direction for your code. I am glad to help you in any case. Thank you very much, Guillermo Avendano-Franco PhD Computational Physics Research Computing Software Developer Information Technology Services West Virginia University On Mar 7, 2018, at 1:35 PM, Oliver Beckstein <notifications@github.com<mailto:notifications@github.com>> wrote: Hi @guilleaf<https://github.com/guilleaf> welcome to MDAnalysis and thank you for contributing! Is it a problem for you that Zn is spelled ZN? There is a reason that some of the elements are capitalized. This is how they appear in common forcefields. (We could probably do something about this and spell elements with correct element names and all-caps everything and make the atom-guessers do case-insensitive comparisons, but then this would involve a bit more work.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDkWw4DYJV9lOsJyBVvz2JAMh37gCks5tcChXgaJpZM4Sg6ee>.

guilleaf · 2018-03-07T19:21:55Z

Oliver, You write emails faster than me! ;-) I suggest something like masses = np.array([get_atom_mass(atom_t.capitalize()) for atom_t in atom_types], dtype=np.float64) And use the right capitalization inside your tables. In [1]: zn='ZN' In [2]: zn.capitalize() Out[2]: 'Zn' I do not know how prevalent is the issue in your area. My research area is condensed matter so the usual codes for DFT are very canonical in the symbols for atomic species. Best, Guillermo Avendano-Franco PhD Computational Physics Research Computing Software Developer Information Technology Services West Virginia University On Mar 7, 2018, at 2:12 PM, Oliver Beckstein <notifications@github.com<mailto:notifications@github.com>> wrote: Admittedly, I just ran into the situation where I am loading a PDB-type file with Zn and get ~/Projects/Methods/MDAnalysis/mdanalysis/package/MDAnalysis/topology/guessers.py:72: UserWarning: Failed to guess the mass for the following atom types: Zn warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type)) So I concede the point that something ought to be done about it... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDhpAtjNB-Ll6nJnVG2XPRHBoNWq6ks5tcCzHgaJpZM4Sg6ee>.

orbeckst · 2018-03-07T22:05:35Z

@guilleaf many thanks for the detailed feedback. We'll have to think about how to handle this. In principle I agree with your clean/dirty approach.

richardjgowers · 2018-03-08T09:47:44Z

The fun corner case is @orbeckst et al work in a world of Ca being Carbon alpha :). I think we can just have both Zn and ZN in the table at no risk.

orbeckst · 2018-03-08T16:47:48Z

It's actually CA being carbon alpha and Ca calcium, at least most of the time. Just putting ZN and Zn in the table (and doing the same for the other ones) seems an ok solution (not pretty but should get the job done).

…

-- Oliver Beckstein email: orbeckst@gmail.com

Am Mar 8, 2018 um 02:47 schrieb Richard Gowers ***@***.***>: The fun corner case is @orbeckst et al work in a world of Ca being Carbon alpha :). I think we can just have both Zn and ZN in the table at no risk. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

orbeckst · 2018-03-08T22:22:56Z

@guilleaf would you like to update your PR and duplicate the upper-case elements (except CA) to lower case?

You'll also have to add an entry to CHANGELOG and add yourself to AUTHORS.

guilleaf · 2018-03-09T20:26:24Z

If Ca can be interpreted as Carbon alpha is strong indication about how polluted can be the external world. The solution is not filling your code with redundancies and exceptions. That is a bad design and you will pay the price in maintainability down the road. I understand than sometimes doing the "easy thing" is a temptation over doing the "right thing". You should treat external files even using standard formats as potentially polluted. They must be filtered, checked for internal consistency and all units reconverted to a clear standard of internal units. I learned that, on my own, the hard way, writing first-principles codes and I am still paying the price of my mistakes. For your case, if you want to keep a dictionary Talking specifically for: package/MDAnalysis/topology/tables.py I recommend: 1. Avoid the use of kv2dict and setup those tables as python dictionaries directly on the code. If you use "masses" frequently on your code, you will be converting and reconverting the string into a python dictionary. I know that the point of writing code in python is not performance, but there is not reason to make it worst. Another solution could be using lazy evaluation, but again, that dictionary will not change, so create it once and use forever. 1. TABLE_ATOMELEMENTS is half way of being the actual "conversion table" that you need to purify the often ambiguous notations of your research area. With some tweaking that can be converted into a function that normalizes entries like "CAL", "CA" and "Ca" into just "Ca" a neat implementation could eventually take either one string or a list of strings and return the corresponding list of correct atomic symbols. The pathological case of Carbon alpha must be filtered somewhere else. Your code is a nice and needed effort in your area. I am not into MD simulations. But some of the people that I support and collaborate uses CMD or ab-initio MD one way of another. Your code is valuable for the community. I refrain myself of submitting significant changes to your code base at this point. My familiarity with your sources is minimal, but I am glad to help. Thank you, Guillermo Avendaño-Franco Computational Physics Research Computing Software Developer ITS, West Virginia University

…

________________________________ From: Richard Gowers <notifications@github.com> Sent: Thursday, March 8, 2018 4:47 AM To: MDAnalysis/mdanalysis Cc: Guillermo Avendano-Franco; Mention Subject: Re: [MDAnalysis/mdanalysis] Update tables.py (#1808) The fun corner case is @orbeckst<https://github.com/orbeckst> et al work in a world of Ca being Carbon alpha :). I think we can just have both Zn and ZN in the table at no risk. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDk1_ZCJu30EO_S_bS1xGKn1kh1yWks5tcP5AgaJpZM4Sg6ee>.

guilleaf · 2018-03-09T20:50:11Z

That is an example of doing what is easy over doing what is right 😉 I was inspecting your code with PyCharm and I see that you uses "masses" twice inside your sources "guessers.py": get_atom_masses and validate_atom_masses and the first one actually calling the second one. Now imagine that you have a protein, lets say you have a XYZ file with 100k atoms. Your parser will create a list of 100k names. Now lets suppose that you are executing "guess_masses" for those 100k names (BTW, you are not doing that [line 93 on XYZParser.py is your salvation]) That means that you will be calling "kv2dict" converting the TABLE_MASSES into a python dictionary 200k times, getting over and over exactly the same result and using it to valiatate once and get mass next of every single atom name. I friendly recommend move away from that design and reconsider the idea of start duplicating atomic masses. Code development is all about reducing entropy. 😉 Best, Guillermo Avendaño-Franco Research Computing Software Developer ITS, West Virginia University p: 3042931855 m: 3043763731 a: One Waterfront Place, Morgantown, WV 26506 e: gufranco@mail.wvu.edu<mailto:gufranco@mail.wvu.edu> [http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/facebook_sig.png]<https://www.facebook.com/guilleaf> [http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/twitter_sig.png] <https://twitter.com/guilleaf> [http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/linkedin_sig.png] <https://www.linkedin.com/in/gaf1978> [http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/instagram_sig.png] <https://www.instagram.com/guilleaf78>

________________________________ From: Oliver Beckstein <notifications@github.com> Sent: Thursday, March 8, 2018 11:47 AM To: MDAnalysis/mdanalysis Cc: Guillermo Avendano-Franco; Mention Subject: Re: [MDAnalysis/mdanalysis] Update tables.py (#1808) It's actually CA being carbon alpha and Ca calcium, at least most of the time. Just putting ZN and Zn in the table (and doing the same for the other ones) seems an ok solution (not pretty but should get the job done).

-- Oliver Beckstein email: orbeckst@gmail.com

Am Mar 8, 2018 um 02:47 schrieb Richard Gowers ***@***.***>: The fun corner case is @orbeckst et al work in a world of Ca being Carbon alpha :). I think we can just have both Zn and ZN in the table at no risk. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDpH-dWIcyV-0T1vlx_0YKanJiGsUks5tcWC2gaJpZM4Sg6ee>.

guilleaf · 2018-03-09T20:58:39Z

I prefer not to do that, not good for my "karma" if you like. I you want to duplicate that line go and do it yourself. I do not feel offended by that. Thank you for hearing me at least. Best, Guillermo Avendaño-Franco Computational Phsyics Research Computing Software Developer ITS, West Virginia University

…

________________________________ From: Oliver Beckstein <notifications@github.com> Sent: Thursday, March 8, 2018 5:22 PM To: MDAnalysis/mdanalysis Cc: Guillermo Avendano-Franco; Mention Subject: Re: [MDAnalysis/mdanalysis] Update tables.py (#1808) @guilleaf<https://github.com/guilleaf> would you like to update your PR and duplicate the upper-case elements (except CA) to lower case? You'll also have to add an entry to CHANGELOG and add yourself to AUTHORS. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDq-ZRTzCQNUCfylxu6CgpnSTUOFxks5tca9BgaJpZM4Sg6ee>.

orbeckst · 2018-03-10T00:58:31Z

@guilleaf thanks for taking the time to look at the code and share your ideas. You raise a number of good points.

CA vs Ca... and upper case "elements"

I had a look at the masses table

mdanalysis/package/MDAnalysis/topology/tables.py

Line 174 in e94b1c4

TABLE_MASSES = """

specifically the entry for "CA"

mdanalysis/package/MDAnalysis/topology/tables.py

Line 196 in e94b1c4

CA 40.08000

and we clearly treat it as calcium.

There really does not seem to exist a good reason for us to keep upper case element names, especially as we have TABLE_ATOMELEMENTS

mdanalysis/package/MDAnalysis/topology/tables.py

Line 79 in e94b1c4

TABLE_ATOMELEMENTS = """

as you pointed out in #1808 (comment)

All we should do is make the TABLE_MASSES and the TABLE_ATOMELEMENTS consistent.

make TABLE_MASSES lower case so that they correspond to proper element symbols
add any upper/lower case translations to TABLE_ATOMELEMENTS
explicitly add "CA" as a "C" to table TABLE_ATOMELEMENTS to make this choice somewhat more transparent. (In biomolecular simulation CA is much more likely to mean "C-alpha" atom than calcium so I would want to go with the dominant semantics.)

Philosopy

Guessing is a bit of a problem because inherently we are biased to a certain domain, which carries the risk that scientists from other domains do not get the same benefits or worse, get wrong results. Perhaps we should fail cleanly with a good error message if we cannot guarantee that we're making a good guess.

The real solution is for the user to provide un-ambiguous input files.

However, many users like the convenience of not having to do that...

We could use a guess=False flag by default and tell people that with guess=True they run the risk of wrong results. In an ideal world we could flag all guesses that are ambiguous. Given the amount of different data and formats that people throw into MDAnalysis, this is not really feasible. We can catch cases that we are aware of but that's not providing any certainty.

using kv2dict

Avoid the use of kv2dict and setup those tables as python dictionaries directly on the code. If you use "masses" frequently on your code, you will be converting and reconverting the string into a python dictionary.

It is actually not true that we are creating masses repeatedly. The masses = kv2dic()

mdanalysis/package/MDAnalysis/topology/tables.py

Line 296 in e94b1c4

masses = kv2dict(TABLE_MASSES, convertor=float)

is run exactly once, the very first time that topology.table is imported somewhere. From then on, masses exists. The Python interpreter is not reloading the module or recalculating masses every time. Writing it with kv2dict is just a quick way to have the data in a human-readable form without having it reside in a separate file. There's no appreciable performance penalty.

orbeckst · 2018-03-10T01:06:37Z

If Ca can be interpreted as Carbon alpha is strong indication about how polluted can be the external world.

More precisely, "CA", but many file formats do not care about case.

I respectfully disagree with labelling the choice of "CA" as C-alpha as wrong. It depends on your domain. CA is not really supposed to be an element, it's a name, and if the name is the only thing we can use to guess then the better guess in biomolecular simulations is that it is a protein C-alpha atom. If you can use more information then the guess can get better (it depends on your prior...). For instance, if you know that it is inside a known protein residue, than it is almost certainly carbon. If it is in an ATOM record in a PDB file it is carbon per PDB standard, if it is HETATM then it is calcium... only most programs couldn't care less about how they write PDB files and just use ATOM for everything and we have to deal with what's out there in the wild.

Just telling users "your file format is broken, complain to <insert widely adopted program here>" is not a good strategy to keep users. But if you have ideas how to handle these cases, we're all ears. We have to deal with this situation on a constant basis.

jbarnoud · 2018-06-17T09:07:44Z

In what case do we get the element of a calbon alpha set to CA? CA is the name for a carbon alpha, but what would set it as the element? This is what we need to fix. I agree with @guilleaf here.

RMeli · 2019-05-18T10:16:27Z

Hi all, I still get similar errors with MDAnalysis 0.19.3-dev:

~/software/python/mdanalysis-develop/package/MDAnalysis/topology/guessers.py:73: UserWarning: Failed to guess the mass for the following atom types: Mg
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
~/software/python/mdanalysis-develop/package/MDAnalysis/topology/guessers.py:73: UserWarning: Failed to guess the mass for the following atom types: Zn
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
~/software/python/mdanalysis-develop/package/MDAnalysis/topology/guessers.py:73: UserWarning: Failed to guess the mass for the following atom types: Fe
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
~/software/python/mdanalysis-develop/package/MDAnalysis/topology/guessers.py:73: UserWarning: Failed to guess the mass for the following atom types: Ca
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))

Is there any update on this issue?

orbeckst · 2019-05-20T18:08:53Z

Hi @RMeli , could you please raise an issue for your specific problem? You can refer to this PR but I'd prefer to have initial discussions/questions on a proper issue. The discussion here touched on various aspects of element guessing with no clear conclusion yet and if we can just solve a sub-problem then that would be a good step forward. Thanks!

IAlibay · 2022-02-08T18:56:05Z

So a lot of this has changed, especially when dealing with PDBs (e.g. #3001).

I'm going to go ahead and close this and ask folks to open a separate issue to tackle any remaining issues if that's ok?

Update tables.py

f08b35c

jbarnoud mentioned this pull request Mar 19, 2018

Regarding printing the values of C-alpha and all the oxygen atoms of water molecules to a file #1823

Closed

RMeli mentioned this pull request May 21, 2019

Metal Atoms in PDB not properly recognized #2265

Closed

RMeli mentioned this pull request Aug 21, 2019

Try uppercase atom names when guessing the mass #2331

Merged

4 tasks

RMeli mentioned this pull request Sep 11, 2019

make guessers more consistent and transparent #2348

Closed

lilyminium mentioned this pull request Mar 16, 2020

Add contexts to guess attributes better (especially elements and masses) #2630

Closed

RMeli mentioned this pull request Jun 10, 2020

PDBWriter writes some C as Ca and some N as Na #2732

Closed

IAlibay added the close? Evaluate if issue/PR is stale and can be closed. label May 18, 2021

IAlibay closed this Feb 8, 2022

Conversation

guilleaf commented Mar 7, 2018

Small typo for the Mass of Zn

PR Checklist

Uh oh!

orbeckst commented Mar 7, 2018

Uh oh!

orbeckst commented Mar 7, 2018

Uh oh!

orbeckst commented Mar 7, 2018

Uh oh!

guilleaf commented Mar 7, 2018 via email

Uh oh!

guilleaf commented Mar 7, 2018 via email

Uh oh!

orbeckst commented Mar 7, 2018

Uh oh!

richardjgowers commented Mar 8, 2018

Uh oh!

orbeckst commented Mar 8, 2018 via email

Uh oh!

orbeckst commented Mar 8, 2018

Uh oh!

guilleaf commented Mar 9, 2018 via email

Uh oh!

guilleaf commented Mar 9, 2018 via email

Uh oh!

guilleaf commented Mar 9, 2018 via email

Uh oh!

orbeckst commented Mar 10, 2018

CA vs Ca... and upper case "elements"

Philosopy

using kv2dict

Uh oh!

orbeckst commented Mar 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbarnoud commented Jun 17, 2018

Uh oh!

RMeli commented May 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orbeckst commented May 20, 2019

Uh oh!

IAlibay commented Feb 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

orbeckst commented Mar 10, 2018 •

edited

Loading

RMeli commented May 18, 2019 •

edited

Loading