Conversation
|
Hi @guilleaf welcome to MDAnalysis and thank you for contributing! Is it a problem for you that Zn is spelled ZN? There is a reason that some of the elements are capitalized. This is how they appear in common forcefields. (We could probably do something about this and spell elements with correct element names and all-caps everything and make the atom-guessers do case-insensitive comparisons, but then this would involve a bit more work.) |
|
Admittedly, I just ran into the situation where I am loading a PDB-type file with Zn and get So I concede the point that something ought to be done about it... |
|
@guilleaf Could you please raise an issue that states the problem that you're having and then we can use the issue report to discuss what ought to be done about it? |
|
Hello Oliver,
Sorry if the approach that I took was not appropriated in your code. MD calculations is not exactly my area of expertise.
I am helping some else in the University with MD simulations with Metal Organic Frameworks.
I summarize the story.
1. The researcher is using a package
https://github.com/a-anik/zif-8_md
Not an actual code, but rather a set of python scripts created a few months ago.
That code has MDAnalysis in the list of dependencies and reading a bit about your project I decide to install it centrally in our cluster.
I am attaching the relevant files for reference:
The researcher approach me asking for help with some error messages using that script.
$ ./pdb2top_ZIF-8.py conf.pdb > zif8_1x1x1_periodic.itp
/shared/software/languages/python/2.7.13/lib/python2.7/site-packages/MDAnalysis/topology/guessers.py:56: UserWarning: Failed to guess the mass for the following atom types: Zn
"".format(', '.join(misses)))
Traceback (most recent call last):
File "./pdb2top_ZIF-8.py", line 170, in <module>
u = build_zif8_top(args.fname)
File "./pdb2top_ZIF-8.py", line 50, in build_zif8_top
tgb = topobj.TopologyGroup.from_indices(all_bonds, u.atoms, bondclass=topobj.Bond, guessed=False)
AttributeError: type object 'TopologyGroup' has no attribute 'from_indices'
So I was digging on both the pdb2top_ZIF-8.py code and your library trying to understand the origin of the "Warning” and the actual error.
The error is clearly a change in the API from your TopologyGroup object, so the issue at that point is more with the script that was probably intended to use a previous version of your code.
The warning however comes from the guesser that associate masses based on the atoms found on conf.pdb
Following the warning message it comes from this code
def guess_masses(atom_types):
"""Guess the mass of many atoms based upon their type
Parameters
----------
atom_types
Type of each atom
Returns
-------
atom_masses : np.ndarray dtype float64
"""
masses = np.array([get_atom_mass(atom_t) for atom_t in atom_types], dtype=np.float64)
if np.any(masses == 0.0):
# figure out where the misses were and report
misses = np.unique(np.asarray(atom_types)[np.where(masses == 0.0)])
warnings.warn("Failed to guess the mass for the following atom types: {}"
"".format(', '.join(misses)))
return masses
From:
MDAnalysis/topology/guessers.py
Probably the best approach is not using a non-standard name for the atoms and eventually standardized their names before the use the list comprehension on that function and keep the list of masses using canonical names for atoms.
My approach to your problem usually is:
“The world outside can be dirty, but I keep my code inside clean”
In practice, the best approach is probably convert everything from outside like “ZN”, “zn”, “zN” or “Zn” to Zn and being clean inside the code using the correct name of the atom “Zn"
I understand your stand on the issue.
I let you to choose the best direction for your code. I am glad to help you in any case.
Thank you very much,
Guillermo Avendano-Franco PhD
Computational Physics
Research Computing Software Developer
Information Technology Services
West Virginia University
On Mar 7, 2018, at 1:35 PM, Oliver Beckstein <notifications@github.com<mailto:notifications@github.com>> wrote:
Hi @guilleaf<https://github.com/guilleaf> welcome to MDAnalysis and thank you for contributing!
Is it a problem for you that Zn is spelled ZN?
There is a reason that some of the elements are capitalized. This is how they appear in common forcefields. (We could probably do something about this and spell elements with correct element names and all-caps everything and make the atom-guessers do case-insensitive comparisons, but then this would involve a bit more work.)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDkWw4DYJV9lOsJyBVvz2JAMh37gCks5tcChXgaJpZM4Sg6ee>.
|
|
Oliver,
You write emails faster than me! ;-)
I suggest something like
masses = np.array([get_atom_mass(atom_t.capitalize()) for atom_t in atom_types], dtype=np.float64)
And use the right capitalization inside your tables.
In [1]: zn='ZN'
In [2]: zn.capitalize()
Out[2]: 'Zn'
I do not know how prevalent is the issue in your area. My research area is condensed matter so the usual codes for DFT are very canonical in the symbols for atomic species.
Best,
Guillermo Avendano-Franco PhD
Computational Physics
Research Computing Software Developer
Information Technology Services
West Virginia University
On Mar 7, 2018, at 2:12 PM, Oliver Beckstein <notifications@github.com<mailto:notifications@github.com>> wrote:
Admittedly, I just ran into the situation where I am loading a PDB-type file with Zn and get
~/Projects/Methods/MDAnalysis/mdanalysis/package/MDAnalysis/topology/guessers.py:72: UserWarning: Failed to guess the mass for the following atom types: Zn
warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
So I concede the point that something ought to be done about it...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDhpAtjNB-Ll6nJnVG2XPRHBoNWq6ks5tcCzHgaJpZM4Sg6ee>.
|
|
@guilleaf many thanks for the detailed feedback. We'll have to think about how to handle this. In principle I agree with your clean/dirty approach. |
|
The fun corner case is @orbeckst et al work in a world of Ca being Carbon alpha :). I think we can just have both Zn and ZN in the table at no risk. |
|
It's actually CA being carbon alpha and Ca calcium, at least most of the time.
Just putting ZN and Zn in the table (and doing the same for the other ones) seems an ok solution (not pretty but should get the job done).
…--
Oliver Beckstein
email: orbeckst@gmail.com
Am Mar 8, 2018 um 02:47 schrieb Richard Gowers ***@***.***>:
The fun corner case is @orbeckst et al work in a world of Ca being Carbon alpha :). I think we can just have both Zn and ZN in the table at no risk.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
@guilleaf would you like to update your PR and duplicate the upper-case elements (except CA) to lower case? You'll also have to add an entry to CHANGELOG and add yourself to AUTHORS. |
|
If Ca can be interpreted as Carbon alpha is strong indication about how polluted can be the external world. The solution is not filling your code with redundancies and exceptions. That is a bad design and you will pay the price in maintainability down the road.
I understand than sometimes doing the "easy thing" is a temptation over doing the "right thing". You should treat external files even using standard formats as potentially polluted. They must be filtered, checked for internal consistency and all units reconverted to a clear standard of internal units. I learned that, on my own, the hard way, writing first-principles codes and I am still paying the price of my mistakes.
For your case, if you want to keep a dictionary
Talking specifically for:
package/MDAnalysis/topology/tables.py
I recommend:
1. Avoid the use of kv2dict and setup those tables as python dictionaries directly on the code. If you use "masses" frequently on your code, you will be converting and reconverting the string into a python dictionary. I know that the point of writing code in python is not performance, but there is not reason to make it worst. Another solution could be using lazy evaluation, but again, that dictionary will not change, so create it once and use forever.
1. TABLE_ATOMELEMENTS is half way of being the actual "conversion table" that you need to purify the often ambiguous notations of your research area. With some tweaking that can be converted into a function that normalizes entries like "CAL", "CA" and "Ca" into just "Ca" a neat implementation could eventually take either one string or a list of strings and return the corresponding list of correct atomic symbols. The pathological case of Carbon alpha must be filtered somewhere else.
Your code is a nice and needed effort in your area. I am not into MD simulations. But some of the people that I support and collaborate uses CMD or ab-initio MD one way of another. Your code is valuable for the community. I refrain myself of submitting significant changes to your code base at this point. My familiarity with your sources is minimal, but I am glad to help.
Thank you,
Guillermo Avendaño-Franco
Computational Physics
Research Computing Software Developer
ITS, West Virginia University
…________________________________
From: Richard Gowers <notifications@github.com>
Sent: Thursday, March 8, 2018 4:47 AM
To: MDAnalysis/mdanalysis
Cc: Guillermo Avendano-Franco; Mention
Subject: Re: [MDAnalysis/mdanalysis] Update tables.py (#1808)
The fun corner case is @orbeckst<https://github.com/orbeckst> et al work in a world of Ca being Carbon alpha :). I think we can just have both Zn and ZN in the table at no risk.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDk1_ZCJu30EO_S_bS1xGKn1kh1yWks5tcP5AgaJpZM4Sg6ee>.
|
|
That is an example of doing what is easy over doing what is right 😉
I was inspecting your code with PyCharm and I see that you uses "masses" twice inside your sources "guessers.py":
get_atom_masses and validate_atom_masses and the first one actually calling the second one.
Now imagine that you have a protein, lets say you have a XYZ file with 100k atoms.
Your parser will create a list of 100k names.
Now lets suppose that you are executing "guess_masses" for those 100k names (BTW, you are not doing that [line 93 on XYZParser.py is your salvation])
That means that you will be calling "kv2dict" converting the TABLE_MASSES into a python dictionary 200k times, getting over and over exactly the same result and using it to valiatate once and get mass next of every single atom name.
I friendly recommend move away from that design and reconsider the idea of start duplicating atomic masses. Code development is all about reducing entropy. 😉
Best,
Guillermo Avendaño-Franco
Research Computing Software Developer
ITS, West Virginia University
p: 3042931855 m: 3043763731
a: One Waterfront Place, Morgantown, WV 26506
e: gufranco@mail.wvu.edu<mailto:gufranco@mail.wvu.edu>
[http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/facebook_sig.png]<https://www.facebook.com/guilleaf> [http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/twitter_sig.png] <https://twitter.com/guilleaf> [http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/linkedin_sig.png] <https://www.linkedin.com/in/gaf1978> [http://cdn2.hubspot.net/hubfs/184235/dev_images/signature_app/instagram_sig.png] <https://www.instagram.com/guilleaf78>
________________________________
From: Oliver Beckstein <notifications@github.com>
Sent: Thursday, March 8, 2018 11:47 AM
To: MDAnalysis/mdanalysis
Cc: Guillermo Avendano-Franco; Mention
Subject: Re: [MDAnalysis/mdanalysis] Update tables.py (#1808)
It's actually CA being carbon alpha and Ca calcium, at least most of the time.
Just putting ZN and Zn in the table (and doing the same for the other ones) seems an ok solution (not pretty but should get the job done).
--
Oliver Beckstein
email: orbeckst@gmail.com
Am Mar 8, 2018 um 02:47 schrieb Richard Gowers ***@***.***>:
The fun corner case is @orbeckst et al work in a world of Ca being Carbon alpha :). I think we can just have both Zn and ZN in the table at no risk.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDpH-dWIcyV-0T1vlx_0YKanJiGsUks5tcWC2gaJpZM4Sg6ee>.
|
|
I prefer not to do that, not good for my "karma" if you like.
I you want to duplicate that line go and do it yourself.
I do not feel offended by that.
Thank you for hearing me at least.
Best,
Guillermo Avendaño-Franco
Computational Phsyics
Research Computing Software Developer
ITS, West Virginia University
…________________________________
From: Oliver Beckstein <notifications@github.com>
Sent: Thursday, March 8, 2018 5:22 PM
To: MDAnalysis/mdanalysis
Cc: Guillermo Avendano-Franco; Mention
Subject: Re: [MDAnalysis/mdanalysis] Update tables.py (#1808)
@guilleaf<https://github.com/guilleaf> would you like to update your PR and duplicate the upper-case elements (except CA) to lower case?
You'll also have to add an entry to CHANGELOG and add yourself to AUTHORS.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1808 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADiLDq-ZRTzCQNUCfylxu6CgpnSTUOFxks5tca9BgaJpZM4Sg6ee>.
|
|
@guilleaf thanks for taking the time to look at the code and share your ideas. You raise a number of good points. CA vs Ca... and upper case "elements"I had a look at the masses table specifically the entry for "CA" and we clearly treat it as calcium.There really does not seem to exist a good reason for us to keep upper case element names, especially as we have TABLE_ATOMELEMENTS as you pointed out in #1808 (comment)All we should do is make the
PhilosopyGuessing is a bit of a problem because inherently we are biased to a certain domain, which carries the risk that scientists from other domains do not get the same benefits or worse, get wrong results. Perhaps we should fail cleanly with a good error message if we cannot guarantee that we're making a good guess. The real solution is for the user to provide un-ambiguous input files. However, many users like the convenience of not having to do that... We could use a using kv2dict
It is actually not true that we are creating is run exactly once, the very first time that topology.table is imported somewhere. From then on, masses exists. The Python interpreter is not reloading the module or recalculating masses every time. Writing it with kv2dict is just a quick way to have the data in a human-readable form without having it reside in a separate file. There's no appreciable performance penalty.
|
More precisely, "CA", but many file formats do not care about case. I respectfully disagree with labelling the choice of "CA" as C-alpha as wrong. It depends on your domain. CA is not really supposed to be an element, it's a name, and if the name is the only thing we can use to guess then the better guess in biomolecular simulations is that it is a protein C-alpha atom. If you can use more information then the guess can get better (it depends on your prior...). For instance, if you know that it is inside a known protein residue, than it is almost certainly carbon. If it is in an ATOM record in a PDB file it is carbon per PDB standard, if it is HETATM then it is calcium... only most programs couldn't care less about how they write PDB files and just use ATOM for everything and we have to deal with what's out there in the wild. Just telling users "your file format is broken, complain to |
|
In what case do we get the element of a calbon alpha set to CA? CA is the name for a carbon alpha, but what would set it as the element? This is what we need to fix. I agree with @guilleaf here. |
|
Hi all, I still get similar errors with MDAnalysis Is there any update on this issue? |
|
Hi @RMeli , could you please raise an issue for your specific problem? You can refer to this PR but I'd prefer to have initial discussions/questions on a proper issue. The discussion here touched on various aspects of element guessing with no clear conclusion yet and if we can just solve a sub-problem then that would be a good step forward. Thanks! |
|
So a lot of this has changed, especially when dealing with PDBs (e.g. #3001). I'm going to go ahead and close this and ask folks to open a separate issue to tackle any remaining issues if that's ok? |
Fixes #
Changes made in this Pull Request:
Small typo for the Mass of Zn
PR Checklist