Skip to content

Update tables.py#1808

Closed
guilleaf wants to merge 1 commit intoMDAnalysis:developfrom
guilleaf:patch-1
Closed

Update tables.py#1808
guilleaf wants to merge 1 commit intoMDAnalysis:developfrom
guilleaf:patch-1

Conversation

@guilleaf
Copy link

@guilleaf guilleaf commented Mar 7, 2018

Fixes #

Changes made in this Pull Request:

Small typo for the Mass of Zn

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@orbeckst
Copy link
Member

orbeckst commented Mar 7, 2018

Hi @guilleaf welcome to MDAnalysis and thank you for contributing!

Is it a problem for you that Zn is spelled ZN?

There is a reason that some of the elements are capitalized. This is how they appear in common forcefields. (We could probably do something about this and spell elements with correct element names and all-caps everything and make the atom-guessers do case-insensitive comparisons, but then this would involve a bit more work.)

@orbeckst
Copy link
Member

orbeckst commented Mar 7, 2018

Admittedly, I just ran into the situation where I am loading a PDB-type file with Zn and get

~/Projects/Methods/MDAnalysis/mdanalysis/package/MDAnalysis/topology/guessers.py:72: UserWarning: Failed to guess the mass for the following atom types: Zn
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))

So I concede the point that something ought to be done about it...

@orbeckst
Copy link
Member

orbeckst commented Mar 7, 2018

@guilleaf Could you please raise an issue that states the problem that you're having and then we can use the issue report to discuss what ought to be done about it?

@guilleaf
Copy link
Author

guilleaf commented Mar 7, 2018 via email

@guilleaf
Copy link
Author

guilleaf commented Mar 7, 2018 via email

@orbeckst
Copy link
Member

orbeckst commented Mar 7, 2018

@guilleaf many thanks for the detailed feedback. We'll have to think about how to handle this. In principle I agree with your clean/dirty approach.

@richardjgowers
Copy link
Member

The fun corner case is @orbeckst et al work in a world of Ca being Carbon alpha :). I think we can just have both Zn and ZN in the table at no risk.

@orbeckst
Copy link
Member

orbeckst commented Mar 8, 2018 via email

@orbeckst
Copy link
Member

orbeckst commented Mar 8, 2018

@guilleaf would you like to update your PR and duplicate the upper-case elements (except CA) to lower case?

You'll also have to add an entry to CHANGELOG and add yourself to AUTHORS.

@guilleaf
Copy link
Author

guilleaf commented Mar 9, 2018 via email

@guilleaf
Copy link
Author

guilleaf commented Mar 9, 2018 via email

@guilleaf
Copy link
Author

guilleaf commented Mar 9, 2018 via email

@orbeckst
Copy link
Member

@guilleaf thanks for taking the time to look at the code and share your ideas. You raise a number of good points.

CA vs Ca... and upper case "elements"

I had a look at the masses table

specifically the entry for "CA" and we clearly treat it as calcium.

There really does not seem to exist a good reason for us to keep upper case element names, especially as we have TABLE_ATOMELEMENTS

TABLE_ATOMELEMENTS = """
as you pointed out in #1808 (comment)

All we should do is make the TABLE_MASSES and the TABLE_ATOMELEMENTS consistent.

  1. make TABLE_MASSES lower case so that they correspond to proper element symbols
  2. add any upper/lower case translations to TABLE_ATOMELEMENTS
  3. explicitly add "CA" as a "C" to table TABLE_ATOMELEMENTS to make this choice somewhat more transparent. (In biomolecular simulation CA is much more likely to mean "C-alpha" atom than calcium so I would want to go with the dominant semantics.)

Philosopy

Guessing is a bit of a problem because inherently we are biased to a certain domain, which carries the risk that scientists from other domains do not get the same benefits or worse, get wrong results. Perhaps we should fail cleanly with a good error message if we cannot guarantee that we're making a good guess.

The real solution is for the user to provide un-ambiguous input files.

However, many users like the convenience of not having to do that...

We could use a guess=False flag by default and tell people that with guess=True they run the risk of wrong results. In an ideal world we could flag all guesses that are ambiguous. Given the amount of different data and formats that people throw into MDAnalysis, this is not really feasible. We can catch cases that we are aware of but that's not providing any certainty.

using kv2dict

Avoid the use of kv2dict and setup those tables as python dictionaries directly on the code. If you use "masses" frequently on your code, you will be converting and reconverting the string into a python dictionary.

It is actually not true that we are creating masses repeatedly. The masses = kv2dic()

masses = kv2dict(TABLE_MASSES, convertor=float)

is run exactly once, the very first time that topology.table is imported somewhere. From then on, masses exists. The Python interpreter is not reloading the module or recalculating masses every time. Writing it with kv2dict is just a quick way to have the data in a human-readable form without having it reside in a separate file. There's no appreciable performance penalty.

@orbeckst
Copy link
Member

orbeckst commented Mar 10, 2018

If Ca can be interpreted as Carbon alpha is strong indication about how polluted can be the external world.

More precisely, "CA", but many file formats do not care about case.

I respectfully disagree with labelling the choice of "CA" as C-alpha as wrong. It depends on your domain. CA is not really supposed to be an element, it's a name, and if the name is the only thing we can use to guess then the better guess in biomolecular simulations is that it is a protein C-alpha atom. If you can use more information then the guess can get better (it depends on your prior...). For instance, if you know that it is inside a known protein residue, than it is almost certainly carbon. If it is in an ATOM record in a PDB file it is carbon per PDB standard, if it is HETATM then it is calcium... only most programs couldn't care less about how they write PDB files and just use ATOM for everything and we have to deal with what's out there in the wild.

Just telling users "your file format is broken, complain to <insert widely adopted program here>" is not a good strategy to keep users. But if you have ideas how to handle these cases, we're all ears. We have to deal with this situation on a constant basis.

@jbarnoud
Copy link
Contributor

In what case do we get the element of a calbon alpha set to CA? CA is the name for a carbon alpha, but what would set it as the element? This is what we need to fix. I agree with @guilleaf here.

@RMeli
Copy link
Member

RMeli commented May 18, 2019

Hi all, I still get similar errors with MDAnalysis 0.19.3-dev:

~/software/python/mdanalysis-develop/package/MDAnalysis/topology/guessers.py:73: UserWarning: Failed to guess the mass for the following atom types: Mg
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
~/software/python/mdanalysis-develop/package/MDAnalysis/topology/guessers.py:73: UserWarning: Failed to guess the mass for the following atom types: Zn
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
~/software/python/mdanalysis-develop/package/MDAnalysis/topology/guessers.py:73: UserWarning: Failed to guess the mass for the following atom types: Fe
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
~/software/python/mdanalysis-develop/package/MDAnalysis/topology/guessers.py:73: UserWarning: Failed to guess the mass for the following atom types: Ca
  warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))

Is there any update on this issue?

@orbeckst
Copy link
Member

Hi @RMeli , could you please raise an issue for your specific problem? You can refer to this PR but I'd prefer to have initial discussions/questions on a proper issue. The discussion here touched on various aspects of element guessing with no clear conclusion yet and if we can just solve a sub-problem then that would be a good step forward. Thanks!

@IAlibay
Copy link
Member

IAlibay commented Feb 8, 2022

So a lot of this has changed, especially when dealing with PDBs (e.g. #3001).

I'm going to go ahead and close this and ask folks to open a separate issue to tackle any remaining issues if that's ok?

@IAlibay IAlibay closed this Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

close? Evaluate if issue/PR is stale and can be closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants