Skip to content

Add CD benchmarks to manuscript#47

Merged
davidlmobley merged 40 commits intomasterfrom
nieldev
Aug 26, 2017
Merged

Add CD benchmarks to manuscript#47
davidlmobley merged 40 commits intomasterfrom
nieldev

Conversation

@nhenriksen
Copy link
Collaborator

No description provided.

@jchodera
Copy link
Contributor

jchodera commented Aug 8, 2017

It looks like a lot of those mol2 files are post-antechamber with GAFF atom types. Since these are non-compliant mol2 files that cannot be used as standard Tripos mol2 files, don't you want to put those in a special directory that isn't named something like mol2/ and put lots of warnings in the README? It would also be great to make sure the Tripos-format mol2 files are included for each of those.

@jchodera
Copy link
Contributor

jchodera commented Aug 8, 2017

Also, if those antechamber-format mol2 files are intended for use with a very specific version of gaff.dat, you may want to include that version here.

500+ files is also a lot. Many of these formats (mol2, SDF) support multi-molecule files where all the ligands can be compiled into a single file. That may be helpful here!

@nhenriksen
Copy link
Collaborator Author

nhenriksen commented Aug 8, 2017

The original CB7 and OA files were added by @Janeyin600, so I followed her format for the CDs. I agree that the current mol2 format is not ideal for universal usage. I like the idea of putting them all into a single file.

The GAFF version is mentioned in the README (GAFF v1.7).

@davidlmobley
Copy link
Member

Thanks, @nhenriksen ! I'm super swamped right now because of a couple of deadlines early next week, but hopefully can review this in detail next week.

In terms of mol2 files -- I think I agree it would be good if we provided GAFF and "standard" mol2 files of everything. OpenEye tools can do this conversion easily. If you're on board with that, I could add a script for it to the repo and probably handle conversions of everything (or you can do it if you guys have your OE license renewed yet). Thoughts?

@nhenriksen
Copy link
Collaborator Author

@davidlmobley I think I can get this done today.

@nhenriksen
Copy link
Collaborator Author

I've redone all the CD benchmark files. All the mol2 files use SYBYL atom types now. They are all generated from the original bound structures (secondary orientation), which means that all guest coordinates fit into the host coordinates from a single file (that way the aligning procedure that @andrrizzi used is no longer necessary). I've also generated multi-molecule versions for the .mol2, .sdf, and .pdb formats.

Before making commits, I just want to follow-up on some items:

  • How do we feel about using multi-molecule files instead of the one-guest-per-file approach? On reflection, although it is cleaner for archival purposes, the multi-molecule format might be more annoying for users to parse. I currently am in favor of one-guest-per-file approach, but don't feel super strongly about it.
  • Do we need copies of the .mol2 files with GAFF atom types? The new files I've generated have SYBYL. If someone really wants GAFF types they could just run the .mol2 files through antechamber.
  • If you recall, the guests in these benchmarks have effectively two binding orientations that don't really interchange on the simulation timescale. Now that the host and guest .mol2 coordinates all represent the secondary (wider end) conformation, it might bias a users expectation about the orientation. I plan to highlight this in the README.
  • I've placed CONECT lines in the host PDB files. It's necessary to get some viewers to connect the first and last residues into a ring. Let me know if there are any foreseeable issues with CONECT lines.

@davidlmobley
Copy link
Member

@nhenriksen

How do we feel about using multi-molecule files instead of the one-guest-per-file approach? On reflection, although it is cleaner for archival purposes, the multi-molecule format might be more annoying for users to parse. I currently am in favor of one-guest-per-file approach, but don't feel super strongly about it.

I think @jchodera has a bit of an aversion to "lots of files in the repo", but I agree with you that for most "normal simulation" users, multi-molecule mol2 files are not useful -- they basically mean you either need to rely on a chemistry toolkit for parsing them (OpenEye software, RDKit, etc.) or write your own tool to separate them out into individual files. Thus, my preference is actually to have the individual files, OR provide a tool which can break them into these, otherwise we're just asking other people to do the work so that our repo can stay tidy. Sorry, I should have answered that aspect before.

Do we need copies of the .mol2 files with GAFF atom types? The new files I've generated have SYBYL. If someone really wants GAFF types they could just run the .mol2 files through antechamber.

I don't think we need GAFF types, no.

If you recall, the guests in these benchmarks have effectively two binding orientations that don't really interchange on the simulation timescale. Now that the host and guest .mol2 coordinates all represent the secondary (wider end) conformation, it might bias a users expectation about the orientation. I plan to highlight this in the README.

Sounds great.

I've placed CONECT lines in the host PDB files. It's necessary to get some viewers to connect the first and last residues into a ring. Let me know if there are any foreseeable issues with CONECT lines.

They should be there! So I'm all for having them. :) I don't think our tools will correctly process the PDB files without them.

@nhenriksen
Copy link
Collaborator Author

I've updated the input files and READMEs for the host-guest benchmark files. (For some reason I screwed up the commit process and it did it twice ... not sure how that was possible.)
Updates include:

  • New .mol2, .pdb, and .sdf files for the CD benchmarks. These are now in a proper bound state conformation to a single host molecule. I've added header/remark lines in these files identifying them.
  • Updated the README files in cd-set1 and cd-set2 to include host and guest names along with a spot for reference computational studies. There are few more papers I can add besides mine, but will do in the future.
  • Changed the .sd extension to be .sdf as I think that is more standard ...?

Things to be addressed in the future:

  • The CB7 and GDCC benchmarks do not have a README providing data or references to other computational works.
  • The CB7 and GDCC guest input files do not have coordinates which correspond to a bound state in the host. They are close, but clearly not a plausible bound state. Jane made these files, and I don't see a way to fix this without manually setting them up or extracting conformations from the equilibrated prmtop/rst7 files.
  • The BRD4 benchmark .sd files do not appear to be valid .mol or .sdf format. I'm actually not sure who uses this format, but at least Chimera won't open it. These guidelines seem to indicate that something bad happened when Germano converted with Babel. But I'm not familiar enough to really know.

@davidlmobley
Copy link
Member

@GHeinzelmann - can you see Niel's note above? Specifically:

The BRD4 benchmark .sd files do not appear to be valid .mol or .sdf format. I'm actually not sure who uses this format, but at least Chimera won't open it. These guidelines seem to indicate that something bad happened when Germano converted with Babel. But I'm not familiar enough to really know.

@nhenriksen - thanks for all the work. I'll review as soon as I can. At the moment I'm slightly swamped dealing with SAMPL so I'm not going to get to it as soon as I hoped to (I'd hoped to have reviewed it already!). It looks like you're being quite thorough though.

I can create issues for these:

Things to be addressed in the future:

  • The CB7 and GDCC benchmarks do not have a README providing data or references to other computational works.
  • The CB7 and GDCC guest input files do not have coordinates which correspond to a bound state in the host. They are close, but clearly not a plausible bound state. Jane made these files, and I don't see a way to fix this without manually setting them up or extracting conformations from the equilibrated prmtop/rst7 files.

I think I can fix the second of these, at least for the GDCC case, by docking into the host using the code I just figured out. Alternatively your strategy of extracting from the equilibrated files should work. Do you have thoughts on which is preferable? @Janeyin600 might also want to weigh in?

@nhenriksen
Copy link
Collaborator Author

@davidlmobley Whenever you get to reviewing is fine, no urgency on my part.

If you want to get the CB7/GDCC guests docked with your code, that would be great!

Copy link
Member

@davidlmobley davidlmobley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nhenriksen - overall this looks great! Thanks so much for all your work. I'll put some feedback here for you to address (sorry if it's not formatted that nicely but I'm in an airport between flights and only have a moment to info dump) and also will submit a PR to your branch that includes some changes that address some minor issues.

Changes suggested:

  • The updated Figure 1 removes TEMOA and adds beta-CD, though the benchmark includes both beta-CD and gamma-CD. I think my preference would be to just expand the figure (making it two rows perhaps) to keep TEMOA and add beta and gamma-CD
  • The Figure 1 caption is not correct since it still refers to TEMOA, and doesn't say which cyclodextrin is pictured (but see point just prior)
  • I'm thinking it would be convenient for Fig. 1 to say which is the "wider end" vs the "narrower end"; I didn't notice they were asymmetric at first from just the text.
  • "The behavior of phosphate ions, particularly the HPO_4^{2-} species, can be sensitive to the particular choice of water model..." seems to need a reference.
  • The SMILES for the aromatic guests use different kekulization than those in the other tables; e.g. benzoic acid shows up in Table IV and Table VI with different SMILES. If you guys have the OpenEye tools again I can easily generate SMILES for you, or you can do it, but I'd prefer to treat them consistently. (To say it another way: The aromatic rings get "aromatic carbon" (small c1ccc(cc1)) notation in all my SMILES strings, whereas yours use alternating singles and doubles (C1=CC=C(C=C1)) notation.

Questions:

  • "Despite the flexibility of CD hosts, the small size of the guests combined with long timescale simulations enable by GPUs should allow adequate convergence of these thermodynamic values." Would it be accurate to say, "appears to allow adequate convergence..." and give a reference to your study? This would be a little stronger of a statement.
  • "Since the cavity is both narrow and hydrophobic, interchange between the two orientations is usually slow and requires release into the solvent and subsequent rebinding"; can you say anything about the timescales for this from your experience or prior work, e.g. "These events may take as long as XYZ microseconds for some guests" or similar?
  • Your tables use "mol/kg" units for buffer. I realize this is probably what the experiments were reported with, but it's inconvenient for computational folks and could introduce a source of error in setup (if everyone is converting). Maybe we want to put "(approximately XYZ mM)" in parens to help people out?

I took the liberty of going ahead and adding you as an author (will be in my PR)! Thanks for all your work on this. It's much appreciated.

@davidlmobley
Copy link
Member

Also, I forgot this comment:

  • Your README.md files in the cd-set1 and cd-set2 directories are beautiful and sure to be very helpful. Do you want to also provide isomeric SMILES in these? I'm OK if not, but I'm thinking that will increase the likelihood that people will actually do something with them (makes it easier to generate input files)

@davidlmobley
Copy link
Member

BTW I think we are almost ready to merge, so once you get these edits in (plus my PR) we should go ahead and send the LaTeX/PDF to Mike so he can edit.

@GHeinzelmann - please also review. :)

Incorporate minor edits of new cyclodextrin material into nieldev branch
@nhenriksen
Copy link
Collaborator Author

A few clarifications ...

Changes suggested:

  • The updated Figure 1 removes TEMOA and adds beta-CD, though the benchmark includes both beta-CD and gamma-CD. I think my preference would be to just expand the figure (making it two rows perhaps) to keep TEMOA and add beta and gamma-CD

The figure is already 3 rows due to the various views and takes up slightly more than half the page. My thinking was that due to the similarity of OA/TEMOA and α-CD/β-CD, we could just use a single representative. But assuming we do want to display all 5 hosts, shall I aim for a full page figure then?

  • I'm thinking it would be convenient for Fig. 1 to say which is the "wider end" vs the "narrower end"; I didn't notice they were asymmetric at first from just the text.

Good idea.

  • "The behavior of phosphate ions, particularly the HPO_4^{2-} species, can be sensitive to the particular choice of water model..." seems to need a reference.

I don't have a reference for that because it's unpublished work. We could add an "(unpublished)" to it?

  • The SMILES for the aromatic guests use different kekulization than those in the other tables; e.g. benzoic acid shows up in Table IV and Table VI with different SMILES. If you guys have the OpenEye tools again I can easily generate SMILES for you, or you can do it, but I'd prefer to treat them consistently. (To say it another way: The aromatic rings get "aromatic carbon" (small c1ccc(cc1)) notation in all my SMILES strings, whereas yours use alternating singles and doubles (C1=CC=C(C=C1)) notation.

Good point, I'll redo those.

Questions:

  • "Despite the flexibility of CD hosts, the small size of the guests combined with long timescale simulations enable by GPUs should allow adequate convergence of these thermodynamic values." Would it be accurate to say, "appears to allow adequate convergence..." and give a reference to your study? This would be a little stronger of a statement.

Sure, that works.

  • "Since the cavity is both narrow and hydrophobic, interchange between the two orientations is usually slow and requires release into the solvent and subsequent rebinding"; can you say anything about the timescales for this from your experience or prior work, e.g. "These events may take as long as XYZ microseconds for some guests" or similar?

It's hard to say, since I haven't done any long time-scale on/off type simulations. For strong binders, I don't see any unbinding on the microsecond timescale, so interchange would definitely exceed that due to all the additional time it takes to diffuse around the box once it does successfully unbind. I could say something like "Sampling these events is expected to require microsecond-scale simulations or longer."

  • Your tables use "mol/kg" units for buffer. I realize this is probably what the experiments were reported with, but it's inconvenient for computational folks and could introduce a source of error in setup (if everyone is converting). Maybe we want to put "(approximately XYZ mM)" in parens to help people out?

Actually I think molality is nicer to deal with than molarity, since it won't depend on volume/temp./pressure. But I agree that it is less common. At lower concentrations, it will have essentially the same molarity as molality since 1 kg of the solution will be about 1 Liter. I couldn't find any exact data for density of 50 mM sodium phosphate, but I did find it for potassium phosphate, which shows that it is very close to 1 g/cm^3. So yes, I think we could say "(approximately 50 mM)" in the table footnotes.

  • Your README.md files in the cd-set1 and cd-set2 directories are beautiful and sure to be very helpful. Do you want to also provide isomeric SMILES in these? I'm OK if not, but I'm thinking that will increase the likelihood that people will actually do something with them (makes it easier to generate input files)

Yes, I can add them ... although the table might get squeezed for space. I'll try it out.

- added smiles strings
- updated naming to reflect expected protonation state.
- added smiles
- edited names to reflect protonation state
@davidlmobley
Copy link
Member

@nhenriksen

The figure is already 3 rows due to the various views and takes up slightly more than half the page. My thinking was that due to the similarity of OA/TEMOA and α-CD/β-CD, we could just use a single representative. But assuming we do want to display all 5 hosts, shall I aim for a full page figure then?

True, it was getting back. But, we're electronic-only now and don't have to worry about page limits, so, why not full-page? Or we can split it into three separate figures, one for each category of host, if having a full-page figure bothers you. :) I'd rather be able to see all three of them, myself -- otherwise first thing I'll do wonder what the other ones look like and wish for a figure. Ha.

I don't have a reference for that because it's unpublished work. We could add an "(unpublished)" to it?

Sure, cite yourself...? I think we just need to say how we know.

I could say something like "Sampling these events is expected to require microsecond-scale simulations or longer."

That sounds good to me; you could even do better and say "... because previous work has not observed binding and unbinding events in microsecond-length simulations [ref your paper]."

Otherwise, all sounds good.

@nhenriksen
Copy link
Collaborator Author

@davidlmobley I've addressed your requested changes in the latest commit. I had to shrink the images in Fig 1 to get them to fit. The figure might work better on a landscape page rather than portrait, but that might be more trouble than it's worth to do. All the .png's are in the figure_source_files directory if you have ideas for different formats.

@davidlmobley
Copy link
Member

@nhenriksen - Thanks! Hopefully I'll review late today. I'll go ahead and send TeX/PDF to Mike, CC you.

The Zhang 2016 paper has a bunch more of the same guests, but they were all treated as neutral, whereas the experiment had the guests charged.  I've omitted them to avoid confusion.

Also, I just realized that names like "4-methylphenylacetate" may have caused confusion in the literature.  Rekharsky 1997 used that name to refer to the charged version of 4-methylphenylacetic acid, but both Wickstrom 2013 or Zhang 2016 interpreted it as the structure that is more commonly known as "p-tolyl acetate" which is totally different.  This occurred for several different related compounds.  I've left those off the list as well, because clearly the numbers won't be comparable.
Copy link
Member

@davidlmobley davidlmobley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nhenriksen - this is looking great. README files look excellent, the new Figure 1 looks good to me, and the changes appear to have addressed my issues.

I'm marking this as approved. I think we just need Mike to take a look, but I've sent it to him and CCd him, so hopefully that doesn't take long.

Again, thank you for all your work on this. I hope it feels worthwhile -- I'm convinced it will be (people are going to use your data/files/etc.!).

@davidlmobley
Copy link
Member

Also, random comments on your last commit:

The Zhang 2016 paper has a bunch more of the same guests, but they were all treated as neutral, whereas the experiment had the guests charged. I've omitted them to avoid confusion.

Gosh, that's a huge issue. Yes, definitely want to avoid confusion.

Also, I just realized that names like "4-methylphenylacetate" may have caused confusion in the literature. Rekharsky 1997 used that name to refer to the charged version of 4-methylphenylacetic acid, but both Wickstrom 2013 or Zhang 2016 interpreted it as the structure that is more commonly known as "p-tolyl acetate" which is totally different. This occurred for several different related compounds. I've left those off the list as well, because clearly the numbers won't be comparable.

This is part of why I don't like to use compound names as identifiers ever -- using a compound name is begging a human to interpret it (partly because automated name parsers sometimes make mistakes, and/or can't interpret names that humans can), and once you get a human involved they will misinterpret things. Hence, SMILES and/or PubChem compound IDs...

I'm starting to work with some people on some best practices documents for MD simulations. I should probably mention the "please don't use compound names" thing in those. :)

Copy link
Member

@davidlmobley davidlmobley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @nhenriksen !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants