Add CD benchmarks to manuscript by nhenriksen · Pull Request #47 · MobleyLab/benchmarksets

nhenriksen · 2017-08-08T16:22:07Z

No description provided.

…ration is still in progress and will be updated)

…cket

CD mol2/pdb/sd files coordinates starting in the binding pocket

jchodera · 2017-08-08T21:41:34Z

It looks like a lot of those mol2 files are post-antechamber with GAFF atom types. Since these are non-compliant mol2 files that cannot be used as standard Tripos mol2 files, don't you want to put those in a special directory that isn't named something like mol2/ and put lots of warnings in the README? It would also be great to make sure the Tripos-format mol2 files are included for each of those.

jchodera · 2017-08-08T21:42:47Z

Also, if those antechamber-format mol2 files are intended for use with a very specific version of gaff.dat, you may want to include that version here.

500+ files is also a lot. Many of these formats (mol2, SDF) support multi-molecule files where all the ligands can be compiled into a single file. That may be helpful here!

nhenriksen · 2017-08-08T21:51:28Z

The original CB7 and OA files were added by @Janeyin600, so I followed her format for the CDs. I agree that the current mol2 format is not ideal for universal usage. I like the idea of putting them all into a single file.

The GAFF version is mentioned in the README (GAFF v1.7).

davidlmobley · 2017-08-10T16:20:17Z

Thanks, @nhenriksen ! I'm super swamped right now because of a couple of deadlines early next week, but hopefully can review this in detail next week.

In terms of mol2 files -- I think I agree it would be good if we provided GAFF and "standard" mol2 files of everything. OpenEye tools can do this conversion easily. If you're on board with that, I could add a script for it to the repo and probably handle conversions of everything (or you can do it if you guys have your OE license renewed yet). Thoughts?

nhenriksen · 2017-08-10T16:58:57Z

@davidlmobley I think I can get this done today.

nhenriksen · 2017-08-14T23:23:52Z

I've redone all the CD benchmark files. All the mol2 files use SYBYL atom types now. They are all generated from the original bound structures (secondary orientation), which means that all guest coordinates fit into the host coordinates from a single file (that way the aligning procedure that @andrrizzi used is no longer necessary). I've also generated multi-molecule versions for the .mol2, .sdf, and .pdb formats.

Before making commits, I just want to follow-up on some items:

How do we feel about using multi-molecule files instead of the one-guest-per-file approach? On reflection, although it is cleaner for archival purposes, the multi-molecule format might be more annoying for users to parse. I currently am in favor of one-guest-per-file approach, but don't feel super strongly about it.
Do we need copies of the .mol2 files with GAFF atom types? The new files I've generated have SYBYL. If someone really wants GAFF types they could just run the .mol2 files through antechamber.
If you recall, the guests in these benchmarks have effectively two binding orientations that don't really interchange on the simulation timescale. Now that the host and guest .mol2 coordinates all represent the secondary (wider end) conformation, it might bias a users expectation about the orientation. I plan to highlight this in the README.
I've placed CONECT lines in the host PDB files. It's necessary to get some viewers to connect the first and last residues into a ring. Let me know if there are any foreseeable issues with CONECT lines.

davidlmobley · 2017-08-14T23:29:00Z

@nhenriksen

How do we feel about using multi-molecule files instead of the one-guest-per-file approach? On reflection, although it is cleaner for archival purposes, the multi-molecule format might be more annoying for users to parse. I currently am in favor of one-guest-per-file approach, but don't feel super strongly about it.

I think @jchodera has a bit of an aversion to "lots of files in the repo", but I agree with you that for most "normal simulation" users, multi-molecule mol2 files are not useful -- they basically mean you either need to rely on a chemistry toolkit for parsing them (OpenEye software, RDKit, etc.) or write your own tool to separate them out into individual files. Thus, my preference is actually to have the individual files, OR provide a tool which can break them into these, otherwise we're just asking other people to do the work so that our repo can stay tidy. Sorry, I should have answered that aspect before.

Do we need copies of the .mol2 files with GAFF atom types? The new files I've generated have SYBYL. If someone really wants GAFF types they could just run the .mol2 files through antechamber.

I don't think we need GAFF types, no.

If you recall, the guests in these benchmarks have effectively two binding orientations that don't really interchange on the simulation timescale. Now that the host and guest .mol2 coordinates all represent the secondary (wider end) conformation, it might bias a users expectation about the orientation. I plan to highlight this in the README.

Sounds great.

I've placed CONECT lines in the host PDB files. It's necessary to get some viewers to connect the first and last residues into a ring. Let me know if there are any foreseeable issues with CONECT lines.

They should be there! So I'm all for having them. :) I don't think our tools will correctly process the PDB files without them.

nhenriksen · 2017-08-16T00:30:26Z

I've updated the input files and READMEs for the host-guest benchmark files. (For some reason I screwed up the commit process and it did it twice ... not sure how that was possible.)
Updates include:

New .mol2, .pdb, and .sdf files for the CD benchmarks. These are now in a proper bound state conformation to a single host molecule. I've added header/remark lines in these files identifying them.
Updated the README files in cd-set1 and cd-set2 to include host and guest names along with a spot for reference computational studies. There are few more papers I can add besides mine, but will do in the future.
Changed the .sd extension to be .sdf as I think that is more standard ...?

Things to be addressed in the future:

The CB7 and GDCC benchmarks do not have a README providing data or references to other computational works.
The CB7 and GDCC guest input files do not have coordinates which correspond to a bound state in the host. They are close, but clearly not a plausible bound state. Jane made these files, and I don't see a way to fix this without manually setting them up or extracting conformations from the equilibrated prmtop/rst7 files.
The BRD4 benchmark .sd files do not appear to be valid .mol or .sdf format. I'm actually not sure who uses this format, but at least Chimera won't open it. These guidelines seem to indicate that something bad happened when Germano converted with Babel. But I'm not familiar enough to really know.

davidlmobley · 2017-08-16T16:40:32Z

@GHeinzelmann - can you see Niel's note above? Specifically:

The BRD4 benchmark .sd files do not appear to be valid .mol or .sdf format. I'm actually not sure who uses this format, but at least Chimera won't open it. These guidelines seem to indicate that something bad happened when Germano converted with Babel. But I'm not familiar enough to really know.

@nhenriksen - thanks for all the work. I'll review as soon as I can. At the moment I'm slightly swamped dealing with SAMPL so I'm not going to get to it as soon as I hoped to (I'd hoped to have reviewed it already!). It looks like you're being quite thorough though.

I can create issues for these:

Things to be addressed in the future:

The CB7 and GDCC benchmarks do not have a README providing data or references to other computational works.

The CB7 and GDCC guest input files do not have coordinates which correspond to a bound state in the host. They are close, but clearly not a plausible bound state. Jane made these files, and I don't see a way to fix this without manually setting them up or extracting conformations from the equilibrated prmtop/rst7 files.

I think I can fix the second of these, at least for the GDCC case, by docking into the host using the code I just figured out. Alternatively your strategy of extracting from the equilibrated files should work. Do you have thoughts on which is preferable? @Janeyin600 might also want to weigh in?

nhenriksen · 2017-08-16T16:46:07Z

@davidlmobley Whenever you get to reviewing is fine, no urgency on my part.

If you want to get the CB7/GDCC guests docked with your code, that would be great!

… one-sentence-per-line for better version control

davidlmobley

@nhenriksen - overall this looks great! Thanks so much for all your work. I'll put some feedback here for you to address (sorry if it's not formatted that nicely but I'm in an airport between flights and only have a moment to info dump) and also will submit a PR to your branch that includes some changes that address some minor issues.

Changes suggested:

The updated Figure 1 removes TEMOA and adds beta-CD, though the benchmark includes both beta-CD and gamma-CD. I think my preference would be to just expand the figure (making it two rows perhaps) to keep TEMOA and add beta and gamma-CD
The Figure 1 caption is not correct since it still refers to TEMOA, and doesn't say which cyclodextrin is pictured (but see point just prior)
I'm thinking it would be convenient for Fig. 1 to say which is the "wider end" vs the "narrower end"; I didn't notice they were asymmetric at first from just the text.
"The behavior of phosphate ions, particularly the HPO_4^{2-} species, can be sensitive to the particular choice of water model..." seems to need a reference.
The SMILES for the aromatic guests use different kekulization than those in the other tables; e.g. benzoic acid shows up in Table IV and Table VI with different SMILES. If you guys have the OpenEye tools again I can easily generate SMILES for you, or you can do it, but I'd prefer to treat them consistently. (To say it another way: The aromatic rings get "aromatic carbon" (small c1ccc(cc1)) notation in all my SMILES strings, whereas yours use alternating singles and doubles (C1=CC=C(C=C1)) notation.

Questions:

"Despite the flexibility of CD hosts, the small size of the guests combined with long timescale simulations enable by GPUs should allow adequate convergence of these thermodynamic values." Would it be accurate to say, "appears to allow adequate convergence..." and give a reference to your study? This would be a little stronger of a statement.
"Since the cavity is both narrow and hydrophobic, interchange between the two orientations is usually slow and requires release into the solvent and subsequent rebinding"; can you say anything about the timescales for this from your experience or prior work, e.g. "These events may take as long as XYZ microseconds for some guests" or similar?
Your tables use "mol/kg" units for buffer. I realize this is probably what the experiments were reported with, but it's inconvenient for computational folks and could introduce a source of error in setup (if everyone is converting). Maybe we want to put "(approximately XYZ mM)" in parens to help people out?

I took the liberty of going ahead and adding you as an author (will be in my PR)! Thanks for all your work on this. It's much appreciated.

davidlmobley · 2017-08-23T18:42:21Z

Also, I forgot this comment:

Your README.md files in the cd-set1 and cd-set2 directories are beautiful and sure to be very helpful. Do you want to also provide isomeric SMILES in these? I'm OK if not, but I'm thinking that will increase the likelihood that people will actually do something with them (makes it easier to generate input files)

davidlmobley · 2017-08-23T18:46:01Z

BTW I think we are almost ready to merge, so once you get these edits in (plus my PR) we should go ahead and send the LaTeX/PDF to Mike so he can edit.

@GHeinzelmann - please also review. :)

Incorporate minor edits of new cyclodextrin material into nieldev branch

nhenriksen · 2017-08-23T22:06:30Z

A few clarifications ...

Changes suggested:

The updated Figure 1 removes TEMOA and adds beta-CD, though the benchmark includes both beta-CD and gamma-CD. I think my preference would be to just expand the figure (making it two rows perhaps) to keep TEMOA and add beta and gamma-CD

The figure is already 3 rows due to the various views and takes up slightly more than half the page. My thinking was that due to the similarity of OA/TEMOA and α-CD/β-CD, we could just use a single representative. But assuming we do want to display all 5 hosts, shall I aim for a full page figure then?

I'm thinking it would be convenient for Fig. 1 to say which is the "wider end" vs the "narrower end"; I didn't notice they were asymmetric at first from just the text.

Good idea.

"The behavior of phosphate ions, particularly the HPO_4^{2-} species, can be sensitive to the particular choice of water model..." seems to need a reference.

I don't have a reference for that because it's unpublished work. We could add an "(unpublished)" to it?

The SMILES for the aromatic guests use different kekulization than those in the other tables; e.g. benzoic acid shows up in Table IV and Table VI with different SMILES. If you guys have the OpenEye tools again I can easily generate SMILES for you, or you can do it, but I'd prefer to treat them consistently. (To say it another way: The aromatic rings get "aromatic carbon" (small c1ccc(cc1)) notation in all my SMILES strings, whereas yours use alternating singles and doubles (C1=CC=C(C=C1)) notation.

Good point, I'll redo those.

Questions:

"Despite the flexibility of CD hosts, the small size of the guests combined with long timescale simulations enable by GPUs should allow adequate convergence of these thermodynamic values." Would it be accurate to say, "appears to allow adequate convergence..." and give a reference to your study? This would be a little stronger of a statement.

Sure, that works.

"Since the cavity is both narrow and hydrophobic, interchange between the two orientations is usually slow and requires release into the solvent and subsequent rebinding"; can you say anything about the timescales for this from your experience or prior work, e.g. "These events may take as long as XYZ microseconds for some guests" or similar?

It's hard to say, since I haven't done any long time-scale on/off type simulations. For strong binders, I don't see any unbinding on the microsecond timescale, so interchange would definitely exceed that due to all the additional time it takes to diffuse around the box once it does successfully unbind. I could say something like "Sampling these events is expected to require microsecond-scale simulations or longer."

Your tables use "mol/kg" units for buffer. I realize this is probably what the experiments were reported with, but it's inconvenient for computational folks and could introduce a source of error in setup (if everyone is converting). Maybe we want to put "(approximately XYZ mM)" in parens to help people out?

Actually I think molality is nicer to deal with than molarity, since it won't depend on volume/temp./pressure. But I agree that it is less common. At lower concentrations, it will have essentially the same molarity as molality since 1 kg of the solution will be about 1 Liter. I couldn't find any exact data for density of 50 mM sodium phosphate, but I did find it for potassium phosphate, which shows that it is very close to 1 g/cm^3. So yes, I think we could say "(approximately 50 mM)" in the table footnotes.

Your README.md files in the cd-set1 and cd-set2 directories are beautiful and sure to be very helpful. Do you want to also provide isomeric SMILES in these? I'm OK if not, but I'm thinking that will increase the likelihood that people will actually do something with them (makes it easier to generate input files)

Yes, I can add them ... although the table might get squeezed for space. I'll try it out.

- added smiles strings - updated naming to reflect expected protonation state.

- added smiles - edited names to reflect protonation state

davidlmobley · 2017-08-24T01:14:14Z

@nhenriksen

The figure is already 3 rows due to the various views and takes up slightly more than half the page. My thinking was that due to the similarity of OA/TEMOA and α-CD/β-CD, we could just use a single representative. But assuming we do want to display all 5 hosts, shall I aim for a full page figure then?

True, it was getting back. But, we're electronic-only now and don't have to worry about page limits, so, why not full-page? Or we can split it into three separate figures, one for each category of host, if having a full-page figure bothers you. :) I'd rather be able to see all three of them, myself -- otherwise first thing I'll do wonder what the other ones look like and wish for a figure. Ha.

I don't have a reference for that because it's unpublished work. We could add an "(unpublished)" to it?

Sure, cite yourself...? I think we just need to say how we know.

I could say something like "Sampling these events is expected to require microsecond-scale simulations or longer."

That sounds good to me; you could even do better and say "... because previous work has not observed binding and unbinding events in microsecond-length simulations [ref your paper]."

Otherwise, all sounds good.

…ldev

nhenriksen · 2017-08-24T23:00:03Z

@davidlmobley I've addressed your requested changes in the latest commit. I had to shrink the images in Fig 1 to get them to fit. The figure might work better on a landscape page rather than portrait, but that might be more trouble than it's worth to do. All the .png's are in the figure_source_files directory if you have ideas for different formats.

davidlmobley · 2017-08-25T12:52:30Z

@nhenriksen - Thanks! Hopefully I'll review late today. I'll go ahead and send TeX/PDF to Mike, CC you.

The Zhang 2016 paper has a bunch more of the same guests, but they were all treated as neutral, whereas the experiment had the guests charged. I've omitted them to avoid confusion. Also, I just realized that names like "4-methylphenylacetate" may have caused confusion in the literature. Rekharsky 1997 used that name to refer to the charged version of 4-methylphenylacetic acid, but both Wickstrom 2013 or Zhang 2016 interpreted it as the structure that is more commonly known as "p-tolyl acetate" which is totally different. This occurred for several different related compounds. I've left those off the list as well, because clearly the numbers won't be comparable.

davidlmobley

@nhenriksen - this is looking great. README files look excellent, the new Figure 1 looks good to me, and the changes appear to have addressed my issues.

I'm marking this as approved. I think we just need Mike to take a look, but I've sent it to him and CCd him, so hopefully that doesn't take long.

Again, thank you for all your work on this. I hope it feels worthwhile -- I'm convinced it will be (people are going to use your data/files/etc.!).

davidlmobley · 2017-08-25T20:29:40Z

Also, random comments on your last commit:

The Zhang 2016 paper has a bunch more of the same guests, but they were all treated as neutral, whereas the experiment had the guests charged. I've omitted them to avoid confusion.

Gosh, that's a huge issue. Yes, definitely want to avoid confusion.

Also, I just realized that names like "4-methylphenylacetate" may have caused confusion in the literature. Rekharsky 1997 used that name to refer to the charged version of 4-methylphenylacetic acid, but both Wickstrom 2013 or Zhang 2016 interpreted it as the structure that is more commonly known as "p-tolyl acetate" which is totally different. This occurred for several different related compounds. I've left those off the list as well, because clearly the numbers won't be comparable.

This is part of why I don't like to use compound names as identifiers ever -- using a compound name is begging a human to interpret it (partly because automated name parsers sometimes make mistakes, and/or can't interpret names that humans can), and once you get a human involved they will misinterpret things. Hence, SMILES and/or PubChem compound IDs...

I'm starting to work with some people on some best practices documents for MD simulations. I should probably mention the "please don't use compound names" thing in those. :)

davidlmobley

Thanks, @nhenriksen !!

nhenriksen and others added 16 commits May 4, 2017 17:18

Updated input_files README file, starting adding input files (equilib…

85aa097

…ration is still in progress and will be updated)

added CD sets and updated file naming schemes

10b5175

removed old directories

25ab907

Updated CD experimental data

428ebb1

Merge branch 'master' into nieldev

d3d5faf

Updated Fig 1 to include bCD

1b88fb1

Test out making structures with openeye

238e24a

added some more guest SMILES, stereochemistry works

c040dce

Merge branch 'master' of github.com:MobleyLab/benchmarksets

27d6017

Added tables and partial text for CDs section

c369804

Found and created a few rst7 files which were not created in cd-set1

daa8b2e

Change mol2 pdb and sd CD guests positions to start in the binding po…

99464e3

…cket

Merge pull request #44 from andrrizzi/mol-positions

401827c

CD mol2/pdb/sd files coordinates starting in the binding pocket

Added prior studies text to CDs section

9fd1945

added additional CD citations to bib and manuscript

f4c890d

Merged with master

f85a8ae

nhenriksen added 3 commits August 15, 2017 17:09

Updated the cd-set1 and cd-set1 input files and READMEs

94b7378

Updated the cd-set1 and cd-set1 input files and READMEs

24707b9

Updated filename extensions from .sd --> .sdf

8479f44

GHeinzelmann closed this Aug 16, 2017

davidlmobley added 8 commits August 23, 2017 09:30

Finish editing manuscript

ee78776

Remove extraneous comment

c45f614

Condense discussion of prior CD computational work

0c11d90

Make main README.md current

04a466d

Update authors in license.

f12e5d8

Update README.md to provide a manifest/background info, and format as…

012771b

… one-sentence-per-line for better version control

Further clarifications in input_files/README.md

f434ffb

Forgot to commit final PDF?

7aee827

davidlmobley requested changes Aug 23, 2017

View reviewed changes

nhenriksen added 2 commits August 23, 2017 12:25

get pdf merged

8304ee2

Merge pull request #51 from MobleyLab/nieldev_mobley

a3d2f62

Incorporate minor edits of new cyclodextrin material into nieldev branch

nhenriksen added 3 commits August 23, 2017 17:26

Update README.md

bfd9f76

- added smiles strings - updated naming to reflect expected protonation state.

Update README.md

443c519

- added smiles - edited names to reflect protonation state

Update README.md

4861a60

nhenriksen added 2 commits August 24, 2017 15:55

Addressed @davidlmobley requested changes for PR #47

5554d21

Merge branch 'nieldev' of github.com:MobleyLab/benchmarksets into nie…

4d02046

…ldev

davidlmobley approved these changes Aug 25, 2017

View reviewed changes

Added Mike's changes, changed table uncertainty to %.2f

2d462ed

davidlmobley approved these changes Aug 26, 2017

View reviewed changes

davidlmobley merged commit 4ba062a into master Aug 26, 2017

davidlmobley mentioned this pull request Aug 26, 2017

Update/reorganize input files for BRD4(1); rename ligands in tables for consistency with rest of table #48

Merged

davidlmobley mentioned this pull request Sep 23, 2020

Include additional data, references #72

Open

20 tasks

Conversation

nhenriksen commented Aug 8, 2017

Uh oh!

jchodera commented Aug 8, 2017

Uh oh!

jchodera commented Aug 8, 2017

Uh oh!

nhenriksen commented Aug 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidlmobley commented Aug 10, 2017

Uh oh!

nhenriksen commented Aug 10, 2017

Uh oh!

nhenriksen commented Aug 14, 2017

Uh oh!

davidlmobley commented Aug 14, 2017

Uh oh!

nhenriksen commented Aug 16, 2017

Uh oh!

davidlmobley commented Aug 16, 2017

Uh oh!

nhenriksen commented Aug 16, 2017

Uh oh!

davidlmobley left a comment

Choose a reason for hiding this comment

Uh oh!

davidlmobley commented Aug 23, 2017

Uh oh!

davidlmobley commented Aug 23, 2017

Uh oh!

nhenriksen commented Aug 23, 2017

Uh oh!

davidlmobley commented Aug 24, 2017

Uh oh!

nhenriksen commented Aug 24, 2017

Uh oh!

davidlmobley commented Aug 25, 2017

Uh oh!

davidlmobley left a comment

Choose a reason for hiding this comment

Uh oh!

davidlmobley commented Aug 25, 2017

Uh oh!

davidlmobley left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nhenriksen commented Aug 8, 2017 •

edited

Loading