-
Notifications
You must be signed in to change notification settings - Fork 35
Compute selection: deviceIndex & enforce 1 thread in vacuum #752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello @IAlibay! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-07-04 00:06:38 UTC |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #752 +/- ##
==========================================
- Coverage 94.59% 92.83% -1.77%
==========================================
Files 134 134
Lines 9940 9961 +21
==========================================
- Hits 9403 9247 -156
- Misses 537 714 +177
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
@mikemhenry when you get a chance please do have a look at this - I suspect it'll make life a bit easier in some cases. |
| String with the platform name. If None, it will use the fastest | ||
| platform supporting mixed precision. | ||
| Default ``None``. | ||
| gpu_device_index : Optional[list[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should probably have a chat about how we handle this long term - this is a bit like MPI settings, where technically we shouldn't make this immutable but maybe something we pick up at run time?
How can we go about handling this properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I don't think we abstracted well are "run time arguments". Like we have the split for settings that change thermo, but didn't consider a category of non-thermo settings that make the most sense to pick at runtime, I haven't looked at the code yet and will update this comment, but I suspect what we should do is
- have some default
- read this setting
- read in an environmental variable
If we do things in that order, it means we don't break anything old, then when configuring your system you can make some choices, but then when running on HPC you can still set things if needed and override the settings
|
|
||
| **Changed:** | ||
|
|
||
| * `openfe.protocols.openmm_rfe._rfe_utils.compute` has been moved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good example of a nice changelog entry but since this was a private API, no symver major bump needed
| **Changed:** | ||
|
|
||
| * `openfe.protocols.openmm_rfe._rfe_utils.compute` has been moved | ||
| to `openfe.protocols.openmm_utils.omm_compute`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a private namespace or is this now in our public API?
| to `openfe.protocols.openmm_utils.omm_compute`. | |
| to `openfe.protocols.openmm_utils._omm_compute`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say public developer API is fine, private was because we were directly vendoring from perses.
mikemhenry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is good, have a few notes but nothing blocking.
| platform = compute.get_openmm_platform( | ||
| settings['engine_settings'].compute_platform | ||
| # Restrict CPU count if running vacuum simulation | ||
| restrict_cpu = settings['forcefield_settings'].nonbonded_method.lower() == 'nocutoff' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another argument we really should have an explicit way of saying this is a vacuum simulation. I propose we add a setting somewhere for that #904
In the meantime, this seems like a pretty good heuristic.
We could do more logging, I it would be nice to do a hackathon on it but in the mean time I will just suggest as I see it. It would be good to log what is going on here, maybe could be more verbose than what I suggest but this seems like a spot where if someone was like "why is this running on the CPU and not the GPU?" a log message could hep
| restrict_cpu = settings['forcefield_settings'].nonbonded_method.lower() == 'nocutoff' | |
| restrict_cpu = settings['forcefield_settings'].nonbonded_method.lower() == 'nocutoff' | |
| logging.info(f"{restrict_cpu=}") |
mikemhenry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so I think that the device index selection is good but I am non the fence if we should be doing anything other than warning users when they are running a vacuum on a GPU or with more than 1 thread.
I think there is an argument for this being a "sane default" that we provide, but we need some user doc explaining how by default this is how a vacuum transformation works for these protocol(s)
Co-authored-by: Mike Henry <11765982+mikemhenry@users.noreply.github.com>
Are you saying that you don't want to go with this enforced 1 thread approach? I'm happy to reconsider this change (and just stick to the deviceIndex stuff) - what do you think? |
mikemhenry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only bit needed -- we should respect whatever the user has set for OPENMM_CPU_THREADS if they have set it
|
@mikemhenry could you have a look and check that the latest change is what you meant? |
mikemhenry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is exactly what I was thinking!
| contained in MultiState reporter generated NetCDF file. | ||
| """ | ||
| ncfile = nc.Dataset(filename, 'w', format='NETCDF3_64BIT') | ||
| ncfile = nc.Dataset(filename, 'w', format='NETCDF3_64BIT_OFFSET') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should fix the mypy error, see here: https://github.com/Unidata/netcdf4-python/blob/9426a684d3bd7b6c0c9b4d3c3f8531f30583d1e2/README.md?plain=1#L202C37-L202C57
|
running ci on main here https://github.com/OpenFreeEnergy/openfe/actions/runs/12162626721 to see if we are seeing these errors, I kinda remember this happening before and it was a weird thing... |
|
Cycling, but it looks like we're hitting an rdkit or networkx error. |
|
Enviroment diff: Possibilities are either:
|
|
Have opened #1033 - this should be a simple fix if it's what I think it is. |
|
@mikemhenry looks like all the failures are related to #1033 Are you ok with merging this as-is (knowing CI is failing due to another issue) or do you want to wait until we fix #1033? |
|
Happy to merge in now that we have #1033 triaged. |
Fixes #739 #704
Checklist
newsentryDevelopers certificate of origin