fix(models): drop persistent energies buffer in Ewald + PME#82
fix(models): drop persistent energies buffer in Ewald + PME#82Ryan-Reese wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
EwaldModelWrapper and PMEModelWrapper kept a persistent `self._energies_buf` and reused it each forward via `zero_()` + `scatter_add_`. Because `per_atom_energies` carries a `grad_fn` from the Warp backward tape, the in-place ops chain each step's autograd graph onto the buffer via PyTorch's version-counter mechanism, causing linear per-step slowdown and unbounded GPU-memory growth in long MD runs. Allocate a fresh (B,) buffer per forward and clone on return so callers continue to receive a tensor with independent storage (matches the prior output contract; some downstream consumers mutate batch.energy in place). Adds two regression tests per wrapper: one asserting `_energies_buf` is absent after a forward, one asserting successive forwards return energy tensors whose storage is independent of prior outputs.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Greptile SummaryThis PR removes the persistent Important Files Changed
Reviews (1): Last reviewed commit: "fix(models): drop persistent energies bu..." | Re-trigger Greptile |
ALCHEMI Toolkit Pull Request
Description
EwaldModelWrapperandPMEModelWrapperkept a persistentself._energies_buftensor and reused it each forward viazero_()+scatter_add_(0, batch_idx, per_atom_energies). Becauseper_atom_energiescarries agrad_fnfrom the Warp backward tape, the in-place ops chain each step's autograd graph onto the buffer through PyTorch's version-counter mechanism. The buffer is owned byself, so the chain cannot be GC'd — per-step cost grows linearly and GPU memory climbs unboundedly in long MD runs. This PR drops the persistent buffer and allocates a fresh one per forward.Type of Change
Related Issues
N/A — bug surfaced during downstream MD integration work, not tracked in an existing issue.
Changes Made
nvalchemi/models/ewald.py,nvalchemi/models/pme.py: dropself._energies_buf; allocate a fresh(B,)zero buffer perforward(), scatter into it, clone on return so callers continue to receive a tensor with independent storage (matches the prior output contract; some downstream consumers mutatebatch.energyin place).test/models/test_ewald.py,test/models/test_pme.py: add two regression tests per wrapper — one asserts_energies_bufis absent after a forward, one asserts consecutive forwards return energy tensors whose storage is independent of prior outputs.CHANGELOG.md: add entry under Unreleased.Testing
make pytest—test/models/test_ewald.py test/models/test_pme.py= 134 passed, 130 existing + 4 new)make lint)Empirical verification
1000 NVT Langevin steps, naphthalene (3,6,3) = 1944 atoms, PBC, AIMNet2 + electrostatics (
accuracy=1e-6,hybrid_forces=False,compile_model=True), dt = 1.5 fs, T = 200 K, friction = 0.01, 50-step timing blocks.Tested on: NVIDIA H100 NVL (95 GB, driver 570.195.03), ARM (Linux 6.8.0), PyTorch 2.11.0+cu130 (CUDA 13.0), Python 3.12.13.
Figure (per-step runtime + GPU memory vs step, all four configs) attached below.
Numerical equivalence. GPU
scatter_add_is non-deterministic (atomic-add ordering), so neither variant is bit-exact against itself. Measured relative L2 deltas of per-atom positions and forces at step 20 (seed = 42): unfixed-vs-fixed deltas are within same-code run-to-run envelope for both models. No systematic shift.Checklist
Additional Notes
nvalchemi/models/lj.pyhas a superficially similar pattern but withself._atomic_energies_buf(a kernel output buffer) on the scatter RHS — the autograd story is different there. Out of scope for this PR; flagged as a separate investigation / follow-up if it turns out to leak.