Skip to content

Comments

Improve memory logging#2839

Merged
deepakn94 merged 4 commits intoNVIDIA:mainfrom
deepakn94:dnarayanan/memory_reporting_cleanup
Jan 21, 2026
Merged

Improve memory logging#2839
deepakn94 merged 4 commits intoNVIDIA:mainfrom
deepakn94:dnarayanan/memory_reporting_cleanup

Conversation

@deepakn94
Copy link
Contributor

@deepakn94 deepakn94 commented Jan 7, 2026

  • Add torch.cuda.device_memory_used() to memory reporting function, and also report memory around checkpoint saves
  • Make sure memory is logged after optimizer state is allocated
  • Add option to report memory periodically
  • Report memory around first 3 checkpoint saves

@github-actions github-actions bot requested a review from Phlip79 January 7, 2026 03:17
@ko3n1g ko3n1g added this to the Core 0.16 milestone Jan 7, 2026
@deepakn94 deepakn94 force-pushed the dnarayanan/memory_reporting_cleanup branch from f667baf to f356daa Compare January 12, 2026 21:45
@deepakn94 deepakn94 force-pushed the dnarayanan/memory_reporting_cleanup branch from f356daa to bbab637 Compare January 13, 2026 21:12
- Add torch.cuda.device_memory_used() to memory reporting function, and also report memory around checkpoint saves
- Make sure memory is logged after optimizer state is allocated
- Add option to report memory periodically
- Report memory around first 3 checkpoint saves

Signed-off-by: Deepak Narayanan <dnarayanan@nvidia.com>
…l variable instead of args to avoid polluting checkpoint with runtime state

Signed-off-by: Deepak Narayanan <dnarayanan@nvidia.com>
…ptimizer state is created

Signed-off-by: Deepak Narayanan <dnarayanan@nvidia.com>
Signed-off-by: Deepak Narayanan <dnarayanan@nvidia.com>
@deepakn94 deepakn94 added this pull request to the merge queue Jan 21, 2026
Merged via the queue into NVIDIA:main with commit 82ea022 Jan 21, 2026
71 of 73 checks passed
@deepakn94 deepakn94 deleted the dnarayanan/memory_reporting_cleanup branch January 21, 2026 07:54
daiyaanarfeen pushed a commit to daiyaanarfeen/Megatron-LM that referenced this pull request Feb 23, 2026
Signed-off-by: Deepak Narayanan <dnarayanan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants