Pull in `argonne-lcf/Megatron-DeepSpeed` @ `main` by saforem2 · Pull Request #15 · saforem2/Megatron-DeepSpeed

saforem2 · 2025-10-10T00:35:26Z

Copilot Review

This pull request introduces several improvements and additions to both the training utilities and documentation, as well as enhanced debugging and robustness in the pipeline parallel schedules. The main highlights are the addition of a new model architecture, improved gradient NaN detection and handling, and expanded documentation for optimizer and CPT strategies.

Key changes:

Model Architecture & Training Utilities

Added support for the AuroraGPT-2B model architecture in the ALCF/helpers.sh script, including a new function to set its hyperparameters and updated parameter selection logic to recognize various aliases for this model. [1] [2]
Increased the default value of --rotary-position-embeddings-theta from 50000 to 5000000 to support models with longer sequence lengths.
Commented out the --blend-sample-in-corpus training argument, likely to change the default data blending behavior.

Robustness & Debugging in Pipeline Parallel Schedules

Integrated the ezpz utility library and replaced print_rank_0 with a logger for improved logging. [1] [2] [3]
Added logic to detect NaNs in gradients during backpropagation, log critical information (including rank and host), and set NaN values to zero to prevent training crashes.
Improved assertion messages and added type checks in the pipeline schedule code for better debugging and clarity. [1] [2] [3]

Documentation

Added a comprehensive guide (ALCF/notes/lb_optimizers_settings.md) on supported optimizers, schedulers, and hyperparameter tuning in Megatron-DeepSpeed, including instructions for adding custom optimizers and schedulers.
Added a detailed note (ALCF/notes/cpt.md) on performing Continued Pretraining (CPT) with various strategies and data mixing recipes, including a copy of the mix_datasets.py script for dataset blending.

Dependency Installation

Relaxed the requirement for a virtual environment in the dependency installation step, making it easier to install dependencies in various environments.

These changes collectively improve the flexibility, robustness, and usability of the training scripts and pipeline, while also providing valuable documentation for users and developers.

…settings.md

feat: Initial logic to prevent NaNs from crashing training

sourcery-ai

Sorry @saforem2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-10T00:38:18Z

+        if not unwrap_input_tensor_grad:
+            for idx, x in enumerate(input_tensor_grad):
+                if torch.isnan(x).any():
+                    logger.critical(
+                        " ".join(


Guard gradient NaN check against None entries

The new NaN sanitization iterates over input_tensor_grad and calls torch.isnan(x) for each element. When a stage uses skip connections, this list deliberately contains None placeholders (added a few lines above), so the first None will raise TypeError: isNaN(): argument 'input' (position 1) must be Tensor before any logging happens. Training with encoder/decoder splits will therefore crash as soon as backward_step executes. Skip None values before invoking torch.isnan.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-10T00:38:18Z

+    if input_tensor_grad is not None:
+        ezpz.breakpoint(0)
+        if not unwrap_input_tensor_grad:


Remove breakpoint from production backward pass

backward_step now unconditionally calls ezpz.breakpoint(0) whenever input_tensor_grad exists. ezpz.breakpoint enters an interactive debugger and blocks the process unless a developer explicitly disables it, which will stall every microbatch in normal distributed training and make the code unusable in production. This debugging hook should be removed or gated behind a configuration flag.

Useful? React with 👍 / 👎.

saforem2 and others added 20 commits September 11, 2025 16:13

feat: Initial logic to prevent NaNs from crashing training

6b32c95

chore: Prevent checkpoint timers from bringing down training

e966d29

chore: Prevent checkpoint timers from bringing down training

b4a0e7a

chore: Update ALCF/helpers.sh

c72e2eb

chore: Update train_aGPT_2B_large_batch.sh

da80b20

custom optimizers, schedulers, hp tuning and CPT

f59df1d

Update lb_optimizers_settings_and_cpt.md

74b4cf3

details about doing cpt

d17ccad

Update and rename lb_optimizers_settings_and_cpt.md to lb_optimizers_…

55f7963

…settings.md

Update cpt.md

f077ce6

Update cpt.md

9a9e69c

Merge branch 'main' into saforem2/resilient-training

a7825f1

chore: Update ALCF/helpers.sh

56d0d80

chore: Format pretrain_gpt_alcf.py

7d79965

chore: Update train_aGPT_2B_large_batch.sh

9fed3b5

Update cpt.md

527fe81

Update cpt.md

7d90dfe

Update cpt.md

a513caf

Merge branch 'main' into saforem2/resilient-training

1385e26

Merge pull request #94 from argonne-lcf/saforem2/resilient-training

653a679

feat: Initial logic to prevent NaNs from crashing training

sourcery-ai Bot reviewed Oct 10, 2025

View reviewed changes

Merge branch 'main' into main

41d1a41

chatgpt-codex-connector Bot reviewed Oct 10, 2025

View reviewed changes

saforem2 merged commit 76bcc88 into saforem2:main Oct 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull in `argonne-lcf/Megatron-DeepSpeed` @ `main`#15

Pull in `argonne-lcf/Megatron-DeepSpeed` @ `main`#15
saforem2 merged 21 commits intosaforem2:mainfrom
argonne-lcf:main

saforem2 commented Oct 10, 2025

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Oct 10, 2025

Uh oh!

chatgpt-codex-connector Bot Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saforem2 commented Oct 10, 2025

Model Architecture & Training Utilities

Robustness & Debugging in Pipeline Parallel Schedules

Documentation

Dependency Installation

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants