Skip to content

Pull in argonne-lcf/Megatron-DeepSpeed @ main#15

Merged
saforem2 merged 21 commits intosaforem2:mainfrom
argonne-lcf:main
Oct 10, 2025
Merged

Pull in argonne-lcf/Megatron-DeepSpeed @ main#15
saforem2 merged 21 commits intosaforem2:mainfrom
argonne-lcf:main

Conversation

@saforem2
Copy link
Copy Markdown
Owner

Copilot Review

This pull request introduces several improvements and additions to both the training utilities and documentation, as well as enhanced debugging and robustness in the pipeline parallel schedules. The main highlights are the addition of a new model architecture, improved gradient NaN detection and handling, and expanded documentation for optimizer and CPT strategies.

Key changes:

Model Architecture & Training Utilities

  • Added support for the AuroraGPT-2B model architecture in the ALCF/helpers.sh script, including a new function to set its hyperparameters and updated parameter selection logic to recognize various aliases for this model. [1] [2]
  • Increased the default value of --rotary-position-embeddings-theta from 50000 to 5000000 to support models with longer sequence lengths.
  • Commented out the --blend-sample-in-corpus training argument, likely to change the default data blending behavior.

Robustness & Debugging in Pipeline Parallel Schedules

  • Integrated the ezpz utility library and replaced print_rank_0 with a logger for improved logging. [1] [2] [3]
  • Added logic to detect NaNs in gradients during backpropagation, log critical information (including rank and host), and set NaN values to zero to prevent training crashes.
  • Improved assertion messages and added type checks in the pipeline schedule code for better debugging and clarity. [1] [2] [3]

Documentation

  • Added a comprehensive guide (ALCF/notes/lb_optimizers_settings.md) on supported optimizers, schedulers, and hyperparameter tuning in Megatron-DeepSpeed, including instructions for adding custom optimizers and schedulers.
  • Added a detailed note (ALCF/notes/cpt.md) on performing Continued Pretraining (CPT) with various strategies and data mixing recipes, including a copy of the mix_datasets.py script for dataset blending.

Dependency Installation

  • Relaxed the requirement for a virtual environment in the dependency installation step, making it easier to install dependencies in various environments.

These changes collectively improve the flexibility, robustness, and usability of the training scripts and pipeline, while also providing valuable documentation for users and developers.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @saforem2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment on lines +329 to +333
if not unwrap_input_tensor_grad:
for idx, x in enumerate(input_tensor_grad):
if torch.isnan(x).any():
logger.critical(
" ".join(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard gradient NaN check against None entries

The new NaN sanitization iterates over input_tensor_grad and calls torch.isnan(x) for each element. When a stage uses skip connections, this list deliberately contains None placeholders (added a few lines above), so the first None will raise TypeError: isNaN(): argument 'input' (position 1) must be Tensor before any logging happens. Training with encoder/decoder splits will therefore crash as soon as backward_step executes. Skip None values before invoking torch.isnan.

Useful? React with 👍 / 👎.

Comment on lines +327 to +329
if input_tensor_grad is not None:
ezpz.breakpoint(0)
if not unwrap_input_tensor_grad:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove breakpoint from production backward pass

backward_step now unconditionally calls ezpz.breakpoint(0) whenever input_tensor_grad exists. ezpz.breakpoint enters an interactive debugger and blocks the process unless a developer explicitly disables it, which will stall every microbatch in normal distributed training and make the code unusable in production. This debugging hook should be removed or gated behind a configuration flag.

Useful? React with 👍 / 👎.

@saforem2 saforem2 merged commit 76bcc88 into saforem2:main Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants