Pull in argonne-lcf/Megatron-DeepSpeed @ main#15
Conversation
feat: Initial logic to prevent NaNs from crashing training
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| if not unwrap_input_tensor_grad: | ||
| for idx, x in enumerate(input_tensor_grad): | ||
| if torch.isnan(x).any(): | ||
| logger.critical( | ||
| " ".join( |
There was a problem hiding this comment.
Guard gradient NaN check against None entries
The new NaN sanitization iterates over input_tensor_grad and calls torch.isnan(x) for each element. When a stage uses skip connections, this list deliberately contains None placeholders (added a few lines above), so the first None will raise TypeError: isNaN(): argument 'input' (position 1) must be Tensor before any logging happens. Training with encoder/decoder splits will therefore crash as soon as backward_step executes. Skip None values before invoking torch.isnan.
Useful? React with 👍 / 👎.
| if input_tensor_grad is not None: | ||
| ezpz.breakpoint(0) | ||
| if not unwrap_input_tensor_grad: |
There was a problem hiding this comment.
Remove breakpoint from production backward pass
backward_step now unconditionally calls ezpz.breakpoint(0) whenever input_tensor_grad exists. ezpz.breakpoint enters an interactive debugger and blocks the process unless a developer explicitly disables it, which will stall every microbatch in normal distributed training and make the code unusable in production. This debugging hook should be removed or gated behind a configuration flag.
Useful? React with 👍 / 👎.
Copilot Review
This pull request introduces several improvements and additions to both the training utilities and documentation, as well as enhanced debugging and robustness in the pipeline parallel schedules. The main highlights are the addition of a new model architecture, improved gradient NaN detection and handling, and expanded documentation for optimizer and CPT strategies.
Key changes:
Model Architecture & Training Utilities
AuroraGPT-2Bmodel architecture in theALCF/helpers.shscript, including a new function to set its hyperparameters and updated parameter selection logic to recognize various aliases for this model. [1] [2]--rotary-position-embeddings-thetafrom50000to5000000to support models with longer sequence lengths.--blend-sample-in-corpustraining argument, likely to change the default data blending behavior.Robustness & Debugging in Pipeline Parallel Schedules
ezpzutility library and replacedprint_rank_0with a logger for improved logging. [1] [2] [3]Documentation
ALCF/notes/lb_optimizers_settings.md) on supported optimizers, schedulers, and hyperparameter tuning in Megatron-DeepSpeed, including instructions for adding custom optimizers and schedulers.ALCF/notes/cpt.md) on performing Continued Pretraining (CPT) with various strategies and data mixing recipes, including a copy of themix_datasets.pyscript for dataset blending.Dependency Installation
These changes collectively improve the flexibility, robustness, and usability of the training scripts and pipeline, while also providing valuable documentation for users and developers.