Pull in upstream changes by saforem2 · Pull Request #18 · saforem2/Megatron-DeepSpeed

saforem2 · 2025-11-18T17:29:58Z

Copilot Summary

This pull request introduces improvements for running and documenting large-scale AuroraGPT-2B training experiments. The most significant changes include a new experiment documentation for "cooling down" checkpoints, a fix to the rotary position embeddings hyperparameter, and an update to the launcher setup for better Python compatibility.

Experiment documentation and reproducibility:

Added a new markdown note ALCF/notes/cooldown.md that documents the "cool down" experiment for AuroraGPT-2B checkpoints, including validation loss curves, explicit training commands, environment details, and W&B links for reproducibility.

Training configuration fixes:

Updated the default value for --rotary-position-embeddings-theta in ALCF/helpers.sh from 5000000 to 50000 to correct the hyperparameter for rotary position embeddings.

Launcher compatibility and reliability:

Modified the launcher setup in ALCF/helpers.sh to explicitly use the python3 executable when launching jobs with ezpz-launch, improving compatibility with Python environments.

Summary by Sourcery

Add documentation for the AuroraGPT-2B cooldown experiment, correct the rotary position embeddings hyperparameter, and ensure the launcher uses python3 for compatibility.

Bug Fixes:

Fix default rotary position embeddings theta from 5000000 to 50000.

Enhancements:

Update ezpz-launch commands to explicitly invoke python3 for better environment compatibility.

Documentation:

Add ALCF/notes/cooldown.md detailing the “cool down” experiment with commands, loss curves, and W&B links.

Updated image reference for cooldown documentation.

Updated the section title and added details to the example.

sourcery-ai · 2025-11-18T17:30:05Z

Reviewer's Guide

This PR imports upstream updates by adding a detailed markdown note for the “cool down” experiment—complete with commands, environment context, and W&B links—and refines the ALCF helper script to correct a rotary embeddings default value and ensure the launcher invokes Python 3 explicitly for better compatibility.

Sequence diagram for launching jobs with explicit Python 3 in ALCF helpers

sequenceDiagram
    participant User
    participant "ALCF/helpers.sh"
    participant "ezpz-launch"
    participant "python3"
    User->>"ALCF/helpers.sh": Initiate job launch
    "ALCF/helpers.sh"->>"ezpz-launch": Call ezpz-launch with python3 executable
    "ezpz-launch"->>"python3": Execute training script
    "python3"-->>"ezpz-launch": Return execution result
    "ezpz-launch"-->>"ALCF/helpers.sh": Job completion status
    "ALCF/helpers.sh"-->>User: Notify job completion

Class diagram for rotary position embeddings hyperparameter update

classDiagram
    class HelpersSh {
      +setup_run_cmd()
      +setupLauncher()
      -ROPE_THETA : int = 50000
    }

File-Level Changes

Change	Details	Files
Added comprehensive cooldown experiment documentation	create ALCF/notes/cooldown.md with experiment overview and validation loss grid embed explicit training commands and environment details link to W&B reports for reproducibility	`ALCF/notes/cooldown.md`
Refined training helper script for hyperparameter accuracy and launcher reliability	adjust default rotary-position-embeddings-theta from 5000000 to 50000 update ezpz-launch invocation to prefix with python3 executable	`ALCF/helpers.sh`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

saforem2 added 11 commits October 12, 2025 13:43

fix: Update ALCF/helpers.sh

07a7bbd

docs: Add ALCF/notes/cooldown.md

889ff8d

docs: Add ALCF/notes/assets/

d16272e

Add files via upload

710246a

Rename ScreenShot-2025-11-10-125411@2x.png to cooldownHD.png

0eb4e65

Replace cooldown image with high-definition version

41d5c07

Updated image reference for cooldown documentation.

Enhance cooldown.md with new title and example details

8da9b0d

Updated the section title and added details to the example.

Fix markdown formatting in cooldown.md

b654baa

docs: Update ALCF/notes/cooldown.md

76f5f9d

docs: Update ALCF/notes/cooldown.md

6c20809

fix: Update default ROPE_THETA in ALCF/helpers.sh

6b36b3e

sourcery-ai Bot reviewed Nov 18, 2025

View reviewed changes

saforem2 merged commit 366ce9c into saforem2:main Dec 10, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull in upstream changes#18

Pull in upstream changes#18
saforem2 merged 11 commits intosaforem2:mainfrom
argonne-lcf:main

saforem2 commented Nov 18, 2025 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Nov 18, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saforem2 commented Nov 18, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Copilot Summary

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for launching jobs with explicit Python 3 in ALCF helpers

Class diagram for rotary position embeddings hyperparameter update

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

saforem2 commented Nov 18, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Nov 18, 2025 •

edited

Loading