Merged
Conversation
Updated image reference for cooldown documentation.
Updated the section title and added details to the example.
Reviewer's GuideThis PR imports upstream updates by adding a detailed markdown note for the “cool down” experiment—complete with commands, environment context, and W&B links—and refines the ALCF helper script to correct a rotary embeddings default value and ensure the launcher invokes Python 3 explicitly for better compatibility. Sequence diagram for launching jobs with explicit Python 3 in ALCF helperssequenceDiagram
participant User
participant "ALCF/helpers.sh"
participant "ezpz-launch"
participant "python3"
User->>"ALCF/helpers.sh": Initiate job launch
"ALCF/helpers.sh"->>"ezpz-launch": Call ezpz-launch with python3 executable
"ezpz-launch"->>"python3": Execute training script
"python3"-->>"ezpz-launch": Return execution result
"ezpz-launch"-->>"ALCF/helpers.sh": Job completion status
"ALCF/helpers.sh"-->>User: Notify job completion
Class diagram for rotary position embeddings hyperparameter updateclassDiagram
class HelpersSh {
+setup_run_cmd()
+setupLauncher()
-ROPE_THETA : int = 50000
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Copilot Summary
This pull request introduces improvements for running and documenting large-scale AuroraGPT-2B training experiments. The most significant changes include a new experiment documentation for "cooling down" checkpoints, a fix to the rotary position embeddings hyperparameter, and an update to the launcher setup for better Python compatibility.
Experiment documentation and reproducibility:
ALCF/notes/cooldown.mdthat documents the "cool down" experiment for AuroraGPT-2B checkpoints, including validation loss curves, explicit training commands, environment details, and W&B links for reproducibility.Training configuration fixes:
--rotary-position-embeddings-thetainALCF/helpers.shfrom5000000to50000to correct the hyperparameter for rotary position embeddings.Launcher compatibility and reliability:
ALCF/helpers.shto explicitly use thepython3executable when launching jobs withezpz-launch, improving compatibility with Python environments.Summary by Sourcery
Add documentation for the AuroraGPT-2B cooldown experiment, correct the rotary position embeddings hyperparameter, and ensure the launcher uses python3 for compatibility.
Bug Fixes:
Enhancements:
Documentation: