Initial commit with updates to cpt.md and new README by saforem2 · Pull Request #19 · saforem2/Megatron-DeepSpeed

saforem2 · 2026-03-11T14:45:40Z

Copilot Summary

This pull request makes several improvements and updates to the CPT (Continual Pre-Training) documentation and related scripts for AuroraGPT, focusing on providing a comprehensive strategy guide, cleaning up helper scripts, and improving documentation organization. The main changes include a major rewrite and expansion of the cpt.md strategy cookbook, removal of an embedded script, minor script refactoring, and the addition of documentation assets.

Major documentation update:

The ALCF/notes/cpt.md file has been rewritten and significantly expanded into a detailed CPT strategy cookbook for AuroraGPT V1. It now covers data-centric and optimization strategies, recommended mixing weights, stage-by-stage guidance, and practical tips for handling distribution shifts during continual pre-training. The update also includes new diagrams and links to relevant assets. An embedded script (mix_datasets.py) was removed from the documentation, with references to its GitHub location instead. [1] [2]

Script and helper improvements:

In ALCF/helpers.sh, the call to install_dependencies in the setup() function is now commented out, likely to avoid redundant installations during repeated runs.
The setupLauncher() function in ALCF/helpers.sh has been simplified: the launcher command is updated to use ezpz launch instead of ezpz-launch, and commented-out legacy logic has been removed for clarity.
Minor whitespace cleanup in set_ccl_vars_on_aurora() in ALCF/helpers.sh.

Documentation assets:

Added placeholder readme files to ALCF/notes/assets/cpt_images and ALCF/notes/assets/lb_optimizers to clarify the purpose of images used in the CPT and optimizer documentation. [1] [2]

These changes collectively improve the clarity, maintainability, and usability of the CPT workflow and documentation for AuroraGPT.

References:

Major documentation update: [1] [2]
Script and helper improvements: [1] [2] [3]
Documentation assets: [1] [2]

Summary by Sourcery

Document continual pre-training strategies and large-batch optimizer guidance for AuroraGPT, adjust training and helper scripts to use ezpz utilities and new optimizer/scheduler settings, and add tooling to generate LR cooldown commands and stage-specific training jobs.

New Features:

Introduce a cooldown command generation toolkit under tools/cooldown_generator to compute checkpoint/cooldown schedules from token budgets and emit ready-to-run Megatron-DeepSpeed commands and per-ID sweep scripts.
Add stage-2 and stage-3 AuroraGPT-2B training PBS scripts configured for sophiag optimizer, constant LR with cooldown, and appropriate data lists and tokenizer settings.
Add new Aurora-specific fused data-list files for Dolmino and math/code mixtures used in later training stages.
Add a CPT image and large-batch optimizer image asset structure to support the new documentation.

Enhancements:

Refine Megatron training utilities to use the ezpz API for rank, device, profiling, and timing, and improve assertion formatting and logging behavior for timers and wandb integration.
Simplify ALCF helper scripts by disabling automatic dependency installation, updating the launcher to use ezpz launch, and minor cleanup in model selection and environment configuration.
Retune the 2B large-batch training script to use the sophiag optimizer by default instead of muonclip.
Guard wandb usage in pretrain_gpt_alcf.py to avoid errors when wandb is unavailable.

Documentation:

Rewrite and significantly expand the CPT strategy cookbook in ALCF/notes/cpt.md, including stage-wise data-mixing guidance, learning-rate strategies, and handling distribution shifts for AuroraGPT V1 and legacy agpt-7B.
Add a new large_batch_optimizers_settings guide documenting optimizer options, scheduler variants, LR finder usage, initialization schemes, and hyperparameter tuning recommendations for large-batch training.
Document usage of cooldown command generation tooling and workflows in tools/cooldown_generator/README.md.
Add placeholder readme files describing assets used in CPT and large-batch optimizer documentation image directories.

Chores:

Ignore additional generated or local files via .gitignore if applicable.

into auto_cooldown_script

Checkpoint Cooldown Helper Scripts

sourcery-ai · 2026-03-11T14:45:55Z

Reviewer's Guide

This PR significantly extends AuroraGPT continual pre‑training documentation, introduces a new large‑batch optimizer guide, adds cooldown command-generation tooling and training scripts for specific stages, and makes small runtime/launcher cleanups (switching to ezpz APIs, safer logging, and making ezpz usage less intrusive).

Flow diagram for AuroraGPT CPT data-mixing strategy across stages

graph TD
    Start["Start CPT planning for AuroraGPT V1"]

    Start --> S1["Stage1 D0 pretraining complete"]
    S1 --> S1_S2["Stage1 → Stage2\nWeak distribution shift"]

    S1_S2 --> S1S2_Strategy1["Strategy1 No replay\nTrain only on D1"]
    S1_S2 --> S1S2_Strategy2["Strategy2 Replay from D0\nMix D0 + D1"]

    S1S2_Strategy2 --> S1S2_Mix["Set mixing weights\nalpha0=0.05–0.10, alphaD=1-alpha0, alphaB=0"]
    S1S2_Mix --> S1S2_Buffer["Build buffer B with samples from D1"]
    S1S2_Buffer --> S2_Complete["End of Stage2\nHave DCPT_1"]

    S1S2_Strategy1 --> S1S2_Buffer

    S2_Complete --> S2_S3["Stage2 → Stage3\nStrong shift to math/code D2"]

    S2_S3 --> S2S3_Naive["Naive strategy\nTrain mostly on D2"]
    S2_S3 --> S2S3_Mix["Strategy3 Mix D0 + D2 + B"]
    S2_S3 --> S2S3_Decay["Strategy4 LR decay recipe\nif data mixing fails"]

    S2S3_Mix --> S2S3_MixA["MixA D0=0.33, D2=0.33, B=0.34"]
    S2S3_Mix --> S2S3_MixB["MixB D0=0.05, D2=0.48, B=0.47"]
    S2S3_Mix --> S2S3_MixC["MixC D0=0.0, D2=0.10, B=0.90"]

    S2S3_MixA --> S2S3_BufferUpdate["Add D2 to buffer B at end of stage"]
    S2S3_MixB --> S2S3_BufferUpdate
    S2S3_MixC --> S2S3_BufferUpdate

    S2S3_BufferUpdate --> S3_Complete["End of Stage3\nHave DCPT_2"]

    S3_Complete --> S3_S4["Stage3 → Stage4\nShift to reasoning traces D3"]

    S3_S4 --> S3S4_NoPrevDecay["If Strategy4 was not used"]
    S3_S4 --> S3S4_PrevDecay["If Strategy4 was used"]

    S3S4_NoPrevDecay --> S3S4_MixA["MixA D0=0.33, D3=0.33, B=0.34"]
    S3S4_NoPrevDecay --> S3S4_MixB["MixB D0=0.50, D3=0.25, B=0.25"]

    S3S4_MixA --> S3S4_Cooldown["Cool down LR to convergence"]
    S3S4_MixB --> S3S4_Cooldown

    S3S4_PrevDecay --> S3S4_ContinueDecay["Continue decaying with DCPT_2\nuntil LR3/100"]
    S3S4_ContinueDecay --> S3S4_IntroduceD3["Introduce D3 mix at LR3/5"]
    S3S4_IntroduceD3 --> S3S4_Cooldown

    S3S4_Cooldown --> Final["Final CPT model converged"]

Flow diagram for generating cooldown commands with make_cooldown_cmds.py

graph TD
    A["Start cooldown planning"] --> B["Provide checkpoint resume step S\n(--checkpoint-iters or --pairs)"]
    B --> C["Provide cooldown length R in steps\n(--cooldown-steps or in --pairs)"]
    C --> D["Optionally specify ID(s)\n(--checkpoint-ids or id:S:R in --pairs)"]

    D --> E["Compute total train iters T = S + R"]
    E --> F["Compute fraction f = S / T\n(for lr_constant_plus_cooldown_frac)"]

    F --> G["Build env block\nMODEL_ARCH, OPT, GRAD_ACC_STEPS, TRAIN_ITERS=T, ..."]
    G --> H["Construct train_alcf_sh command\nwith lr_constant_plus_cooldown and frac=f"]

    H --> I{Emit mode}

    I -->|--emit-sh provided| J["Write cooldown_idX_sS_rR_tT_sh script"]
    I -->|no --emit-sh| K["Print command block to stdout"]

    J --> L["Optionally submit script via PBS"]
    K --> L

    L --> M["Megatron_DeepSpeed resumes at S\nLR cooldown runs for R steps"]

File-Level Changes

Change	Details	Files
Rewrite and expand CPT strategy documentation for AuroraGPT V1 and adjust legacy CPT notes.	Replace the brief CPT note with a detailed multi‑stage data‑mixing and LR‑strategy cookbook for AuroraGPT V1, including tables, figures, and concrete recipes Remove the inlined mix_datasets.py script in favor of linking to the upstream GitHub version Tidy legacy agpt‑7B CPT guidance and add a short "things to keep in mind" section with batch‑size/LR scaling tips	`ALCF/notes/cpt.md`
Add a new guide for large‑batch optimizers, LR schedulers, and LR finder usage.	Document supported optimizers and schedulers, including configuration examples and how to register custom implementations Describe hyperparameter tuning strategies (init variance formula, LR‑finder workflow, and MuP/CompleteP notes) Provide example plotting code to analyze LR‑finder output and select candidate learning rates	`ALCF/notes/large_batch_optimizers_settings.md`
Introduce cooldown command-generation tooling and wrappers for Megatron‑DeepSpeed runs.	Add make_cooldown_cmds.py to generate resume‑and‑cooldown commands and optionally update latest checkpoint markers Add build_checkpoints_from_tokens.py to map trillion‑token milestones to step counts and rollback points Add gen_cooldown_sweep.sh to generate per‑checkpoint cooldown job scripts, with phase‑specific data lists and PBS headers Document usage patterns and examples for the cooldown tooling in a dedicated README	`tools/cooldown_generator/make_cooldown_cmds.py` `tools/cooldown_generator/build_checkpoints_from_tokens.py` `tools/cooldown_generator/gen_cooldown_sweep.sh` `tools/cooldown_generator/README.md`
Add stage‑specific AuroraGPT‑2B training launcher scripts using the SophiaG optimizer and new data lists.	Introduce train_aGPT_2B_sophiag_stage2.sh to run 2B training on Dolmino mix with constant LR and SophiaG Introduce train_aGPT_2B_sophiag_stage3.sh to extend training to math/code data with updated TRAIN_TOKENS and data list Add corresponding fused data‑list files for Dolmino mix and math/code datasets	`train_aGPT_2B_sophiag_stage2.sh` `train_aGPT_2B_sophiag_stage3.sh` `ALCF/data-lists/aurora/dolmino-mix-1124-fused-file-list.txt` `ALCF/data-lists/aurora/nvidia-math1-code2.txt`
Make small runtime and launcher cleanups (switch to ezpz helpers, safer logging, and less aggressive setup).	Switch megatron.training to use ezpz.* rank/device helpers and decorators, and reformat some long asserts Adjust LR scheduler construction logic to compute constant+cooldown steps in a more readable style Route Megatron timer logging through an ezpz logger instead of bare prints Guard final wandb access on pretrain exit to avoid errors when wandb is absent Simplify ALCF helpers: comment out install_dependencies in setup, change launcher to "ezpz launch", normalize case labels, and small whitespace fixes Make train_alcf.sh stop auto‑installing ezpz and stop unconditionally killing MPI via ezpz_kill_mpi	`megatron/training.py` `megatron/timers.py` `pretrain_gpt_alcf.py` `ALCF/helpers.sh` `train_alcf.sh` `train_aGPT_2B_large_batch.sh`
Add small documentation asset placeholders for images used by CPT and optimizer docs.	Add readme placeholder marking CPT images directory purpose Add readme placeholder marking large‑batch optimizer figures directory purpose	`ALCF/notes/assets/cpt_images/readme.md` `ALCF/notes/assets/lb_optimizers/readme.md`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 12 issues, and left some high level feedback:

In tools/cooldown_generator/make_cooldown_cmds.py, the computation of train_iters from train_tokens is inverted (it currently multiplies by global_batch_size and sequence_length instead of dividing by them), which will massively overshoot the intended number of iterations; consider fixing this before relying on token-based configuration.
The helper get_total_iters_from_cooldown_percent in make_cooldown_cmds.py has a suspicious expression cooldown_percent = (train_iters - cooldown_steps / train_iters) that likely needs parentheses ((train_iters - cooldown_steps) / train_iters) or a different formula; it would be good to correct or remove this before someone uses it.
In ALCF/helpers.sh and train_alcf.sh you’ve commented out install_dependencies and ezpz_kill_mpi/ezpz auto-install; if these side effects are still required in common workflows, consider adding a clear alternative invocation path or a guarded flag so users don’t silently skip setup/cleanup.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In tools/cooldown_generator/make_cooldown_cmds.py, the computation of train_iters from train_tokens is inverted (it currently multiplies by global_batch_size and sequence_length instead of dividing by them), which will massively overshoot the intended number of iterations; consider fixing this before relying on token-based configuration.
- The helper get_total_iters_from_cooldown_percent in make_cooldown_cmds.py has a suspicious expression `cooldown_percent = (train_iters - cooldown_steps / train_iters)` that likely needs parentheses (`(train_iters - cooldown_steps) / train_iters`) or a different formula; it would be good to correct or remove this before someone uses it.
- In ALCF/helpers.sh and train_alcf.sh you’ve commented out install_dependencies and ezpz_kill_mpi/ezpz auto-install; if these side effects are still required in common workflows, consider adding a clear alternative invocation path or a guarded flag so users don’t silently skip setup/cleanup.

## Individual Comments

### Comment 1
<location path="tools/cooldown_generator/make_cooldown_cmds.py" line_range="162-163" />
<code_context>
+        assert train_iters is None, (
+            f"Only one of {train_tokens, train_iters} should be specified."
+        )
+        assert global_batch_size is not None and sequence_length is not None
+        train_iters = train_tokens * global_batch_size * sequence_length
+
+    assert train_iters is not None
</code_context>
<issue_to_address>
**issue (bug_risk):** Train iterations from train_tokens should divide by tokens-per-step, not multiply.

In the `train_tokens` branch you currently compute:

```python
train_iters = train_tokens * global_batch_size * sequence_length
```
If `train_tokens` is the total desired tokens, the correct step count is:

```python
tokens_per_step = global_batch_size * sequence_length
train_iters = train_tokens // tokens_per_step  # or float division + rounding
```
Multiplying by `global_batch_size * sequence_length` yields an incorrect (and massively inflated) number of steps.
</issue_to_address>

### Comment 2
<location path="tools/cooldown_generator/make_cooldown_cmds.py" line_range="271-276" />
<code_context>
+
+    lines = []
+    # header = "# Auto-generated cooldown commands\nset -euo pipefail\n\n"
+    if args.include_header:
+        if (hfp := Path(args.include_header)).is_file():
+            with hfp.open("r") as f:
+                lines.extend(f.readlines())
+        else:
+            lines.extend("\n".join(args.include_header.split("\n")))
+    # if args.emit_sh:
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Using list.extend on a joined header string appends characters instead of lines.

In the non-file case you currently do:

```python
lines.extend("\n".join(args.include_header.split("\n")))
```

This extends `lines` with individual characters, not header lines, so the output script will be malformed.

Use either:

```python
lines.append(args.include_header)
```
if `include_header` is a full block, or:

```python
lines.extend(args.include_header.splitlines(keepends=True))
```
if you need to preserve line boundaries and newlines.
</issue_to_address>

### Comment 3
<location path="tools/cooldown_generator/gen_cooldown_sweep.sh" line_range="71" />
<code_context>
+#     die "--cool-R (cooldown steps) or --cooldown-percent required"
+# fi
+
+# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"
+[[ -n "${LOAD_PATH}" ]] || die "--load path is required"
+[[ -n "${PHASE1_LIST}" ]] || die "--phase1-list is required (IDs 1..4)"
</code_context>
<issue_to_address>
**issue (bug_risk):** Cooldown steps (COOL_R) can be left empty but are still passed to make_cooldown_cmds.

The guard requiring `COOL_R` was commented out:

```bash
# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"
```

but `--cooldown-steps "${COOL_R}"` is still passed to `make_cooldown_cmds.py` in both exact and rollback scripts. If `COOL_R` is unset/empty, the Python script receives an empty value for a required integer argument and will fail or misbehave.

Please either restore the guard or define a safe default for `COOL_R` at the top of the script.
</issue_to_address>

### Comment 4
<location path="ALCF/notes/cpt.md" line_range="33" />
<code_context>
+
+| CPT Stages | Distribution shift | Primary strategy | Fallbacks | Notes |
+|------|--------------------|------------------|-------------|-------|
+| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
+| Stage 2 → 3 | Strong | 5-30% replay of $D^{CPT}_{2}$ and monitor loss  | Use buffer: $0.33D_0 + 0.33D_2 + 0.34B$| Add $D_2$ to buffer $B$, you might need to switch to LR centric strategy, see below |
+| Stage 3 → 4 | Strong | Cooldown with mix  $0.33D_0 + 0.33D_3 + 0.34B$ |  Cooldown with mix  $0.05D_0 + 0.47D_3 + 0.48B$  | You will need to continue decay if used in previous stage. Play with decay function,stages, and final LR value |
</code_context>
<issue_to_address>
**issue (typo):** Fix typo in "wih" → "with" in the Stage 1→2 row.

"wih" in this row is a typo; please change it to "with".

```suggestion
| Stage 1 → 2 | Weak | Naive CPT with $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
```
</issue_to_address>

### Comment 5
<location path="ALCF/notes/cpt.md" line_range="47" />
<code_context>
+
+As a result, we primarily adopt a **data-centric strategy** throughout these stages, resorting to learning-rate adjustments only when necessary.
+
+The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces. 
+
+| Stage | Dataset Symbol | Size | Source / Path | Notes |
</code_context>
<issue_to_address>
**issue (typo):** Correct "ans science papers" to "and science papers".

In the $D_2$ description, "ans science papers" is a typo and should be "and science papers".

```suggestion
The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, and science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces. 
```
</issue_to_address>

### Comment 6
<location path="ALCF/notes/cpt.md" line_range="103" />
<code_context>
+ ```bash
+python3 mix_datasets.py --input 0.9 /flare/Aurora_deployment/AuroraGPT/datasets/papers/papers.txt 0.1 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_v2.txt > ${debug_dir}/Megatron-DeepSpeed/ALCF/data-lists/aurora/mix_lucid_papers09_dolma01.txt
+```
+2. **Start building the buffer $B$** in prevision of the next stages.
+3. **Run CPT**: Load your checkpoint and run CPT with the --finetube flag.
+Note that you might need to convert your checkpoints following [these instructions](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/notes/universal_checkpoint_bug.md) to a universal checkpoint.
</code_context>
<issue_to_address>
**suggestion (typo):** Replace "in prevision of" with a more natural phrase like "in preparation for".

The phrase "in prevision of the next stages" is non-idiomatic. Please use a more natural alternative, e.g. "in preparation for the next stages" or simply "for the next stages".

```suggestion
2. **Start building the buffer $B$** in preparation for the next stages.
```
</issue_to_address>

### Comment 7
<location path="ALCF/notes/cpt.md" line_range="104" />
<code_context>
+python3 mix_datasets.py --input 0.9 /flare/Aurora_deployment/AuroraGPT/datasets/papers/papers.txt 0.1 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_v2.txt > ${debug_dir}/Megatron-DeepSpeed/ALCF/data-lists/aurora/mix_lucid_papers09_dolma01.txt
+```
+2. **Start building the buffer $B$** in prevision of the next stages.
+3. **Run CPT**: Load your checkpoint and run CPT with the --finetube flag.
+Note that you might need to convert your checkpoints following [these instructions](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/notes/universal_checkpoint_bug.md) to a universal checkpoint.
+
</code_context>
<issue_to_address>
**issue (typo):** Likely typo: "--finetube" should be "--finetune".

If other commands use a `--finetune` flag, update this reference to match for consistency and to avoid confusion.

```suggestion
3. **Run CPT**: Load your checkpoint and run CPT with the --finetune flag.
```
</issue_to_address>

### Comment 8
<location path="ALCF/notes/cpt.md" line_range="79" />
<code_context>
+##### Strategy 1: No replay
+`Important: USE A CHECKPOINT AT LR=LR_max i.e. BEFORE COOLING DOWN`.
+
+Naively continue training with $D_1$, no replay data. 
+- Continue training using only the current dataset $D_1$
+- No replay from $D_0$ or buffer data
+This may be sufficient under weak distribution shift but there is potential risks of forgetting
+
+##### Strategy 2: Replay from pretraining dataset
</code_context>
<issue_to_address>
**suggestion (typo):** Fix subject–verb agreement in "there is potential risks".

Change to either "there are potential risks of forgetting" or "there is a potential risk of forgetting" for correct subject–verb agreement.

```suggestion
This may be sufficient under weak distribution shift but there are potential risks of forgetting
```
</issue_to_address>

### Comment 9
<location path="ALCF/notes/cpt.md" line_range="165" />
<code_context>
+  - adjusting the data-mixing strategy by **increasing the weight of pretraining data**.
+
+At the end of this stage, we have $D^{CPT}_2$.
+#### Stage 3 to stage 4 (shift to reasoning tracex)
+![stage 3 to 4](./assets/cpt_images/strategy_cpt_stage3tostage4-1.png)
+At this point, we only have ~6% of training left and one should start the final decay.
</code_context>
<issue_to_address>
**issue (typo):** Fix typo "tracex" → "traces".

```suggestion
#### Stage 3 to stage 4 (shift to reasoning traces)
```
</issue_to_address>

### Comment 10
<location path="ALCF/notes/cpt.md" line_range="184" />
<code_context>
+
+***If we did use Strategy 4 above:***
+We should keep decaying with $D^{CPT}_2$ until $LR_3/100$ then introduce the new mix at $LR_3/5$
+![stage 3 to 4 previous devay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)
+
+
</code_context>
<issue_to_address>
**issue (typo):** Fix typo "devay" → "decay" in the image alt-text.

For clarity, update the alt text phrase to "previous decay".

```suggestion
![stage 3 to 4 previous decay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)
```
</issue_to_address>

### Comment 11
<location path="ALCF/notes/large_batch_optimizers_settings.md" line_range="27" />
<code_context>
+
+### Adding custom optimizers
+To add a custom optimizer, you have to modify the following files:
+- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.
+- `megatron/arguments.py`: [optimizer arguments](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/arguments.py#L1070), to add the optimizer arguments
+- `megatron/arguments.py`: [list of valid optimizers](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/arguments.py#L1485), to add the new optimizer to the list of valid optimizers
</code_context>
<issue_to_address>
**issue (typo):** Correct "heve" → "have".

```suggestion
- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either have to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.
```
</issue_to_address>

### Comment 12
<location path="tools/cooldown_generator/make_cooldown_cmds.py" line_range="42" />
<code_context>
+    return f"{x:.8f}".rstrip("0").rstrip(".")
+
+
+def get_total_iters_from_cooldown_percent(
+    checkpoint_iter: Optional[int] = None,
+    cooldown_percent: Optional[float] = None,
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying this script by removing unused helpers, fixing and streamlining header handling, extracting checkpoint file updates into a helper, and adding a thin wrapper around `build_command` for the common cooldown case to make the main flow easier to follow.

1. `get_total_iters_from_cooldown_percent` is unused and adds complexity  
If you don’t have a concrete caller yet, it’s just mental overhead. You can safely drop it now and reintroduce a focused version when needed:

```python
# Remove this entire function for now:
# def get_total_iters_from_cooldown_percent(...):
#     ...
```

If you plan to use it soon, consider a much narrower helper that matches how this script actually computes things now, e.g.:

```python
def compute_total_iters_and_frac(S: int, R: int) -> tuple[int, float]:
    T = S + R
    return T, S / T
```

and then use that in `main()` instead of recomputing `T` and `f` inline.


2. Header handling has a subtle bug and can be simplified  
`lines.extend("\n".join(args.include_header.split("\n")))` extends by characters, not lines. You can fix that and also normalize to a single header string:

```python
# Build a header string once
if args.include_header is None:
    header_str = get_header_template(
        queue=args.queue,
        project=args.project,
        walltime=args.walltime,
        filesystems=args.filesystems,
        nodes=args.nodes,
    )
else:
    hfp = Path(args.include_header)
    if hfp.is_file():
        header_str = hfp.read_text()
    else:
        header_str = args.include_header

lines = []
if header_str:
    # Ensure trailing newline
    if not header_str.endswith("\n"):
        header_str += "\n"
    lines.append(header_str)
```

Then when writing scripts:

```python
with open(outfile, "w") as f:
    f.writelines(lines)
    f.write(block + "\n")
```

This fixes the bug and keeps header behavior explicit and simple.


3. Extract checkpoint metadata sync out of the main loop  
The per-record loop mixes command generation and checkpoint file mutation. Extracting the latter into a helper reduces cognitive load without changing behavior:

```python
def ensure_checkpoint_pointers(load_path: str, S: int) -> None:
    ckpt_parent = Path(load_path).parent
    latest_fp = ckpt_parent / "latest"
    latest_ckpt_iter = ckpt_parent / "latest_checkpointed_iteration.txt"

    if latest_fp.is_file():
        logger.info(f"Found 'latest' in {ckpt_parent}!")
        with latest_fp.open("r") as f:
            _latest = f.read().rstrip("\n").lstrip("global_step")
        assert int(_latest) == int(S), f"{_latest=} != {S=}"
    else:
        logger.info(f"No 'latest' in {ckpt_parent}!")
        logger.info(f"Writing global_step{S} to {latest_fp}")
        with latest_fp.open("w") as f:
            f.write(f"global_step{S}")

    if latest_ckpt_iter.is_file():
        logger.info(f"Found 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
        with latest_ckpt_iter.open("r") as f:
            _latest = f.read().rstrip("\n")
        assert int(_latest) == int(S), f"{_latest=} != {S=}"
    else:
        logger.info(f"No 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
        logger.info(f"Writing {S} to {latest_ckpt_iter}")
        with latest_ckpt_iter.open("w") as f:
            f.write(f"{S}")
```

and in the loop:

```python
for rec in records:
    cid, S, R = rec["id"], rec["S"], rec["R"]
    T = S + R
    f = S / T
    tag = f"# id={cid} resume_step={S} cooldown_steps={R} total_iters={T} frac={fmt_float(f)}"

    ensure_checkpoint_pointers(args.load, S)

    cmd = build_command(
        load_path=args.load,
        data_file_list=args.data_file_list,
        train_script=args.train_script,
        train_iters=T,
        lr_cooldown_frac=f,
        grad_acc_steps=args.grad_acc_steps,
        opt=args.opt,
        min_lr=args.min_lr,
        override_ckpt_opt_param=override_flag,
        extra_args=args.extra_args.strip(),
        # other options unchanged...
    )
    ...
```

The behavior is identical, but the main loop reads as a higher-level sequence of steps.


4. `build_command` is very general compared to how it’s used  
Right now `main()` only drives the `train_iters` path and a small subset of the parameters. To keep the generality without making the core harder to follow, you can factor the “cooldown sweep common case” into a thin wrapper and keep the existing `build_command` as a lower-level primitive:

```python
def build_cooldown_command(
    load_path: str,
    data_file_list: str,
    train_script: str,
    T: int,
    f: float,
    grad_acc_steps: int,
    opt: str,
    min_lr: float,
    override_ckpt_opt_param: bool,
    extra_args: str,
) -> str:
    return build_command(
        load_path=load_path,
        data_file_list=data_file_list,
        train_script=train_script,
        train_iters=T,
        lr_cooldown_frac=f,
        grad_acc_steps=grad_acc_steps,
        opt=opt,
        min_lr=min_lr,
        override_ckpt_opt_param=override_ckpt_opt_param,
        extra_args=extra_args,
        # leave other parameters at their defaults
    )
```

and in `main()`:

```python
cmd = build_cooldown_command(
    load_path=args.load,
    data_file_list=args.data_file_list,
    train_script=args.train_script,
    T=T,
    f=f,
    grad_acc_steps=args.grad_acc_steps,
    opt=args.opt,
    min_lr=args.min_lr,
    override_ckpt_opt_param=override_flag,
    extra_args=args.extra_args.strip(),
)
```

This keeps all existing knobs intact for future use, but makes the primary flow much easier to understand.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-03-11T14:48:04Z

+        assert global_batch_size is not None and sequence_length is not None
+        train_iters = train_tokens * global_batch_size * sequence_length


issue (bug_risk): Train iterations from train_tokens should divide by tokens-per-step, not multiply.

In the train_tokens branch you currently compute:

train_iters = train_tokens * global_batch_size * sequence_length

If train_tokens is the total desired tokens, the correct step count is:

tokens_per_step = global_batch_size * sequence_length train_iters = train_tokens // tokens_per_step # or float division + rounding

Multiplying by global_batch_size * sequence_length yields an incorrect (and massively inflated) number of steps.

sourcery-ai · 2026-03-11T14:48:04Z

+    if args.include_header:
+        if (hfp := Path(args.include_header)).is_file():
+            with hfp.open("r") as f:
+                lines.extend(f.readlines())
+        else:
+            lines.extend("\n".join(args.include_header.split("\n")))


issue (bug_risk): Using list.extend on a joined header string appends characters instead of lines.

In the non-file case you currently do:

lines.extend("\n".join(args.include_header.split("\n")))

This extends lines with individual characters, not header lines, so the output script will be malformed.

Use either:

lines.append(args.include_header)

if include_header is a full block, or:

lines.extend(args.include_header.splitlines(keepends=True))

if you need to preserve line boundaries and newlines.

sourcery-ai · 2026-03-11T14:48:04Z

+#     die "--cool-R (cooldown steps) or --cooldown-percent required"
+# fi
+
+# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"


issue (bug_risk): Cooldown steps (COOL_R) can be left empty but are still passed to make_cooldown_cmds.

The guard requiring COOL_R was commented out:

# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"

but --cooldown-steps "${COOL_R}" is still passed to make_cooldown_cmds.py in both exact and rollback scripts. If COOL_R is unset/empty, the Python script receives an empty value for a required integer argument and will fail or misbehave.

Please either restore the guard or define a safe default for COOL_R at the top of the script.

sourcery-ai · 2026-03-11T14:48:04Z

+
+| CPT Stages | Distribution shift | Primary strategy | Fallbacks | Notes |
+|------|--------------------|------------------|-------------|-------|
+| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |


issue (typo): Fix typo in "wih" → "with" in the Stage 1→2 row.

"wih" in this row is a typo; please change it to "with".

Suggested change

| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |

| Stage 1 → 2 | Weak | Naive CPT with $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |

sourcery-ai · 2026-03-11T14:48:04Z

+
+As a result, we primarily adopt a **data-centric strategy** throughout these stages, resorting to learning-rate adjustments only when necessary.
+
+The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces. 


issue (typo): Correct "ans science papers" to "and science papers".

In the $D_2$ description, "ans science papers" is a typo and should be "and science papers".

Suggested change

The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces.

The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, and science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces.

sourcery-ai · 2026-03-11T14:48:04Z

+Naively continue training with $D_1$, no replay data. 
+- Continue training using only the current dataset $D_1$
+- No replay from $D_0$ or buffer data
+This may be sufficient under weak distribution shift but there is potential risks of forgetting


suggestion (typo): Fix subject–verb agreement in "there is potential risks".

Change to either "there are potential risks of forgetting" or "there is a potential risk of forgetting" for correct subject–verb agreement.

Suggested change

This may be sufficient under weak distribution shift but there is potential risks of forgetting

This may be sufficient under weak distribution shift but there are potential risks of forgetting

sourcery-ai · 2026-03-11T14:48:04Z

+  - adjusting the data-mixing strategy by **increasing the weight of pretraining data**.
+
+At the end of this stage, we have $D^{CPT}_2$.
+#### Stage 3 to stage 4 (shift to reasoning tracex)


issue (typo): Fix typo "tracex" → "traces".

Suggested change

#### Stage 3 to stage 4 (shift to reasoning tracex)

#### Stage 3 to stage 4 (shift to reasoning traces)

sourcery-ai · 2026-03-11T14:48:04Z

+
+***If we did use Strategy 4 above:***
+We should keep decaying with $D^{CPT}_2$ until $LR_3/100$ then introduce the new mix at $LR_3/5$
+![stage 3 to 4 previous devay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)


issue (typo): Fix typo "devay" → "decay" in the image alt-text.

For clarity, update the alt text phrase to "previous decay".

Suggested change

![stage 3 to 4 previous devay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)

![stage 3 to 4 previous decay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)

sourcery-ai · 2026-03-11T14:48:05Z

+
+### Adding custom optimizers
+To add a custom optimizer, you have to modify the following files:
+- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.


issue (typo): Correct "heve" → "have".

Suggested change

- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.

- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either have to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.

sourcery-ai · 2026-03-11T14:48:05Z

+    return f"{x:.8f}".rstrip("0").rstrip(".")
+
+
+def get_total_iters_from_cooldown_percent(


issue (complexity): Consider simplifying this script by removing unused helpers, fixing and streamlining header handling, extracting checkpoint file updates into a helper, and adding a thin wrapper around build_command for the common cooldown case to make the main flow easier to follow.

get_total_iters_from_cooldown_percent is unused and adds complexity
If you don’t have a concrete caller yet, it’s just mental overhead. You can safely drop it now and reintroduce a focused version when needed:

# Remove this entire function for now: # def get_total_iters_from_cooldown_percent(...): # ...

If you plan to use it soon, consider a much narrower helper that matches how this script actually computes things now, e.g.:

def compute_total_iters_and_frac(S: int, R: int) -> tuple[int, float]: T = S + R return T, S / T

and then use that in main() instead of recomputing T and f inline.

Header handling has a subtle bug and can be simplified
lines.extend("\n".join(args.include_header.split("\n"))) extends by characters, not lines. You can fix that and also normalize to a single header string:

# Build a header string once if args.include_header is None: header_str = get_header_template( queue=args.queue, project=args.project, walltime=args.walltime, filesystems=args.filesystems, nodes=args.nodes, ) else: hfp = Path(args.include_header) if hfp.is_file(): header_str = hfp.read_text() else: header_str = args.include_header lines = [] if header_str: # Ensure trailing newline if not header_str.endswith("\n"): header_str += "\n" lines.append(header_str)

Then when writing scripts:

with open(outfile, "w") as f: f.writelines(lines) f.write(block + "\n")

This fixes the bug and keeps header behavior explicit and simple.

Extract checkpoint metadata sync out of the main loop
The per-record loop mixes command generation and checkpoint file mutation. Extracting the latter into a helper reduces cognitive load without changing behavior:

def ensure_checkpoint_pointers(load_path: str, S: int) -> None: ckpt_parent = Path(load_path).parent latest_fp = ckpt_parent / "latest" latest_ckpt_iter = ckpt_parent / "latest_checkpointed_iteration.txt" if latest_fp.is_file(): logger.info(f"Found 'latest' in {ckpt_parent}!") with latest_fp.open("r") as f: _latest = f.read().rstrip("\n").lstrip("global_step") assert int(_latest) == int(S), f"{_latest=} != {S=}" else: logger.info(f"No 'latest' in {ckpt_parent}!") logger.info(f"Writing global_step{S} to {latest_fp}") with latest_fp.open("w") as f: f.write(f"global_step{S}") if latest_ckpt_iter.is_file(): logger.info(f"Found 'latest_checkpointed_iteration.txt' in {ckpt_parent}!") with latest_ckpt_iter.open("r") as f: _latest = f.read().rstrip("\n") assert int(_latest) == int(S), f"{_latest=} != {S=}" else: logger.info(f"No 'latest_checkpointed_iteration.txt' in {ckpt_parent}!") logger.info(f"Writing {S} to {latest_ckpt_iter}") with latest_ckpt_iter.open("w") as f: f.write(f"{S}")

and in the loop:

for rec in records: cid, S, R = rec["id"], rec["S"], rec["R"] T = S + R f = S / T tag = f"# id={cid} resume_step={S} cooldown_steps={R} total_iters={T} frac={fmt_float(f)}" ensure_checkpoint_pointers(args.load, S) cmd = build_command( load_path=args.load, data_file_list=args.data_file_list, train_script=args.train_script, train_iters=T, lr_cooldown_frac=f, grad_acc_steps=args.grad_acc_steps, opt=args.opt, min_lr=args.min_lr, override_ckpt_opt_param=override_flag, extra_args=args.extra_args.strip(), # other options unchanged... ) ...

The behavior is identical, but the main loop reads as a higher-level sequence of steps.

build_command is very general compared to how it’s used
Right now main() only drives the train_iters path and a small subset of the parameters. To keep the generality without making the core harder to follow, you can factor the “cooldown sweep common case” into a thin wrapper and keep the existing build_command as a lower-level primitive:

def build_cooldown_command( load_path: str, data_file_list: str, train_script: str, T: int, f: float, grad_acc_steps: int, opt: str, min_lr: float, override_ckpt_opt_param: bool, extra_args: str, ) -> str: return build_command( load_path=load_path, data_file_list=data_file_list, train_script=train_script, train_iters=T, lr_cooldown_frac=f, grad_acc_steps=grad_acc_steps, opt=opt, min_lr=min_lr, override_ckpt_opt_param=override_ckpt_opt_param, extra_args=extra_args, # leave other parameters at their defaults )

and in main():

cmd = build_cooldown_command( load_path=args.load, data_file_list=args.data_file_list, train_script=args.train_script, T=T, f=f, grad_acc_steps=args.grad_acc_steps, opt=args.opt, min_lr=args.min_lr, override_ckpt_opt_param=override_flag, extra_args=args.extra_args.strip(), )

This keeps all existing knobs intact for future use, but makes the primary flow much easier to understand.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff45fa8ed9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-11T14:50:48Z

+    p.add_argument("--min-lr", type=float, default=2e-5)
+    p.add_argument("--no-override-ckpt-opt", action="store_true")
+    p.add_argument("--extra-args", default="")
+    p.add_argument("--emit-sh", action="store_true", default=None)


Accept file argument for --emit-sh

--emit-sh is declared with action="store_true", so argparse treats it as a boolean flag and does not accept a value, but both the README examples and gen_cooldown_sweep.sh call it as --emit-sh <path>. In that common usage, the path is parsed as an unexpected positional argument and command generation aborts, which breaks the documented per-file cooldown workflow.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-11T14:50:48Z

+            override_ckpt_opt_param=override_flag,
+            extra_args=args.extra_args.strip(),
+        )
+        latest_fp = Path(args.load).parent.joinpath("latest")


Write tracker files inside the load directory

The script computes tracker paths with Path(args.load).parent, which moves one level above the --load directory. For the documented --load /.../checkpoints_parent usage, this reads/writes latest in the parent directory instead of the checkpoint directory, so generated runs can resume from the wrong checkpoint and the script may modify unrelated files.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-11T14:50:48Z

+            logger.info(f"Found 'latest' in {ckpt_parent}!")
+            with latest_fp.open("r") as f:
+                _latest = f.read().rstrip("\n").lstrip("global_step")
+            assert int(_latest) == int(S), f"{_latest=} != {S=}"


Allow sweeps over multiple checkpoint iterations

This assertion requires the existing latest tracker to equal each requested S, so generating commands for multiple checkpoint iterations (or any non-current checkpoint) fails immediately. The script is documented to emit grids/sweeps across many -S values, but this check makes that workflow unusable unless all records share the same checkpoint.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR updates AuroraGPT’s ALCF documentation and training utilities by expanding CPT guidance, adding a cooldown-command generator workflow, and adjusting launch/setup scripts to better support ongoing large-scale training and checkpoint cooldown experiments.

Changes:

Major rewrite/expansion of ALCF/notes/cpt.md into a stage-by-stage CPT strategy cookbook (and removal of embedded helper script content).
Added a new tools/cooldown_generator/ workflow (Python + shell) to generate Megatron-DeepSpeed cooldown commands/scripts from checkpoint steps or token milestones.
Updated various ALCF training/launcher scripts and notes (new stage scripts, launcher command tweak, dependency installation behavior, and logging tweaks).

Reviewed changes

Copilot reviewed 19 out of 28 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
`train_alcf.sh`	Disables in-script ezpz install block and MPI cleanup call.
`train_aGPT_2B_sophiag_stage2.sh`	New PBS launch script for 2B SophiaG stage2.
`train_aGPT_2B_sophiag_stage3.sh`	New PBS launch script for 2B SophiaG stage3.
`train_aGPT_2B_large_batch.sh`	Updates optimizer selection and removes old optimizer-plans comment block.
`tools/cooldown_generator/make_cooldown_cmds.py`	New Python generator for cooldown commands/scripts.
`tools/cooldown_generator/gen_cooldown_sweep.sh`	New shell wrapper to generate per-ID cooldown scripts from a TSV.
`tools/cooldown_generator/build_checkpoints_from_tokens.py`	New utility to map token milestones → step checkpoints (+ rollback offsets).
`tools/cooldown_generator/README.md`	New documentation for cooldown generator tools and workflows.
`pretrain_gpt_alcf.py`	Makes W&B finish conditional on `wandb` being non-`None`.
`megatron/training.py`	Normalizes `ezpz` usage (`ezpz.` / `ezpz.dist.`) and minor formatting cleanups.
`megatron/timers.py`	Switches timer output from `print(..., flush=True)` to `ezpz` logger.
`ALCF/notes/lb_optimizers_settings.md`	Removes older optimizer/settings note doc (replaced by new doc).
`ALCF/notes/large_batch_optimizers_settings.md`	Adds updated large-batch optimizer + LR-finder documentation and assets reference.
`ALCF/notes/cpt.md`	Expanded CPT “strategy cookbook” for AuroraGPT V1, includes stage guidance and mixing strategies.
`ALCF/notes/assets/lb_optimizers/readme.md`	Adds placeholder/readme for optimizer-doc images.
`ALCF/notes/assets/cpt_images/readme.md`	Adds placeholder/readme for CPT-doc images.
`ALCF/helpers.sh`	Comments out dependency install step; simplifies launcher command; whitespace/indent cleanup.
`ALCF/data-lists/aurora/nvidia-math1-code2.txt`	Adds a new dataset file list for math/code stage data.
`ALCF/data-lists/aurora/dolmino-mix-1124-fused-file-list.txt`	Adds a new fused Dolmino mix file list.
`.gitignore`	Adds un-ignore rules for `tools/cooldown_generator/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T15:00:03Z

+            with hfp.open("r") as f:
+                lines.extend(f.readlines())
+        else:
+            lines.extend("\n".join(args.include_header.split("\n")))


When args.include_header is a literal header string, lines.extend("\n".join(...)) extends the list with individual characters (because extend iterates the string). This bloats memory and is easy to misuse later. Use lines.append(args.include_header) or lines.extend(args.include_header.splitlines(keepends=True)) instead.

Suggested change

lines.extend("\n".join(args.include_header.split("\n")))

lines.append("\n".join(args.include_header.split("\n")))

Copilot · 2026-03-11T15:00:04Z

  MODEL_ARCH=AuroraGPT-2B \
-    OPT=muonclip \
+    OPT=sophiag \
    LR=2.28e-5 \


OPT is changed to sophiag but LR remains 2.28e-5 (which appears to be the previous muonclip setting). Other sophiag launch scripts in this repo use LR=2.17e-5, so this looks like a copy/paste mismatch. Please confirm the intended LR for sophiag and update accordingly.

Suggested change

LR=2.28e-5 \

LR=2.17e-5 \

Copilot · 2026-03-11T15:00:04Z

@@ -0,0 +1,200 @@
+# Megatron-DeepSpeed, optimizers, hyperparameters
+`**Important** For large batch training with infinite schedulers, It is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.`


The opening “Important” note is wrapped in backticks, but it also contains an inline-code span for lr_finder, which breaks Markdown rendering (nested backticks). Consider using bold/blockquote formatting for the note and reserving backticks only for the lr_finder identifier.

Suggested change

`**Important** For large batch training with infinite schedulers, It is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.`

> **Important** For large batch training with infinite schedulers, it is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.

Copilot · 2026-03-11T15:00:04Z

+
+| CPT Stages | Distribution shift | Primary strategy | Fallbacks | Notes |
+|------|--------------------|------------------|-------------|-------|
+| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |


Typo in table: “Naive CPT wih $D_1$” → “with”.

Suggested change

| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |

| Stage 1 → 2 | Weak | Naive CPT with $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |

Copilot · 2026-03-11T15:00:04Z

+        latest_fp = Path(args.load).parent.joinpath("latest")
+        latest_ckpt_iter = Path(args.load).parent.joinpath("latest_checkpointed_iteration.txt")
+        ckpt_parent = Path(args.load).parent


--load is documented/used as the checkpoint parent directory, but the code uses Path(args.load).parent as the checkpoint directory for latest files. This will write/read latest one directory above the intended checkpoint root (and could corrupt unrelated directories). Use ckpt_parent = Path(args.load) (and derive latest* paths from that).

Suggested change

latest_fp = Path(args.load).parent.joinpath("latest")

latest_ckpt_iter = Path(args.load).parent.joinpath("latest_checkpointed_iteration.txt")

ckpt_parent = Path(args.load).parent

ckpt_parent = Path(args.load)

latest_fp = ckpt_parent.joinpath("latest")

latest_ckpt_iter = ckpt_parent.joinpath("latest_checkpointed_iteration.txt")

Copilot · 2026-03-11T15:00:08Z

+# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"
+[[ -n "${LOAD_PATH}" ]] || die "--load path is required"
+[[ -n "${PHASE1_LIST}" ]] || die "--phase1-list is required (IDs 1..4)"
+[[ -n "${PHASE2_LIST}" ]] || die "--phase2-list is required (IDs 5..7)"
+


COOL_R is used to populate --cooldown-steps later, but validation is commented out, so running without --cool-R will emit broken make_cooldown_cmds.py invocations. Reinstate a hard check that --cool-R is provided (or add alternate logic to compute cooldown steps from --cooldown-percent).

Copilot · 2026-03-11T15:00:08Z

+
+## Hyperparameter tuning
+#### Init variance
+Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do


This sentence has broken inline-code formatting: one can use `--init-method-std, `--adjust-word-embedding-init`, .... The backticks are mismatched, so Markdown will render incorrectly. Use separate properly-closed inline-code spans for each flag.

Suggested change

Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do

Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std`, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do

Copilot · 2026-03-11T15:00:08Z

+	# shellcheck disable=SC1090
+	source <(curl -L https://bit.ly/ezpz-utils)
+	ezpz_setup_env


Using source <(curl -L https://bit.ly/ezpz-utils) here downloads and executes a remote shell script at job runtime with no version pinning or integrity verification. If the bit.ly link or its upstream hosting is ever compromised, an attacker can run arbitrary code in your training environment with access to cluster credentials and data. Vendor this helper script into the repo or fetch it via a pinned, checksummed artifact instead of sourcing an unpinned URL at runtime.

Copilot · 2026-03-11T15:00:09Z

+	# shellcheck disable=SC1090
+	source <(curl -L https://bit.ly/ezpz-utils)
+	ezpz_setup_env


This script also uses source <(curl -L https://bit.ly/ezpz-utils) to pull and execute remote shell code at runtime without any integrity or version pinning. A compromise of that short URL or its backing content would give an attacker arbitrary code execution inside AuroraGPT training jobs. Check in a local copy of the helper and/or use a pinned, checksummed distribution rather than sourcing from a mutable URL.

Copilot · 2026-03-11T15:00:09Z

+Single command to test and run Megatron-DeepSpeed:
+
+```bash
+now=$(date +'%Y-%m-%d-%H%M%S') && debug_dir="${now}" && mkdir -p "${debug_dir}"&& cd "${debug_dir}"&& git clone https://github.com/argonne-lcf/Megatron-DeepSpeed && cd Megatron-DeepSpeed && source <(curl -L https://bit.ly/ezpz-utils) && ezpz_setup_env && python3 -m pip install --require-virtualenv "git+https://github.com/saforem2/ezpz" "numpy<2" deepspeed tensorboard && ezpz-test && DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_alcf.sh


The "single command" snippet both sources https://bit.ly/ezpz-utils and installs ezpz via python3 -m pip install "git+https://github.com/saforem2/ezpz" without pinning or integrity checks. Anyone who controls the bit.ly target or the GitHub repo (or can intercept those HTTPS connections) can ship malicious code that runs with the user's environment and cluster credentials. Prefer vendoring the helper script and pinning the Python package to a specific version/commit (and optionally verifying a hash) instead of installing from mutable URLs in a recommended one-liner.

nscottnichols and others added 30 commits November 10, 2025 23:29

initial commit

8db699a

Merge branch 'main' into auto_cooldown_script

d68874d

Update cpt.md

c39584d

Update cpt.md

f74d237

Update cpt.md

1f76b5a

Update cpt.md

0f23c94

Update cpt.md

04f5b1e

Update cpt.md

c10a6f2

Update cpt.md

1e55c76

Update cpt.md

7abf24d

cpt data mixing image

b73f02d

Update cpt.md

c5ba8d8

Update cpt.md

0a5ac67

Update cpt.md

e7ad382

Update cpt.md

307f78f

Update cpt.md

749c5d3

Update cpt.md

16fdd88

Update cpt.md

790e1f0

Update cpt.md

87731ff

Update cpt.md

71d3ec9

Update cpt.md

b03749e

chore: Update tools/gen_cooldown/gen_cooldown_sweep.sh

86d8ac6

chore: Update tools/cooldown_generator/make_cooldown_cmds.py

248cf22

Merge branch 'main' into auto_cooldown_script

fa696cc

Update cpt.md

ab22dc1

Create readme.md

747b7fe

Add files via upload

c832c04

Add files via upload

5b930b8

Delete ALCF/notes/assets/CPT_data_mixing.png

992fb43

Update cpt.md

46f8399

mngom2 and others added 22 commits January 28, 2026 09:48

Add files via upload

5551ac8

Update large_batch_optimizers_settings.md

3c26dc7

Update large_batch_optimizers_settings.md

06f86a9

Update large_batch_optimizers_settings.md

c2d9f8b

Update large_batch_optimizers_settings.md

2808ff4

Update large_batch_optimizers_settings.md

3d0fd17

Merge branch 'main' of https://github.com/argonne-lcf/Megatron-DeepSpeed

63749b9

into auto_cooldown_script

chore: Update ALCF/helpers.sh

22b5b24

chore: Replace print with logger.info in megatron/timers.py

88dbdc4

chore: Update megatron/training.py

fc0dad5

chore: Update pretrain_gpt_alcf.py

08e8728

chore: Update train_alcf.sh

6027ea6

chore: Update tools/cooldown_generator/make_cooldown_cmds.py

3ea917b

chore: Update ALCF/helpers.sh

b1bbe01

chore: Update tools/cooldown_generator/make_cooldown_cmds.py

5f11d7b

chore: Fix #96 (comment)

b5617ba

Merge pull request #96 from argonne-lcf/auto_cooldown_script

1d600aa

Checkpoint Cooldown Helper Scripts

feat: Add train_aGPT_2B_sophiag_stage3.sh

65d0a41

feat: Add ALCF/data-lists/aurora/nvidia-math1-code2.txt

81b1d1a

feat: Add train_aGPT_2B_sophiag_stage2.sh

81ac9c5

chore: Update train_aGPT_2B_large_batch.sh

b8629a6

feat: Add stage2 data list

ff45fa8

Copilot AI review requested due to automatic review settings March 11, 2026 14:45

Merge branch 'main' into main

ffa3456

saforem2 merged commit 83fd580 into saforem2:main Mar 11, 2026

sourcery-ai Bot reviewed Mar 11, 2026

View reviewed changes

Copilot started reviewing on behalf of saforem2 March 11, 2026 14:48 View session

chatgpt-codex-connector Bot reviewed Mar 11, 2026

View reviewed changes

Copilot AI reviewed Mar 11, 2026

View reviewed changes

		assert global_batch_size is not None and sequence_length is not None
		train_iters = train_tokens * global_batch_size * sequence_length

	\| Stage 1 → 2 \| Weak \| Naive CPT wih $D_1$ (no replay, no mixing) \| 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$\| Add $D_1$ to buffer $B$ \|
	\| Stage 1 → 2 \| Weak \| Naive CPT with $D_1$ (no replay, no mixing) \| 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$\| Add $D_1$ to buffer $B$ \|


		As a result, we primarily adopt a data-centric strategy throughout these stages, resorting to learning-rate adjustments only when necessary.

		The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces.

	This may be sufficient under weak distribution shift but there is potential risks of forgetting
	This may be sufficient under weak distribution shift but there are potential risks of forgetting

	#### Stage 3 to stage 4 (shift to reasoning tracex)
	#### Stage 3 to stage 4 (shift to reasoning traces)

	![stage 3 to 4 previous devay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)
	![stage 3 to 4 previous decay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)

	- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.
	- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either have to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.

		return f"{x:.8f}".rstrip("0").rstrip(".")


		def get_total_iters_from_cooldown_percent(

	lines.extend("\n".join(args.include_header.split("\n")))
	lines.append("\n".join(args.include_header.split("\n")))

		@@ -0,0 +1,200 @@
		# Megatron-DeepSpeed, optimizers, hyperparameters
		`Important For large batch training with infinite schedulers, It is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.`

	`Important For large batch training with infinite schedulers, It is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.`
	> Important For large batch training with infinite schedulers, it is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.

	Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do
	Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std`, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do

Conversation

saforem2 commented Mar 11, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Copilot Summary

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Flow diagram for AuroraGPT CPT data-mixing strategy across stages

Flow diagram for generating cooldown commands with make_cooldown_cmds.py

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

saforem2 commented Mar 11, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Mar 11, 2026 •

edited

Loading