Skip to content

Initial commit with updates to cpt.md and new README#19

Merged
saforem2 merged 79 commits intosaforem2:mainfrom
argonne-lcf:main
Mar 11, 2026
Merged

Initial commit with updates to cpt.md and new README#19
saforem2 merged 79 commits intosaforem2:mainfrom
argonne-lcf:main

Conversation

@saforem2
Copy link
Copy Markdown
Owner

@saforem2 saforem2 commented Mar 11, 2026

Copilot Summary

This pull request makes several improvements and updates to the CPT (Continual Pre-Training) documentation and related scripts for AuroraGPT, focusing on providing a comprehensive strategy guide, cleaning up helper scripts, and improving documentation organization. The main changes include a major rewrite and expansion of the cpt.md strategy cookbook, removal of an embedded script, minor script refactoring, and the addition of documentation assets.

Major documentation update:

  • The ALCF/notes/cpt.md file has been rewritten and significantly expanded into a detailed CPT strategy cookbook for AuroraGPT V1. It now covers data-centric and optimization strategies, recommended mixing weights, stage-by-stage guidance, and practical tips for handling distribution shifts during continual pre-training. The update also includes new diagrams and links to relevant assets. An embedded script (mix_datasets.py) was removed from the documentation, with references to its GitHub location instead. [1] [2]

Script and helper improvements:

  • In ALCF/helpers.sh, the call to install_dependencies in the setup() function is now commented out, likely to avoid redundant installations during repeated runs.
  • The setupLauncher() function in ALCF/helpers.sh has been simplified: the launcher command is updated to use ezpz launch instead of ezpz-launch, and commented-out legacy logic has been removed for clarity.
  • Minor whitespace cleanup in set_ccl_vars_on_aurora() in ALCF/helpers.sh.

Documentation assets:

  • Added placeholder readme files to ALCF/notes/assets/cpt_images and ALCF/notes/assets/lb_optimizers to clarify the purpose of images used in the CPT and optimizer documentation. [1] [2]

These changes collectively improve the clarity, maintainability, and usability of the CPT workflow and documentation for AuroraGPT.


References:

  • Major documentation update: [1] [2]
  • Script and helper improvements: [1] [2] [3]
  • Documentation assets: [1] [2]

Summary by Sourcery

Document continual pre-training strategies and large-batch optimizer guidance for AuroraGPT, adjust training and helper scripts to use ezpz utilities and new optimizer/scheduler settings, and add tooling to generate LR cooldown commands and stage-specific training jobs.

New Features:

  • Introduce a cooldown command generation toolkit under tools/cooldown_generator to compute checkpoint/cooldown schedules from token budgets and emit ready-to-run Megatron-DeepSpeed commands and per-ID sweep scripts.
  • Add stage-2 and stage-3 AuroraGPT-2B training PBS scripts configured for sophiag optimizer, constant LR with cooldown, and appropriate data lists and tokenizer settings.
  • Add new Aurora-specific fused data-list files for Dolmino and math/code mixtures used in later training stages.
  • Add a CPT image and large-batch optimizer image asset structure to support the new documentation.

Enhancements:

  • Refine Megatron training utilities to use the ezpz API for rank, device, profiling, and timing, and improve assertion formatting and logging behavior for timers and wandb integration.
  • Simplify ALCF helper scripts by disabling automatic dependency installation, updating the launcher to use ezpz launch, and minor cleanup in model selection and environment configuration.
  • Retune the 2B large-batch training script to use the sophiag optimizer by default instead of muonclip.
  • Guard wandb usage in pretrain_gpt_alcf.py to avoid errors when wandb is unavailable.

Documentation:

  • Rewrite and significantly expand the CPT strategy cookbook in ALCF/notes/cpt.md, including stage-wise data-mixing guidance, learning-rate strategies, and handling distribution shifts for AuroraGPT V1 and legacy agpt-7B.
  • Add a new large_batch_optimizers_settings guide documenting optimizer options, scheduler variants, LR finder usage, initialization schemes, and hyperparameter tuning recommendations for large-batch training.
  • Document usage of cooldown command generation tooling and workflows in tools/cooldown_generator/README.md.
  • Add placeholder readme files describing assets used in CPT and large-batch optimizer documentation image directories.

Chores:

  • Ignore additional generated or local files via .gitignore if applicable.

Copilot AI review requested due to automatic review settings March 11, 2026 14:45
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Mar 11, 2026

Reviewer's Guide

This PR significantly extends AuroraGPT continual pre‑training documentation, introduces a new large‑batch optimizer guide, adds cooldown command-generation tooling and training scripts for specific stages, and makes small runtime/launcher cleanups (switching to ezpz APIs, safer logging, and making ezpz usage less intrusive).

Flow diagram for AuroraGPT CPT data-mixing strategy across stages

graph TD
    Start["Start CPT planning for AuroraGPT V1"]

    Start --> S1["Stage1 D0 pretraining complete"]
    S1 --> S1_S2["Stage1 → Stage2\nWeak distribution shift"]

    S1_S2 --> S1S2_Strategy1["Strategy1 No replay\nTrain only on D1"]
    S1_S2 --> S1S2_Strategy2["Strategy2 Replay from D0\nMix D0 + D1"]

    S1S2_Strategy2 --> S1S2_Mix["Set mixing weights\nalpha0=0.05–0.10, alphaD=1-alpha0, alphaB=0"]
    S1S2_Mix --> S1S2_Buffer["Build buffer B with samples from D1"]
    S1S2_Buffer --> S2_Complete["End of Stage2\nHave DCPT_1"]

    S1S2_Strategy1 --> S1S2_Buffer

    S2_Complete --> S2_S3["Stage2 → Stage3\nStrong shift to math/code D2"]

    S2_S3 --> S2S3_Naive["Naive strategy\nTrain mostly on D2"]
    S2_S3 --> S2S3_Mix["Strategy3 Mix D0 + D2 + B"]
    S2_S3 --> S2S3_Decay["Strategy4 LR decay recipe\nif data mixing fails"]

    S2S3_Mix --> S2S3_MixA["MixA D0=0.33, D2=0.33, B=0.34"]
    S2S3_Mix --> S2S3_MixB["MixB D0=0.05, D2=0.48, B=0.47"]
    S2S3_Mix --> S2S3_MixC["MixC D0=0.0, D2=0.10, B=0.90"]

    S2S3_MixA --> S2S3_BufferUpdate["Add D2 to buffer B at end of stage"]
    S2S3_MixB --> S2S3_BufferUpdate
    S2S3_MixC --> S2S3_BufferUpdate

    S2S3_BufferUpdate --> S3_Complete["End of Stage3\nHave DCPT_2"]

    S3_Complete --> S3_S4["Stage3 → Stage4\nShift to reasoning traces D3"]

    S3_S4 --> S3S4_NoPrevDecay["If Strategy4 was not used"]
    S3_S4 --> S3S4_PrevDecay["If Strategy4 was used"]

    S3S4_NoPrevDecay --> S3S4_MixA["MixA D0=0.33, D3=0.33, B=0.34"]
    S3S4_NoPrevDecay --> S3S4_MixB["MixB D0=0.50, D3=0.25, B=0.25"]

    S3S4_MixA --> S3S4_Cooldown["Cool down LR to convergence"]
    S3S4_MixB --> S3S4_Cooldown

    S3S4_PrevDecay --> S3S4_ContinueDecay["Continue decaying with DCPT_2\nuntil LR3/100"]
    S3S4_ContinueDecay --> S3S4_IntroduceD3["Introduce D3 mix at LR3/5"]
    S3S4_IntroduceD3 --> S3S4_Cooldown

    S3S4_Cooldown --> Final["Final CPT model converged"]
Loading

Flow diagram for generating cooldown commands with make_cooldown_cmds.py

graph TD
    A["Start cooldown planning"] --> B["Provide checkpoint resume step S\n(--checkpoint-iters or --pairs)"]
    B --> C["Provide cooldown length R in steps\n(--cooldown-steps or in --pairs)"]
    C --> D["Optionally specify ID(s)\n(--checkpoint-ids or id:S:R in --pairs)"]

    D --> E["Compute total train iters T = S + R"]
    E --> F["Compute fraction f = S / T\n(for lr_constant_plus_cooldown_frac)"]

    F --> G["Build env block\nMODEL_ARCH, OPT, GRAD_ACC_STEPS, TRAIN_ITERS=T, ..."]
    G --> H["Construct train_alcf_sh command\nwith lr_constant_plus_cooldown and frac=f"]

    H --> I{Emit mode}

    I -->|--emit-sh provided| J["Write cooldown_idX_sS_rR_tT_sh script"]
    I -->|no --emit-sh| K["Print command block to stdout"]

    J --> L["Optionally submit script via PBS"]
    K --> L

    L --> M["Megatron_DeepSpeed resumes at S\nLR cooldown runs for R steps"]
Loading

File-Level Changes

Change Details Files
Rewrite and expand CPT strategy documentation for AuroraGPT V1 and adjust legacy CPT notes.
  • Replace the brief CPT note with a detailed multi‑stage data‑mixing and LR‑strategy cookbook for AuroraGPT V1, including tables, figures, and concrete recipes
  • Remove the inlined mix_datasets.py script in favor of linking to the upstream GitHub version
  • Tidy legacy agpt‑7B CPT guidance and add a short "things to keep in mind" section with batch‑size/LR scaling tips
ALCF/notes/cpt.md
Add a new guide for large‑batch optimizers, LR schedulers, and LR finder usage.
  • Document supported optimizers and schedulers, including configuration examples and how to register custom implementations
  • Describe hyperparameter tuning strategies (init variance formula, LR‑finder workflow, and MuP/CompleteP notes)
  • Provide example plotting code to analyze LR‑finder output and select candidate learning rates
ALCF/notes/large_batch_optimizers_settings.md
Introduce cooldown command-generation tooling and wrappers for Megatron‑DeepSpeed runs.
  • Add make_cooldown_cmds.py to generate resume‑and‑cooldown commands and optionally update latest checkpoint markers
  • Add build_checkpoints_from_tokens.py to map trillion‑token milestones to step counts and rollback points
  • Add gen_cooldown_sweep.sh to generate per‑checkpoint cooldown job scripts, with phase‑specific data lists and PBS headers
  • Document usage patterns and examples for the cooldown tooling in a dedicated README
tools/cooldown_generator/make_cooldown_cmds.py
tools/cooldown_generator/build_checkpoints_from_tokens.py
tools/cooldown_generator/gen_cooldown_sweep.sh
tools/cooldown_generator/README.md
Add stage‑specific AuroraGPT‑2B training launcher scripts using the SophiaG optimizer and new data lists.
  • Introduce train_aGPT_2B_sophiag_stage2.sh to run 2B training on Dolmino mix with constant LR and SophiaG
  • Introduce train_aGPT_2B_sophiag_stage3.sh to extend training to math/code data with updated TRAIN_TOKENS and data list
  • Add corresponding fused data‑list files for Dolmino mix and math/code datasets
train_aGPT_2B_sophiag_stage2.sh
train_aGPT_2B_sophiag_stage3.sh
ALCF/data-lists/aurora/dolmino-mix-1124-fused-file-list.txt
ALCF/data-lists/aurora/nvidia-math1-code2.txt
Make small runtime and launcher cleanups (switch to ezpz helpers, safer logging, and less aggressive setup).
  • Switch megatron.training to use ezpz.* rank/device helpers and decorators, and reformat some long asserts
  • Adjust LR scheduler construction logic to compute constant+cooldown steps in a more readable style
  • Route Megatron timer logging through an ezpz logger instead of bare prints
  • Guard final wandb access on pretrain exit to avoid errors when wandb is absent
  • Simplify ALCF helpers: comment out install_dependencies in setup, change launcher to "ezpz launch", normalize case labels, and small whitespace fixes
  • Make train_alcf.sh stop auto‑installing ezpz and stop unconditionally killing MPI via ezpz_kill_mpi
megatron/training.py
megatron/timers.py
pretrain_gpt_alcf.py
ALCF/helpers.sh
train_alcf.sh
train_aGPT_2B_large_batch.sh
Add small documentation asset placeholders for images used by CPT and optimizer docs.
  • Add readme placeholder marking CPT images directory purpose
  • Add readme placeholder marking large‑batch optimizer figures directory purpose
ALCF/notes/assets/cpt_images/readme.md
ALCF/notes/assets/lb_optimizers/readme.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@saforem2 saforem2 merged commit 83fd580 into saforem2:main Mar 11, 2026
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 12 issues, and left some high level feedback:

  • In tools/cooldown_generator/make_cooldown_cmds.py, the computation of train_iters from train_tokens is inverted (it currently multiplies by global_batch_size and sequence_length instead of dividing by them), which will massively overshoot the intended number of iterations; consider fixing this before relying on token-based configuration.
  • The helper get_total_iters_from_cooldown_percent in make_cooldown_cmds.py has a suspicious expression cooldown_percent = (train_iters - cooldown_steps / train_iters) that likely needs parentheses ((train_iters - cooldown_steps) / train_iters) or a different formula; it would be good to correct or remove this before someone uses it.
  • In ALCF/helpers.sh and train_alcf.sh you’ve commented out install_dependencies and ezpz_kill_mpi/ezpz auto-install; if these side effects are still required in common workflows, consider adding a clear alternative invocation path or a guarded flag so users don’t silently skip setup/cleanup.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In tools/cooldown_generator/make_cooldown_cmds.py, the computation of train_iters from train_tokens is inverted (it currently multiplies by global_batch_size and sequence_length instead of dividing by them), which will massively overshoot the intended number of iterations; consider fixing this before relying on token-based configuration.
- The helper get_total_iters_from_cooldown_percent in make_cooldown_cmds.py has a suspicious expression `cooldown_percent = (train_iters - cooldown_steps / train_iters)` that likely needs parentheses (`(train_iters - cooldown_steps) / train_iters`) or a different formula; it would be good to correct or remove this before someone uses it.
- In ALCF/helpers.sh and train_alcf.sh you’ve commented out install_dependencies and ezpz_kill_mpi/ezpz auto-install; if these side effects are still required in common workflows, consider adding a clear alternative invocation path or a guarded flag so users don’t silently skip setup/cleanup.

## Individual Comments

### Comment 1
<location path="tools/cooldown_generator/make_cooldown_cmds.py" line_range="162-163" />
<code_context>
+        assert train_iters is None, (
+            f"Only one of {train_tokens, train_iters} should be specified."
+        )
+        assert global_batch_size is not None and sequence_length is not None
+        train_iters = train_tokens * global_batch_size * sequence_length
+
+    assert train_iters is not None
</code_context>
<issue_to_address>
**issue (bug_risk):** Train iterations from train_tokens should divide by tokens-per-step, not multiply.

In the `train_tokens` branch you currently compute:

```python
train_iters = train_tokens * global_batch_size * sequence_length
```
If `train_tokens` is the total desired tokens, the correct step count is:

```python
tokens_per_step = global_batch_size * sequence_length
train_iters = train_tokens // tokens_per_step  # or float division + rounding
```
Multiplying by `global_batch_size * sequence_length` yields an incorrect (and massively inflated) number of steps.
</issue_to_address>

### Comment 2
<location path="tools/cooldown_generator/make_cooldown_cmds.py" line_range="271-276" />
<code_context>
+
+    lines = []
+    # header = "# Auto-generated cooldown commands\nset -euo pipefail\n\n"
+    if args.include_header:
+        if (hfp := Path(args.include_header)).is_file():
+            with hfp.open("r") as f:
+                lines.extend(f.readlines())
+        else:
+            lines.extend("\n".join(args.include_header.split("\n")))
+    # if args.emit_sh:
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Using list.extend on a joined header string appends characters instead of lines.

In the non-file case you currently do:

```python
lines.extend("\n".join(args.include_header.split("\n")))
```

This extends `lines` with individual characters, not header lines, so the output script will be malformed.

Use either:

```python
lines.append(args.include_header)
```
if `include_header` is a full block, or:

```python
lines.extend(args.include_header.splitlines(keepends=True))
```
if you need to preserve line boundaries and newlines.
</issue_to_address>

### Comment 3
<location path="tools/cooldown_generator/gen_cooldown_sweep.sh" line_range="71" />
<code_context>
+#     die "--cool-R (cooldown steps) or --cooldown-percent required"
+# fi
+
+# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"
+[[ -n "${LOAD_PATH}" ]] || die "--load path is required"
+[[ -n "${PHASE1_LIST}" ]] || die "--phase1-list is required (IDs 1..4)"
</code_context>
<issue_to_address>
**issue (bug_risk):** Cooldown steps (COOL_R) can be left empty but are still passed to make_cooldown_cmds.

The guard requiring `COOL_R` was commented out:

```bash
# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"
```

but `--cooldown-steps "${COOL_R}"` is still passed to `make_cooldown_cmds.py` in both exact and rollback scripts. If `COOL_R` is unset/empty, the Python script receives an empty value for a required integer argument and will fail or misbehave.

Please either restore the guard or define a safe default for `COOL_R` at the top of the script.
</issue_to_address>

### Comment 4
<location path="ALCF/notes/cpt.md" line_range="33" />
<code_context>
+
+| CPT Stages | Distribution shift | Primary strategy | Fallbacks | Notes |
+|------|--------------------|------------------|-------------|-------|
+| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
+| Stage 2 → 3 | Strong | 5-30% replay of $D^{CPT}_{2}$ and monitor loss  | Use buffer: $0.33D_0 + 0.33D_2 + 0.34B$| Add $D_2$ to buffer $B$, you might need to switch to LR centric strategy, see below |
+| Stage 3 → 4 | Strong | Cooldown with mix  $0.33D_0 + 0.33D_3 + 0.34B$ |  Cooldown with mix  $0.05D_0 + 0.47D_3 + 0.48B$  | You will need to continue decay if used in previous stage. Play with decay function,stages, and final LR value |
</code_context>
<issue_to_address>
**issue (typo):** Fix typo in "wih" → "with" in the Stage 1→2 row.

"wih" in this row is a typo; please change it to "with".

```suggestion
| Stage 1 → 2 | Weak | Naive CPT with $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
```
</issue_to_address>

### Comment 5
<location path="ALCF/notes/cpt.md" line_range="47" />
<code_context>
+
+As a result, we primarily adopt a **data-centric strategy** throughout these stages, resorting to learning-rate adjustments only when necessary.
+
+The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces. 
+
+| Stage | Dataset Symbol | Size | Source / Path | Notes |
</code_context>
<issue_to_address>
**issue (typo):** Correct "ans science papers" to "and science papers".

In the $D_2$ description, "ans science papers" is a typo and should be "and science papers".

```suggestion
The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, and science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces. 
```
</issue_to_address>

### Comment 6
<location path="ALCF/notes/cpt.md" line_range="103" />
<code_context>
+ ```bash
+python3 mix_datasets.py --input 0.9 /flare/Aurora_deployment/AuroraGPT/datasets/papers/papers.txt 0.1 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_v2.txt > ${debug_dir}/Megatron-DeepSpeed/ALCF/data-lists/aurora/mix_lucid_papers09_dolma01.txt
+```
+2. **Start building the buffer $B$** in prevision of the next stages.
+3. **Run CPT**: Load your checkpoint and run CPT with the --finetube flag.
+Note that you might need to convert your checkpoints following [these instructions](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/notes/universal_checkpoint_bug.md) to a universal checkpoint.
</code_context>
<issue_to_address>
**suggestion (typo):** Replace "in prevision of" with a more natural phrase like "in preparation for".

The phrase "in prevision of the next stages" is non-idiomatic. Please use a more natural alternative, e.g. "in preparation for the next stages" or simply "for the next stages".

```suggestion
2. **Start building the buffer $B$** in preparation for the next stages.
```
</issue_to_address>

### Comment 7
<location path="ALCF/notes/cpt.md" line_range="104" />
<code_context>
+python3 mix_datasets.py --input 0.9 /flare/Aurora_deployment/AuroraGPT/datasets/papers/papers.txt 0.1 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_v2.txt > ${debug_dir}/Megatron-DeepSpeed/ALCF/data-lists/aurora/mix_lucid_papers09_dolma01.txt
+```
+2. **Start building the buffer $B$** in prevision of the next stages.
+3. **Run CPT**: Load your checkpoint and run CPT with the --finetube flag.
+Note that you might need to convert your checkpoints following [these instructions](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/notes/universal_checkpoint_bug.md) to a universal checkpoint.
+
</code_context>
<issue_to_address>
**issue (typo):** Likely typo: "--finetube" should be "--finetune".

If other commands use a `--finetune` flag, update this reference to match for consistency and to avoid confusion.

```suggestion
3. **Run CPT**: Load your checkpoint and run CPT with the --finetune flag.
```
</issue_to_address>

### Comment 8
<location path="ALCF/notes/cpt.md" line_range="79" />
<code_context>
+##### Strategy 1: No replay
+`Important: USE A CHECKPOINT AT LR=LR_max i.e. BEFORE COOLING DOWN`.
+
+Naively continue training with $D_1$, no replay data. 
+- Continue training using only the current dataset $D_1$
+- No replay from $D_0$ or buffer data
+This may be sufficient under weak distribution shift but there is potential risks of forgetting
+
+##### Strategy 2: Replay from pretraining dataset
</code_context>
<issue_to_address>
**suggestion (typo):** Fix subject–verb agreement in "there is potential risks".

Change to either "there are potential risks of forgetting" or "there is a potential risk of forgetting" for correct subject–verb agreement.

```suggestion
This may be sufficient under weak distribution shift but there are potential risks of forgetting
```
</issue_to_address>

### Comment 9
<location path="ALCF/notes/cpt.md" line_range="165" />
<code_context>
+  - adjusting the data-mixing strategy by **increasing the weight of pretraining data**.
+
+At the end of this stage, we have $D^{CPT}_2$.
+#### Stage 3 to stage 4 (shift to reasoning tracex)
+![stage 3 to 4](./assets/cpt_images/strategy_cpt_stage3tostage4-1.png)
+At this point, we only have ~6% of training left and one should start the final decay.
</code_context>
<issue_to_address>
**issue (typo):** Fix typo "tracex" → "traces".

```suggestion
#### Stage 3 to stage 4 (shift to reasoning traces)
```
</issue_to_address>

### Comment 10
<location path="ALCF/notes/cpt.md" line_range="184" />
<code_context>
+
+***If we did use Strategy 4 above:***
+We should keep decaying with $D^{CPT}_2$ until $LR_3/100$ then introduce the new mix at $LR_3/5$
+![stage 3 to 4 previous devay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)
+
+
</code_context>
<issue_to_address>
**issue (typo):** Fix typo "devay" → "decay" in the image alt-text.

For clarity, update the alt text phrase to "previous decay".

```suggestion
![stage 3 to 4 previous decay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)
```
</issue_to_address>

### Comment 11
<location path="ALCF/notes/large_batch_optimizers_settings.md" line_range="27" />
<code_context>
+
+### Adding custom optimizers
+To add a custom optimizer, you have to modify the following files:
+- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.
+- `megatron/arguments.py`: [optimizer arguments](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/arguments.py#L1070), to add the optimizer arguments
+- `megatron/arguments.py`: [list of valid optimizers](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/arguments.py#L1485), to add the new optimizer to the list of valid optimizers
</code_context>
<issue_to_address>
**issue (typo):** Correct "heve" → "have".

```suggestion
- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either have to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.
```
</issue_to_address>

### Comment 12
<location path="tools/cooldown_generator/make_cooldown_cmds.py" line_range="42" />
<code_context>
+    return f"{x:.8f}".rstrip("0").rstrip(".")
+
+
+def get_total_iters_from_cooldown_percent(
+    checkpoint_iter: Optional[int] = None,
+    cooldown_percent: Optional[float] = None,
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying this script by removing unused helpers, fixing and streamlining header handling, extracting checkpoint file updates into a helper, and adding a thin wrapper around `build_command` for the common cooldown case to make the main flow easier to follow.

1. `get_total_iters_from_cooldown_percent` is unused and adds complexity  
If you don’t have a concrete caller yet, it’s just mental overhead. You can safely drop it now and reintroduce a focused version when needed:

```python
# Remove this entire function for now:
# def get_total_iters_from_cooldown_percent(...):
#     ...
```

If you plan to use it soon, consider a much narrower helper that matches how this script actually computes things now, e.g.:

```python
def compute_total_iters_and_frac(S: int, R: int) -> tuple[int, float]:
    T = S + R
    return T, S / T
```

and then use that in `main()` instead of recomputing `T` and `f` inline.


2. Header handling has a subtle bug and can be simplified  
`lines.extend("\n".join(args.include_header.split("\n")))` extends by characters, not lines. You can fix that and also normalize to a single header string:

```python
# Build a header string once
if args.include_header is None:
    header_str = get_header_template(
        queue=args.queue,
        project=args.project,
        walltime=args.walltime,
        filesystems=args.filesystems,
        nodes=args.nodes,
    )
else:
    hfp = Path(args.include_header)
    if hfp.is_file():
        header_str = hfp.read_text()
    else:
        header_str = args.include_header

lines = []
if header_str:
    # Ensure trailing newline
    if not header_str.endswith("\n"):
        header_str += "\n"
    lines.append(header_str)
```

Then when writing scripts:

```python
with open(outfile, "w") as f:
    f.writelines(lines)
    f.write(block + "\n")
```

This fixes the bug and keeps header behavior explicit and simple.


3. Extract checkpoint metadata sync out of the main loop  
The per-record loop mixes command generation and checkpoint file mutation. Extracting the latter into a helper reduces cognitive load without changing behavior:

```python
def ensure_checkpoint_pointers(load_path: str, S: int) -> None:
    ckpt_parent = Path(load_path).parent
    latest_fp = ckpt_parent / "latest"
    latest_ckpt_iter = ckpt_parent / "latest_checkpointed_iteration.txt"

    if latest_fp.is_file():
        logger.info(f"Found 'latest' in {ckpt_parent}!")
        with latest_fp.open("r") as f:
            _latest = f.read().rstrip("\n").lstrip("global_step")
        assert int(_latest) == int(S), f"{_latest=} != {S=}"
    else:
        logger.info(f"No 'latest' in {ckpt_parent}!")
        logger.info(f"Writing global_step{S} to {latest_fp}")
        with latest_fp.open("w") as f:
            f.write(f"global_step{S}")

    if latest_ckpt_iter.is_file():
        logger.info(f"Found 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
        with latest_ckpt_iter.open("r") as f:
            _latest = f.read().rstrip("\n")
        assert int(_latest) == int(S), f"{_latest=} != {S=}"
    else:
        logger.info(f"No 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
        logger.info(f"Writing {S} to {latest_ckpt_iter}")
        with latest_ckpt_iter.open("w") as f:
            f.write(f"{S}")
```

and in the loop:

```python
for rec in records:
    cid, S, R = rec["id"], rec["S"], rec["R"]
    T = S + R
    f = S / T
    tag = f"# id={cid} resume_step={S} cooldown_steps={R} total_iters={T} frac={fmt_float(f)}"

    ensure_checkpoint_pointers(args.load, S)

    cmd = build_command(
        load_path=args.load,
        data_file_list=args.data_file_list,
        train_script=args.train_script,
        train_iters=T,
        lr_cooldown_frac=f,
        grad_acc_steps=args.grad_acc_steps,
        opt=args.opt,
        min_lr=args.min_lr,
        override_ckpt_opt_param=override_flag,
        extra_args=args.extra_args.strip(),
        # other options unchanged...
    )
    ...
```

The behavior is identical, but the main loop reads as a higher-level sequence of steps.


4. `build_command` is very general compared to how it’s used  
Right now `main()` only drives the `train_iters` path and a small subset of the parameters. To keep the generality without making the core harder to follow, you can factor the “cooldown sweep common case” into a thin wrapper and keep the existing `build_command` as a lower-level primitive:

```python
def build_cooldown_command(
    load_path: str,
    data_file_list: str,
    train_script: str,
    T: int,
    f: float,
    grad_acc_steps: int,
    opt: str,
    min_lr: float,
    override_ckpt_opt_param: bool,
    extra_args: str,
) -> str:
    return build_command(
        load_path=load_path,
        data_file_list=data_file_list,
        train_script=train_script,
        train_iters=T,
        lr_cooldown_frac=f,
        grad_acc_steps=grad_acc_steps,
        opt=opt,
        min_lr=min_lr,
        override_ckpt_opt_param=override_ckpt_opt_param,
        extra_args=extra_args,
        # leave other parameters at their defaults
    )
```

and in `main()`:

```python
cmd = build_cooldown_command(
    load_path=args.load,
    data_file_list=args.data_file_list,
    train_script=args.train_script,
    T=T,
    f=f,
    grad_acc_steps=args.grad_acc_steps,
    opt=args.opt,
    min_lr=args.min_lr,
    override_ckpt_opt_param=override_flag,
    extra_args=args.extra_args.strip(),
)
```

This keeps all existing knobs intact for future use, but makes the primary flow much easier to understand.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +162 to +163
assert global_batch_size is not None and sequence_length is not None
train_iters = train_tokens * global_batch_size * sequence_length
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Train iterations from train_tokens should divide by tokens-per-step, not multiply.

In the train_tokens branch you currently compute:

train_iters = train_tokens * global_batch_size * sequence_length

If train_tokens is the total desired tokens, the correct step count is:

tokens_per_step = global_batch_size * sequence_length
train_iters = train_tokens // tokens_per_step  # or float division + rounding

Multiplying by global_batch_size * sequence_length yields an incorrect (and massively inflated) number of steps.

Comment on lines +271 to +276
if args.include_header:
if (hfp := Path(args.include_header)).is_file():
with hfp.open("r") as f:
lines.extend(f.readlines())
else:
lines.extend("\n".join(args.include_header.split("\n")))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Using list.extend on a joined header string appends characters instead of lines.

In the non-file case you currently do:

lines.extend("\n".join(args.include_header.split("\n")))

This extends lines with individual characters, not header lines, so the output script will be malformed.

Use either:

lines.append(args.include_header)

if include_header is a full block, or:

lines.extend(args.include_header.splitlines(keepends=True))

if you need to preserve line boundaries and newlines.

# die "--cool-R (cooldown steps) or --cooldown-percent required"
# fi

# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Cooldown steps (COOL_R) can be left empty but are still passed to make_cooldown_cmds.

The guard requiring COOL_R was commented out:

# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"

but --cooldown-steps "${COOL_R}" is still passed to make_cooldown_cmds.py in both exact and rollback scripts. If COOL_R is unset/empty, the Python script receives an empty value for a required integer argument and will fail or misbehave.

Please either restore the guard or define a safe default for COOL_R at the top of the script.

Comment thread ALCF/notes/cpt.md

| CPT Stages | Distribution shift | Primary strategy | Fallbacks | Notes |
|------|--------------------|------------------|-------------|-------|
| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Fix typo in "wih" → "with" in the Stage 1→2 row.

"wih" in this row is a typo; please change it to "with".

Suggested change
| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
| Stage 1 → 2 | Weak | Naive CPT with $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |

Comment thread ALCF/notes/cpt.md

As a result, we primarily adopt a **data-centric strategy** throughout these stages, resorting to learning-rate adjustments only when necessary.

The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Correct "ans science papers" to "and science papers".

In the $D_2$ description, "ans science papers" is a typo and should be "and science papers".

Suggested change
The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces.
The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, and science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces.

Comment thread ALCF/notes/cpt.md
Naively continue training with $D_1$, no replay data.
- Continue training using only the current dataset $D_1$
- No replay from $D_0$ or buffer data
This may be sufficient under weak distribution shift but there is potential risks of forgetting
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Fix subject–verb agreement in "there is potential risks".

Change to either "there are potential risks of forgetting" or "there is a potential risk of forgetting" for correct subject–verb agreement.

Suggested change
This may be sufficient under weak distribution shift but there is potential risks of forgetting
This may be sufficient under weak distribution shift but there are potential risks of forgetting

Comment thread ALCF/notes/cpt.md
- adjusting the data-mixing strategy by **increasing the weight of pretraining data**.

At the end of this stage, we have $D^{CPT}_2$.
#### Stage 3 to stage 4 (shift to reasoning tracex)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Fix typo "tracex" → "traces".

Suggested change
#### Stage 3 to stage 4 (shift to reasoning tracex)
#### Stage 3 to stage 4 (shift to reasoning traces)

Comment thread ALCF/notes/cpt.md

***If we did use Strategy 4 above:***
We should keep decaying with $D^{CPT}_2$ until $LR_3/100$ then introduce the new mix at $LR_3/5$
![stage 3 to 4 previous devay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Fix typo "devay" → "decay" in the image alt-text.

For clarity, update the alt text phrase to "previous decay".

Suggested change
![stage 3 to 4 previous devay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)
![stage 3 to 4 previous decay](./assets/cpt_images/strategy3_cpt_stage3tostage4ifprevdecay-1.png)


### Adding custom optimizers
To add a custom optimizer, you have to modify the following files:
- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Correct "heve" → "have".

Suggested change
- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.
- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either have to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.

return f"{x:.8f}".rstrip("0").rstrip(".")


def get_total_iters_from_cooldown_percent(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider simplifying this script by removing unused helpers, fixing and streamlining header handling, extracting checkpoint file updates into a helper, and adding a thin wrapper around build_command for the common cooldown case to make the main flow easier to follow.

  1. get_total_iters_from_cooldown_percent is unused and adds complexity
    If you don’t have a concrete caller yet, it’s just mental overhead. You can safely drop it now and reintroduce a focused version when needed:
# Remove this entire function for now:
# def get_total_iters_from_cooldown_percent(...):
#     ...

If you plan to use it soon, consider a much narrower helper that matches how this script actually computes things now, e.g.:

def compute_total_iters_and_frac(S: int, R: int) -> tuple[int, float]:
    T = S + R
    return T, S / T

and then use that in main() instead of recomputing T and f inline.

  1. Header handling has a subtle bug and can be simplified
    lines.extend("\n".join(args.include_header.split("\n"))) extends by characters, not lines. You can fix that and also normalize to a single header string:
# Build a header string once
if args.include_header is None:
    header_str = get_header_template(
        queue=args.queue,
        project=args.project,
        walltime=args.walltime,
        filesystems=args.filesystems,
        nodes=args.nodes,
    )
else:
    hfp = Path(args.include_header)
    if hfp.is_file():
        header_str = hfp.read_text()
    else:
        header_str = args.include_header

lines = []
if header_str:
    # Ensure trailing newline
    if not header_str.endswith("\n"):
        header_str += "\n"
    lines.append(header_str)

Then when writing scripts:

with open(outfile, "w") as f:
    f.writelines(lines)
    f.write(block + "\n")

This fixes the bug and keeps header behavior explicit and simple.

  1. Extract checkpoint metadata sync out of the main loop
    The per-record loop mixes command generation and checkpoint file mutation. Extracting the latter into a helper reduces cognitive load without changing behavior:
def ensure_checkpoint_pointers(load_path: str, S: int) -> None:
    ckpt_parent = Path(load_path).parent
    latest_fp = ckpt_parent / "latest"
    latest_ckpt_iter = ckpt_parent / "latest_checkpointed_iteration.txt"

    if latest_fp.is_file():
        logger.info(f"Found 'latest' in {ckpt_parent}!")
        with latest_fp.open("r") as f:
            _latest = f.read().rstrip("\n").lstrip("global_step")
        assert int(_latest) == int(S), f"{_latest=} != {S=}"
    else:
        logger.info(f"No 'latest' in {ckpt_parent}!")
        logger.info(f"Writing global_step{S} to {latest_fp}")
        with latest_fp.open("w") as f:
            f.write(f"global_step{S}")

    if latest_ckpt_iter.is_file():
        logger.info(f"Found 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
        with latest_ckpt_iter.open("r") as f:
            _latest = f.read().rstrip("\n")
        assert int(_latest) == int(S), f"{_latest=} != {S=}"
    else:
        logger.info(f"No 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
        logger.info(f"Writing {S} to {latest_ckpt_iter}")
        with latest_ckpt_iter.open("w") as f:
            f.write(f"{S}")

and in the loop:

for rec in records:
    cid, S, R = rec["id"], rec["S"], rec["R"]
    T = S + R
    f = S / T
    tag = f"# id={cid} resume_step={S} cooldown_steps={R} total_iters={T} frac={fmt_float(f)}"

    ensure_checkpoint_pointers(args.load, S)

    cmd = build_command(
        load_path=args.load,
        data_file_list=args.data_file_list,
        train_script=args.train_script,
        train_iters=T,
        lr_cooldown_frac=f,
        grad_acc_steps=args.grad_acc_steps,
        opt=args.opt,
        min_lr=args.min_lr,
        override_ckpt_opt_param=override_flag,
        extra_args=args.extra_args.strip(),
        # other options unchanged...
    )
    ...

The behavior is identical, but the main loop reads as a higher-level sequence of steps.

  1. build_command is very general compared to how it’s used
    Right now main() only drives the train_iters path and a small subset of the parameters. To keep the generality without making the core harder to follow, you can factor the “cooldown sweep common case” into a thin wrapper and keep the existing build_command as a lower-level primitive:
def build_cooldown_command(
    load_path: str,
    data_file_list: str,
    train_script: str,
    T: int,
    f: float,
    grad_acc_steps: int,
    opt: str,
    min_lr: float,
    override_ckpt_opt_param: bool,
    extra_args: str,
) -> str:
    return build_command(
        load_path=load_path,
        data_file_list=data_file_list,
        train_script=train_script,
        train_iters=T,
        lr_cooldown_frac=f,
        grad_acc_steps=grad_acc_steps,
        opt=opt,
        min_lr=min_lr,
        override_ckpt_opt_param=override_ckpt_opt_param,
        extra_args=extra_args,
        # leave other parameters at their defaults
    )

and in main():

cmd = build_cooldown_command(
    load_path=args.load,
    data_file_list=args.data_file_list,
    train_script=args.train_script,
    T=T,
    f=f,
    grad_acc_steps=args.grad_acc_steps,
    opt=args.opt,
    min_lr=args.min_lr,
    override_ckpt_opt_param=override_flag,
    extra_args=args.extra_args.strip(),
)

This keeps all existing knobs intact for future use, but makes the primary flow much easier to understand.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff45fa8ed9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

p.add_argument("--min-lr", type=float, default=2e-5)
p.add_argument("--no-override-ckpt-opt", action="store_true")
p.add_argument("--extra-args", default="")
p.add_argument("--emit-sh", action="store_true", default=None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accept file argument for --emit-sh

--emit-sh is declared with action="store_true", so argparse treats it as a boolean flag and does not accept a value, but both the README examples and gen_cooldown_sweep.sh call it as --emit-sh <path>. In that common usage, the path is parsed as an unexpected positional argument and command generation aborts, which breaks the documented per-file cooldown workflow.

Useful? React with 👍 / 👎.

override_ckpt_opt_param=override_flag,
extra_args=args.extra_args.strip(),
)
latest_fp = Path(args.load).parent.joinpath("latest")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Write tracker files inside the load directory

The script computes tracker paths with Path(args.load).parent, which moves one level above the --load directory. For the documented --load /.../checkpoints_parent usage, this reads/writes latest in the parent directory instead of the checkpoint directory, so generated runs can resume from the wrong checkpoint and the script may modify unrelated files.

Useful? React with 👍 / 👎.

logger.info(f"Found 'latest' in {ckpt_parent}!")
with latest_fp.open("r") as f:
_latest = f.read().rstrip("\n").lstrip("global_step")
assert int(_latest) == int(S), f"{_latest=} != {S=}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow sweeps over multiple checkpoint iterations

This assertion requires the existing latest tracker to equal each requested S, so generating commands for multiple checkpoint iterations (or any non-current checkpoint) fails immediately. The script is documented to emit grids/sweeps across many -S values, but this check makes that workflow unusable unless all records share the same checkpoint.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates AuroraGPT’s ALCF documentation and training utilities by expanding CPT guidance, adding a cooldown-command generator workflow, and adjusting launch/setup scripts to better support ongoing large-scale training and checkpoint cooldown experiments.

Changes:

  • Major rewrite/expansion of ALCF/notes/cpt.md into a stage-by-stage CPT strategy cookbook (and removal of embedded helper script content).
  • Added a new tools/cooldown_generator/ workflow (Python + shell) to generate Megatron-DeepSpeed cooldown commands/scripts from checkpoint steps or token milestones.
  • Updated various ALCF training/launcher scripts and notes (new stage scripts, launcher command tweak, dependency installation behavior, and logging tweaks).

Reviewed changes

Copilot reviewed 19 out of 28 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
train_alcf.sh Disables in-script ezpz install block and MPI cleanup call.
train_aGPT_2B_sophiag_stage2.sh New PBS launch script for 2B SophiaG stage2.
train_aGPT_2B_sophiag_stage3.sh New PBS launch script for 2B SophiaG stage3.
train_aGPT_2B_large_batch.sh Updates optimizer selection and removes old optimizer-plans comment block.
tools/cooldown_generator/make_cooldown_cmds.py New Python generator for cooldown commands/scripts.
tools/cooldown_generator/gen_cooldown_sweep.sh New shell wrapper to generate per-ID cooldown scripts from a TSV.
tools/cooldown_generator/build_checkpoints_from_tokens.py New utility to map token milestones → step checkpoints (+ rollback offsets).
tools/cooldown_generator/README.md New documentation for cooldown generator tools and workflows.
pretrain_gpt_alcf.py Makes W&B finish conditional on wandb being non-None.
megatron/training.py Normalizes ezpz usage (ezpz.* / ezpz.dist.*) and minor formatting cleanups.
megatron/timers.py Switches timer output from print(..., flush=True) to ezpz logger.
ALCF/notes/lb_optimizers_settings.md Removes older optimizer/settings note doc (replaced by new doc).
ALCF/notes/large_batch_optimizers_settings.md Adds updated large-batch optimizer + LR-finder documentation and assets reference.
ALCF/notes/cpt.md Expanded CPT “strategy cookbook” for AuroraGPT V1, includes stage guidance and mixing strategies.
ALCF/notes/assets/lb_optimizers/readme.md Adds placeholder/readme for optimizer-doc images.
ALCF/notes/assets/cpt_images/readme.md Adds placeholder/readme for CPT-doc images.
ALCF/helpers.sh Comments out dependency install step; simplifies launcher command; whitespace/indent cleanup.
ALCF/data-lists/aurora/nvidia-math1-code2.txt Adds a new dataset file list for math/code stage data.
ALCF/data-lists/aurora/dolmino-mix-1124-fused-file-list.txt Adds a new fused Dolmino mix file list.
.gitignore Adds un-ignore rules for tools/cooldown_generator/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

with hfp.open("r") as f:
lines.extend(f.readlines())
else:
lines.extend("\n".join(args.include_header.split("\n")))
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When args.include_header is a literal header string, lines.extend("\n".join(...)) extends the list with individual characters (because extend iterates the string). This bloats memory and is easy to misuse later. Use lines.append(args.include_header) or lines.extend(args.include_header.splitlines(keepends=True)) instead.

Suggested change
lines.extend("\n".join(args.include_header.split("\n")))
lines.append("\n".join(args.include_header.split("\n")))

Copilot uses AI. Check for mistakes.
MODEL_ARCH=AuroraGPT-2B \
OPT=muonclip \
OPT=sophiag \
LR=2.28e-5 \
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPT is changed to sophiag but LR remains 2.28e-5 (which appears to be the previous muonclip setting). Other sophiag launch scripts in this repo use LR=2.17e-5, so this looks like a copy/paste mismatch. Please confirm the intended LR for sophiag and update accordingly.

Suggested change
LR=2.28e-5 \
LR=2.17e-5 \

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,200 @@
# Megatron-DeepSpeed, optimizers, hyperparameters
`**Important** For large batch training with infinite schedulers, It is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.`
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opening “Important” note is wrapped in backticks, but it also contains an inline-code span for lr_finder, which breaks Markdown rendering (nested backticks). Consider using bold/blockquote formatting for the note and reserving backticks only for the lr_finder identifier.

Suggested change
`**Important** For large batch training with infinite schedulers, It is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.`
> **Important** For large batch training with infinite schedulers, it is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.

Copilot uses AI. Check for mistakes.
Comment thread ALCF/notes/cpt.md

| CPT Stages | Distribution shift | Primary strategy | Fallbacks | Notes |
|------|--------------------|------------------|-------------|-------|
| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in table: “Naive CPT wih $D_1$” → “with”.

Suggested change
| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
| Stage 1 → 2 | Weak | Naive CPT with $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |

Copilot uses AI. Check for mistakes.
Comment on lines +296 to +298
latest_fp = Path(args.load).parent.joinpath("latest")
latest_ckpt_iter = Path(args.load).parent.joinpath("latest_checkpointed_iteration.txt")
ckpt_parent = Path(args.load).parent
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--load is documented/used as the checkpoint parent directory, but the code uses Path(args.load).parent as the checkpoint directory for latest files. This will write/read latest one directory above the intended checkpoint root (and could corrupt unrelated directories). Use ckpt_parent = Path(args.load) (and derive latest* paths from that).

Suggested change
latest_fp = Path(args.load).parent.joinpath("latest")
latest_ckpt_iter = Path(args.load).parent.joinpath("latest_checkpointed_iteration.txt")
ckpt_parent = Path(args.load).parent
ckpt_parent = Path(args.load)
latest_fp = ckpt_parent.joinpath("latest")
latest_ckpt_iter = ckpt_parent.joinpath("latest_checkpointed_iteration.txt")

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +75
# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"
[[ -n "${LOAD_PATH}" ]] || die "--load path is required"
[[ -n "${PHASE1_LIST}" ]] || die "--phase1-list is required (IDs 1..4)"
[[ -n "${PHASE2_LIST}" ]] || die "--phase2-list is required (IDs 5..7)"

Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COOL_R is used to populate --cooldown-steps later, but validation is commented out, so running without --cool-R will emit broken make_cooldown_cmds.py invocations. Reinstate a hard check that --cool-R is provided (or add alternate logic to compute cooldown steps from --cooldown-percent).

Copilot uses AI. Check for mistakes.

## Hyperparameter tuning
#### Init variance
Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence has broken inline-code formatting: one can use `--init-method-std, `--adjust-word-embedding-init`, .... The backticks are mismatched, so Markdown will render incorrectly. Use separate properly-closed inline-code spans for each flag.

Suggested change
Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do
Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std`, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +15
# shellcheck disable=SC1090
source <(curl -L https://bit.ly/ezpz-utils)
ezpz_setup_env
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using source <(curl -L https://bit.ly/ezpz-utils) here downloads and executes a remote shell script at job runtime with no version pinning or integrity verification. If the bit.ly link or its upstream hosting is ever compromised, an attacker can run arbitrary code in your training environment with access to cluster credentials and data. Vendor this helper script into the repo or fetch it via a pinned, checksummed artifact instead of sourcing an unpinned URL at runtime.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +15
# shellcheck disable=SC1090
source <(curl -L https://bit.ly/ezpz-utils)
ezpz_setup_env
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script also uses source <(curl -L https://bit.ly/ezpz-utils) to pull and execute remote shell code at runtime without any integrity or version pinning. A compromise of that short URL or its backing content would give an attacker arbitrary code execution inside AuroraGPT training jobs. Check in a local copy of the helper and/or use a pinned, checksummed distribution rather than sourcing from a mutable URL.

Copilot uses AI. Check for mistakes.
Single command to test and run Megatron-DeepSpeed:

```bash
now=$(date +'%Y-%m-%d-%H%M%S') && debug_dir="${now}" && mkdir -p "${debug_dir}"&& cd "${debug_dir}"&& git clone https://github.com/argonne-lcf/Megatron-DeepSpeed && cd Megatron-DeepSpeed && source <(curl -L https://bit.ly/ezpz-utils) && ezpz_setup_env && python3 -m pip install --require-virtualenv "git+https://github.com/saforem2/ezpz" "numpy<2" deepspeed tensorboard && ezpz-test && DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_alcf.sh
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "single command" snippet both sources https://bit.ly/ezpz-utils and installs ezpz via python3 -m pip install "git+https://github.com/saforem2/ezpz" without pinning or integrity checks. Anyone who controls the bit.ly target or the GitHub repo (or can intercept those HTTPS connections) can ship malicious code that runs with the user's environment and cluster credentials. Prefer vendoring the helper script and pinning the Python package to a specific version/commit (and optionally verifying a hash) instead of installing from mutable URLs in a recommended one-liner.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants