Auto cooldown script by saforem2 · Pull Request #17 · saforem2/Megatron-DeepSpeed

saforem2 · 2025-11-11T00:28:10Z

Copilot Summary

This pull request improves the AuroraGPT training workflow documentation and updates the job launcher setup for better compatibility. The most significant changes are the addition of a detailed guide for cooling down checkpoints and converting them to a universal format, as well as an update to the launcher script to ensure the correct Python interpreter is used.

Documentation and workflow improvements:

Added a comprehensive guide in ALCF/notes/cooldown.md for cooling down AuroraGPT-2B checkpoints, including step-by-step instructions, example commands, and logs from a large-scale (256 nodes) training run. This documentation covers both the cooldown process and conversion to a universal checkpoint format.

Launcher setup enhancement:

Updated the launcher command in ALCF/helpers.sh to explicitly use the path to the python3 interpreter when invoking ezpz-launch, improving reliability and compatibility with Python-based workflows.

Summary by Sourcery

Add tools and documentation for automating and documenting the cooldown process and checkpoint conversion for AuroraGPT training runs.

New Features:

Introduce scripts to generate cooldown commands, enumerate checkpoints, and automate cooldown job creation for Megatron-DeepSpeed training.

Enhancements:

Update launcher script to explicitly use the correct Python interpreter for improved compatibility.

Documentation:

Add detailed guide for cooling down AuroraGPT-2B checkpoints and converting them to a universal format.

Updated image reference for cooldown documentation.

Updated the section title and added details to the example.

sourcery-ai · 2025-11-11T00:28:16Z

Reviewer's Guide

This PR enhances the AuroraGPT training workflow by adding a comprehensive cooldown pipeline—complete with documentation and helper scripts for generating and managing LR cooldown commands—and updates the job launcher to explicitly invoke the correct Python interpreter for improved compatibility.

Sequence diagram for cooldown command generation and training launch

sequenceDiagram
  actor User
  participant "build_checkpoints_from_tokens.py"
  participant "gen_cooldown_sweep.sh"
  participant "make_cooldown_cmds.py"
  participant "Generated cooldown script"
  participant "AuroraGPT Training"
  User->>"build_checkpoints_from_tokens.py": Provide token milestones and params
  "build_checkpoints_from_tokens.py"->>User: Output checkpoints.tsv
  User->>"gen_cooldown_sweep.sh": Run with checkpoints.tsv and config
  "gen_cooldown_sweep.sh"->>"make_cooldown_cmds.py": For each checkpoint, call to generate command
  "make_cooldown_cmds.py"->>"gen_cooldown_sweep.sh": Output cooldown command script
  "gen_cooldown_sweep.sh"->>User: Emit per-ID cooldown scripts
  User->>"Generated cooldown script": Launch training job
  "Generated cooldown script"->>"AuroraGPT Training": Start training with LR cooldown
  "AuroraGPT Training"->>User: Training results and logs

Class diagram for cooldown command generator scripts

classDiagram
  class build_checkpoints_from_tokens.py {
    +main()
    -ttokens: int
    -tokens_per_step: int
    -cooldown_percent: float
    -round: int
    -out: str
    +computes steps_mod and steps_rollback
  }
  class gen_cooldown_sweep.sh {
    +parses arguments
    +calls build_checkpoints_from_tokens.py
    +calls make_cooldown_cmds.py for each checkpoint
    +writes per-ID cooldown scripts
  }
  class make_cooldown_cmds.py {
    +main()
    +build_command(...)
    +parse_pairs(...)
    +generates cooldown command blocks
    +supports --emit-sh for script output
  }
  build_checkpoints_from_tokens.py --> gen_cooldown_sweep.sh : generates checkpoints.tsv
  gen_cooldown_sweep.sh --> make_cooldown_cmds.py : calls for command generation

Class diagram for updated launcher setup in helpers.sh

classDiagram
  class helpers.sh {
    +setupLauncher()
    -LAUNCHER: string
    +uses ezpz-launch with explicit python3 path
  }

File-Level Changes

Change	Details	Files
Add detailed cooldown workflow documentation and helper scripts	Introduce ALCF/notes/cooldown.md with step-by-step cooldown and universal conversion guide Add tools/cooldown_generator/README.md explaining usage of the generator scripts Implement make_cooldown_cmds.py to emit Megatron-DeepSpeed cooldown command blocks Create gen_cooldown_sweep.sh to automate per-checkpoint script generation (exact and rollback) Add build_checkpoints_from_tokens.py to compute token-to-iteration milestones and rollback offsets	`ALCF/notes/cooldown.md` `tools/cooldown_generator/README.md` `tools/cooldown_generator/make_cooldown_cmds.py` `tools/cooldown_generator/gen_cooldown_sweep.sh` `tools/cooldown_generator/build_checkpoints_from_tokens.py`
Ensure launcher uses explicit python3 interpreter	Modify setupLauncher in helpers.sh to prepend $(which python3) to ezpz-launch Improve reliability of the launcher command under Python-based environments	`ALCF/helpers.sh`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The generated scripts from make_cooldown_cmds.py don’t include a shebang, so consider prepending #!/usr/bin/env bash when using --emit-sh to ensure they execute under the intended shell.
There’s duplicated CLI parsing logic between make_cooldown_cmds.py and build_checkpoints_from_tokens.py—extract common argument handling into a shared module to reduce drift.
Hardcoding $(which python3) in helpers.sh can be brittle across environments; consider parameterizing the interpreter path or sourcing it from a common config instead.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The generated scripts from make_cooldown_cmds.py don’t include a shebang, so consider prepending `#!/usr/bin/env bash` when using --emit-sh to ensure they execute under the intended shell.
- There’s duplicated CLI parsing logic between make_cooldown_cmds.py and build_checkpoints_from_tokens.py—extract common argument handling into a shared module to reduce drift.
- Hardcoding `$(which python3)` in helpers.sh can be brittle across environments; consider parameterizing the interpreter path or sourcing it from a common config instead.

## Individual Comments

### Comment 1
<location> `tools/cooldown_generator/build_checkpoints_from_tokens.py:34` </location>
<code_context>
+    runs_mod = {k: int(round(v / r) * r) for k, v in runs.items()}
+
+    # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
+    cooldown_iters = int(c * runs[ttokens - 1])
+    runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Rounding of cooldown_iters may introduce off-by-one errors.

Using int() truncates the value, which can lead to a lower cooldown_iters than intended. Switching to round() will help prevent off-by-one errors, particularly with small values.

```suggestion
    cooldown_iters = round(c * runs[ttokens - 1])
```
</issue_to_address>

### Comment 2
<location> `tools/cooldown_generator/build_checkpoints_from_tokens.py:35` </location>
<code_context>
+
+    # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
+    cooldown_iters = int(c * runs[ttokens - 1])
+    runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
+
+    with open(args.out, "w") as f:
</code_context>

<issue_to_address>
**issue:** Rollback steps may become negative for small step counts.

If cooldown_iters is greater than v for small k, rollback steps can be negative. Add a check to prevent negative values or clamp them to zero.
</issue_to_address>

### Comment 3
<location> `tools/cooldown_generator/make_cooldown_cmds.py:62` </location>
<code_context>
+"""
+import argparse
+
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--ttokens", type=int, default=8, help="Total token milestones (default: 8 for 0..7T).")
</code_context>

<issue_to_address>
**issue (complexity):** Consider using argparse mutually-exclusive groups, enumerate for IDs, and clearer list-joins to simplify argument parsing and command construction.

```suggestion
# Use argparse mutually‐exclusive groups, enumerate for IDs, and clearer list-joins

import argparse

def main():
    p = argparse.ArgumentParser(...)
    mode = p.add_mutually_exclusive_group(required=True)
    mode.add_argument('--pairs', nargs='+', metavar='[CID:]S:R')
    mode.add_argument('--checkpoint-iters', '-S', nargs='+', type=int)
    p.add_argument('--cooldown-steps', '-R', nargs='+', type=int)
    p.add_argument('--checkpoint-ids', nargs='+', type=int)
    # ... other args ...
    args = p.parse_args()

    if args.pairs:
        records = parse_pairs(args.pairs)
    else:
        if not args.cooldown_steps:
            p.error('--cooldown-steps/-R is required with --checkpoint-iters/-S')
        ids = args.checkpoint_ids or list(range(1, len(args.checkpoint_iters)+1))
        if len(ids) != len(args.checkpoint_iters):
            p.error('--checkpoint-ids must match length of --checkpoint-iters')
        records = [
            {"id": cid, "S": S, "R": R}
            for cid, S in zip(ids, args.checkpoint_iters)
            for R in args.cooldown_steps
        ]

    # ...

def parse_pairs(pairs_args):
    records = []
    for idx, item in enumerate(pairs_args, 1):
        parts = list(map(int, item.split(':')))
        if len(parts) == 2:
            cid, S, R = idx, *parts
        elif len(parts) == 3:
            cid, S, R = parts
        else:
            raise SystemExit(f"Bad --pairs entry: {item}")
        if S <= 0 or R <= 0:
            raise SystemExit(f"Non-positive S/R in --pairs entry: {item}")
        records.append({"id": cid, "S": S, "R": R})
    return records

def build_command(
    load_path, data_file_list, train_script, train_iters,
    lr_cooldown_frac, grad_acc_steps, opt, min_lr,
    override_ckpt_opt_param, extra_args
):
    env = {
        "LR_DECAY_STYLE": "constant",
        "OPT": opt,
        "OVERRIDE_CKPT_OPT_PARAM": int(override_ckpt_opt_param),
        "TRAIN_ITERS": train_iters,
        "GRAD_ACC_STEPS": grad_acc_steps,
        "LOAD": load_path,
        "DATA_FILE_LIST": data_file_list,
    }
    env_block = " \\\n".join(f"{k}={v}" for k,v in env.items() if v)
    cmd_parts = [
        f"bash {train_script}",
        "--override-opt_param-scheduler",
        f"--min-lr={min_lr}",
        "--lr_constant_plus_cooldown",
        f"--lr_constant_plus_cooldown_frac={fmt_float(lr_cooldown_frac)}",
    ]
    if extra_args:
        cmd_parts.append(extra_args.strip())
    return env_block + " \\\n  " + " \\\n  ".join(cmd_parts)
```

– Use a mutually‐exclusive group to enforce “pairs” _or_ “-S/-R” modes.  
– Switch from manual `next_id` to `enumerate()` in `parse_pairs`.  
– Build the `env_block` and command parts with dict/list comprehensions and simple joins (no `dedent` or filtering).
</issue_to_address>

### Comment 4
<location> `tools/cooldown_generator/build_checkpoints_from_tokens.py:30` </location>
<code_context>
+    c = args.cooldown_percent
+
+    # Steps at each i*T (i in 0..ttokens-1)
+    runs = {i: (i * 10**12) / tps for i in range(ttokens)}
+    runs_mod = {k: int(round(v / r) * r) for k, v in runs.items()}
+
</code_context>

<issue_to_address>
**issue (complexity):** Consider replacing the three dictionary comprehensions with a single loop and a rounding helper function to simplify the code.

You can collapse all three dict‐comprehensions into one simple loop and pull out the “round to nearest N” logic into a tiny helper—no functionality is lost, and you avoid keeping three intermediate mappings around:

```python
#!/usr/bin/env python3
import argparse

def _round_to(x: float, base: int) -> int:
    """Round x to nearest multiple of base."""
    return int(round(x / base) * base)

def main():
    p = argparse.ArgumentParser()
    # … your existing argparse setup …
    args = p.parse_args()

    ttokens = args.ttokens
    tps      = args.tokens_per_step
    r        = args.round
    c        = args.cooldown_percent
    out_path = args.out

    # compute the cooldown offset from the *raw* final run
    final_raw     = ( (ttokens - 1) * 10**12 ) / tps
    cooldown_iters = int(c * final_raw)

    with open(out_path, "w") as f:
        f.write("id\tsteps_mod\tsteps_rollback\n")
        for k in range(1, ttokens):
            raw   = (k * 10**12) / tps
            steps = _round_to(raw, r)
            rollback = _round_to(steps - cooldown_iters, r)
            f.write(f"{k}\t{steps}\t{rollback}\n")

    print(f"Wrote TSV to {out_path}")

if __name__ == "__main__":
    main()
```

- No more `runs`, `runs_mod`, `runs_rollback` dicts.  
- All rounding is in a single helper.  
- Keeps identical output and logic.
</issue_to_address>

### Comment 5
<location> `tools/cooldown_generator/make_cooldown_cmds.py:10` </location>
<code_context>
def build_command(
    load_path: str,
    data_file_list: str,
    train_script: str,
    train_iters: int,
    lr_cooldown_frac: float,
    grad_acc_steps: int,
    opt: str,
    min_lr: float,
    override_ckpt_opt_param: bool,
    extra_args: str,
) -> str:
    env_lines = [
        "LR_DECAY_STYLE=constant",
        f"OPT={opt}",
        "OVERRIDE_CKPT_OPT_PARAM=1" if override_ckpt_opt_param else "",
        f"TRAIN_ITERS={train_iters}",
        f"GRAD_ACC_STEPS={grad_acc_steps}",
        f"LOAD={load_path}",
        f"DATA_FILE_LIST={data_file_list}",
    ]
    env_block = " \\\n".join([l for l in env_lines if l])

    extra_line = ""
    if extra_args:
        extra_line = f" \\\n      {extra_args}"

    cmd = dedent(f"""\
    {env_block} \\
    bash {train_script} \\
      --override-opt_param-scheduler \\
      --min-lr={min_lr} \\
      --lr_constant_plus_cooldown \\
      --lr_constant_plus_cooldown_frac={fmt_float(lr_cooldown_frac)}{extra_line}
    """).strip()
    return cmd

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
- Move setting of default value for variable into `else` branch ([`introduce-default-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/introduce-default-else/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))
</issue_to_address>

### Comment 6
<location> `tools/cooldown_generator/make_cooldown_cmds.py:99` </location>
<code_context>
def main():
    p = argparse.ArgumentParser(
        description="Emit Megatron-DeepSpeed cooldown commands so LR cooldown starts at resume.\n"
                    "Provide checkpoint iteration(s) S and cooldown step(s) R.\n"
                    "For each pair, sets TRAIN_ITERS T=S+R and lr_constant_plus_cooldown_frac f=S/T."
    )
    p.add_argument("--load", required=True)
    p.add_argument("--data-file-list", required=True)
    p.add_argument("--train-script", default="train_alcf.sh")
    p.add_argument("--grad-acc-steps", type=int, default=2)
    p.add_argument("--opt", default="ipex.fusedlamb")
    p.add_argument("--min-lr", type=float, default=2e-5)
    p.add_argument("--no-override-ckpt-opt", action="store_true")
    p.add_argument("--extra-args", default="")
    p.add_argument("--emit-sh", type=Path, default=None)

    p.add_argument("--checkpoint-iters", "-S", type=int, nargs="+")
    p.add_argument("--cooldown-steps", "-R", type=int, nargs="+")
    p.add_argument("--checkpoint-ids", type=int, nargs="+")
    p.add_argument("--pairs", type=str, nargs="*")

    args = p.parse_args()
    override_flag = not args.no_override_ckpt_opt

    if args.pairs:
        records = parse_pairs(args.pairs)
    else:
        if not args.checkpoint_iters or not args.cooldown_steps:
            raise SystemExit("Provide either --pairs OR both --checkpoint-iters and --cooldown-steps.")
        ids = args.checkpoint_ids or list(range(1, len(args.checkpoint_iters) + 1))
        if len(ids) != len(args.checkpoint_iters):
            raise SystemExit("--checkpoint-ids must match length of --checkpoint-iters.")
        records = [{"id": cid, "S": int(S), "R": int(R)}
                   for cid, S in zip(ids, args.checkpoint_iters)
                   for R in args.cooldown_steps]

    lines = []
    header = "# Auto-generated cooldown commands\nset -euo pipefail\n\n"
    if args.emit_sh:
        lines.append(header)

    for rec in records:
        cid, S, R = rec["id"], rec["S"], rec["R"]
        T = S + R
        f = S / T
        tag = f"# id={cid} resume_step={S} cooldown_steps={R} total_iters={T} frac={fmt_float(f)}"
        cmd = build_command(
            load_path=args.load,
            data_file_list=args.data_file_list,
            train_script=args.train_script,
            train_iters=T,
            lr_cooldown_frac=f,
            grad_acc_steps=args.grad_acc_steps,
            opt=args.opt,
            min_lr=args.min_lr,
            override_ckpt_opt_param=override_flag,
            extra_args=args.extra_args.strip(),
        )
        block = f"{tag}\n{cmd}\n"
        print(block)
        if args.emit_sh:
            lines.append(block + "\n")

    if args.emit_sh:
        args.emit_sh.write_text("\n".join(lines))
        print(f"# Wrote script to: {args.emit_sh}")

</code_context>

<issue_to_address>
**suggestion (code-quality):** Move assignments closer to their usage ([`move-assign`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/move-assign/))

```suggestion
        header = "# Auto-generated cooldown commands\nset -euo pipefail\n\n"
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-11-11T00:29:11Z

+
+    # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
+    cooldown_iters = int(c * runs[ttokens - 1])
+    runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}


issue: Rollback steps may become negative for small step counts.

If cooldown_iters is greater than v for small k, rollback steps can be negative. Add a check to prevent negative values or clamp them to zero.

sourcery-ai · 2025-11-11T00:29:11Z

+    c = args.cooldown_percent
+
+    # Steps at each i*T (i in 0..ttokens-1)
+    runs = {i: (i * 10**12) / tps for i in range(ttokens)}


issue (complexity): Consider replacing the three dictionary comprehensions with a single loop and a rounding helper function to simplify the code.

You can collapse all three dict‐comprehensions into one simple loop and pull out the “round to nearest N” logic into a tiny helper—no functionality is lost, and you avoid keeping three intermediate mappings around:

#!/usr/bin/env python3 import argparse def _round_to(x: float, base: int) -> int: """Round x to nearest multiple of base.""" return int(round(x / base) * base) def main(): p = argparse.ArgumentParser() # … your existing argparse setup … args = p.parse_args() ttokens = args.ttokens tps = args.tokens_per_step r = args.round c = args.cooldown_percent out_path = args.out # compute the cooldown offset from the *raw* final run final_raw = ( (ttokens - 1) * 10**12 ) / tps cooldown_iters = int(c * final_raw) with open(out_path, "w") as f: f.write("id\tsteps_mod\tsteps_rollback\n") for k in range(1, ttokens): raw = (k * 10**12) / tps steps = _round_to(raw, r) rollback = _round_to(steps - cooldown_iters, r) f.write(f"{k}\t{steps}\t{rollback}\n") print(f"Wrote TSV to {out_path}") if __name__ == "__main__": main()

No more runs, runs_mod, runs_rollback dicts.

All rounding is in a single helper.

Keeps identical output and logic.

Copilot

Pull Request Overview

This pull request enhances the AuroraGPT training workflow by adding automated tools for checkpoint cooldown management and improving the launcher script compatibility. The changes introduce a comprehensive toolset for generating cooldown commands at various training milestones and updates the job launcher to explicitly specify the Python interpreter.

Key changes:

Added three cooldown generator scripts that automate the creation of training commands with learning rate cooldown schedules
Updated the launcher setup to explicitly use the full path to python3 for better reliability
Provided comprehensive documentation including examples and detailed usage instructions for the new cooldown tools

Reviewed Changes

Copilot reviewed 6 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tools/cooldown_generator/make_cooldown_cmds.py`	Core script that generates Megatron-DeepSpeed cooldown commands with configurable checkpoint iterations and cooldown steps
`tools/cooldown_generator/gen_cooldown_sweep.sh`	Wrapper script that automates generation of per-checkpoint cooldown scripts with phase-based data list selection
`tools/cooldown_generator/build_checkpoints_from_tokens.py`	Utility to compute training iterations from token milestones and calculate rollback checkpoints
`tools/cooldown_generator/README.md`	Comprehensive documentation covering usage, examples, and workflow for all cooldown generation tools
`ALCF/notes/cooldown.md`	User guide documenting the cooldown process with real examples from a 256-node training run
`ALCF/notes/assets/cooldown.png`	Visualization asset for cooldown documentation
`ALCF/helpers.sh`	Updated launcher command to explicitly include python3 path for ezpz-launch
`.gitignore`	Added exceptions to ensure cooldown generator tools are tracked in version control

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-11T00:30:56Z

+
+    # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
+    cooldown_iters = int(c * runs[ttokens - 1])
+    runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}


Potential for negative rollback steps. Line 35 calculates runs_rollback by subtracting cooldown_iters from runs_mod[k]. For early checkpoints (small k values), this subtraction could result in negative values, which may not be meaningful for checkpoint steps. Consider adding validation or clamping to ensure runs_rollback[k] >= 0.

Suggested change

runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}

runs_rollback = {k: max(int(round((v - cooldown_iters) / r) * r), 0) for k, v in runs_mod.items()}

Copilot · 2025-11-11T00:30:56Z

 		export LAUNCHER="deepspeed --hostfile $hfds --launcher MPICH ${EXEC}"
 	else
-		LAUNCHER="ezpz-launch ${EXEC}"
+    LAUNCHER="ezpz-launch $(which python3) ${EXEC}"


Inconsistent indentation in the else block. Line 364 uses spaces for indentation while surrounding lines (363, 365-384) use tabs. This should use tabs to maintain consistency with the rest of the function.

Suggested change

LAUNCHER="ezpz-launch $(which python3) ${EXEC}"

LAUNCHER="ezpz-launch $(which python3) ${EXEC}"

Copilot · 2025-11-11T00:30:57Z

 		export LAUNCHER="deepspeed --hostfile $hfds --launcher MPICH ${EXEC}"
 	else
-		LAUNCHER="ezpz-launch ${EXEC}"
+    LAUNCHER="ezpz-launch $(which python3) ${EXEC}"


Missing export keyword for the LAUNCHER variable assignment. Line 362 in the if branch uses export LAUNCHER=..., but line 364 in the else branch does not. This inconsistency could lead to the variable not being available in child processes. Consider adding export for consistency.

Suggested change

LAUNCHER="ezpz-launch $(which python3) ${EXEC}"

export LAUNCHER="ezpz-launch $(which python3) ${EXEC}"

Copilot · 2025-11-11T00:30:57Z

+#   IDs 5..7 -> aurora/dolmino-mix-1124-fused-file-list.txt
+#
+# Example:
+#   ./run_cooldown_per_id_split.sh \


Incorrect script name in example comment. The comment on line 13 references ./run_cooldown_per_id_split.sh but the actual script is named gen_cooldown_sweep.sh. This could confuse users trying to follow the example.

Suggested change

# ./run_cooldown_per_id_split.sh \

# ./gen_cooldown_sweep.sh \

Copilot · 2025-11-11T00:30:57Z

+  You must supply `(S, R)` pairs. Use `--pairs` *or* `-S ... -R ...`.
+
+* **Nothing about the data list changes across IDs.**
+  This script is intentionally **generic**. If your data list changes by phase/ID, handle that in a wrapper (e.g., `run_cooldown_per_id_split.sh`) and pass the correct `--data-file-list` for each call.


Incorrect script name in documentation. Line 180 references run_cooldown_per_id_split.sh as an example wrapper, but the actual script in this PR is named gen_cooldown_sweep.sh. This should be updated to match the actual filename.

Suggested change

This script is intentionally **generic**. If your data list changes by phase/ID, handle that in a wrapper (e.g., `run_cooldown_per_id_split.sh`) and pass the correct `--data-file-list` for each call.

This script is intentionally **generic**. If your data list changes by phase/ID, handle that in a wrapper (e.g., `gen_cooldown_sweep.sh`) and pass the correct `--data-file-list` for each call.

Copilot · 2025-11-11T00:30:58Z

+
+- Cooled down over last 10\%:
+  - [volcanic-blaze-4312](https://wandb.ai/aurora_gpt/AuroraGPT/runs/7bjj8vgu/overview?nw=nwuserforemans)
+    ![](./assets/cooldownHD.png)


Broken image reference. The path ./assets/cooldownHD.png references an image file that doesn't appear to exist in the PR. The actual image file added is cooldown.png (not cooldownHD.png). This should be ./assets/cooldown.png.

Suggested change

![](./assets/cooldownHD.png)

![](./assets/cooldown.png)

into auto_cooldown_script

saforem2 and others added 9 commits October 12, 2025 13:43

fix: Update ALCF/helpers.sh

07a7bbd

docs: Add ALCF/notes/cooldown.md

889ff8d

docs: Add ALCF/notes/assets/

d16272e

Add files via upload

710246a

Rename ScreenShot-2025-11-10-125411@2x.png to cooldownHD.png

0eb4e65

Replace cooldown image with high-definition version

41d5c07

Updated image reference for cooldown documentation.

Enhance cooldown.md with new title and example details

8da9b0d

Updated the section title and added details to the example.

Fix markdown formatting in cooldown.md

b654baa

initial commit

8db699a

saforem2 requested a review from Copilot November 11, 2025 00:28

Copilot started reviewing on behalf of saforem2 November 11, 2025 00:28 View session

sourcery-ai Bot reviewed Nov 11, 2025

View reviewed changes

Copilot finished reviewing on behalf of saforem2 November 11, 2025 00:30

Copilot AI reviewed Nov 11, 2025

View reviewed changes

saforem2 and others added 15 commits November 10, 2025 18:30

Merge branch 'main' into auto_cooldown_script

d68874d

docs: Update ALCF/notes/cooldown.md

76f5f9d

docs: Update ALCF/notes/cooldown.md

6c20809

fix: Update default ROPE_THETA in ALCF/helpers.sh

6b36b3e

Update cpt.md

c39584d

Update cpt.md

f74d237

Update cpt.md

1f76b5a

Update cpt.md

0f23c94

Update cpt.md

04f5b1e

Update cpt.md

c10a6f2

Update cpt.md

1e55c76

Update cpt.md

7abf24d

cpt data mixing image

b73f02d

Update cpt.md

c5ba8d8

Update cpt.md

0a5ac67

mngom2 and others added 29 commits January 27, 2026 19:10

Update cpt.md

48892e3

Update cpt.md

fc44d13

Update cpt.md

8a28d36

Update cpt.md

07170ac

Update cpt.md

9cb3e8e

Update cpt.md

4092a6f

Update cpt.md

2a0517d

Update lb_optimizers_settings.md

d3ca3a4

Create readme.md

915cada

Update readme.md

b663188

Rename lb_optimizers_settings.md to large_batch_optimizers_settings.md

bf3c5ce

Update cpt.md

ad92695

Update large_batch_optimizers_settings.md

2292e2c

Add files via upload

5551ac8

Update large_batch_optimizers_settings.md

3c26dc7

Update large_batch_optimizers_settings.md

06f86a9

Update large_batch_optimizers_settings.md

c2d9f8b

Update large_batch_optimizers_settings.md

2808ff4

Update large_batch_optimizers_settings.md

3d0fd17

Merge branch 'main' of https://github.com/argonne-lcf/Megatron-DeepSpeed

63749b9

into auto_cooldown_script

chore: Update ALCF/helpers.sh

22b5b24

chore: Replace print with logger.info in megatron/timers.py

88dbdc4

chore: Update megatron/training.py

fc0dad5

chore: Update pretrain_gpt_alcf.py

08e8728

chore: Update train_alcf.sh

6027ea6

chore: Update tools/cooldown_generator/make_cooldown_cmds.py

3ea917b

chore: Update ALCF/helpers.sh

b1bbe01

chore: Update tools/cooldown_generator/make_cooldown_cmds.py

5f11d7b

chore: Fix #96 (comment)

b5617ba

saforem2 closed this pull request by merging all changes into saforem2:main in 83fd580 Mar 11, 2026

	runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
	runs_rollback = {k: max(int(round((v - cooldown_iters) / r) * r), 0) for k, v in runs_mod.items()}

	LAUNCHER="ezpz-launch $(which python3) ${EXEC}"
	LAUNCHER="ezpz-launch $(which python3) ${EXEC}"

	LAUNCHER="ezpz-launch $(which python3) ${EXEC}"
	export LAUNCHER="ezpz-launch $(which python3) ${EXEC}"

	# ./run_cooldown_per_id_split.sh \
	# ./gen_cooldown_sweep.sh \

	This script is intentionally generic. If your data list changes by phase/ID, handle that in a wrapper (e.g., `run_cooldown_per_id_split.sh`) and pass the correct `--data-file-list` for each call.
	This script is intentionally generic. If your data list changes by phase/ID, handle that in a wrapper (e.g., `gen_cooldown_sweep.sh`) and pass the correct `--data-file-list` for each call.

Conversation

saforem2 commented Nov 11, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Copilot Summary

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for cooldown command generation and training launch

Class diagram for cooldown command generator scripts

Class diagram for updated launcher setup in helpers.sh

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai Bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saforem2 commented Nov 11, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Nov 11, 2025 •

edited

Loading