Skip to content

Auto cooldown script#17

Merged
83 commits merged intosaforem2:mainfrom
argonne-lcf:auto_cooldown_script
Mar 11, 2026
Merged

Auto cooldown script#17
83 commits merged intosaforem2:mainfrom
argonne-lcf:auto_cooldown_script

Conversation

@saforem2
Copy link
Copy Markdown
Owner

@saforem2 saforem2 commented Nov 11, 2025

Copilot Summary

This pull request improves the AuroraGPT training workflow documentation and updates the job launcher setup for better compatibility. The most significant changes are the addition of a detailed guide for cooling down checkpoints and converting them to a universal format, as well as an update to the launcher script to ensure the correct Python interpreter is used.

Documentation and workflow improvements:

  • Added a comprehensive guide in ALCF/notes/cooldown.md for cooling down AuroraGPT-2B checkpoints, including step-by-step instructions, example commands, and logs from a large-scale (256 nodes) training run. This documentation covers both the cooldown process and conversion to a universal checkpoint format.

Launcher setup enhancement:

  • Updated the launcher command in ALCF/helpers.sh to explicitly use the path to the python3 interpreter when invoking ezpz-launch, improving reliability and compatibility with Python-based workflows.

Summary by Sourcery

Add tools and documentation for automating and documenting the cooldown process and checkpoint conversion for AuroraGPT training runs.

New Features:

  • Introduce scripts to generate cooldown commands, enumerate checkpoints, and automate cooldown job creation for Megatron-DeepSpeed training.

Enhancements:

  • Update launcher script to explicitly use the correct Python interpreter for improved compatibility.

Documentation:

  • Add detailed guide for cooling down AuroraGPT-2B checkpoints and converting them to a universal format.

@saforem2 saforem2 requested a review from Copilot November 11, 2025 00:28
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Nov 11, 2025

Reviewer's Guide

This PR enhances the AuroraGPT training workflow by adding a comprehensive cooldown pipeline—complete with documentation and helper scripts for generating and managing LR cooldown commands—and updates the job launcher to explicitly invoke the correct Python interpreter for improved compatibility.

Sequence diagram for cooldown command generation and training launch

sequenceDiagram
  actor User
  participant "build_checkpoints_from_tokens.py"
  participant "gen_cooldown_sweep.sh"
  participant "make_cooldown_cmds.py"
  participant "Generated cooldown script"
  participant "AuroraGPT Training"
  User->>"build_checkpoints_from_tokens.py": Provide token milestones and params
  "build_checkpoints_from_tokens.py"->>User: Output checkpoints.tsv
  User->>"gen_cooldown_sweep.sh": Run with checkpoints.tsv and config
  "gen_cooldown_sweep.sh"->>"make_cooldown_cmds.py": For each checkpoint, call to generate command
  "make_cooldown_cmds.py"->>"gen_cooldown_sweep.sh": Output cooldown command script
  "gen_cooldown_sweep.sh"->>User: Emit per-ID cooldown scripts
  User->>"Generated cooldown script": Launch training job
  "Generated cooldown script"->>"AuroraGPT Training": Start training with LR cooldown
  "AuroraGPT Training"->>User: Training results and logs
Loading

Class diagram for cooldown command generator scripts

classDiagram
  class build_checkpoints_from_tokens.py {
    +main()
    -ttokens: int
    -tokens_per_step: int
    -cooldown_percent: float
    -round: int
    -out: str
    +computes steps_mod and steps_rollback
  }
  class gen_cooldown_sweep.sh {
    +parses arguments
    +calls build_checkpoints_from_tokens.py
    +calls make_cooldown_cmds.py for each checkpoint
    +writes per-ID cooldown scripts
  }
  class make_cooldown_cmds.py {
    +main()
    +build_command(...)
    +parse_pairs(...)
    +generates cooldown command blocks
    +supports --emit-sh for script output
  }
  build_checkpoints_from_tokens.py --> gen_cooldown_sweep.sh : generates checkpoints.tsv
  gen_cooldown_sweep.sh --> make_cooldown_cmds.py : calls for command generation
Loading

Class diagram for updated launcher setup in helpers.sh

classDiagram
  class helpers.sh {
    +setupLauncher()
    -LAUNCHER: string
    +uses ezpz-launch with explicit python3 path
  }
Loading

File-Level Changes

Change Details Files
Add detailed cooldown workflow documentation and helper scripts
  • Introduce ALCF/notes/cooldown.md with step-by-step cooldown and universal conversion guide
  • Add tools/cooldown_generator/README.md explaining usage of the generator scripts
  • Implement make_cooldown_cmds.py to emit Megatron-DeepSpeed cooldown command blocks
  • Create gen_cooldown_sweep.sh to automate per-checkpoint script generation (exact and rollback)
  • Add build_checkpoints_from_tokens.py to compute token-to-iteration milestones and rollback offsets
ALCF/notes/cooldown.md
tools/cooldown_generator/README.md
tools/cooldown_generator/make_cooldown_cmds.py
tools/cooldown_generator/gen_cooldown_sweep.sh
tools/cooldown_generator/build_checkpoints_from_tokens.py
Ensure launcher uses explicit python3 interpreter
  • Modify setupLauncher in helpers.sh to prepend $(which python3) to ezpz-launch
  • Improve reliability of the launcher command under Python-based environments
ALCF/helpers.sh

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The generated scripts from make_cooldown_cmds.py don’t include a shebang, so consider prepending #!/usr/bin/env bash when using --emit-sh to ensure they execute under the intended shell.
  • There’s duplicated CLI parsing logic between make_cooldown_cmds.py and build_checkpoints_from_tokens.py—extract common argument handling into a shared module to reduce drift.
  • Hardcoding $(which python3) in helpers.sh can be brittle across environments; consider parameterizing the interpreter path or sourcing it from a common config instead.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The generated scripts from make_cooldown_cmds.py don’t include a shebang, so consider prepending `#!/usr/bin/env bash` when using --emit-sh to ensure they execute under the intended shell.
- There’s duplicated CLI parsing logic between make_cooldown_cmds.py and build_checkpoints_from_tokens.py—extract common argument handling into a shared module to reduce drift.
- Hardcoding `$(which python3)` in helpers.sh can be brittle across environments; consider parameterizing the interpreter path or sourcing it from a common config instead.

## Individual Comments

### Comment 1
<location> `tools/cooldown_generator/build_checkpoints_from_tokens.py:34` </location>
<code_context>
+    runs_mod = {k: int(round(v / r) * r) for k, v in runs.items()}
+
+    # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
+    cooldown_iters = int(c * runs[ttokens - 1])
+    runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Rounding of cooldown_iters may introduce off-by-one errors.

Using int() truncates the value, which can lead to a lower cooldown_iters than intended. Switching to round() will help prevent off-by-one errors, particularly with small values.

```suggestion
    cooldown_iters = round(c * runs[ttokens - 1])
```
</issue_to_address>

### Comment 2
<location> `tools/cooldown_generator/build_checkpoints_from_tokens.py:35` </location>
<code_context>
+
+    # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
+    cooldown_iters = int(c * runs[ttokens - 1])
+    runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
+
+    with open(args.out, "w") as f:
</code_context>

<issue_to_address>
**issue:** Rollback steps may become negative for small step counts.

If cooldown_iters is greater than v for small k, rollback steps can be negative. Add a check to prevent negative values or clamp them to zero.
</issue_to_address>

### Comment 3
<location> `tools/cooldown_generator/make_cooldown_cmds.py:62` </location>
<code_context>
+"""
+import argparse
+
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--ttokens", type=int, default=8, help="Total token milestones (default: 8 for 0..7T).")
</code_context>

<issue_to_address>
**issue (complexity):** Consider using argparse mutually-exclusive groups, enumerate for IDs, and clearer list-joins to simplify argument parsing and command construction.

```suggestion
# Use argparse mutually‐exclusive groups, enumerate for IDs, and clearer list-joins

import argparse

def main():
    p = argparse.ArgumentParser(...)
    mode = p.add_mutually_exclusive_group(required=True)
    mode.add_argument('--pairs', nargs='+', metavar='[CID:]S:R')
    mode.add_argument('--checkpoint-iters', '-S', nargs='+', type=int)
    p.add_argument('--cooldown-steps', '-R', nargs='+', type=int)
    p.add_argument('--checkpoint-ids', nargs='+', type=int)
    # ... other args ...
    args = p.parse_args()

    if args.pairs:
        records = parse_pairs(args.pairs)
    else:
        if not args.cooldown_steps:
            p.error('--cooldown-steps/-R is required with --checkpoint-iters/-S')
        ids = args.checkpoint_ids or list(range(1, len(args.checkpoint_iters)+1))
        if len(ids) != len(args.checkpoint_iters):
            p.error('--checkpoint-ids must match length of --checkpoint-iters')
        records = [
            {"id": cid, "S": S, "R": R}
            for cid, S in zip(ids, args.checkpoint_iters)
            for R in args.cooldown_steps
        ]

    # ...

def parse_pairs(pairs_args):
    records = []
    for idx, item in enumerate(pairs_args, 1):
        parts = list(map(int, item.split(':')))
        if len(parts) == 2:
            cid, S, R = idx, *parts
        elif len(parts) == 3:
            cid, S, R = parts
        else:
            raise SystemExit(f"Bad --pairs entry: {item}")
        if S <= 0 or R <= 0:
            raise SystemExit(f"Non-positive S/R in --pairs entry: {item}")
        records.append({"id": cid, "S": S, "R": R})
    return records

def build_command(
    load_path, data_file_list, train_script, train_iters,
    lr_cooldown_frac, grad_acc_steps, opt, min_lr,
    override_ckpt_opt_param, extra_args
):
    env = {
        "LR_DECAY_STYLE": "constant",
        "OPT": opt,
        "OVERRIDE_CKPT_OPT_PARAM": int(override_ckpt_opt_param),
        "TRAIN_ITERS": train_iters,
        "GRAD_ACC_STEPS": grad_acc_steps,
        "LOAD": load_path,
        "DATA_FILE_LIST": data_file_list,
    }
    env_block = " \\\n".join(f"{k}={v}" for k,v in env.items() if v)
    cmd_parts = [
        f"bash {train_script}",
        "--override-opt_param-scheduler",
        f"--min-lr={min_lr}",
        "--lr_constant_plus_cooldown",
        f"--lr_constant_plus_cooldown_frac={fmt_float(lr_cooldown_frac)}",
    ]
    if extra_args:
        cmd_parts.append(extra_args.strip())
    return env_block + " \\\n  " + " \\\n  ".join(cmd_parts)
```

– Use a mutually‐exclusive group to enforce “pairs” _or_ “-S/-R” modes.  
– Switch from manual `next_id` to `enumerate()` in `parse_pairs`.  
– Build the `env_block` and command parts with dict/list comprehensions and simple joins (no `dedent` or filtering).
</issue_to_address>

### Comment 4
<location> `tools/cooldown_generator/build_checkpoints_from_tokens.py:30` </location>
<code_context>
+    c = args.cooldown_percent
+
+    # Steps at each i*T (i in 0..ttokens-1)
+    runs = {i: (i * 10**12) / tps for i in range(ttokens)}
+    runs_mod = {k: int(round(v / r) * r) for k, v in runs.items()}
+
</code_context>

<issue_to_address>
**issue (complexity):** Consider replacing the three dictionary comprehensions with a single loop and a rounding helper function to simplify the code.

You can collapse all three dict‐comprehensions into one simple loop and pull out the “round to nearest N” logic into a tiny helper—no functionality is lost, and you avoid keeping three intermediate mappings around:

```python
#!/usr/bin/env python3
import argparse

def _round_to(x: float, base: int) -> int:
    """Round x to nearest multiple of base."""
    return int(round(x / base) * base)

def main():
    p = argparse.ArgumentParser()
    # … your existing argparse setup …
    args = p.parse_args()

    ttokens = args.ttokens
    tps      = args.tokens_per_step
    r        = args.round
    c        = args.cooldown_percent
    out_path = args.out

    # compute the cooldown offset from the *raw* final run
    final_raw     = ( (ttokens - 1) * 10**12 ) / tps
    cooldown_iters = int(c * final_raw)

    with open(out_path, "w") as f:
        f.write("id\tsteps_mod\tsteps_rollback\n")
        for k in range(1, ttokens):
            raw   = (k * 10**12) / tps
            steps = _round_to(raw, r)
            rollback = _round_to(steps - cooldown_iters, r)
            f.write(f"{k}\t{steps}\t{rollback}\n")

    print(f"Wrote TSV to {out_path}")

if __name__ == "__main__":
    main()
```

- No more `runs`, `runs_mod`, `runs_rollback` dicts.  
- All rounding is in a single helper.  
- Keeps identical output and logic.
</issue_to_address>

### Comment 5
<location> `tools/cooldown_generator/make_cooldown_cmds.py:10` </location>
<code_context>
def build_command(
    load_path: str,
    data_file_list: str,
    train_script: str,
    train_iters: int,
    lr_cooldown_frac: float,
    grad_acc_steps: int,
    opt: str,
    min_lr: float,
    override_ckpt_opt_param: bool,
    extra_args: str,
) -> str:
    env_lines = [
        "LR_DECAY_STYLE=constant",
        f"OPT={opt}",
        "OVERRIDE_CKPT_OPT_PARAM=1" if override_ckpt_opt_param else "",
        f"TRAIN_ITERS={train_iters}",
        f"GRAD_ACC_STEPS={grad_acc_steps}",
        f"LOAD={load_path}",
        f"DATA_FILE_LIST={data_file_list}",
    ]
    env_block = " \\\n".join([l for l in env_lines if l])

    extra_line = ""
    if extra_args:
        extra_line = f" \\\n      {extra_args}"

    cmd = dedent(f"""\
    {env_block} \\
    bash {train_script} \\
      --override-opt_param-scheduler \\
      --min-lr={min_lr} \\
      --lr_constant_plus_cooldown \\
      --lr_constant_plus_cooldown_frac={fmt_float(lr_cooldown_frac)}{extra_line}
    """).strip()
    return cmd

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
- Move setting of default value for variable into `else` branch ([`introduce-default-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/introduce-default-else/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))
</issue_to_address>

### Comment 6
<location> `tools/cooldown_generator/make_cooldown_cmds.py:99` </location>
<code_context>
def main():
    p = argparse.ArgumentParser(
        description="Emit Megatron-DeepSpeed cooldown commands so LR cooldown starts at resume.\n"
                    "Provide checkpoint iteration(s) S and cooldown step(s) R.\n"
                    "For each pair, sets TRAIN_ITERS T=S+R and lr_constant_plus_cooldown_frac f=S/T."
    )
    p.add_argument("--load", required=True)
    p.add_argument("--data-file-list", required=True)
    p.add_argument("--train-script", default="train_alcf.sh")
    p.add_argument("--grad-acc-steps", type=int, default=2)
    p.add_argument("--opt", default="ipex.fusedlamb")
    p.add_argument("--min-lr", type=float, default=2e-5)
    p.add_argument("--no-override-ckpt-opt", action="store_true")
    p.add_argument("--extra-args", default="")
    p.add_argument("--emit-sh", type=Path, default=None)

    p.add_argument("--checkpoint-iters", "-S", type=int, nargs="+")
    p.add_argument("--cooldown-steps", "-R", type=int, nargs="+")
    p.add_argument("--checkpoint-ids", type=int, nargs="+")
    p.add_argument("--pairs", type=str, nargs="*")

    args = p.parse_args()
    override_flag = not args.no_override_ckpt_opt

    if args.pairs:
        records = parse_pairs(args.pairs)
    else:
        if not args.checkpoint_iters or not args.cooldown_steps:
            raise SystemExit("Provide either --pairs OR both --checkpoint-iters and --cooldown-steps.")
        ids = args.checkpoint_ids or list(range(1, len(args.checkpoint_iters) + 1))
        if len(ids) != len(args.checkpoint_iters):
            raise SystemExit("--checkpoint-ids must match length of --checkpoint-iters.")
        records = [{"id": cid, "S": int(S), "R": int(R)}
                   for cid, S in zip(ids, args.checkpoint_iters)
                   for R in args.cooldown_steps]

    lines = []
    header = "# Auto-generated cooldown commands\nset -euo pipefail\n\n"
    if args.emit_sh:
        lines.append(header)

    for rec in records:
        cid, S, R = rec["id"], rec["S"], rec["R"]
        T = S + R
        f = S / T
        tag = f"# id={cid} resume_step={S} cooldown_steps={R} total_iters={T} frac={fmt_float(f)}"
        cmd = build_command(
            load_path=args.load,
            data_file_list=args.data_file_list,
            train_script=args.train_script,
            train_iters=T,
            lr_cooldown_frac=f,
            grad_acc_steps=args.grad_acc_steps,
            opt=args.opt,
            min_lr=args.min_lr,
            override_ckpt_opt_param=override_flag,
            extra_args=args.extra_args.strip(),
        )
        block = f"{tag}\n{cmd}\n"
        print(block)
        if args.emit_sh:
            lines.append(block + "\n")

    if args.emit_sh:
        args.emit_sh.write_text("\n".join(lines))
        print(f"# Wrote script to: {args.emit_sh}")

</code_context>

<issue_to_address>
**suggestion (code-quality):** Move assignments closer to their usage ([`move-assign`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/move-assign/))

```suggestion
        header = "# Auto-generated cooldown commands\nset -euo pipefail\n\n"
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.


# <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
cooldown_iters = int(c * runs[ttokens - 1])
runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Rollback steps may become negative for small step counts.

If cooldown_iters is greater than v for small k, rollback steps can be negative. Add a check to prevent negative values or clamp them to zero.

Comment thread tools/cooldown_generator/make_cooldown_cmds.py
c = args.cooldown_percent

# Steps at each i*T (i in 0..ttokens-1)
runs = {i: (i * 10**12) / tps for i in range(ttokens)}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider replacing the three dictionary comprehensions with a single loop and a rounding helper function to simplify the code.

You can collapse all three dict‐comprehensions into one simple loop and pull out the “round to nearest N” logic into a tiny helper—no functionality is lost, and you avoid keeping three intermediate mappings around:

#!/usr/bin/env python3
import argparse

def _round_to(x: float, base: int) -> int:
    """Round x to nearest multiple of base."""
    return int(round(x / base) * base)

def main():
    p = argparse.ArgumentParser()
    # … your existing argparse setup …
    args = p.parse_args()

    ttokens = args.ttokens
    tps      = args.tokens_per_step
    r        = args.round
    c        = args.cooldown_percent
    out_path = args.out

    # compute the cooldown offset from the *raw* final run
    final_raw     = ( (ttokens - 1) * 10**12 ) / tps
    cooldown_iters = int(c * final_raw)

    with open(out_path, "w") as f:
        f.write("id\tsteps_mod\tsteps_rollback\n")
        for k in range(1, ttokens):
            raw   = (k * 10**12) / tps
            steps = _round_to(raw, r)
            rollback = _round_to(steps - cooldown_iters, r)
            f.write(f"{k}\t{steps}\t{rollback}\n")

    print(f"Wrote TSV to {out_path}")

if __name__ == "__main__":
    main()
  • No more runs, runs_mod, runs_rollback dicts.
  • All rounding is in a single helper.
  • Keeps identical output and logic.

Comment thread tools/cooldown_generator/make_cooldown_cmds.py
Comment thread tools/cooldown_generator/make_cooldown_cmds.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request enhances the AuroraGPT training workflow by adding automated tools for checkpoint cooldown management and improving the launcher script compatibility. The changes introduce a comprehensive toolset for generating cooldown commands at various training milestones and updates the job launcher to explicitly specify the Python interpreter.

Key changes:

  • Added three cooldown generator scripts that automate the creation of training commands with learning rate cooldown schedules
  • Updated the launcher setup to explicitly use the full path to python3 for better reliability
  • Provided comprehensive documentation including examples and detailed usage instructions for the new cooldown tools

Reviewed Changes

Copilot reviewed 6 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tools/cooldown_generator/make_cooldown_cmds.py Core script that generates Megatron-DeepSpeed cooldown commands with configurable checkpoint iterations and cooldown steps
tools/cooldown_generator/gen_cooldown_sweep.sh Wrapper script that automates generation of per-checkpoint cooldown scripts with phase-based data list selection
tools/cooldown_generator/build_checkpoints_from_tokens.py Utility to compute training iterations from token milestones and calculate rollback checkpoints
tools/cooldown_generator/README.md Comprehensive documentation covering usage, examples, and workflow for all cooldown generation tools
ALCF/notes/cooldown.md User guide documenting the cooldown process with real examples from a 256-node training run
ALCF/notes/assets/cooldown.png Visualization asset for cooldown documentation
ALCF/helpers.sh Updated launcher command to explicitly include python3 path for ezpz-launch
.gitignore Added exceptions to ensure cooldown generator tools are tracked in version control

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


# <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
cooldown_iters = int(c * runs[ttokens - 1])
runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential for negative rollback steps. Line 35 calculates runs_rollback by subtracting cooldown_iters from runs_mod[k]. For early checkpoints (small k values), this subtraction could result in negative values, which may not be meaningful for checkpoint steps. Consider adding validation or clamping to ensure runs_rollback[k] >= 0.

Suggested change
runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
runs_rollback = {k: max(int(round((v - cooldown_iters) / r) * r), 0) for k, v in runs_mod.items()}

Copilot uses AI. Check for mistakes.
Comment thread ALCF/helpers.sh Outdated
export LAUNCHER="deepspeed --hostfile $hfds --launcher MPICH ${EXEC}"
else
LAUNCHER="ezpz-launch ${EXEC}"
LAUNCHER="ezpz-launch $(which python3) ${EXEC}"
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation in the else block. Line 364 uses spaces for indentation while surrounding lines (363, 365-384) use tabs. This should use tabs to maintain consistency with the rest of the function.

Suggested change
LAUNCHER="ezpz-launch $(which python3) ${EXEC}"
LAUNCHER="ezpz-launch $(which python3) ${EXEC}"

Copilot uses AI. Check for mistakes.
Comment thread ALCF/helpers.sh Outdated
export LAUNCHER="deepspeed --hostfile $hfds --launcher MPICH ${EXEC}"
else
LAUNCHER="ezpz-launch ${EXEC}"
LAUNCHER="ezpz-launch $(which python3) ${EXEC}"
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing export keyword for the LAUNCHER variable assignment. Line 362 in the if branch uses export LAUNCHER=..., but line 364 in the else branch does not. This inconsistency could lead to the variable not being available in child processes. Consider adding export for consistency.

Suggested change
LAUNCHER="ezpz-launch $(which python3) ${EXEC}"
export LAUNCHER="ezpz-launch $(which python3) ${EXEC}"

Copilot uses AI. Check for mistakes.
# IDs 5..7 -> aurora/dolmino-mix-1124-fused-file-list.txt
#
# Example:
# ./run_cooldown_per_id_split.sh \
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect script name in example comment. The comment on line 13 references ./run_cooldown_per_id_split.sh but the actual script is named gen_cooldown_sweep.sh. This could confuse users trying to follow the example.

Suggested change
# ./run_cooldown_per_id_split.sh \
# ./gen_cooldown_sweep.sh \

Copilot uses AI. Check for mistakes.
You must supply `(S, R)` pairs. Use `--pairs` *or* `-S ... -R ...`.

* **Nothing about the data list changes across IDs.**
This script is intentionally **generic**. If your data list changes by phase/ID, handle that in a wrapper (e.g., `run_cooldown_per_id_split.sh`) and pass the correct `--data-file-list` for each call.
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect script name in documentation. Line 180 references run_cooldown_per_id_split.sh as an example wrapper, but the actual script in this PR is named gen_cooldown_sweep.sh. This should be updated to match the actual filename.

Suggested change
This script is intentionally **generic**. If your data list changes by phase/ID, handle that in a wrapper (e.g., `run_cooldown_per_id_split.sh`) and pass the correct `--data-file-list` for each call.
This script is intentionally **generic**. If your data list changes by phase/ID, handle that in a wrapper (e.g., `gen_cooldown_sweep.sh`) and pass the correct `--data-file-list` for each call.

Copilot uses AI. Check for mistakes.
Comment thread ALCF/notes/cooldown.md Outdated

- Cooled down over last 10\%:
- [volcanic-blaze-4312](https://wandb.ai/aurora_gpt/AuroraGPT/runs/7bjj8vgu/overview?nw=nwuserforemans)
![](./assets/cooldownHD.png)
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken image reference. The path ./assets/cooldownHD.png references an image file that doesn't appear to exist in the PR. The actual image file added is cooldown.png (not cooldownHD.png). This should be ./assets/cooldown.png.

Suggested change
![](./assets/cooldownHD.png)
![](./assets/cooldown.png)

Copilot uses AI. Check for mistakes.
@saforem2 saforem2 closed this pull request by merging all changes into saforem2:main in 83fd580 Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants