Conversation
Updated image reference for cooldown documentation.
Updated the section title and added details to the example.
Reviewer's GuideThis PR enhances the AuroraGPT training workflow by adding a comprehensive cooldown pipeline—complete with documentation and helper scripts for generating and managing LR cooldown commands—and updates the job launcher to explicitly invoke the correct Python interpreter for improved compatibility. Sequence diagram for cooldown command generation and training launchsequenceDiagram
actor User
participant "build_checkpoints_from_tokens.py"
participant "gen_cooldown_sweep.sh"
participant "make_cooldown_cmds.py"
participant "Generated cooldown script"
participant "AuroraGPT Training"
User->>"build_checkpoints_from_tokens.py": Provide token milestones and params
"build_checkpoints_from_tokens.py"->>User: Output checkpoints.tsv
User->>"gen_cooldown_sweep.sh": Run with checkpoints.tsv and config
"gen_cooldown_sweep.sh"->>"make_cooldown_cmds.py": For each checkpoint, call to generate command
"make_cooldown_cmds.py"->>"gen_cooldown_sweep.sh": Output cooldown command script
"gen_cooldown_sweep.sh"->>User: Emit per-ID cooldown scripts
User->>"Generated cooldown script": Launch training job
"Generated cooldown script"->>"AuroraGPT Training": Start training with LR cooldown
"AuroraGPT Training"->>User: Training results and logs
Class diagram for cooldown command generator scriptsclassDiagram
class build_checkpoints_from_tokens.py {
+main()
-ttokens: int
-tokens_per_step: int
-cooldown_percent: float
-round: int
-out: str
+computes steps_mod and steps_rollback
}
class gen_cooldown_sweep.sh {
+parses arguments
+calls build_checkpoints_from_tokens.py
+calls make_cooldown_cmds.py for each checkpoint
+writes per-ID cooldown scripts
}
class make_cooldown_cmds.py {
+main()
+build_command(...)
+parse_pairs(...)
+generates cooldown command blocks
+supports --emit-sh for script output
}
build_checkpoints_from_tokens.py --> gen_cooldown_sweep.sh : generates checkpoints.tsv
gen_cooldown_sweep.sh --> make_cooldown_cmds.py : calls for command generation
Class diagram for updated launcher setup in helpers.shclassDiagram
class helpers.sh {
+setupLauncher()
-LAUNCHER: string
+uses ezpz-launch with explicit python3 path
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- The generated scripts from make_cooldown_cmds.py don’t include a shebang, so consider prepending
#!/usr/bin/env bashwhen using --emit-sh to ensure they execute under the intended shell. - There’s duplicated CLI parsing logic between make_cooldown_cmds.py and build_checkpoints_from_tokens.py—extract common argument handling into a shared module to reduce drift.
- Hardcoding
$(which python3)in helpers.sh can be brittle across environments; consider parameterizing the interpreter path or sourcing it from a common config instead.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The generated scripts from make_cooldown_cmds.py don’t include a shebang, so consider prepending `#!/usr/bin/env bash` when using --emit-sh to ensure they execute under the intended shell.
- There’s duplicated CLI parsing logic between make_cooldown_cmds.py and build_checkpoints_from_tokens.py—extract common argument handling into a shared module to reduce drift.
- Hardcoding `$(which python3)` in helpers.sh can be brittle across environments; consider parameterizing the interpreter path or sourcing it from a common config instead.
## Individual Comments
### Comment 1
<location> `tools/cooldown_generator/build_checkpoints_from_tokens.py:34` </location>
<code_context>
+ runs_mod = {k: int(round(v / r) * r) for k, v in runs.items()}
+
+ # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
+ cooldown_iters = int(c * runs[ttokens - 1])
+ runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Rounding of cooldown_iters may introduce off-by-one errors.
Using int() truncates the value, which can lead to a lower cooldown_iters than intended. Switching to round() will help prevent off-by-one errors, particularly with small values.
```suggestion
cooldown_iters = round(c * runs[ttokens - 1])
```
</issue_to_address>
### Comment 2
<location> `tools/cooldown_generator/build_checkpoints_from_tokens.py:35` </location>
<code_context>
+
+ # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback
+ cooldown_iters = int(c * runs[ttokens - 1])
+ runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()}
+
+ with open(args.out, "w") as f:
</code_context>
<issue_to_address>
**issue:** Rollback steps may become negative for small step counts.
If cooldown_iters is greater than v for small k, rollback steps can be negative. Add a check to prevent negative values or clamp them to zero.
</issue_to_address>
### Comment 3
<location> `tools/cooldown_generator/make_cooldown_cmds.py:62` </location>
<code_context>
+"""
+import argparse
+
+def main():
+ p = argparse.ArgumentParser()
+ p.add_argument("--ttokens", type=int, default=8, help="Total token milestones (default: 8 for 0..7T).")
</code_context>
<issue_to_address>
**issue (complexity):** Consider using argparse mutually-exclusive groups, enumerate for IDs, and clearer list-joins to simplify argument parsing and command construction.
```suggestion
# Use argparse mutually‐exclusive groups, enumerate for IDs, and clearer list-joins
import argparse
def main():
p = argparse.ArgumentParser(...)
mode = p.add_mutually_exclusive_group(required=True)
mode.add_argument('--pairs', nargs='+', metavar='[CID:]S:R')
mode.add_argument('--checkpoint-iters', '-S', nargs='+', type=int)
p.add_argument('--cooldown-steps', '-R', nargs='+', type=int)
p.add_argument('--checkpoint-ids', nargs='+', type=int)
# ... other args ...
args = p.parse_args()
if args.pairs:
records = parse_pairs(args.pairs)
else:
if not args.cooldown_steps:
p.error('--cooldown-steps/-R is required with --checkpoint-iters/-S')
ids = args.checkpoint_ids or list(range(1, len(args.checkpoint_iters)+1))
if len(ids) != len(args.checkpoint_iters):
p.error('--checkpoint-ids must match length of --checkpoint-iters')
records = [
{"id": cid, "S": S, "R": R}
for cid, S in zip(ids, args.checkpoint_iters)
for R in args.cooldown_steps
]
# ...
def parse_pairs(pairs_args):
records = []
for idx, item in enumerate(pairs_args, 1):
parts = list(map(int, item.split(':')))
if len(parts) == 2:
cid, S, R = idx, *parts
elif len(parts) == 3:
cid, S, R = parts
else:
raise SystemExit(f"Bad --pairs entry: {item}")
if S <= 0 or R <= 0:
raise SystemExit(f"Non-positive S/R in --pairs entry: {item}")
records.append({"id": cid, "S": S, "R": R})
return records
def build_command(
load_path, data_file_list, train_script, train_iters,
lr_cooldown_frac, grad_acc_steps, opt, min_lr,
override_ckpt_opt_param, extra_args
):
env = {
"LR_DECAY_STYLE": "constant",
"OPT": opt,
"OVERRIDE_CKPT_OPT_PARAM": int(override_ckpt_opt_param),
"TRAIN_ITERS": train_iters,
"GRAD_ACC_STEPS": grad_acc_steps,
"LOAD": load_path,
"DATA_FILE_LIST": data_file_list,
}
env_block = " \\\n".join(f"{k}={v}" for k,v in env.items() if v)
cmd_parts = [
f"bash {train_script}",
"--override-opt_param-scheduler",
f"--min-lr={min_lr}",
"--lr_constant_plus_cooldown",
f"--lr_constant_plus_cooldown_frac={fmt_float(lr_cooldown_frac)}",
]
if extra_args:
cmd_parts.append(extra_args.strip())
return env_block + " \\\n " + " \\\n ".join(cmd_parts)
```
– Use a mutually‐exclusive group to enforce “pairs” _or_ “-S/-R” modes.
– Switch from manual `next_id` to `enumerate()` in `parse_pairs`.
– Build the `env_block` and command parts with dict/list comprehensions and simple joins (no `dedent` or filtering).
</issue_to_address>
### Comment 4
<location> `tools/cooldown_generator/build_checkpoints_from_tokens.py:30` </location>
<code_context>
+ c = args.cooldown_percent
+
+ # Steps at each i*T (i in 0..ttokens-1)
+ runs = {i: (i * 10**12) / tps for i in range(ttokens)}
+ runs_mod = {k: int(round(v / r) * r) for k, v in runs.items()}
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider replacing the three dictionary comprehensions with a single loop and a rounding helper function to simplify the code.
You can collapse all three dict‐comprehensions into one simple loop and pull out the “round to nearest N” logic into a tiny helper—no functionality is lost, and you avoid keeping three intermediate mappings around:
```python
#!/usr/bin/env python3
import argparse
def _round_to(x: float, base: int) -> int:
"""Round x to nearest multiple of base."""
return int(round(x / base) * base)
def main():
p = argparse.ArgumentParser()
# … your existing argparse setup …
args = p.parse_args()
ttokens = args.ttokens
tps = args.tokens_per_step
r = args.round
c = args.cooldown_percent
out_path = args.out
# compute the cooldown offset from the *raw* final run
final_raw = ( (ttokens - 1) * 10**12 ) / tps
cooldown_iters = int(c * final_raw)
with open(out_path, "w") as f:
f.write("id\tsteps_mod\tsteps_rollback\n")
for k in range(1, ttokens):
raw = (k * 10**12) / tps
steps = _round_to(raw, r)
rollback = _round_to(steps - cooldown_iters, r)
f.write(f"{k}\t{steps}\t{rollback}\n")
print(f"Wrote TSV to {out_path}")
if __name__ == "__main__":
main()
```
- No more `runs`, `runs_mod`, `runs_rollback` dicts.
- All rounding is in a single helper.
- Keeps identical output and logic.
</issue_to_address>
### Comment 5
<location> `tools/cooldown_generator/make_cooldown_cmds.py:10` </location>
<code_context>
def build_command(
load_path: str,
data_file_list: str,
train_script: str,
train_iters: int,
lr_cooldown_frac: float,
grad_acc_steps: int,
opt: str,
min_lr: float,
override_ckpt_opt_param: bool,
extra_args: str,
) -> str:
env_lines = [
"LR_DECAY_STYLE=constant",
f"OPT={opt}",
"OVERRIDE_CKPT_OPT_PARAM=1" if override_ckpt_opt_param else "",
f"TRAIN_ITERS={train_iters}",
f"GRAD_ACC_STEPS={grad_acc_steps}",
f"LOAD={load_path}",
f"DATA_FILE_LIST={data_file_list}",
]
env_block = " \\\n".join([l for l in env_lines if l])
extra_line = ""
if extra_args:
extra_line = f" \\\n {extra_args}"
cmd = dedent(f"""\
{env_block} \\
bash {train_script} \\
--override-opt_param-scheduler \\
--min-lr={min_lr} \\
--lr_constant_plus_cooldown \\
--lr_constant_plus_cooldown_frac={fmt_float(lr_cooldown_frac)}{extra_line}
""").strip()
return cmd
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
- Move setting of default value for variable into `else` branch ([`introduce-default-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/introduce-default-else/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))
</issue_to_address>
### Comment 6
<location> `tools/cooldown_generator/make_cooldown_cmds.py:99` </location>
<code_context>
def main():
p = argparse.ArgumentParser(
description="Emit Megatron-DeepSpeed cooldown commands so LR cooldown starts at resume.\n"
"Provide checkpoint iteration(s) S and cooldown step(s) R.\n"
"For each pair, sets TRAIN_ITERS T=S+R and lr_constant_plus_cooldown_frac f=S/T."
)
p.add_argument("--load", required=True)
p.add_argument("--data-file-list", required=True)
p.add_argument("--train-script", default="train_alcf.sh")
p.add_argument("--grad-acc-steps", type=int, default=2)
p.add_argument("--opt", default="ipex.fusedlamb")
p.add_argument("--min-lr", type=float, default=2e-5)
p.add_argument("--no-override-ckpt-opt", action="store_true")
p.add_argument("--extra-args", default="")
p.add_argument("--emit-sh", type=Path, default=None)
p.add_argument("--checkpoint-iters", "-S", type=int, nargs="+")
p.add_argument("--cooldown-steps", "-R", type=int, nargs="+")
p.add_argument("--checkpoint-ids", type=int, nargs="+")
p.add_argument("--pairs", type=str, nargs="*")
args = p.parse_args()
override_flag = not args.no_override_ckpt_opt
if args.pairs:
records = parse_pairs(args.pairs)
else:
if not args.checkpoint_iters or not args.cooldown_steps:
raise SystemExit("Provide either --pairs OR both --checkpoint-iters and --cooldown-steps.")
ids = args.checkpoint_ids or list(range(1, len(args.checkpoint_iters) + 1))
if len(ids) != len(args.checkpoint_iters):
raise SystemExit("--checkpoint-ids must match length of --checkpoint-iters.")
records = [{"id": cid, "S": int(S), "R": int(R)}
for cid, S in zip(ids, args.checkpoint_iters)
for R in args.cooldown_steps]
lines = []
header = "# Auto-generated cooldown commands\nset -euo pipefail\n\n"
if args.emit_sh:
lines.append(header)
for rec in records:
cid, S, R = rec["id"], rec["S"], rec["R"]
T = S + R
f = S / T
tag = f"# id={cid} resume_step={S} cooldown_steps={R} total_iters={T} frac={fmt_float(f)}"
cmd = build_command(
load_path=args.load,
data_file_list=args.data_file_list,
train_script=args.train_script,
train_iters=T,
lr_cooldown_frac=f,
grad_acc_steps=args.grad_acc_steps,
opt=args.opt,
min_lr=args.min_lr,
override_ckpt_opt_param=override_flag,
extra_args=args.extra_args.strip(),
)
block = f"{tag}\n{cmd}\n"
print(block)
if args.emit_sh:
lines.append(block + "\n")
if args.emit_sh:
args.emit_sh.write_text("\n".join(lines))
print(f"# Wrote script to: {args.emit_sh}")
</code_context>
<issue_to_address>
**suggestion (code-quality):** Move assignments closer to their usage ([`move-assign`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/move-assign/))
```suggestion
header = "# Auto-generated cooldown commands\nset -euo pipefail\n\n"
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
|
||
| # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback | ||
| cooldown_iters = int(c * runs[ttokens - 1]) | ||
| runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()} |
There was a problem hiding this comment.
issue: Rollback steps may become negative for small step counts.
If cooldown_iters is greater than v for small k, rollback steps can be negative. Add a check to prevent negative values or clamp them to zero.
| c = args.cooldown_percent | ||
|
|
||
| # Steps at each i*T (i in 0..ttokens-1) | ||
| runs = {i: (i * 10**12) / tps for i in range(ttokens)} |
There was a problem hiding this comment.
issue (complexity): Consider replacing the three dictionary comprehensions with a single loop and a rounding helper function to simplify the code.
You can collapse all three dict‐comprehensions into one simple loop and pull out the “round to nearest N” logic into a tiny helper—no functionality is lost, and you avoid keeping three intermediate mappings around:
#!/usr/bin/env python3
import argparse
def _round_to(x: float, base: int) -> int:
"""Round x to nearest multiple of base."""
return int(round(x / base) * base)
def main():
p = argparse.ArgumentParser()
# … your existing argparse setup …
args = p.parse_args()
ttokens = args.ttokens
tps = args.tokens_per_step
r = args.round
c = args.cooldown_percent
out_path = args.out
# compute the cooldown offset from the *raw* final run
final_raw = ( (ttokens - 1) * 10**12 ) / tps
cooldown_iters = int(c * final_raw)
with open(out_path, "w") as f:
f.write("id\tsteps_mod\tsteps_rollback\n")
for k in range(1, ttokens):
raw = (k * 10**12) / tps
steps = _round_to(raw, r)
rollback = _round_to(steps - cooldown_iters, r)
f.write(f"{k}\t{steps}\t{rollback}\n")
print(f"Wrote TSV to {out_path}")
if __name__ == "__main__":
main()- No more
runs,runs_mod,runs_rollbackdicts. - All rounding is in a single helper.
- Keeps identical output and logic.
There was a problem hiding this comment.
Pull Request Overview
This pull request enhances the AuroraGPT training workflow by adding automated tools for checkpoint cooldown management and improving the launcher script compatibility. The changes introduce a comprehensive toolset for generating cooldown commands at various training milestones and updates the job launcher to explicitly specify the Python interpreter.
Key changes:
- Added three cooldown generator scripts that automate the creation of training commands with learning rate cooldown schedules
- Updated the launcher setup to explicitly use the full path to python3 for better reliability
- Provided comprehensive documentation including examples and detailed usage instructions for the new cooldown tools
Reviewed Changes
Copilot reviewed 6 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tools/cooldown_generator/make_cooldown_cmds.py |
Core script that generates Megatron-DeepSpeed cooldown commands with configurable checkpoint iterations and cooldown steps |
tools/cooldown_generator/gen_cooldown_sweep.sh |
Wrapper script that automates generation of per-checkpoint cooldown scripts with phase-based data list selection |
tools/cooldown_generator/build_checkpoints_from_tokens.py |
Utility to compute training iterations from token milestones and calculate rollback checkpoints |
tools/cooldown_generator/README.md |
Comprehensive documentation covering usage, examples, and workflow for all cooldown generation tools |
ALCF/notes/cooldown.md |
User guide documenting the cooldown process with real examples from a 256-node training run |
ALCF/notes/assets/cooldown.png |
Visualization asset for cooldown documentation |
ALCF/helpers.sh |
Updated launcher command to explicitly include python3 path for ezpz-launch |
.gitignore |
Added exceptions to ensure cooldown generator tools are tracked in version control |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # <cooldown_percent>% of the FINAL (ttokens-1) step count, then rounded rollback | ||
| cooldown_iters = int(c * runs[ttokens - 1]) | ||
| runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()} |
There was a problem hiding this comment.
Potential for negative rollback steps. Line 35 calculates runs_rollback by subtracting cooldown_iters from runs_mod[k]. For early checkpoints (small k values), this subtraction could result in negative values, which may not be meaningful for checkpoint steps. Consider adding validation or clamping to ensure runs_rollback[k] >= 0.
| runs_rollback = {k: int(round((v - cooldown_iters) / r) * r) for k, v in runs_mod.items()} | |
| runs_rollback = {k: max(int(round((v - cooldown_iters) / r) * r), 0) for k, v in runs_mod.items()} |
| export LAUNCHER="deepspeed --hostfile $hfds --launcher MPICH ${EXEC}" | ||
| else | ||
| LAUNCHER="ezpz-launch ${EXEC}" | ||
| LAUNCHER="ezpz-launch $(which python3) ${EXEC}" |
There was a problem hiding this comment.
Inconsistent indentation in the else block. Line 364 uses spaces for indentation while surrounding lines (363, 365-384) use tabs. This should use tabs to maintain consistency with the rest of the function.
| LAUNCHER="ezpz-launch $(which python3) ${EXEC}" | |
| LAUNCHER="ezpz-launch $(which python3) ${EXEC}" |
| export LAUNCHER="deepspeed --hostfile $hfds --launcher MPICH ${EXEC}" | ||
| else | ||
| LAUNCHER="ezpz-launch ${EXEC}" | ||
| LAUNCHER="ezpz-launch $(which python3) ${EXEC}" |
There was a problem hiding this comment.
Missing export keyword for the LAUNCHER variable assignment. Line 362 in the if branch uses export LAUNCHER=..., but line 364 in the else branch does not. This inconsistency could lead to the variable not being available in child processes. Consider adding export for consistency.
| LAUNCHER="ezpz-launch $(which python3) ${EXEC}" | |
| export LAUNCHER="ezpz-launch $(which python3) ${EXEC}" |
| # IDs 5..7 -> aurora/dolmino-mix-1124-fused-file-list.txt | ||
| # | ||
| # Example: | ||
| # ./run_cooldown_per_id_split.sh \ |
There was a problem hiding this comment.
Incorrect script name in example comment. The comment on line 13 references ./run_cooldown_per_id_split.sh but the actual script is named gen_cooldown_sweep.sh. This could confuse users trying to follow the example.
| # ./run_cooldown_per_id_split.sh \ | |
| # ./gen_cooldown_sweep.sh \ |
| You must supply `(S, R)` pairs. Use `--pairs` *or* `-S ... -R ...`. | ||
|
|
||
| * **Nothing about the data list changes across IDs.** | ||
| This script is intentionally **generic**. If your data list changes by phase/ID, handle that in a wrapper (e.g., `run_cooldown_per_id_split.sh`) and pass the correct `--data-file-list` for each call. |
There was a problem hiding this comment.
Incorrect script name in documentation. Line 180 references run_cooldown_per_id_split.sh as an example wrapper, but the actual script in this PR is named gen_cooldown_sweep.sh. This should be updated to match the actual filename.
| This script is intentionally **generic**. If your data list changes by phase/ID, handle that in a wrapper (e.g., `run_cooldown_per_id_split.sh`) and pass the correct `--data-file-list` for each call. | |
| This script is intentionally **generic**. If your data list changes by phase/ID, handle that in a wrapper (e.g., `gen_cooldown_sweep.sh`) and pass the correct `--data-file-list` for each call. |
|
|
||
| - Cooled down over last 10\%: | ||
| - [volcanic-blaze-4312](https://wandb.ai/aurora_gpt/AuroraGPT/runs/7bjj8vgu/overview?nw=nwuserforemans) | ||
|  |
There was a problem hiding this comment.
Broken image reference. The path ./assets/cooldownHD.png references an image file that doesn't appear to exist in the PR. The actual image file added is cooldown.png (not cooldownHD.png). This should be ./assets/cooldown.png.
|  | |
|  |
Copilot Summary
This pull request improves the AuroraGPT training workflow documentation and updates the job launcher setup for better compatibility. The most significant changes are the addition of a detailed guide for cooling down checkpoints and converting them to a universal format, as well as an update to the launcher script to ensure the correct Python interpreter is used.
Documentation and workflow improvements:
ALCF/notes/cooldown.mdfor cooling down AuroraGPT-2B checkpoints, including step-by-step instructions, example commands, and logs from a large-scale (256 nodes) training run. This documentation covers both the cooldown process and conversion to a universal checkpoint format.Launcher setup enhancement:
ALCF/helpers.shto explicitly use the path to thepython3interpreter when invokingezpz-launch, improving reliability and compatibility with Python-based workflows.Summary by Sourcery
Add tools and documentation for automating and documenting the cooldown process and checkpoint conversion for AuroraGPT training runs.
New Features:
Enhancements:
Documentation: