Initial commit with updates to cpt.md and new README#19
Initial commit with updates to cpt.md and new README#19saforem2 merged 79 commits intosaforem2:mainfrom
Conversation
Checkpoint Cooldown Helper Scripts
Reviewer's GuideThis PR significantly extends AuroraGPT continual pre‑training documentation, introduces a new large‑batch optimizer guide, adds cooldown command-generation tooling and training scripts for specific stages, and makes small runtime/launcher cleanups (switching to ezpz APIs, safer logging, and making ezpz usage less intrusive). Flow diagram for AuroraGPT CPT data-mixing strategy across stagesgraph TD
Start["Start CPT planning for AuroraGPT V1"]
Start --> S1["Stage1 D0 pretraining complete"]
S1 --> S1_S2["Stage1 → Stage2\nWeak distribution shift"]
S1_S2 --> S1S2_Strategy1["Strategy1 No replay\nTrain only on D1"]
S1_S2 --> S1S2_Strategy2["Strategy2 Replay from D0\nMix D0 + D1"]
S1S2_Strategy2 --> S1S2_Mix["Set mixing weights\nalpha0=0.05–0.10, alphaD=1-alpha0, alphaB=0"]
S1S2_Mix --> S1S2_Buffer["Build buffer B with samples from D1"]
S1S2_Buffer --> S2_Complete["End of Stage2\nHave DCPT_1"]
S1S2_Strategy1 --> S1S2_Buffer
S2_Complete --> S2_S3["Stage2 → Stage3\nStrong shift to math/code D2"]
S2_S3 --> S2S3_Naive["Naive strategy\nTrain mostly on D2"]
S2_S3 --> S2S3_Mix["Strategy3 Mix D0 + D2 + B"]
S2_S3 --> S2S3_Decay["Strategy4 LR decay recipe\nif data mixing fails"]
S2S3_Mix --> S2S3_MixA["MixA D0=0.33, D2=0.33, B=0.34"]
S2S3_Mix --> S2S3_MixB["MixB D0=0.05, D2=0.48, B=0.47"]
S2S3_Mix --> S2S3_MixC["MixC D0=0.0, D2=0.10, B=0.90"]
S2S3_MixA --> S2S3_BufferUpdate["Add D2 to buffer B at end of stage"]
S2S3_MixB --> S2S3_BufferUpdate
S2S3_MixC --> S2S3_BufferUpdate
S2S3_BufferUpdate --> S3_Complete["End of Stage3\nHave DCPT_2"]
S3_Complete --> S3_S4["Stage3 → Stage4\nShift to reasoning traces D3"]
S3_S4 --> S3S4_NoPrevDecay["If Strategy4 was not used"]
S3_S4 --> S3S4_PrevDecay["If Strategy4 was used"]
S3S4_NoPrevDecay --> S3S4_MixA["MixA D0=0.33, D3=0.33, B=0.34"]
S3S4_NoPrevDecay --> S3S4_MixB["MixB D0=0.50, D3=0.25, B=0.25"]
S3S4_MixA --> S3S4_Cooldown["Cool down LR to convergence"]
S3S4_MixB --> S3S4_Cooldown
S3S4_PrevDecay --> S3S4_ContinueDecay["Continue decaying with DCPT_2\nuntil LR3/100"]
S3S4_ContinueDecay --> S3S4_IntroduceD3["Introduce D3 mix at LR3/5"]
S3S4_IntroduceD3 --> S3S4_Cooldown
S3S4_Cooldown --> Final["Final CPT model converged"]
Flow diagram for generating cooldown commands with make_cooldown_cmds.pygraph TD
A["Start cooldown planning"] --> B["Provide checkpoint resume step S\n(--checkpoint-iters or --pairs)"]
B --> C["Provide cooldown length R in steps\n(--cooldown-steps or in --pairs)"]
C --> D["Optionally specify ID(s)\n(--checkpoint-ids or id:S:R in --pairs)"]
D --> E["Compute total train iters T = S + R"]
E --> F["Compute fraction f = S / T\n(for lr_constant_plus_cooldown_frac)"]
F --> G["Build env block\nMODEL_ARCH, OPT, GRAD_ACC_STEPS, TRAIN_ITERS=T, ..."]
G --> H["Construct train_alcf_sh command\nwith lr_constant_plus_cooldown and frac=f"]
H --> I{Emit mode}
I -->|--emit-sh provided| J["Write cooldown_idX_sS_rR_tT_sh script"]
I -->|no --emit-sh| K["Print command block to stdout"]
J --> L["Optionally submit script via PBS"]
K --> L
L --> M["Megatron_DeepSpeed resumes at S\nLR cooldown runs for R steps"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 12 issues, and left some high level feedback:
- In tools/cooldown_generator/make_cooldown_cmds.py, the computation of train_iters from train_tokens is inverted (it currently multiplies by global_batch_size and sequence_length instead of dividing by them), which will massively overshoot the intended number of iterations; consider fixing this before relying on token-based configuration.
- The helper get_total_iters_from_cooldown_percent in make_cooldown_cmds.py has a suspicious expression
cooldown_percent = (train_iters - cooldown_steps / train_iters)that likely needs parentheses ((train_iters - cooldown_steps) / train_iters) or a different formula; it would be good to correct or remove this before someone uses it. - In ALCF/helpers.sh and train_alcf.sh you’ve commented out install_dependencies and ezpz_kill_mpi/ezpz auto-install; if these side effects are still required in common workflows, consider adding a clear alternative invocation path or a guarded flag so users don’t silently skip setup/cleanup.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In tools/cooldown_generator/make_cooldown_cmds.py, the computation of train_iters from train_tokens is inverted (it currently multiplies by global_batch_size and sequence_length instead of dividing by them), which will massively overshoot the intended number of iterations; consider fixing this before relying on token-based configuration.
- The helper get_total_iters_from_cooldown_percent in make_cooldown_cmds.py has a suspicious expression `cooldown_percent = (train_iters - cooldown_steps / train_iters)` that likely needs parentheses (`(train_iters - cooldown_steps) / train_iters`) or a different formula; it would be good to correct or remove this before someone uses it.
- In ALCF/helpers.sh and train_alcf.sh you’ve commented out install_dependencies and ezpz_kill_mpi/ezpz auto-install; if these side effects are still required in common workflows, consider adding a clear alternative invocation path or a guarded flag so users don’t silently skip setup/cleanup.
## Individual Comments
### Comment 1
<location path="tools/cooldown_generator/make_cooldown_cmds.py" line_range="162-163" />
<code_context>
+ assert train_iters is None, (
+ f"Only one of {train_tokens, train_iters} should be specified."
+ )
+ assert global_batch_size is not None and sequence_length is not None
+ train_iters = train_tokens * global_batch_size * sequence_length
+
+ assert train_iters is not None
</code_context>
<issue_to_address>
**issue (bug_risk):** Train iterations from train_tokens should divide by tokens-per-step, not multiply.
In the `train_tokens` branch you currently compute:
```python
train_iters = train_tokens * global_batch_size * sequence_length
```
If `train_tokens` is the total desired tokens, the correct step count is:
```python
tokens_per_step = global_batch_size * sequence_length
train_iters = train_tokens // tokens_per_step # or float division + rounding
```
Multiplying by `global_batch_size * sequence_length` yields an incorrect (and massively inflated) number of steps.
</issue_to_address>
### Comment 2
<location path="tools/cooldown_generator/make_cooldown_cmds.py" line_range="271-276" />
<code_context>
+
+ lines = []
+ # header = "# Auto-generated cooldown commands\nset -euo pipefail\n\n"
+ if args.include_header:
+ if (hfp := Path(args.include_header)).is_file():
+ with hfp.open("r") as f:
+ lines.extend(f.readlines())
+ else:
+ lines.extend("\n".join(args.include_header.split("\n")))
+ # if args.emit_sh:
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Using list.extend on a joined header string appends characters instead of lines.
In the non-file case you currently do:
```python
lines.extend("\n".join(args.include_header.split("\n")))
```
This extends `lines` with individual characters, not header lines, so the output script will be malformed.
Use either:
```python
lines.append(args.include_header)
```
if `include_header` is a full block, or:
```python
lines.extend(args.include_header.splitlines(keepends=True))
```
if you need to preserve line boundaries and newlines.
</issue_to_address>
### Comment 3
<location path="tools/cooldown_generator/gen_cooldown_sweep.sh" line_range="71" />
<code_context>
+# die "--cool-R (cooldown steps) or --cooldown-percent required"
+# fi
+
+# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"
+[[ -n "${LOAD_PATH}" ]] || die "--load path is required"
+[[ -n "${PHASE1_LIST}" ]] || die "--phase1-list is required (IDs 1..4)"
</code_context>
<issue_to_address>
**issue (bug_risk):** Cooldown steps (COOL_R) can be left empty but are still passed to make_cooldown_cmds.
The guard requiring `COOL_R` was commented out:
```bash
# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"
```
but `--cooldown-steps "${COOL_R}"` is still passed to `make_cooldown_cmds.py` in both exact and rollback scripts. If `COOL_R` is unset/empty, the Python script receives an empty value for a required integer argument and will fail or misbehave.
Please either restore the guard or define a safe default for `COOL_R` at the top of the script.
</issue_to_address>
### Comment 4
<location path="ALCF/notes/cpt.md" line_range="33" />
<code_context>
+
+| CPT Stages | Distribution shift | Primary strategy | Fallbacks | Notes |
+|------|--------------------|------------------|-------------|-------|
+| Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
+| Stage 2 → 3 | Strong | 5-30% replay of $D^{CPT}_{2}$ and monitor loss | Use buffer: $0.33D_0 + 0.33D_2 + 0.34B$| Add $D_2$ to buffer $B$, you might need to switch to LR centric strategy, see below |
+| Stage 3 → 4 | Strong | Cooldown with mix $0.33D_0 + 0.33D_3 + 0.34B$ | Cooldown with mix $0.05D_0 + 0.47D_3 + 0.48B$ | You will need to continue decay if used in previous stage. Play with decay function,stages, and final LR value |
</code_context>
<issue_to_address>
**issue (typo):** Fix typo in "wih" → "with" in the Stage 1→2 row.
"wih" in this row is a typo; please change it to "with".
```suggestion
| Stage 1 → 2 | Weak | Naive CPT with $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ |
```
</issue_to_address>
### Comment 5
<location path="ALCF/notes/cpt.md" line_range="47" />
<code_context>
+
+As a result, we primarily adopt a **data-centric strategy** throughout these stages, resorting to learning-rate adjustments only when necessary.
+
+The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces.
+
+| Stage | Dataset Symbol | Size | Source / Path | Notes |
</code_context>
<issue_to_address>
**issue (typo):** Correct "ans science papers" to "and science papers".
In the $D_2$ description, "ans science papers" is a typo and should be "and science papers".
```suggestion
The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, and science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces.
```
</issue_to_address>
### Comment 6
<location path="ALCF/notes/cpt.md" line_range="103" />
<code_context>
+ ```bash
+python3 mix_datasets.py --input 0.9 /flare/Aurora_deployment/AuroraGPT/datasets/papers/papers.txt 0.1 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_v2.txt > ${debug_dir}/Megatron-DeepSpeed/ALCF/data-lists/aurora/mix_lucid_papers09_dolma01.txt
+```
+2. **Start building the buffer $B$** in prevision of the next stages.
+3. **Run CPT**: Load your checkpoint and run CPT with the --finetube flag.
+Note that you might need to convert your checkpoints following [these instructions](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/notes/universal_checkpoint_bug.md) to a universal checkpoint.
</code_context>
<issue_to_address>
**suggestion (typo):** Replace "in prevision of" with a more natural phrase like "in preparation for".
The phrase "in prevision of the next stages" is non-idiomatic. Please use a more natural alternative, e.g. "in preparation for the next stages" or simply "for the next stages".
```suggestion
2. **Start building the buffer $B$** in preparation for the next stages.
```
</issue_to_address>
### Comment 7
<location path="ALCF/notes/cpt.md" line_range="104" />
<code_context>
+python3 mix_datasets.py --input 0.9 /flare/Aurora_deployment/AuroraGPT/datasets/papers/papers.txt 0.1 /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_v2.txt > ${debug_dir}/Megatron-DeepSpeed/ALCF/data-lists/aurora/mix_lucid_papers09_dolma01.txt
+```
+2. **Start building the buffer $B$** in prevision of the next stages.
+3. **Run CPT**: Load your checkpoint and run CPT with the --finetube flag.
+Note that you might need to convert your checkpoints following [these instructions](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/main/ALCF/notes/universal_checkpoint_bug.md) to a universal checkpoint.
+
</code_context>
<issue_to_address>
**issue (typo):** Likely typo: "--finetube" should be "--finetune".
If other commands use a `--finetune` flag, update this reference to match for consistency and to avoid confusion.
```suggestion
3. **Run CPT**: Load your checkpoint and run CPT with the --finetune flag.
```
</issue_to_address>
### Comment 8
<location path="ALCF/notes/cpt.md" line_range="79" />
<code_context>
+##### Strategy 1: No replay
+`Important: USE A CHECKPOINT AT LR=LR_max i.e. BEFORE COOLING DOWN`.
+
+Naively continue training with $D_1$, no replay data.
+- Continue training using only the current dataset $D_1$
+- No replay from $D_0$ or buffer data
+This may be sufficient under weak distribution shift but there is potential risks of forgetting
+
+##### Strategy 2: Replay from pretraining dataset
</code_context>
<issue_to_address>
**suggestion (typo):** Fix subject–verb agreement in "there is potential risks".
Change to either "there are potential risks of forgetting" or "there is a potential risk of forgetting" for correct subject–verb agreement.
```suggestion
This may be sufficient under weak distribution shift but there are potential risks of forgetting
```
</issue_to_address>
### Comment 9
<location path="ALCF/notes/cpt.md" line_range="165" />
<code_context>
+ - adjusting the data-mixing strategy by **increasing the weight of pretraining data**.
+
+At the end of this stage, we have $D^{CPT}_2$.
+#### Stage 3 to stage 4 (shift to reasoning tracex)
+
+At this point, we only have ~6% of training left and one should start the final decay.
</code_context>
<issue_to_address>
**issue (typo):** Fix typo "tracex" → "traces".
```suggestion
#### Stage 3 to stage 4 (shift to reasoning traces)
```
</issue_to_address>
### Comment 10
<location path="ALCF/notes/cpt.md" line_range="184" />
<code_context>
+
+***If we did use Strategy 4 above:***
+We should keep decaying with $D^{CPT}_2$ until $LR_3/100$ then introduce the new mix at $LR_3/5$
+
+
+
</code_context>
<issue_to_address>
**issue (typo):** Fix typo "devay" → "decay" in the image alt-text.
For clarity, update the alt text phrase to "previous decay".
```suggestion

```
</issue_to_address>
### Comment 11
<location path="ALCF/notes/large_batch_optimizers_settings.md" line_range="27" />
<code_context>
+
+### Adding custom optimizers
+To add a custom optimizer, you have to modify the following files:
+- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.
+- `megatron/arguments.py`: [optimizer arguments](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/arguments.py#L1070), to add the optimizer arguments
+- `megatron/arguments.py`: [list of valid optimizers](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/arguments.py#L1485), to add the new optimizer to the list of valid optimizers
</code_context>
<issue_to_address>
**issue (typo):** Correct "heve" → "have".
```suggestion
- `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either have to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder.
```
</issue_to_address>
### Comment 12
<location path="tools/cooldown_generator/make_cooldown_cmds.py" line_range="42" />
<code_context>
+ return f"{x:.8f}".rstrip("0").rstrip(".")
+
+
+def get_total_iters_from_cooldown_percent(
+ checkpoint_iter: Optional[int] = None,
+ cooldown_percent: Optional[float] = None,
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying this script by removing unused helpers, fixing and streamlining header handling, extracting checkpoint file updates into a helper, and adding a thin wrapper around `build_command` for the common cooldown case to make the main flow easier to follow.
1. `get_total_iters_from_cooldown_percent` is unused and adds complexity
If you don’t have a concrete caller yet, it’s just mental overhead. You can safely drop it now and reintroduce a focused version when needed:
```python
# Remove this entire function for now:
# def get_total_iters_from_cooldown_percent(...):
# ...
```
If you plan to use it soon, consider a much narrower helper that matches how this script actually computes things now, e.g.:
```python
def compute_total_iters_and_frac(S: int, R: int) -> tuple[int, float]:
T = S + R
return T, S / T
```
and then use that in `main()` instead of recomputing `T` and `f` inline.
2. Header handling has a subtle bug and can be simplified
`lines.extend("\n".join(args.include_header.split("\n")))` extends by characters, not lines. You can fix that and also normalize to a single header string:
```python
# Build a header string once
if args.include_header is None:
header_str = get_header_template(
queue=args.queue,
project=args.project,
walltime=args.walltime,
filesystems=args.filesystems,
nodes=args.nodes,
)
else:
hfp = Path(args.include_header)
if hfp.is_file():
header_str = hfp.read_text()
else:
header_str = args.include_header
lines = []
if header_str:
# Ensure trailing newline
if not header_str.endswith("\n"):
header_str += "\n"
lines.append(header_str)
```
Then when writing scripts:
```python
with open(outfile, "w") as f:
f.writelines(lines)
f.write(block + "\n")
```
This fixes the bug and keeps header behavior explicit and simple.
3. Extract checkpoint metadata sync out of the main loop
The per-record loop mixes command generation and checkpoint file mutation. Extracting the latter into a helper reduces cognitive load without changing behavior:
```python
def ensure_checkpoint_pointers(load_path: str, S: int) -> None:
ckpt_parent = Path(load_path).parent
latest_fp = ckpt_parent / "latest"
latest_ckpt_iter = ckpt_parent / "latest_checkpointed_iteration.txt"
if latest_fp.is_file():
logger.info(f"Found 'latest' in {ckpt_parent}!")
with latest_fp.open("r") as f:
_latest = f.read().rstrip("\n").lstrip("global_step")
assert int(_latest) == int(S), f"{_latest=} != {S=}"
else:
logger.info(f"No 'latest' in {ckpt_parent}!")
logger.info(f"Writing global_step{S} to {latest_fp}")
with latest_fp.open("w") as f:
f.write(f"global_step{S}")
if latest_ckpt_iter.is_file():
logger.info(f"Found 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
with latest_ckpt_iter.open("r") as f:
_latest = f.read().rstrip("\n")
assert int(_latest) == int(S), f"{_latest=} != {S=}"
else:
logger.info(f"No 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
logger.info(f"Writing {S} to {latest_ckpt_iter}")
with latest_ckpt_iter.open("w") as f:
f.write(f"{S}")
```
and in the loop:
```python
for rec in records:
cid, S, R = rec["id"], rec["S"], rec["R"]
T = S + R
f = S / T
tag = f"# id={cid} resume_step={S} cooldown_steps={R} total_iters={T} frac={fmt_float(f)}"
ensure_checkpoint_pointers(args.load, S)
cmd = build_command(
load_path=args.load,
data_file_list=args.data_file_list,
train_script=args.train_script,
train_iters=T,
lr_cooldown_frac=f,
grad_acc_steps=args.grad_acc_steps,
opt=args.opt,
min_lr=args.min_lr,
override_ckpt_opt_param=override_flag,
extra_args=args.extra_args.strip(),
# other options unchanged...
)
...
```
The behavior is identical, but the main loop reads as a higher-level sequence of steps.
4. `build_command` is very general compared to how it’s used
Right now `main()` only drives the `train_iters` path and a small subset of the parameters. To keep the generality without making the core harder to follow, you can factor the “cooldown sweep common case” into a thin wrapper and keep the existing `build_command` as a lower-level primitive:
```python
def build_cooldown_command(
load_path: str,
data_file_list: str,
train_script: str,
T: int,
f: float,
grad_acc_steps: int,
opt: str,
min_lr: float,
override_ckpt_opt_param: bool,
extra_args: str,
) -> str:
return build_command(
load_path=load_path,
data_file_list=data_file_list,
train_script=train_script,
train_iters=T,
lr_cooldown_frac=f,
grad_acc_steps=grad_acc_steps,
opt=opt,
min_lr=min_lr,
override_ckpt_opt_param=override_ckpt_opt_param,
extra_args=extra_args,
# leave other parameters at their defaults
)
```
and in `main()`:
```python
cmd = build_cooldown_command(
load_path=args.load,
data_file_list=args.data_file_list,
train_script=args.train_script,
T=T,
f=f,
grad_acc_steps=args.grad_acc_steps,
opt=args.opt,
min_lr=args.min_lr,
override_ckpt_opt_param=override_flag,
extra_args=args.extra_args.strip(),
)
```
This keeps all existing knobs intact for future use, but makes the primary flow much easier to understand.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| assert global_batch_size is not None and sequence_length is not None | ||
| train_iters = train_tokens * global_batch_size * sequence_length |
There was a problem hiding this comment.
issue (bug_risk): Train iterations from train_tokens should divide by tokens-per-step, not multiply.
In the train_tokens branch you currently compute:
train_iters = train_tokens * global_batch_size * sequence_lengthIf train_tokens is the total desired tokens, the correct step count is:
tokens_per_step = global_batch_size * sequence_length
train_iters = train_tokens // tokens_per_step # or float division + roundingMultiplying by global_batch_size * sequence_length yields an incorrect (and massively inflated) number of steps.
| if args.include_header: | ||
| if (hfp := Path(args.include_header)).is_file(): | ||
| with hfp.open("r") as f: | ||
| lines.extend(f.readlines()) | ||
| else: | ||
| lines.extend("\n".join(args.include_header.split("\n"))) |
There was a problem hiding this comment.
issue (bug_risk): Using list.extend on a joined header string appends characters instead of lines.
In the non-file case you currently do:
lines.extend("\n".join(args.include_header.split("\n")))This extends lines with individual characters, not header lines, so the output script will be malformed.
Use either:
lines.append(args.include_header)if include_header is a full block, or:
lines.extend(args.include_header.splitlines(keepends=True))if you need to preserve line boundaries and newlines.
| # die "--cool-R (cooldown steps) or --cooldown-percent required" | ||
| # fi | ||
|
|
||
| # [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required" |
There was a problem hiding this comment.
issue (bug_risk): Cooldown steps (COOL_R) can be left empty but are still passed to make_cooldown_cmds.
The guard requiring COOL_R was commented out:
# [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required"but --cooldown-steps "${COOL_R}" is still passed to make_cooldown_cmds.py in both exact and rollback scripts. If COOL_R is unset/empty, the Python script receives an empty value for a required integer argument and will fail or misbehave.
Please either restore the guard or define a safe default for COOL_R at the top of the script.
|
|
||
| | CPT Stages | Distribution shift | Primary strategy | Fallbacks | Notes | | ||
| |------|--------------------|------------------|-------------|-------| | ||
| | Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ | |
There was a problem hiding this comment.
issue (typo): Fix typo in "wih" → "with" in the Stage 1→2 row.
"wih" in this row is a typo; please change it to "with".
| | Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ | | |
| | Stage 1 → 2 | Weak | Naive CPT with $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ | |
|
|
||
| As a result, we primarily adopt a **data-centric strategy** throughout these stages, resorting to learning-rate adjustments only when necessary. | ||
|
|
||
| The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces. |
There was a problem hiding this comment.
issue (typo): Correct "ans science papers" to "and science papers".
In the
| The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, ans science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces. | |
| The dataset $D_0$ for pretraining is Olmo-mix and has 4 Trillion tokens, then $D_1$ has 2 Trillion tokens from Dolmino and fineweb Edu meaning the data distribution between these two stages is weak. We then have $D_2$ for stage 3 that has 1.5 trillion tokens from math, code, and science papers. Finally, we have $D_3$ stage 4 made of 0.5 trillion tokens from reasoning traces. |
| Naively continue training with $D_1$, no replay data. | ||
| - Continue training using only the current dataset $D_1$ | ||
| - No replay from $D_0$ or buffer data | ||
| This may be sufficient under weak distribution shift but there is potential risks of forgetting |
There was a problem hiding this comment.
suggestion (typo): Fix subject–verb agreement in "there is potential risks".
Change to either "there are potential risks of forgetting" or "there is a potential risk of forgetting" for correct subject–verb agreement.
| This may be sufficient under weak distribution shift but there is potential risks of forgetting | |
| This may be sufficient under weak distribution shift but there are potential risks of forgetting |
| - adjusting the data-mixing strategy by **increasing the weight of pretraining data**. | ||
|
|
||
| At the end of this stage, we have $D^{CPT}_2$. | ||
| #### Stage 3 to stage 4 (shift to reasoning tracex) |
There was a problem hiding this comment.
issue (typo): Fix typo "tracex" → "traces".
| #### Stage 3 to stage 4 (shift to reasoning tracex) | |
| #### Stage 3 to stage 4 (shift to reasoning traces) |
|
|
||
| ***If we did use Strategy 4 above:*** | ||
| We should keep decaying with $D^{CPT}_2$ until $LR_3/100$ then introduce the new mix at $LR_3/5$ | ||
|  |
There was a problem hiding this comment.
issue (typo): Fix typo "devay" → "decay" in the image alt-text.
For clarity, update the alt text phrase to "previous decay".
|  | |
|  |
|
|
||
| ### Adding custom optimizers | ||
| To add a custom optimizer, you have to modify the following files: | ||
| - `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder. |
There was a problem hiding this comment.
issue (typo): Correct "heve" → "have".
| - `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either heve to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder. | |
| - `megatron/optimizer/__init__.py`: [muon example](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/994f2a129d465cc50e6c35af075eb3292874effe/megatron/optimizer/__init__.py#L434), note that you either have to import the optimizer from a pre-installed package or add it in the `megatron/optimizer/` folder. |
| return f"{x:.8f}".rstrip("0").rstrip(".") | ||
|
|
||
|
|
||
| def get_total_iters_from_cooldown_percent( |
There was a problem hiding this comment.
issue (complexity): Consider simplifying this script by removing unused helpers, fixing and streamlining header handling, extracting checkpoint file updates into a helper, and adding a thin wrapper around build_command for the common cooldown case to make the main flow easier to follow.
get_total_iters_from_cooldown_percentis unused and adds complexity
If you don’t have a concrete caller yet, it’s just mental overhead. You can safely drop it now and reintroduce a focused version when needed:
# Remove this entire function for now:
# def get_total_iters_from_cooldown_percent(...):
# ...If you plan to use it soon, consider a much narrower helper that matches how this script actually computes things now, e.g.:
def compute_total_iters_and_frac(S: int, R: int) -> tuple[int, float]:
T = S + R
return T, S / Tand then use that in main() instead of recomputing T and f inline.
- Header handling has a subtle bug and can be simplified
lines.extend("\n".join(args.include_header.split("\n")))extends by characters, not lines. You can fix that and also normalize to a single header string:
# Build a header string once
if args.include_header is None:
header_str = get_header_template(
queue=args.queue,
project=args.project,
walltime=args.walltime,
filesystems=args.filesystems,
nodes=args.nodes,
)
else:
hfp = Path(args.include_header)
if hfp.is_file():
header_str = hfp.read_text()
else:
header_str = args.include_header
lines = []
if header_str:
# Ensure trailing newline
if not header_str.endswith("\n"):
header_str += "\n"
lines.append(header_str)Then when writing scripts:
with open(outfile, "w") as f:
f.writelines(lines)
f.write(block + "\n")This fixes the bug and keeps header behavior explicit and simple.
- Extract checkpoint metadata sync out of the main loop
The per-record loop mixes command generation and checkpoint file mutation. Extracting the latter into a helper reduces cognitive load without changing behavior:
def ensure_checkpoint_pointers(load_path: str, S: int) -> None:
ckpt_parent = Path(load_path).parent
latest_fp = ckpt_parent / "latest"
latest_ckpt_iter = ckpt_parent / "latest_checkpointed_iteration.txt"
if latest_fp.is_file():
logger.info(f"Found 'latest' in {ckpt_parent}!")
with latest_fp.open("r") as f:
_latest = f.read().rstrip("\n").lstrip("global_step")
assert int(_latest) == int(S), f"{_latest=} != {S=}"
else:
logger.info(f"No 'latest' in {ckpt_parent}!")
logger.info(f"Writing global_step{S} to {latest_fp}")
with latest_fp.open("w") as f:
f.write(f"global_step{S}")
if latest_ckpt_iter.is_file():
logger.info(f"Found 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
with latest_ckpt_iter.open("r") as f:
_latest = f.read().rstrip("\n")
assert int(_latest) == int(S), f"{_latest=} != {S=}"
else:
logger.info(f"No 'latest_checkpointed_iteration.txt' in {ckpt_parent}!")
logger.info(f"Writing {S} to {latest_ckpt_iter}")
with latest_ckpt_iter.open("w") as f:
f.write(f"{S}")and in the loop:
for rec in records:
cid, S, R = rec["id"], rec["S"], rec["R"]
T = S + R
f = S / T
tag = f"# id={cid} resume_step={S} cooldown_steps={R} total_iters={T} frac={fmt_float(f)}"
ensure_checkpoint_pointers(args.load, S)
cmd = build_command(
load_path=args.load,
data_file_list=args.data_file_list,
train_script=args.train_script,
train_iters=T,
lr_cooldown_frac=f,
grad_acc_steps=args.grad_acc_steps,
opt=args.opt,
min_lr=args.min_lr,
override_ckpt_opt_param=override_flag,
extra_args=args.extra_args.strip(),
# other options unchanged...
)
...The behavior is identical, but the main loop reads as a higher-level sequence of steps.
build_commandis very general compared to how it’s used
Right nowmain()only drives thetrain_iterspath and a small subset of the parameters. To keep the generality without making the core harder to follow, you can factor the “cooldown sweep common case” into a thin wrapper and keep the existingbuild_commandas a lower-level primitive:
def build_cooldown_command(
load_path: str,
data_file_list: str,
train_script: str,
T: int,
f: float,
grad_acc_steps: int,
opt: str,
min_lr: float,
override_ckpt_opt_param: bool,
extra_args: str,
) -> str:
return build_command(
load_path=load_path,
data_file_list=data_file_list,
train_script=train_script,
train_iters=T,
lr_cooldown_frac=f,
grad_acc_steps=grad_acc_steps,
opt=opt,
min_lr=min_lr,
override_ckpt_opt_param=override_ckpt_opt_param,
extra_args=extra_args,
# leave other parameters at their defaults
)and in main():
cmd = build_cooldown_command(
load_path=args.load,
data_file_list=args.data_file_list,
train_script=args.train_script,
T=T,
f=f,
grad_acc_steps=args.grad_acc_steps,
opt=args.opt,
min_lr=args.min_lr,
override_ckpt_opt_param=override_flag,
extra_args=args.extra_args.strip(),
)This keeps all existing knobs intact for future use, but makes the primary flow much easier to understand.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ff45fa8ed9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| p.add_argument("--min-lr", type=float, default=2e-5) | ||
| p.add_argument("--no-override-ckpt-opt", action="store_true") | ||
| p.add_argument("--extra-args", default="") | ||
| p.add_argument("--emit-sh", action="store_true", default=None) |
There was a problem hiding this comment.
Accept file argument for --emit-sh
--emit-sh is declared with action="store_true", so argparse treats it as a boolean flag and does not accept a value, but both the README examples and gen_cooldown_sweep.sh call it as --emit-sh <path>. In that common usage, the path is parsed as an unexpected positional argument and command generation aborts, which breaks the documented per-file cooldown workflow.
Useful? React with 👍 / 👎.
| override_ckpt_opt_param=override_flag, | ||
| extra_args=args.extra_args.strip(), | ||
| ) | ||
| latest_fp = Path(args.load).parent.joinpath("latest") |
There was a problem hiding this comment.
Write tracker files inside the load directory
The script computes tracker paths with Path(args.load).parent, which moves one level above the --load directory. For the documented --load /.../checkpoints_parent usage, this reads/writes latest in the parent directory instead of the checkpoint directory, so generated runs can resume from the wrong checkpoint and the script may modify unrelated files.
Useful? React with 👍 / 👎.
| logger.info(f"Found 'latest' in {ckpt_parent}!") | ||
| with latest_fp.open("r") as f: | ||
| _latest = f.read().rstrip("\n").lstrip("global_step") | ||
| assert int(_latest) == int(S), f"{_latest=} != {S=}" |
There was a problem hiding this comment.
Allow sweeps over multiple checkpoint iterations
This assertion requires the existing latest tracker to equal each requested S, so generating commands for multiple checkpoint iterations (or any non-current checkpoint) fails immediately. The script is documented to emit grids/sweeps across many -S values, but this check makes that workflow unusable unless all records share the same checkpoint.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR updates AuroraGPT’s ALCF documentation and training utilities by expanding CPT guidance, adding a cooldown-command generator workflow, and adjusting launch/setup scripts to better support ongoing large-scale training and checkpoint cooldown experiments.
Changes:
- Major rewrite/expansion of
ALCF/notes/cpt.mdinto a stage-by-stage CPT strategy cookbook (and removal of embedded helper script content). - Added a new
tools/cooldown_generator/workflow (Python + shell) to generate Megatron-DeepSpeed cooldown commands/scripts from checkpoint steps or token milestones. - Updated various ALCF training/launcher scripts and notes (new stage scripts, launcher command tweak, dependency installation behavior, and logging tweaks).
Reviewed changes
Copilot reviewed 19 out of 28 changed files in this pull request and generated 20 comments.
Show a summary per file
| File | Description |
|---|---|
train_alcf.sh |
Disables in-script ezpz install block and MPI cleanup call. |
train_aGPT_2B_sophiag_stage2.sh |
New PBS launch script for 2B SophiaG stage2. |
train_aGPT_2B_sophiag_stage3.sh |
New PBS launch script for 2B SophiaG stage3. |
train_aGPT_2B_large_batch.sh |
Updates optimizer selection and removes old optimizer-plans comment block. |
tools/cooldown_generator/make_cooldown_cmds.py |
New Python generator for cooldown commands/scripts. |
tools/cooldown_generator/gen_cooldown_sweep.sh |
New shell wrapper to generate per-ID cooldown scripts from a TSV. |
tools/cooldown_generator/build_checkpoints_from_tokens.py |
New utility to map token milestones → step checkpoints (+ rollback offsets). |
tools/cooldown_generator/README.md |
New documentation for cooldown generator tools and workflows. |
pretrain_gpt_alcf.py |
Makes W&B finish conditional on wandb being non-None. |
megatron/training.py |
Normalizes ezpz usage (ezpz.* / ezpz.dist.*) and minor formatting cleanups. |
megatron/timers.py |
Switches timer output from print(..., flush=True) to ezpz logger. |
ALCF/notes/lb_optimizers_settings.md |
Removes older optimizer/settings note doc (replaced by new doc). |
ALCF/notes/large_batch_optimizers_settings.md |
Adds updated large-batch optimizer + LR-finder documentation and assets reference. |
ALCF/notes/cpt.md |
Expanded CPT “strategy cookbook” for AuroraGPT V1, includes stage guidance and mixing strategies. |
ALCF/notes/assets/lb_optimizers/readme.md |
Adds placeholder/readme for optimizer-doc images. |
ALCF/notes/assets/cpt_images/readme.md |
Adds placeholder/readme for CPT-doc images. |
ALCF/helpers.sh |
Comments out dependency install step; simplifies launcher command; whitespace/indent cleanup. |
ALCF/data-lists/aurora/nvidia-math1-code2.txt |
Adds a new dataset file list for math/code stage data. |
ALCF/data-lists/aurora/dolmino-mix-1124-fused-file-list.txt |
Adds a new fused Dolmino mix file list. |
.gitignore |
Adds un-ignore rules for tools/cooldown_generator/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with hfp.open("r") as f: | ||
| lines.extend(f.readlines()) | ||
| else: | ||
| lines.extend("\n".join(args.include_header.split("\n"))) |
There was a problem hiding this comment.
When args.include_header is a literal header string, lines.extend("\n".join(...)) extends the list with individual characters (because extend iterates the string). This bloats memory and is easy to misuse later. Use lines.append(args.include_header) or lines.extend(args.include_header.splitlines(keepends=True)) instead.
| lines.extend("\n".join(args.include_header.split("\n"))) | |
| lines.append("\n".join(args.include_header.split("\n"))) |
| MODEL_ARCH=AuroraGPT-2B \ | ||
| OPT=muonclip \ | ||
| OPT=sophiag \ | ||
| LR=2.28e-5 \ |
There was a problem hiding this comment.
OPT is changed to sophiag but LR remains 2.28e-5 (which appears to be the previous muonclip setting). Other sophiag launch scripts in this repo use LR=2.17e-5, so this looks like a copy/paste mismatch. Please confirm the intended LR for sophiag and update accordingly.
| LR=2.28e-5 \ | |
| LR=2.17e-5 \ |
| @@ -0,0 +1,200 @@ | |||
| # Megatron-DeepSpeed, optimizers, hyperparameters | |||
| `**Important** For large batch training with infinite schedulers, It is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.` | |||
There was a problem hiding this comment.
The opening “Important” note is wrapped in backticks, but it also contains an inline-code span for lr_finder, which breaks Markdown rendering (nested backticks). Consider using bold/blockquote formatting for the note and reserving backticks only for the lr_finder identifier.
| `**Important** For large batch training with infinite schedulers, It is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so.` | |
| > **Important** For large batch training with infinite schedulers, it is crucial to tune the learning rate as these schedulers benefit from larger learning rate values. See below for the `lr_finder` routine implemented in MDS to do so. |
|
|
||
| | CPT Stages | Distribution shift | Primary strategy | Fallbacks | Notes | | ||
| |------|--------------------|------------------|-------------|-------| | ||
| | Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ | |
There was a problem hiding this comment.
Typo in table: “Naive CPT wih
| | Stage 1 → 2 | Weak | Naive CPT wih $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ | | |
| | Stage 1 → 2 | Weak | Naive CPT with $D_1$ (no replay, no mixing) | 5% Replay: $D^{CPT}_{2} = 0.05D_0 + 0.95D_1$| Add $D_1$ to buffer $B$ | |
| latest_fp = Path(args.load).parent.joinpath("latest") | ||
| latest_ckpt_iter = Path(args.load).parent.joinpath("latest_checkpointed_iteration.txt") | ||
| ckpt_parent = Path(args.load).parent |
There was a problem hiding this comment.
--load is documented/used as the checkpoint parent directory, but the code uses Path(args.load).parent as the checkpoint directory for latest files. This will write/read latest one directory above the intended checkpoint root (and could corrupt unrelated directories). Use ckpt_parent = Path(args.load) (and derive latest* paths from that).
| latest_fp = Path(args.load).parent.joinpath("latest") | |
| latest_ckpt_iter = Path(args.load).parent.joinpath("latest_checkpointed_iteration.txt") | |
| ckpt_parent = Path(args.load).parent | |
| ckpt_parent = Path(args.load) | |
| latest_fp = ckpt_parent.joinpath("latest") | |
| latest_ckpt_iter = ckpt_parent.joinpath("latest_checkpointed_iteration.txt") |
| # [[ -n "${COOL_R}" ]] || die "--cool-R (cooldown steps) is required" | ||
| [[ -n "${LOAD_PATH}" ]] || die "--load path is required" | ||
| [[ -n "${PHASE1_LIST}" ]] || die "--phase1-list is required (IDs 1..4)" | ||
| [[ -n "${PHASE2_LIST}" ]] || die "--phase2-list is required (IDs 5..7)" | ||
|
|
There was a problem hiding this comment.
COOL_R is used to populate --cooldown-steps later, but validation is commented out, so running without --cool-R will emit broken make_cooldown_cmds.py invocations. Reinstate a hard check that --cool-R is provided (or add alternate logic to compute cooldown steps from --cooldown-percent).
|
|
||
| ## Hyperparameter tuning | ||
| #### Init variance | ||
| Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do |
There was a problem hiding this comment.
This sentence has broken inline-code formatting: one can use `--init-method-std, `--adjust-word-embedding-init`, .... The backticks are mismatched, so Markdown will render incorrectly. Use separate properly-closed inline-code spans for each flag.
| Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do | |
| Weight initialization is key to training LLMs,and to avoid spikes in losses. Here, we initialize the weights following this [paper](https://arxiv.org/pdf/2312.16903). The default variance value at initialization is 0.02. To add custom variances, one can use `--init-method-std`, `--adjust-word-embedding-init`, and `--word-embedding-init-std`. For our runs, we do |
| # shellcheck disable=SC1090 | ||
| source <(curl -L https://bit.ly/ezpz-utils) | ||
| ezpz_setup_env |
There was a problem hiding this comment.
Using source <(curl -L https://bit.ly/ezpz-utils) here downloads and executes a remote shell script at job runtime with no version pinning or integrity verification. If the bit.ly link or its upstream hosting is ever compromised, an attacker can run arbitrary code in your training environment with access to cluster credentials and data. Vendor this helper script into the repo or fetch it via a pinned, checksummed artifact instead of sourcing an unpinned URL at runtime.
| # shellcheck disable=SC1090 | ||
| source <(curl -L https://bit.ly/ezpz-utils) | ||
| ezpz_setup_env |
There was a problem hiding this comment.
This script also uses source <(curl -L https://bit.ly/ezpz-utils) to pull and execute remote shell code at runtime without any integrity or version pinning. A compromise of that short URL or its backing content would give an attacker arbitrary code execution inside AuroraGPT training jobs. Check in a local copy of the helper and/or use a pinned, checksummed distribution rather than sourcing from a mutable URL.
| Single command to test and run Megatron-DeepSpeed: | ||
|
|
||
| ```bash | ||
| now=$(date +'%Y-%m-%d-%H%M%S') && debug_dir="${now}" && mkdir -p "${debug_dir}"&& cd "${debug_dir}"&& git clone https://github.com/argonne-lcf/Megatron-DeepSpeed && cd Megatron-DeepSpeed && source <(curl -L https://bit.ly/ezpz-utils) && ezpz_setup_env && python3 -m pip install --require-virtualenv "git+https://github.com/saforem2/ezpz" "numpy<2" deepspeed tensorboard && ezpz-test && DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_alcf.sh |
There was a problem hiding this comment.
The "single command" snippet both sources https://bit.ly/ezpz-utils and installs ezpz via python3 -m pip install "git+https://github.com/saforem2/ezpz" without pinning or integrity checks. Anyone who controls the bit.ly target or the GitHub repo (or can intercept those HTTPS connections) can ship malicious code that runs with the user's environment and cluster credentials. Prefer vendoring the helper script and pinning the Python package to a specific version/commit (and optionally verifying a hash) instead of installing from mutable URLs in a recommended one-liner.
Copilot Summary
This pull request makes several improvements and updates to the CPT (Continual Pre-Training) documentation and related scripts for AuroraGPT, focusing on providing a comprehensive strategy guide, cleaning up helper scripts, and improving documentation organization. The main changes include a major rewrite and expansion of the
cpt.mdstrategy cookbook, removal of an embedded script, minor script refactoring, and the addition of documentation assets.Major documentation update:
ALCF/notes/cpt.mdfile has been rewritten and significantly expanded into a detailed CPT strategy cookbook for AuroraGPT V1. It now covers data-centric and optimization strategies, recommended mixing weights, stage-by-stage guidance, and practical tips for handling distribution shifts during continual pre-training. The update also includes new diagrams and links to relevant assets. An embedded script (mix_datasets.py) was removed from the documentation, with references to its GitHub location instead. [1] [2]Script and helper improvements:
ALCF/helpers.sh, the call toinstall_dependenciesin thesetup()function is now commented out, likely to avoid redundant installations during repeated runs.setupLauncher()function inALCF/helpers.shhas been simplified: the launcher command is updated to useezpz launchinstead ofezpz-launch, and commented-out legacy logic has been removed for clarity.set_ccl_vars_on_aurora()inALCF/helpers.sh.Documentation assets:
ALCF/notes/assets/cpt_imagesandALCF/notes/assets/lb_optimizersto clarify the purpose of images used in the CPT and optimizer documentation. [1] [2]These changes collectively improve the clarity, maintainability, and usability of the CPT workflow and documentation for AuroraGPT.
References:
Summary by Sourcery
Document continual pre-training strategies and large-batch optimizer guidance for AuroraGPT, adjust training and helper scripts to use ezpz utilities and new optimizer/scheduler settings, and add tooling to generate LR cooldown commands and stage-specific training jobs.
New Features:
tools/cooldown_generatorto compute checkpoint/cooldown schedules from token budgets and emit ready-to-run Megatron-DeepSpeed commands and per-ID sweep scripts.sophiagoptimizer, constant LR with cooldown, and appropriate data lists and tokenizer settings.Enhancements:
ezpzAPI for rank, device, profiling, and timing, and improve assertion formatting and logging behavior for timers and wandb integration.ezpz launch, and minor cleanup in model selection and environment configuration.sophiagoptimizer by default instead ofmuonclip.pretrain_gpt_alcf.pyto avoid errors when wandb is unavailable.Documentation:
ALCF/notes/cpt.md, including stage-wise data-mixing guidance, learning-rate strategies, and handling distribution shifts for AuroraGPT V1 and legacy agpt-7B.large_batch_optimizers_settingsguide documenting optimizer options, scheduler variants, LR finder usage, initialization schemes, and hyperparameter tuning recommendations for large-batch training.tools/cooldown_generator/README.md.Chores:
.gitignoreif applicable.