trigger H100 multinode evals#1119
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| - config-keys: | ||
| - dsr1-fp8-h100-dynamo-trt | ||
| - dsr1-fp8-h100-dynamo-sglang | ||
| description: | ||
| - "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD | ||
|
|
There was a problem hiding this comment.
🔴 The new entry is missing evals-only: true. The PR title and description explicitly state this is to trigger H100 multinode evals, but without the flag utils/process_changelog.py will generate full benchmark sweeps AND eval runs for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang. Add evals-only: true to match the analogous H200 entry (PR #1094 at perf-changelog.yaml:1681-1687).
Extended reasoning...
Bug: The new perf-changelog entry at lines 1-7 omits the evals-only: true flag, but the PR title ("trigger H100 multinode evals") and description ("Trigger H100 multinode evals after dist-timeout and health-check timeout fixes") make clear the intent is an evals-only run. This flag omission will cause unintended full benchmark sweeps on H100 multinode configurations.
Code path: In utils/process_changelog.py (lines 101-141), each changelog entry is processed as follows:
for entry_data in changelog_data:
entry = ChangelogEntry.model_validate(entry_data)
...
if not entry.evals_only:
# Generate benchmark entries (no evals) <-- runs full benchmark sweep
...
base_cmd = [..., "--no-evals"]
...
all_benchmark_results.extend(...)
# Generate eval entries separately <-- ALWAYS runs
...The eval-generation branch always runs; only the benchmark branch is gated by evals_only. So entries without evals_only: true produce BOTH benchmark and eval runs, while entries with evals_only: true produce eval runs only.
Why existing code doesn't prevent it: The field defaults to False in the Pydantic ChangelogEntry model (see utils/validation.py), so an omitted field silently falls into the "run everything" branch. There is no validator that cross-checks description wording (e.g. "evals") against the flag.
Impact: Running full benchmark sweeps for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang on H100 multinode will consume significant CI resources on runs the author did not intend. It also contradicts the stated purpose ("trigger evals after timeout fixes") and may produce benchmark data that skews perf tracking if these configs were not ready for a full sweep.
Analogous precedent: The directly parallel entry for H200 multinode (PR #1094, at perf-changelog.yaml:1681-1687) uses evals-only: true:
- config-keys:
- dsr1-fp8-h200-dynamo-trt
- dsr1-fp8-h200-dynamo-sglang
description:
- "Add H200 multinode evals-only runs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1094
evals-only: trueEvery other entry whose description mentions evals in this file (PRs #558, #892, #911, #1000, #1094) also sets evals-only: true.
Fix: Add evals-only: true to the new entry:
- config-keys:
- dsr1-fp8-h100-dynamo-trt
- dsr1-fp8-h100-dynamo-sglang
description:
- "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD
evals-only: trueStep-by-step proof:
- The new entry at perf-changelog.yaml:1-7 is parsed by
process_changelog.pyinto aChangelogEntrywithevals_only=False(default since the flag is absent). - At line 107,
if not entry.evals_onlyis True, so execution enters the benchmark-generation block (lines 108-132). generate_sweeps.pyis invoked with--no-evalsand config keysdsr1-fp8-h100-dynamo-trtanddsr1-fp8-h100-dynamo-sglang, producing the full multinode benchmark sweep.- Execution then falls through to line 134, where eval entries are ALSO generated for the same config keys.
- Result: both benchmark and eval runs are triggered, instead of the intended evals-only. Contrast this with the H200 entry (PR Trigger H200 multinode evals & revert MI355X image to mori-0227-3 #1094), where
evals_only=Truecauses theif not entry.evals_onlycheck to skip the benchmark block, producing evals only.
No description provided.