Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
- config-keys:
- dsr1-fp8-h100-dynamo-trt
- dsr1-fp8-h100-dynamo-sglang
description:
- "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD

Check failure on line 7 in perf-changelog.yaml

View check run for this annotation

Claude / Claude Code Review

Missing evals-only: true flag

The new entry is missing `evals-only: true`. The PR title and description explicitly state this is to trigger H100 multinode evals, but without the flag `utils/process_changelog.py` will generate full benchmark sweeps AND eval runs for `dsr1-fp8-h100-dynamo-trt` and `dsr1-fp8-h100-dynamo-sglang`. Add `evals-only: true` to match the analogous H200 entry (PR #1094 at perf-changelog.yaml:1681-1687).
Comment on lines +1 to +7
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new entry is missing evals-only: true. The PR title and description explicitly state this is to trigger H100 multinode evals, but without the flag utils/process_changelog.py will generate full benchmark sweeps AND eval runs for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang. Add evals-only: true to match the analogous H200 entry (PR #1094 at perf-changelog.yaml:1681-1687).

Extended reasoning...

Bug: The new perf-changelog entry at lines 1-7 omits the evals-only: true flag, but the PR title ("trigger H100 multinode evals") and description ("Trigger H100 multinode evals after dist-timeout and health-check timeout fixes") make clear the intent is an evals-only run. This flag omission will cause unintended full benchmark sweeps on H100 multinode configurations.

Code path: In utils/process_changelog.py (lines 101-141), each changelog entry is processed as follows:

for entry_data in changelog_data:
    entry = ChangelogEntry.model_validate(entry_data)
    ...
    if not entry.evals_only:
        # Generate benchmark entries (no evals)   <-- runs full benchmark sweep
        ...
        base_cmd = [..., "--no-evals"]
        ...
        all_benchmark_results.extend(...)

    # Generate eval entries separately           <-- ALWAYS runs
    ...

The eval-generation branch always runs; only the benchmark branch is gated by evals_only. So entries without evals_only: true produce BOTH benchmark and eval runs, while entries with evals_only: true produce eval runs only.

Why existing code doesn't prevent it: The field defaults to False in the Pydantic ChangelogEntry model (see utils/validation.py), so an omitted field silently falls into the "run everything" branch. There is no validator that cross-checks description wording (e.g. "evals") against the flag.

Impact: Running full benchmark sweeps for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang on H100 multinode will consume significant CI resources on runs the author did not intend. It also contradicts the stated purpose ("trigger evals after timeout fixes") and may produce benchmark data that skews perf tracking if these configs were not ready for a full sweep.

Analogous precedent: The directly parallel entry for H200 multinode (PR #1094, at perf-changelog.yaml:1681-1687) uses evals-only: true:

- config-keys:
    - dsr1-fp8-h200-dynamo-trt
    - dsr1-fp8-h200-dynamo-sglang
  description:
    - "Add H200 multinode evals-only runs"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1094
  evals-only: true

Every other entry whose description mentions evals in this file (PRs #558, #892, #911, #1000, #1094) also sets evals-only: true.

Fix: Add evals-only: true to the new entry:

- config-keys:
    - dsr1-fp8-h100-dynamo-trt
    - dsr1-fp8-h100-dynamo-sglang
  description:
    - "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD
  evals-only: true

Step-by-step proof:

  1. The new entry at perf-changelog.yaml:1-7 is parsed by process_changelog.py into a ChangelogEntry with evals_only=False (default since the flag is absent).
  2. At line 107, if not entry.evals_only is True, so execution enters the benchmark-generation block (lines 108-132).
  3. generate_sweeps.py is invoked with --no-evals and config keys dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang, producing the full multinode benchmark sweep.
  4. Execution then falls through to line 134, where eval entries are ALSO generated for the same config keys.
  5. Result: both benchmark and eval runs are triggered, instead of the intended evals-only. Contrast this with the H200 entry (PR Trigger H200 multinode evals & revert MI355X image to mori-0227-3 #1094), where evals_only=True causes the if not entry.evals_only check to skip the benchmark block, producing evals only.

- config-keys:
- glm5.1-fp4-mi355x-sglang
description:
Expand Down
Loading