trigger H100 multinode evals #1119

claude · 2026-04-23T04:01:07Z

🔴 The new entry is missing evals-only: true. The PR title and description explicitly state this is to trigger H100 multinode evals, but without the flag utils/process_changelog.py will generate full benchmark sweeps AND eval runs for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang. Add evals-only: true to match the analogous H200 entry (PR #1094 at perf-changelog.yaml:1681-1687).

Extended reasoning...

Bug: The new perf-changelog entry at lines 1-7 omits the evals-only: true flag, but the PR title ("trigger H100 multinode evals") and description ("Trigger H100 multinode evals after dist-timeout and health-check timeout fixes") make clear the intent is an evals-only run. This flag omission will cause unintended full benchmark sweeps on H100 multinode configurations.

Code path: In utils/process_changelog.py (lines 101-141), each changelog entry is processed as follows:

for entry_data in changelog_data: entry = ChangelogEntry.model_validate(entry_data) ... if not entry.evals_only: # Generate benchmark entries (no evals) <-- runs full benchmark sweep ... base_cmd = [..., "--no-evals"] ... all_benchmark_results.extend(...) # Generate eval entries separately <-- ALWAYS runs ...

The eval-generation branch always runs; only the benchmark branch is gated by evals_only. So entries without evals_only: true produce BOTH benchmark and eval runs, while entries with evals_only: true produce eval runs only.

Why existing code doesn't prevent it: The field defaults to False in the Pydantic ChangelogEntry model (see utils/validation.py), so an omitted field silently falls into the "run everything" branch. There is no validator that cross-checks description wording (e.g. "evals") against the flag.

Impact: Running full benchmark sweeps for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang on H100 multinode will consume significant CI resources on runs the author did not intend. It also contradicts the stated purpose ("trigger evals after timeout fixes") and may produce benchmark data that skews perf tracking if these configs were not ready for a full sweep.

Analogous precedent: The directly parallel entry for H200 multinode (PR #1094, at perf-changelog.yaml:1681-1687) uses evals-only: true:

- config-keys: - dsr1-fp8-h200-dynamo-trt - dsr1-fp8-h200-dynamo-sglang description: - "Add H200 multinode evals-only runs" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1094 evals-only: true

Every other entry whose description mentions evals in this file (PRs #558, #892, #911, #1000, #1094) also sets evals-only: true.

Fix: Add evals-only: true to the new entry:

- config-keys: - dsr1-fp8-h100-dynamo-trt - dsr1-fp8-h100-dynamo-sglang description: - "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD evals-only: true

Step-by-step proof:

The new entry at perf-changelog.yaml:1-7 is parsed by process_changelog.py into a ChangelogEntry with evals_only=False (default since the flag is absent).

At line 107, if not entry.evals_only is True, so execution enters the benchmark-generation block (lines 108-132).

generate_sweeps.py is invoked with --no-evals and config keys dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang, producing the full multinode benchmark sweep.

Execution then falls through to line 134, where eval entries are ALSO generated for the same config keys.

Result: both benchmark and eval runs are triggered, instead of the intended evals-only. Contrast this with the H200 entry (PR Trigger H200 multinode evals & revert MI355X image to mori-0227-3 #1094), where evals_only=True causes the if not entry.evals_only check to skip the benchmark block, producing evals only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trigger H100 multinode evals #1119

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Claude / Claude Code Review

claude Bot Apr 23, 2026

Uh oh!

Uh oh!

-Original file line number
+Diff line change
@@ -1,3 +1,10 @@
+    - config-keys:
+        - dsr1-fp8-h100-dynamo-trt
+        - dsr1-fp8-h100-dynamo-sglang
+      description:
+        - "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes"
+      pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD
     - config-keys:
         - glm5.1-fp4-mi355x-sglang
       description:
@@ Expand Down @@

trigger H100 multinode evals #1119

Uh oh!

trigger H100 multinode evals #1119

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Claude / Claude Code Review

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!