trigger H100 multinode evals by Oseltamivir · Pull Request #1119 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-23T03:53:51Z

No description provided.

github-actions · 2026-04-23T03:53:59Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-23T03:53:59Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-23T04:01:07Z

+- config-keys:
+    - dsr1-fp8-h100-dynamo-trt
+    - dsr1-fp8-h100-dynamo-sglang
+  description:
+    - "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD
+


🔴 The new entry is missing evals-only: true. The PR title and description explicitly state this is to trigger H100 multinode evals, but without the flag utils/process_changelog.py will generate full benchmark sweeps AND eval runs for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang. Add evals-only: true to match the analogous H200 entry (PR #1094 at perf-changelog.yaml:1681-1687).

Extended reasoning...

Bug: The new perf-changelog entry at lines 1-7 omits the evals-only: true flag, but the PR title ("trigger H100 multinode evals") and description ("Trigger H100 multinode evals after dist-timeout and health-check timeout fixes") make clear the intent is an evals-only run. This flag omission will cause unintended full benchmark sweeps on H100 multinode configurations.

Code path: In utils/process_changelog.py (lines 101-141), each changelog entry is processed as follows:

for entry_data in changelog_data: entry = ChangelogEntry.model_validate(entry_data) ... if not entry.evals_only: # Generate benchmark entries (no evals) <-- runs full benchmark sweep ... base_cmd = [..., "--no-evals"] ... all_benchmark_results.extend(...) # Generate eval entries separately <-- ALWAYS runs ...

The eval-generation branch always runs; only the benchmark branch is gated by evals_only. So entries without evals_only: true produce BOTH benchmark and eval runs, while entries with evals_only: true produce eval runs only.

Why existing code doesn't prevent it: The field defaults to False in the Pydantic ChangelogEntry model (see utils/validation.py), so an omitted field silently falls into the "run everything" branch. There is no validator that cross-checks description wording (e.g. "evals") against the flag.

Impact: Running full benchmark sweeps for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang on H100 multinode will consume significant CI resources on runs the author did not intend. It also contradicts the stated purpose ("trigger evals after timeout fixes") and may produce benchmark data that skews perf tracking if these configs were not ready for a full sweep.

Analogous precedent: The directly parallel entry for H200 multinode (PR #1094, at perf-changelog.yaml:1681-1687) uses evals-only: true:

- config-keys: - dsr1-fp8-h200-dynamo-trt - dsr1-fp8-h200-dynamo-sglang description: - "Add H200 multinode evals-only runs" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1094 evals-only: true

Every other entry whose description mentions evals in this file (PRs #558, #892, #911, #1000, #1094) also sets evals-only: true.

Fix: Add evals-only: true to the new entry:

- config-keys: - dsr1-fp8-h100-dynamo-trt - dsr1-fp8-h100-dynamo-sglang description: - "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD evals-only: true

Step-by-step proof:

The new entry at perf-changelog.yaml:1-7 is parsed by process_changelog.py into a ChangelogEntry with evals_only=False (default since the flag is absent).

At line 107, if not entry.evals_only is True, so execution enters the benchmark-generation block (lines 108-132).

generate_sweeps.py is invoked with --no-evals and config keys dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang, producing the full multinode benchmark sweep.

Execution then falls through to line 134, where eval entries are ALSO generated for the same config keys.

Result: both benchmark and eval runs are triggered, instead of the intended evals-only. Contrast this with the H200 entry (PR Trigger H200 multinode evals & revert MI355X image to mori-0227-3 #1094), where evals_only=True causes the if not entry.evals_only check to skip the benchmark block, producing evals only.

trigger H100 multinode evals

2735ad3

Oseltamivir requested a review from a team April 23, 2026 03:53

github-project-automation Bot added this to InferenceMAX Board Apr 23, 2026

Oseltamivir merged commit 8bda39f into main Apr 23, 2026
6 checks passed

Oseltamivir deleted the h100-multinode-eval branch April 23, 2026 03:55

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

claude Bot mentioned this pull request Apr 23, 2026

trigger H100 multinode evals #1120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trigger H100 multinode evals#1119

trigger H100 multinode evals#1119
Oseltamivir merged 1 commit intomainfrom
h100-multinode-eval

Oseltamivir commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant