Multinode evals by Oseltamivir · Pull Request #1000 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-03T04:01:39Z

Summary

Add eval-only support for multi-node benchmarks and wire those eval results into CI collection + summary reporting.

This covers:

eval matrix selection for multi-node configs
eval-only workflow jobs for multi-node sweeps
AMD MI355X eval execution in server.sh
NVIDIA Slurm eval execution through Oseltamivir's srt-slurm fork
eval artifact upload, score validation, and multi-node-aware summary tables

How evals are run

Single-node evals are selected on 8k1k at max + median concurrency for each (model, runner, framework, precision, spec-decoding, dp-attn) group.

Multi-node evals are selected on 8k1k by taking the entry with the highest max concurrency for each (model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn) group, then running eval at the median concurrency from that config via eval-conc.

EVAL_ONLY=true starts the server with expanded eval context, skips throughput benchmarking, runs lm-eval,
writes meta_env.json + results*.json + sample*.jsonl, uploads those artifacts, then validates scores
against thresholds.

srt-slurm fork delta vs upstream

NVIDIA multinode eval uses Oseltamivir/srt-slurm@sa-submission-q1-2026 instead of ishandhanani/srt-slurm.

Compared with current upstream/main, that fork adds the eval path InferenceX needs:

a new lm-eval benchmark runner
/infmax-workspace mounting via INFMAX_WORKSPACE
EVAL_ONLY support in do_sweep.py to skip benchmark stage and run post-eval directly
full wait_for_model() health checking before eval in eval-only mode
pass-through of framework/model/topology/env metadata into the eval container
MODEL_NAME=self.config.served_model_name so eval queries the served alias, not the HF repo id
EVAL_CONC from workflow to EVAL_CONCURRENT_REQUESTS
copying eval outputs into /logs/eval_results/ for launcher-side artifact pickup

Validation

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771

InferenceX PR

NVIDIA/srt-slurm#12

The sglang 0.5.8 Docker image ships a newer lm-eval 0.4.9.2 commit that defaults fewshot_as_multiturn=True for chat-completion models. Since the version string matches the pinned commit, pip silently skips the install. Adding --force-reinstall ensures the pinned commit is always used regardless of what's pre-installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds dsr1-fp8-mi355x-sglang-disagg-nodpa-eval: same image/model/precision as the DPA config but with dp-attn=false and ep=1. Running evals on this will tell us if DPA is the cause of the 0% GSM8K score or if it's something else about the fp8 disagg setup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # perf-changelog.yaml # runners/launch_gb200-nv.sh

/raid/tmp is per-node local storage, so pyxis mount failed on B200 multinode decode workers that landed on nodes without the pre-staged copy. Use the shared /lustre/fsw mirror instead.

Recipe default is max_attempts=360 × interval=10s = 3600s, which is not enough for DSR1-FP8 (~680GB) to load across 5 workers off the shared FS — the prior run timed out at ~50% weights loaded. Patch the recipe in-place after clone; uses ${CONFIG_FILE%%:*} so the :override suffix on sglang configs doesn't break sed.

Missed staging this change before merging #1000.

Pulls 55 upstream commits published on SemiAnalysisAI/InferenceX:main since PR SemiAnalysisAI#1032 was opened. Zero conflicts; none touch tools/ or datasets/isb1/. Purpose: modernize PR base before Cam review and absorb upstream fork-drift reductions. Notable upstream work picked up: - MiniMax M2.5 MXFP4 MI355X + B300 configs - GLM5.1 FP4 MI355X support - GPT-OSS FP4 TP=8 conc=1 extension (SemiAnalysisAI#1096) - H200 multinode evals (SemiAnalysisAI#1000) - B300 configs for Kimi K2.5, DSR1, Qwen3.5 - Parallel random data generation (SemiAnalysisAI#1094) - KNOWN_LIMITATION.md updates Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Oseltamivir and others added 29 commits February 19, 2026 16:12

init

033aa6e

add mat

c177baa

Increase Eval Conc

6988322

8k1k evals instead of 1k1k

c0d008b

reduce conc

d73bf3d

Merge branch 'main' into multinode_eval

1a6e76c

Merge branch 'main' into multinode_eval

e63dbf4

Eval table missing spec decode

ab179c7

Merge branch 'main' into multinode_eval

d965a51

Merge branch 'main' into multinode_eval

22a341c

Merge branch 'main' into multinode_eval

0c6f500

nvda evals

e5c63dc

Merge branch 'main' into multinode_eval

94864f9

merge main

7215f1f

Merge branch 'main' into multinode_eval

df91368

update multinode to singlenode

8d26331

hanging rm rf

0b27187

debug

056a415

update conc req

61f7d9b

documentation

ffdd49b

median instead of max

7639f3d

config file guard

4ffd505

h100/h200/b200/b300 evals

0d0e1e8

Update repo

bf615b9

models_name

28a75a2

model config

98a45e9

summary table

de54974

Oseltamivir requested a review from a team April 3, 2026 04:01

Merge branch 'main' into multinode_eval

5ac89e7

Oseltamivir requested review from 1am9trash, seungrokj and yctseng0211 as code owners April 17, 2026 02:51

github-code-quality Bot found potential problems Apr 17, 2026

View reviewed changes

Comment thread utils/matrix_logic/test_generate_sweep_configs.py Fixed

Oseltamivir and others added 2 commits April 16, 2026 20:10

Merge remote-tracking branch 'origin/main' into multinode_eval

5c1786a

Merge branch 'main' into multinode_eval

4b77d74

Oseltamivir added the sweep-enabled label Apr 17, 2026

Oseltamivir force-pushed the multinode_eval branch from 9ea1f61 to b2aabf3 Compare April 17, 2026 19:24

srt-slurm upstream, perfchanglog

4ed7a9a

Oseltamivir force-pushed the multinode_eval branch from 8ca2534 to 4ed7a9a Compare April 17, 2026 23:35

Oseltamivir added 3 commits April 17, 2026 16:39

Merge remote-tracking branch 'origin/main' into multinode_eval

7320629

h100 shared dir

26e07c4

model path

8e6ec61

Oseltamivir removed the sweep-enabled label Apr 18, 2026

Oseltamivir added 5 commits April 18, 2026 10:47

qa2 things

1fd0936

Merge remote-tracking branch 'origin/main' into multinode_eval

f7f7d45

# Conflicts: # perf-changelog.yaml # runners/launch_gb200-nv.sh

Use shared lustre path for B200 FP8 model

0fe4925

/raid/tmp is per-node local storage, so pyxis mount failed on B200 multinode decode workers that landed on nodes without the pre-staged copy. Use the shared /lustre/fsw mirror instead.

revert amg image change by me

ff69ee0

Oseltamivir merged commit 2e9cecf into main Apr 19, 2026
18 checks passed

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 19, 2026

Oseltamivir deleted the multinode_eval branch April 19, 2026 19:41

Oseltamivir added a commit that referenced this pull request Apr 19, 2026

Revert DSR1 FP4 MI355X SGLang image to mori-0227-3

b0a6c06

Missed staging this change before merging #1000.

Oseltamivir mentioned this pull request Apr 19, 2026

Trigger H200 multinode evals & revert MI355X image to mori-0227-3 #1094

Merged

This was referenced Apr 23, 2026

trigger H100 multinode evals #1119

Merged

trigger H100 multinode evals #1120

Merged

[NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP #1145

Open

Nothing to see here #1189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multinode evals#1000

Multinode evals#1000
Oseltamivir merged 46 commits intomainfrom
multinode_eval

Oseltamivir commented Apr 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Oseltamivir commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How evals are run

srt-slurm fork delta vs upstream

Validation

InferenceX PR

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oseltamivir commented Apr 3, 2026 •

edited

Loading