Skip to content

Multinode evals#1000

Merged
Oseltamivir merged 46 commits intomainfrom
multinode_eval
Apr 19, 2026
Merged

Multinode evals#1000
Oseltamivir merged 46 commits intomainfrom
multinode_eval

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented Apr 3, 2026

Summary

Add eval-only support for multi-node benchmarks and wire those eval results into CI collection + summary reporting.

This covers:

  • eval matrix selection for multi-node configs
  • eval-only workflow jobs for multi-node sweeps
  • AMD MI355X eval execution in server.sh
  • NVIDIA Slurm eval execution through Oseltamivir's srt-slurm fork
  • eval artifact upload, score validation, and multi-node-aware summary tables

How evals are run

Single-node evals are selected on 8k1k at max + median concurrency for each (model, runner, framework, precision, spec-decoding, dp-attn) group.

Multi-node evals are selected on 8k1k by taking the entry with the highest max concurrency for each (model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn) group, then running eval at the median concurrency from that config via eval-conc.

EVAL_ONLY=true starts the server with expanded eval context, skips throughput benchmarking, runs lm-eval,
writes meta_env.json + results*.json + sample*.jsonl, uploads those artifacts, then validates scores
against thresholds.

srt-slurm fork delta vs upstream

NVIDIA multinode eval uses Oseltamivir/srt-slurm@sa-submission-q1-2026 instead of ishandhanani/srt-slurm.

Compared with current upstream/main, that fork adds the eval path InferenceX needs:

  • a new lm-eval benchmark runner
  • /infmax-workspace mounting via INFMAX_WORKSPACE
  • EVAL_ONLY support in do_sweep.py to skip benchmark stage and run post-eval directly
  • full wait_for_model() health checking before eval in eval-only mode
  • pass-through of framework/model/topology/env metadata into the eval container
  • MODEL_NAME=self.config.served_model_name so eval queries the served alias, not the HF repo id
  • EVAL_CONC from workflow to EVAL_CONCURRENT_REQUESTS
  • copying eval outputs into /logs/eval_results/ for launcher-side artifact pickup

Validation

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771

InferenceX PR

NVIDIA/srt-slurm#12

Oseltamivir and others added 29 commits February 19, 2026 16:12
The sglang 0.5.8 Docker image ships a newer lm-eval 0.4.9.2 commit
that defaults fewshot_as_multiturn=True for chat-completion models.
Since the version string matches the pinned commit, pip silently
skips the install. Adding --force-reinstall ensures the pinned
commit is always used regardless of what's pre-installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds dsr1-fp8-mi355x-sglang-disagg-nodpa-eval: same image/model/precision
as the DPA config but with dp-attn=false and ep=1. Running evals on this
will tell us if DPA is the cause of the 0% GSM8K score or if it's
something else about the fp8 disagg setup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Oseltamivir Oseltamivir requested a review from a team April 3, 2026 04:01
Comment thread utils/matrix_logic/test_generate_sweep_configs.py Fixed
# Conflicts:
#	perf-changelog.yaml
#	runners/launch_gb200-nv.sh
/raid/tmp is per-node local storage, so pyxis mount failed on B200
multinode decode workers that landed on nodes without the pre-staged
copy. Use the shared /lustre/fsw mirror instead.
Recipe default is max_attempts=360 × interval=10s = 3600s, which is
not enough for DSR1-FP8 (~680GB) to load across 5 workers off the
shared FS — the prior run timed out at ~50% weights loaded. Patch the
recipe in-place after clone; uses ${CONFIG_FILE%%:*} so the :override
suffix on sglang configs doesn't break sed.
@Oseltamivir Oseltamivir merged commit 2e9cecf into main Apr 19, 2026
18 checks passed
@Oseltamivir Oseltamivir deleted the multinode_eval branch April 19, 2026 19:41
Oseltamivir added a commit that referenced this pull request Apr 19, 2026
Missed staging this change before merging #1000.
OCWC22 added a commit to OCWC22/InferenceX that referenced this pull request Apr 21, 2026
Pulls 55 upstream commits published on SemiAnalysisAI/InferenceX:main
since PR SemiAnalysisAI#1032 was opened. Zero conflicts; none touch tools/ or
datasets/isb1/. Purpose: modernize PR base before Cam review and
absorb upstream fork-drift reductions.

Notable upstream work picked up:
- MiniMax M2.5 MXFP4 MI355X + B300 configs
- GLM5.1 FP4 MI355X support
- GPT-OSS FP4 TP=8 conc=1 extension (SemiAnalysisAI#1096)
- H200 multinode evals (SemiAnalysisAI#1000)
- B300 configs for Kimi K2.5, DSR1, Qwen3.5
- Parallel random data generation (SemiAnalysisAI#1094)
- KNOWN_LIMITATION.md updates

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants