Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1863,6 +1863,25 @@ qwen3.5-fp8-b200-sglang-mtp:
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: mtp }


qwen3.5-fp8-b300-sglang-mtp:
image: lmsysorg/sglang:v0.5.10.post1-cu130
model: Qwen/Qwen3.5-397B-A17B-FP8
model-prefix: qwen3.5
runner: b300
precision: fp8
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: mtp }

kimik2.5-int4-b200-vllm:
image: vllm/vllm-openai:v0.15.1
model: moonshotai/Kimi-K2.5
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/benchmark-tmpl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ jobs:

# Cleanup SLURM resources
if command -v squeue >/dev/null 2>&1; then
if [[ "${{ runner.name }}" == mi355x-amds* || "${{ runner.name }}" == mi325x-amd* || "${{ runner.name }}" == mi300x-amds* || "${{ runner.name }}" == gb200-nv* || "${{ runner.name }}" == gb300-nv* || "${{ runner.name }}" == h100-cw* || "${{ runner.name }}" == h200-cw* || "${{ runner.name }}" == b200-nb* || "${{ runner.name }}" == h200-nb* || "${{ runner.name }}" == h100-dgxc-slurm* || "${{ runner.name }}" == h200-dgxc-slurm* || "${{ runner.name }}" == b200-dgxc-slurm* ]]; then
if [[ "${{ runner.name }}" == mi355x-amds* || "${{ runner.name }}" == mi325x-amd* || "${{ runner.name }}" == mi300x-amds* || "${{ runner.name }}" == gb200-nv* || "${{ runner.name }}" == gb300-nv* || "${{ runner.name }}" == h100-cw* || "${{ runner.name }}" == h200-cw* || "${{ runner.name }}" == b200-nb* || "${{ runner.name }}" == h200-nb* || "${{ runner.name }}" == h100-dgxc-slurm* || "${{ runner.name }}" == h200-dgxc-slurm* || "${{ runner.name }}" == b200-dgxc-slurm* || "${{ runner.name }}" == b300-nv* ]]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ankur-singh did u set up multiple runners on the b300 cluster now too?

+viz @SemiAnalysisAI/core

echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} ..."
scancel --name="${{ runner.name }}" || true
while [ -n "$(squeue --name='${{ runner.name }}' --noheader --format='%i')" ]; do
Expand Down
87 changes: 87 additions & 0 deletions benchmarks/single_node/qwen3.5_fp8_b300_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

CONTEXT_LENGTH=$((ISL + OSL + 20))
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
CONTEXT_LENGTH="$EVAL_MAX_MODEL_LEN"
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 --expert-parallel-size=$EP_SIZE \
--enable-symm-mem \
--disable-radix-cache \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--mamba-ssm-dtype bfloat16 \
--attention-backend trtllm_mha \
--moe-runner-backend flashinfer_trtllm \
--cuda-graph-max-bs $CONC \
--max-running-requests $CONC \
--max-prefill-tokens 16384 \
--chunked-prefill-size 16384 \
--mem-fraction-static 0.8 \
--stream-interval 50 \
--scheduler-recv-interval 10 \
--tokenizer-worker-num 6 \
--tokenizer-path $MODEL \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--context-length $CONTEXT_LENGTH > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
10 changes: 9 additions & 1 deletion perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1374,9 +1374,17 @@
- "Use lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260414 for FP8 benchmark"
- "Image includes upstream SGLang PRs: https://github.com/sgl-project/sglang/pull/21188, https://github.com/sgl-project/sglang/pull/21421, https://github.com/sgl-project/sglang/pull/20736"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1036

- config-keys:
- glm5-fp4-b200-sglang
description:
- "Update SGLang image from nightly-dev-cu13-20260328-a27651d5 to v0.5.10.post1-cu130"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1031

- config-keys:
- qwen3.5-fp8-b300-sglang-mtp
description:
- "Add Qwen3.5-397B-A17B-FP8 B300 SGLang MTP benchmark"
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "EAGLE speculative decoding with MTP, TP=4, concurrency 4-256 for 1k1k and 8k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1035
25 changes: 25 additions & 0 deletions runners/launch_b300-nv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ SLURM_ACCOUNT="benchmark"

set -x

if [[ "$IS_MULTINODE" == "true" ]]; then

# Validate framework
if [[ $FRAMEWORK != "dynamo-sglang" && $FRAMEWORK != "dynamo-trt" ]]; then
echo "Unsupported framework: $FRAMEWORK. Supported frameworks are: dynamo-trt, dynamo-sglang"
Expand Down Expand Up @@ -211,3 +213,26 @@ for i in 1 2 3 4 5; do
sleep 10
done
find . -name '.nfs*' -delete 2>/dev/null || true

else

HF_HUB_CACHE_MOUNT="/scratch/models"
export MODEL="/scratch/models/${MODEL#*/}"
SQUASH_FILE="/data/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')

salloc --partition=$SLURM_PARTITION --account=$SLURM_ACCOUNT --gres=gpu:$TP --exclusive --time=180 --no-shell --job-name="$RUNNER_NAME"
JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1)

srun --jobid=$JOB_ID bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE"
Comment on lines +220 to +228
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The newly added single-node else branch in runners/launch_b300-nv.sh allocates a SLURM job with salloc and then queries squeue to obtain JOB_ID, but never validates that JOB_ID is non-empty before passing it to srun --jobid=$JOB_ID. If salloc fails (quota exceeded, no available nodes), squeue returns nothing and the subsequent srun calls fail with a cryptic SLURM error rather than a clear diagnostic message. Consider adding the same guard that the multinode branch of this same script already uses: if [ -z "$JOB_ID" ]; then echo "Error: Failed to extract JOB_ID"; exit 1; fi.

Extended reasoning...

What the bug is and how it manifests

The single-node else branch added in this PR (runners/launch_b300-nv.sh lines ~220-228) follows a two-step pattern to obtain a SLURM job ID: (1) salloc --no-shell to allocate the job, and (2) JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1) to retrieve the job ID. If salloc fails for any reason, squeue will return an empty string and JOB_ID will be empty. The script then proceeds to call srun --jobid=$JOB_ID ... with an empty value, causing SLURM to emit a cryptic error like srun: error: Invalid job id specified or similar, which is harder to diagnose than a clear failure message.

The specific code path that triggers it

The script has no set -e, so a failed salloc does not automatically abort execution. The failing scenario is: quota exceeded on the partition → salloc exits non-zero → squeue finds no job with the given name → JOB_ID is empty string → srun --jobid= fails with a SLURM error that does not explain the root cause.

Why existing code doesn't prevent it

The multinode branch (earlier in the same file, around lines 115-118) handles a similar situation—extracting a job ID from srtctl text output—and does include an explicit guard: if [ -z "$JOB_ID" ]; then echo "Error: Failed to extract JOB_ID from srtctl output"; exit 1; fi. The single-node branch lacks this guard entirely.

Addressing the refutation

One verifier correctly notes that this exact pattern (no empty-JOB_ID check after squeue) exists across other scripts: launch_b200-dgxc-slurm.sh:222-223, launch_h200-dgxc-slurm.sh:237-238, launch_mi355x-amds.sh:162-163, launch_h100-dgxc-slurm.sh:234-235. This is a valid observation and weakens the severity. However, (a) the fact that a pattern is widespread does not make it correct, (b) the code in question is new code added by this PR (not pre-existing in this file), and (c) the multinode branch of this same script demonstrates that the codebase does apply explicit guards in some contexts. The refuter's claim that the multinode comparison is a "false equivalence" is partially correct—the extraction methods differ—but the underlying issue (empty JOB_ID proceeding to srun) applies equally.

What the impact would be

A failed salloc results in srun failing with a SLURM error that does not explain the root cause. The benchmark CI job will eventually fail (the result file won't be produced), but the error will appear to be an srun problem rather than an allocation problem, making it harder and slower to diagnose. No data is corrupted; the impact is purely diagnostic and operational.

How to fix it

After the squeue line, add:

if [ -z "$JOB_ID" ]; then
    echo "Error: Failed to extract JOB_ID. salloc may have failed."
    exit 1
fi

Step-by-step proof

  1. CI triggers the B300 single-node benchmark job.
  2. The runner is at quota: salloc --partition=batch_1 ... --no-shell fails with exit code 1; no job is created.
  3. Since there is no set -e, execution continues.
  4. JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1) — squeue finds no matching job; JOB_ID is now "".
  5. srun --jobid= bash -c "enroot import ..." — SLURM rejects the empty jobid with a non-obvious error.
  6. Again no set -e, so srun failure may continue to the second srun call.
  7. The CI step eventually fails at the result-file check, not at the allocation step, making root-cause analysis harder.


srun --jobid=$JOB_ID \
--container-image=$SQUASH_FILE \
--container-mounts=$GITHUB_WORKSPACE:/workspace/,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE_MOUNT \
--no-container-mount-home \
--container-workdir=/workspace/ \
--no-container-entrypoint --export=ALL,PORT=8888 \
bash benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b300${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

fi
Loading