Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1882,6 +1882,24 @@ qwen3.5-fp8-b300-sglang-mtp:
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: mtp }

qwen3.5-fp8-b300-sglang:
image: lmsysorg/sglang:v0.5.10.post1-cu130
model: Qwen/Qwen3.5-397B-A17B-FP8
model-prefix: qwen3.5
runner: b300
precision: fp8
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256 }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256 }

kimik2.5-int4-b200-vllm:
image: vllm/vllm-openai:v0.15.1
model: moonshotai/Kimi-K2.5
Expand Down
83 changes: 83 additions & 0 deletions benchmarks/single_node/qwen3.5_fp8_b300.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

CONTEXT_LENGTH=$((ISL + OSL + 20))
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
CONTEXT_LENGTH="$EVAL_MAX_MODEL_LEN"
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 --expert-parallel-size=$EP_SIZE \
--enable-symm-mem \
--disable-radix-cache \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--mamba-ssm-dtype bfloat16 \
--attention-backend trtllm_mha \
--moe-runner-backend flashinfer_trtllm \
--cuda-graph-max-bs $CONC \
--max-running-requests $CONC \
--max-prefill-tokens 16384 \
--chunked-prefill-size 16384 \
--mem-fraction-static 0.8 \
--stream-interval 50 \
--scheduler-recv-interval 10 \
--tokenizer-worker-num 6 \
--tokenizer-path $MODEL \
--context-length $CONTEXT_LENGTH > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1388,3 +1388,11 @@
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "EAGLE speculative decoding with MTP, TP=4, concurrency 4-256 for 1k1k and 8k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1035

- config-keys:
- qwen3.5-fp8-b300-sglang
description:
- "Add Qwen3.5-397B-A17B-FP8 B300 SGLang benchmark (non-MTP)"
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "TP=4, concurrency 4-256 for 1k1k and 8k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1048
Comment on lines +1392 to +1398
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The perf-changelog.yaml entry for qwen3.5-fp8-b300-sglang uses the placeholder pr-link 'pull/XXXX' instead of the actual PR number 1048. Update line 1398 to read 'pull/1048' to restore changelog traceability — every other recent entry uses the real PR number.

Extended reasoning...

What the bug is: The new perf-changelog.yaml entry for the qwen3.5-fp8-b300-sglang config (lines 1392-1398) has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX instead of the actual PR number 1048. The pr-link field is the primary mechanism used to trace a changelog entry back to the pull request that introduced or updated a benchmark configuration.

How it manifests: Any tooling, script, or human reviewer who tries to cross-reference this changelog entry with the originating PR will follow a broken/nonexistent URL. Unlike most other entries which have real PR numbers (e.g., the sibling qwen3.5-fp8-b300-sglang-mtp entry at line 1390 correctly points to /pull/1035), this entry is untraceable by PR number.

The specific code path: The last entry in perf-changelog.yaml reads:

- config-keys:
    - qwen3.5-fp8-b300-sglang
  description:
    - "Add Qwen3.5-397B-A17B-FP8 B300 SGLang benchmark (non-MTP)"
    - "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
    - "TP=4, concurrency 4-256 for 1k1k and 8k1k"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX

The PR diff shows /pull/1048 was intended, but what was actually committed (confirmed via git show e76cbda) is XXXX.

Why existing code doesn't prevent it: There is no validation in CI that checks pr-link fields in perf-changelog.yaml for placeholder values. The placeholder was likely present before the PR number was assigned and was never replaced before merging.

Impact: Changelog traceability is broken for this entry. Anyone auditing which PR introduced the qwen3.5-fp8-b300-sglang config will be unable to follow the link. This is a metadata-only issue with no effect on benchmark execution.

Fix: Change line 1398 from pull/XXXX to pull/1048.

Step-by-step proof:

  1. Commit e76cbda is the merge commit for PR Add B300 config: qwen3.5-fp8-sglang (non-MTP) #1048 ("Add B300 config: qwen3.5-fp8-sglang").
  2. Running git show e76cbda -- perf-changelog.yaml confirms the committed diff adds pr-link: .../pull/XXXX (not /pull/1048).
  3. The current file on disk at perf-changelog.yaml line 1398 reads pull/XXXX.
  4. The adjacent MTP entry (PR Add B300 config: qwen3.5-fp8-sglang-mtp #1035) at line 1390 correctly reads pull/1035, establishing the expected pattern.
  5. Therefore the XXXX placeholder was committed as-is and needs to be corrected to pull/1048.

4 changes: 3 additions & 1 deletion runners/launch_b300-nv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,9 @@ else
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')

salloc --partition=$SLURM_PARTITION --account=$SLURM_ACCOUNT --gres=gpu:$TP --exclusive --time=180 --no-shell --job-name="$RUNNER_NAME"
# Pin to one of the known-good B300 nodes; others have hardware/network
# issues that cause benchmarks to hang or fail to start.
salloc --partition=$SLURM_PARTITION --account=$SLURM_ACCOUNT --nodelist=b300-[001-006,008-012,017-020] -N 1 --gres=gpu:$TP --exclusive --time=180 --no-shell --job-name="$RUNNER_NAME"
JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1)

srun --jobid=$JOB_ID bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE"
Expand Down
Loading