Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3337,6 +3337,38 @@ minimaxm2.5-fp4-b200-vllm:
- { tp: 4, conc-start: 4, conc-end: 512 }
- { tp: 8, conc-start: 4, conc-end: 4 }

# NOTE: At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html
# does not have a B300-specific recipe, so this config reuses the existing
# MiniMax-M2.5 FP4 B200 vLLM recipe as-is until B300-specific tuning is available.
minimaxm2.5-fp4-b300-vllm:
image: vllm/vllm-openai:v0.19.0-cu130
model: nvidia/MiniMax-M2.5-NVFP4
model-prefix: minimaxm2.5
runner: b300
precision: fp4
framework: vllm
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 1, conc-start: 4, conc-end: 4 }
- { tp: 2, conc-start: 4, conc-end: 512 }
- { tp: 2, ep: 2, conc-start: 128, conc-end: 256 }
- { tp: 2, ep: 2, dp-attn: true, conc-start: 512, conc-end: 512 }
- { tp: 4, conc-start: 4, conc-end: 512 }
- { tp: 4, ep: 4, conc-start: 32, conc-end: 128 }
- { tp: 8, conc-start: 4, conc-end: 4 }
- isl: 8192
osl: 1024
search-space:
- { tp: 1, conc-start: 4, conc-end: 32 }
- { tp: 1, conc-start: 256, conc-end: 512 }
- { tp: 2, conc-start: 4, conc-end: 512 }
- { tp: 2, ep: 2, conc-start: 128, conc-end: 512 }
- { tp: 4, conc-start: 4, conc-end: 512 }
- { tp: 8, conc-start: 4, conc-end: 4 }

gptoss-fp4-h100-vllm:
image: vllm/vllm-openai:v0.18.0
model: openai/gpt-oss-120b
Expand Down
84 changes: 84 additions & 0 deletions benchmarks/single_node/minimaxm2.5_fp4_b300.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#!/usr/bin/env bash

# NOTE: At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html
# does not have a B300-specific recipe, so this script reuses the existing
# MiniMax-M2.5 FP4 B200 vLLM recipe as-is until B300-specific tuning is available.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
EP_SIZE \
DP_ATTENTION \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

if [ "${DP_ATTENTION}" = "true" ]; then
PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel"
elif [ "$EP_SIZE" -gt 1 ]; then
PARALLEL_ARGS="--tensor-parallel-size=$TP --enable-expert-parallel"
else
PARALLEL_ARGS="--tensor-parallel-size=$TP"
fi

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
vllm serve $MODEL --port $PORT \
$PARALLEL_ARGS \
--gpu-memory-utilization 0.90 \
--max-model-len $MAX_MODEL_LEN \
--kv-cache-dtype fp8 \
--max-cudagraph-capture-size 2048 \
--max-num-batched-tokens "$((ISL * 2 ))" \
--stream-interval 20 --no-enable-prefix-caching \
--trust-remote-code > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1437,3 +1437,11 @@
- "EAGLE speculative decoding with MTP, TP=8, concurrency 4-512 for 1k1k and 8k1k"
- "At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1 does not have a B300-specific recipe, so this reuses the existing DSR1 FP8 B200 SGLang MTP recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1059

- config-keys:
- minimaxm2.5-fp4-b300-vllm
description:
- "Add MiniMax-M2.5 FP4 (NVFP4) B300 vLLM benchmark"
- "Image: vllm/vllm-openai:v0.19.0-cu130"
- "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html does not have a B300-specific recipe, so this reuses the existing MiniMax-M2.5 FP4 B200 vLLM recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1055
Comment on lines +1446 to +1447
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new minimaxm2.5-fp4-b300-vllm entry in perf-changelog.yaml has a placeholder pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX that was never replaced with the actual PR number. Since this PR is #1055, the link should be updated to pull/1055.

Extended reasoning...

Bug description: The perf-changelog.yaml entry added by this PR for the minimaxm2.5-fp4-b300-vllm config (lines 1413-1414) contains a template placeholder XXXX in the pr-link field rather than the actual PR number, resulting in a broken link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX.

Code path: The diff clearly shows the new entry at the bottom of perf-changelog.yaml includes pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. The PR number (1055) is known at submission time — it is literally the number of this pull request.

Why existing code doesn't prevent it: There is no automated validation step that checks perf-changelog.yaml entries for placeholder patterns like XXXX. The surrounding entries demonstrate the correct pattern: the immediately preceding entry for dsr1-fp4-b300-sglang (PR #1049) correctly uses pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1049, and similarly for qwen3.5-fp8-b300-sglang (PR #1048).

Impact: The broken link makes it impossible to trace the changelog entry back to the originating PR. Anyone reading the changelog to understand when or why the minimaxm2.5-fp4-b300-vllm config was added will be unable to follow the reference. There are already 7 pre-existing XXX placeholder entries in the file (lines 12, 19, 315, 790, 818, 855, 872), and this PR introduces another instance, continuing a pattern that reduces changelog utility.

Fix: Replace pull/XXXX with pull/1055 on the pr-link line of the new minimaxm2.5-fp4-b300-vllm entry.

Step-by-step proof: (1) This PR is numbered #1055 — visible in the PR metadata. (2) The diff adds a new perf-changelog.yaml entry with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. (3) At merge time, clicking that link leads to a 404 (or an unrelated PR if one happens to be numbered XXXX). (4) The correct value https://github.com/SemiAnalysisAI/InferenceX/pull/1055 would link directly to this PR. The fix is a one-word substitution.

Addressing the duplicate refutation: A separate bug_003 was reported describing the identical issue. Whether or not bug_003 is processed, this report independently identifies the same real defect; the fix remains the same regardless of which report is acted upon.

Loading