@claude look at benchmarks/dsr1_fp8_b200.sh and nvidia-master.yaml for deepseek sglang fp8 b200 and then create an new benchmarks/.sh for qwen and then update nvidia-master.yaml
example command
PYTHONNOUSERSITE=1 python -m sglang.launch_server \
--model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--model Qwen/Qwen3.5-397B-A17B \
--tp 8 \
--mem-fraction-static 0.8
here is the recipe documentation link https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5#52-speed-benchmark. it says that the container should be lmsysorg/sglang:nightly-dev-20260216-d3bae71e
dont do stuff like this
Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls.
if [[ $TP -eq 8 ]]; then
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi
Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency
this will help us save memory from being unnecessary used.
MAX_RUNNING_REQUESTS=128
CUDA_GRAPH_MAX_BATCH_SIZE=128
MEM_FRAC_STATIC=0.82
CHUNKED_PREFILL_SIZE=32768
MAX_PREFILL_TOKENS=32768
elif [[ $TP -eq 4 ]]; then
if [[ $ISL -ne 8192 ]] || [[ $OSL -ne 1024 ]]; then
echo "TP=4 not yet supported for ISL=$ISL OSL=$OSL!"
exit 1
fi
Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency
this will help us save memory from being unnecessary used.
MAX_RUNNING_REQUESTS=32
CUDA_GRAPH_MAX_BATCH_SIZE=32
MEM_FRAC_STATIC=0.95
CHUNKED_PREFILL_SIZE=8192
MAX_PREFILL_TOKENS=8192
SCHEDULER_RECV_INTERVAL=10
else
echo "Unrecognized TP size $TP!"
exit 1
fi
echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"
@claude look at
benchmarks/dsr1_fp8_b200.shandnvidia-master.yamlfor deepseek sglang fp8 b200 and then create an new benchmarks/.sh for qwen and then updatenvidia-master.yamlexample command
here is the recipe documentation link https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5#52-speed-benchmark. it says that the container should be
lmsysorg/sglang:nightly-dev-20260216-d3bae71edont do stuff like this
Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls.
if [[ $TP -eq 8 ]]; then
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi
Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency
this will help us save memory from being unnecessary used.
MAX_RUNNING_REQUESTS=128
CUDA_GRAPH_MAX_BATCH_SIZE=128
MEM_FRAC_STATIC=0.82
CHUNKED_PREFILL_SIZE=32768
MAX_PREFILL_TOKENS=32768
elif [[ $TP -eq 4 ]]; then
if [[ $ISL -ne 8192 ]] || [[ $OSL -ne 1024 ]]; then
echo "TP=4 not yet supported for ISL=$ISL OSL=$OSL!"
exit 1
fi
Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency
this will help us save memory from being unnecessary used.
MAX_RUNNING_REQUESTS=32
CUDA_GRAPH_MAX_BATCH_SIZE=32
MEM_FRAC_STATIC=0.95
CHUNKED_PREFILL_SIZE=8192
MAX_PREFILL_TOKENS=8192
SCHEDULER_RECV_INTERVAL=10
else
echo "Unrecognized TP size $TP!"
exit 1
fi
echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"