add B200 Qwen SGLang BF16

@claude look at `benchmarks/dsr1_fp8_b200.sh` and `nvidia-master.yaml` for deepseek sglang fp8 b200 and then create an new benchmarks/.sh for qwen and then update `nvidia-master.yaml` 


example command
```
PYTHONNOUSERSITE=1 python -m sglang.launch_server \
--model-path=$MODEL --host=0.0.0.0 --port=$PORT \
  --model Qwen/Qwen3.5-397B-A17B \
  --tp 8 \
  --mem-fraction-static 0.8
```

here is the recipe documentation link https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5#52-speed-benchmark. it says that the container should be `lmsysorg/sglang:nightly-dev-20260216-d3bae71e`

dont do stuff like this
```

```
# Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls.
if [[ $TP -eq 8 ]]; then
  if [[ $CONC -ge 16 ]]; then
    SCHEDULER_RECV_INTERVAL=30
  else
    SCHEDULER_RECV_INTERVAL=10
  fi

  # Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency
  # this will help us save memory from being unnecessary used. 
  MAX_RUNNING_REQUESTS=128
  CUDA_GRAPH_MAX_BATCH_SIZE=128

  MEM_FRAC_STATIC=0.82
  CHUNKED_PREFILL_SIZE=32768
  MAX_PREFILL_TOKENS=32768
elif [[ $TP -eq 4 ]]; then
  if [[ $ISL -ne 8192 ]] || [[ $OSL -ne 1024 ]]; then 
    echo "TP=4 not yet supported for ISL=$ISL OSL=$OSL!"
    exit 1
  fi

  # Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency
  # this will help us save memory from being unnecessary used. 
  MAX_RUNNING_REQUESTS=32
  CUDA_GRAPH_MAX_BATCH_SIZE=32

  MEM_FRAC_STATIC=0.95
  CHUNKED_PREFILL_SIZE=8192
  MAX_PREFILL_TOKENS=8192

  SCHEDULER_RECV_INTERVAL=10
else
  echo "Unrecognized TP size $TP!"
  exit 1
fi
echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add B200 Qwen SGLang BF16 #702

Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls.

Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency

this will help us save memory from being unnecessary used.

Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency

this will help us save memory from being unnecessary used.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

add B200 Qwen SGLang BF16 #702

Description

Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls.

Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency

this will help us save memory from being unnecessary used.

Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency

this will help us save memory from being unnecessary used.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions