Conversation
|
@claude please add a PR description |
|
Claude finished @kedarpotdar-nv's task in 40s —— View job Adding PR Description
PR description has been added with a summary of all changes: new |
|
@claude please review the PR |
|
Claude finished @kedarpotdar-nv's task in 2m 41s —— View job Reviewing PR
Review:
|
| Area | Verdict |
|---|---|
| Config YAML | ✅ Good |
| Launch script | 🟡 Minor — missing --tokenizer-worker-num, hardcoded scheduler interval |
| Perf changelog | ✅ Good |
The script is functional as-is. The two items flagged (tokenizer workers and dynamic scheduler interval) are nice-to-haves that would align with the BF16 sibling script and could improve low-concurrency latency. Up to you whether to address now or in a follow-up.
|
@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168 Thanks! |
Thanks! excited to see continued focus on ensuring that the end users in the ml community can benefit from these optimized flag recipes for selecting the correct MoE kernel backend & attention kernel backend |
|
/sweep run only evals |
|
@kedarpotdar-nv Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22507799181 |
|
@kedarpotdar-nv some merge conflicts then looks ready to go? |
|
@ankursingh-nv will take over |
86e3eec to
8d9f7c9
Compare
|
@claude resolve the merge conflicts & update the PRs to follow the updated conventions. remember that qwen3.5-fp4-b200-sglang should be at the bottom of perf-changelog.yaml, remember to delete 1k/8k from nvidia-master, remember to do the EVAL_CONTEXT_ARGS thing |
|
Claude finished @functionstackx's task in 3m 16s —— View job Resolving Merge Conflicts & Updating Conventions
SummaryResolved merge conflicts by rebasing cleanly on nvidia-master.yaml — Added benchmarks/single_node/qwen3.5_fp4_b200.sh — Rewritten to follow updated conventions from the fp8 sibling script:
perf-changelog.yaml — Entry added at the bottom. |
e5c143d to
e1e8cb2
Compare
|
@kedarpotdar-nv @ankursingh-nv the eval score is not good , it doesnt even pass an relaxed floor model eval threshold https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23698363526/job/69037542640?pr=820
+viz @Oseltamivir |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
Based on SemiAnalysisAI/InferenceX#820. - Set mem-fraction-static to 0.85 for B200 FP4 (benchmark uses 0.85) - Add --quantization modelopt_fp4 (required flag, was missing) - Add --chunked-prefill-size 32768, --max-prefill-tokens 32768 - Add --max-running-requests 128, --stream-interval 30 - Add --disable-radix-cache (always required for FP4) - Skip --enable-flashinfer-allreduce-fusion for FP4 (TP=4, not used per benchmark) Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
|
https://github.com/sgl-project/sgl-cookbook/pull/230/files |
|
@functionstackx - could you please help reviewing this? |
cquil11
left a comment
There was a problem hiding this comment.
nightly image fine for new arch
|
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24107321633 Evals looks good, throughput looks good |
* MiniMax-M2.5 B200: add EP, FP8 KV cache, disable radix cache Based on validated benchmark configs in SemiAnalysisAI/InferenceX#1010, tp:4/ep:4 and tp:2/ep:2 are now confirmed for B200. Also enables 2-GPU selection for B200, adds --kv-cache-dtype fp8_e4m3 and --disable-radix-cache as B200-specific flags per the benchmark script. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Update Qwen35ConfigGenerator for B200 FP4 (NVFP4) Based on SemiAnalysisAI/InferenceX#820. - Set mem-fraction-static to 0.85 for B200 FP4 (benchmark uses 0.85) - Add --quantization modelopt_fp4 (required flag, was missing) - Add --chunked-prefill-size 32768, --max-prefill-tokens 32768 - Add --max-running-requests 128, --stream-interval 30 - Add --disable-radix-cache (always required for FP4) - Skip --enable-flashinfer-allreduce-fusion for FP4 (TP=4, not used per benchmark) Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Remove --disable-radix-cache flag for B200 in MiniMaxM25ConfigGenerator Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * revert: remove accidental MiniMax B200 changes from Qwen3.5 PR PR #230 should only touch Qwen35ConfigGenerator. Revert all changes to MiniMaxM25ConfigGenerator (B200 2-GPU support, B200 EP, B200 kv-cache-dtype) that were accidentally included on this branch. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * revert: restore MiniMax comment order to match main Undo accidental comment/variable reorder in MiniMaxM25ConfigGenerator that was not part of the intended Qwen3.5 B200 FP4 changes. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Update Qwen3.5 config to conditionally enable allreduce fusion based on quantization --------- Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com> Co-authored-by: Zijie Xia <zijie_xia@icloud.com>

Summary
Add FP4 benchmark configuration and launch script for Qwen3.5-397B-A17B on NVIDIA B200 GPUs using SGLang.
Changes
New Benchmark Config (
nvidia-master.yaml)qwen3.5-fp4-b200-sglangnvidia/Qwen3.5-397B-A17B-NVFP4lmsysorg/sglang:v0.5.9-cu129-amd641k1k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–64), TP8/EP8 (conc 128)1k8k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)8k1k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)New Launch Script (
benchmarks/single_node/qwen3.5_fp4_b200.sh)SGLang server configuration with:
--quantization modelopt_fp4with--fp4-gemm-backend flashinfer_cutlass--kv-cache-dtype fp8_e4m3--attention-backend trtllm_mha/--moe-runner-backend flashinfer_trtllm--enable-flashinfer-allreduce-fusion--chunked-prefill-size 32768/--max-prefill-tokens 32768--disable-radix-cache--mem-fraction-static 0.85Perf Changelog
qwen3.5-fp4-b200-sglangconfig.