[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1243
[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1243lishuoshuo-amd wants to merge 8 commits intomainfrom
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Made-with: Cursor
…tune-dsr1 Made-with: Cursor # Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25204331730 |
00859b5 to
c709a29
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25204505898 |
Description
Tune
--num-continuous-decode-stepsfrom 4 to 8 for DeepSeek-R1-0528 FP8 on MI355X (SGLang).Increasing continuous decode steps reduces prefill/decode scheduling overhead, lowering per-token latency (TPOT) and improving overall throughput.
Changes
benchmarks/single_node/dsr1_fp8_mi355x.sh:--num-continuous-decode-steps 4→8perf-changelog.yaml: Added changelog entryPerformance Results
Hyperloom CI Optimization Report (conc=64, 1k/1k)
Full Parameter Sweep (12 points, 0 failures)
Verified across the complete (tp, conc, isl, osl) search-space from
amd-master.yaml:Average gain: +4.7% — positive improvement across all parameter combinations with no regression.
Baseline Validation Against InferenceX Official
Baseline aligns within <1% of official InferenceX data, confirming test environment reliability.
Note: All throughput numbers in this PR refer to output (decode) token throughput, never total. The "Optimization Report" and "Baseline Validation" tables show per-GPU values; the "Full Parameter Sweep" table shows aggregate (TP-summed) values from raw SGLang output_throughput. Per-GPU = aggregate / TP. Gain percentages are unit-invariant.
Related Issue
Automated optimization by Hyperloom CI.
Type of Change
Checklist
perf-changelog.yaml