Skip to content

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1243

Open
lishuoshuo-amd wants to merge 8 commits intomainfrom
amd/hyperloom/mi355x-tune-dsr1
Open

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1243
lishuoshuo-amd wants to merge 8 commits intomainfrom
amd/hyperloom/mi355x-tune-dsr1

Conversation

@lishuoshuo-amd
Copy link
Copy Markdown
Collaborator

Description

Tune --num-continuous-decode-steps from 4 to 8 for DeepSeek-R1-0528 FP8 on MI355X (SGLang).
Increasing continuous decode steps reduces prefill/decode scheduling overhead, lowering per-token latency (TPOT) and improving overall throughput.

Changes

  • benchmarks/single_node/dsr1_fp8_mi355x.sh: --num-continuous-decode-steps 48
  • perf-changelog.yaml: Added changelog entry

Performance Results

Hyperloom CI Optimization Report (conc=64, 1k/1k)

Metric Baseline Optimized Change
Output Throughput (per GPU) 311.65 tok/s 331.22 tok/s +6.28%
TPOT 24.46 ms 23.01 ms -5.93%
TTFT 581.20 ms 528.52 ms -9.07%
vs InferenceX Official +0.50% +6.81%

Full Parameter Sweep (12 points, 0 failures)

Verified across the complete (tp, conc, isl, osl) search-space from amd-master.yaml:

ISL/OSL TP Conc Baseline (tok/s) Optimized (tok/s) Gain
1k/1k 8 4 399.90 417.60 +4.43%
1k/1k 8 8 729.10 750.26 +2.90%
1k/1k 8 16 1140.92 1173.48 +2.85%
1k/1k 8 32 1683.81 1739.49 +3.31%
1k/1k 8 64 2614.64 2654.50 +1.52%
8k/1k 4 32 770.98 821.53 +6.56%
8k/1k 4 64 991.61 1031.77 +4.05%
8k/1k 8 4 310.30 366.00 +17.95%
8k/1k 8 8 625.61 636.92 +1.81%
8k/1k 8 16 902.98 925.49 +2.49%
8k/1k 8 32 1213.94 1279.19 +5.38%
8k/1k 8 64 1664.40 1709.37 +2.70%

Average gain: +4.7% — positive improvement across all parameter combinations with no regression.

Baseline Validation Against InferenceX Official

Conc Official (tok/s/GPU) Our Baseline (tok/s/GPU) Diff
4 49.82 49.99 +0.3%

Baseline aligns within <1% of official InferenceX data, confirming test environment reliability.

Note: All throughput numbers in this PR refer to output (decode) token throughput, never total. The "Optimization Report" and "Baseline Validation" tables show per-GPU values; the "Full Parameter Sweep" table shows aggregate (TP-summed) values from raw SGLang output_throughput. Per-GPU = aggregate / TP. Gain percentages are unit-invariant.

Related Issue

Automated optimization by Hyperloom CI.

Type of Change

  • Configuration change

Checklist

  • I have tested my changes locally
  • I have updated documentation if necessary
  • If I changed a container image or config, I have already updated perf-changelog.yaml

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread perf-changelog.yaml Outdated
Made-with: Cursor
…tune-dsr1

Made-with: Cursor

# Conflicts:
#	perf-changelog.yaml
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Collaborator

@chunfangamd chunfangamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@chunfangamd chunfangamd force-pushed the amd/hyperloom/mi355x-tune-dsr1 branch from 00859b5 to c709a29 Compare May 1, 2026 06:04
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants