Skip to content

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1109

Closed
lishuoshuo-amd wants to merge 2 commits intoSemiAnalysisAI:mainfrom
lishuoshuo-amd:amd/hyperloom/mi355x-tune-dsr1
Closed

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1109
lishuoshuo-amd wants to merge 2 commits intoSemiAnalysisAI:mainfrom
lishuoshuo-amd:amd/hyperloom/mi355x-tune-dsr1

Conversation

@lishuoshuo-amd
Copy link
Copy Markdown
Collaborator

@lishuoshuo-amd lishuoshuo-amd commented Apr 21, 2026

Description

Tune --num-continuous-decode-steps from 4 to 8 for DeepSeek-R1-0528 FP8 on MI355X (SGLang).
Increasing continuous decode steps reduces prefill/decode scheduling overhead, lowering per-token latency (TPOT) and improving overall throughput.

Changes

  • benchmarks/single_node/dsr1_fp8_mi355x.sh: --num-continuous-decode-steps 48
  • perf-changelog.yaml: Added changelog entry

Performance Results

Hyperloom CI Optimization Report (conc=64, 1k/1k)

Metric Baseline Optimized Change
Output Throughput (per GPU) 311.65 tok/s 331.22 tok/s +6.28%
TPOT 24.46 ms 23.01 ms -5.93%
TTFT 581.20 ms 528.52 ms -9.07%
vs InferenceX Official +0.50% +6.81%

Full Parameter Sweep (12 points, 0 failures)

Verified across the complete (tp, conc, isl, osl) search-space from amd-master.yaml:

ISL/OSL TP Conc Baseline (tok/s) Optimized (tok/s) Gain
1k/1k 8 4 399.90 417.60 +4.43%
1k/1k 8 8 729.10 750.26 +2.90%
1k/1k 8 16 1140.92 1173.48 +2.85%
1k/1k 8 32 1683.81 1739.49 +3.31%
1k/1k 8 64 2614.64 2654.50 +1.52%
8k/1k 4 32 770.98 821.53 +6.56%
8k/1k 4 64 991.61 1031.77 +4.05%
8k/1k 8 4 310.30 366.00 +17.95%
8k/1k 8 8 625.61 636.92 +1.81%
8k/1k 8 16 902.98 925.49 +2.49%
8k/1k 8 32 1213.94 1279.19 +5.38%
8k/1k 8 64 1664.40 1709.37 +2.70%

Average gain: +4.7% — positive improvement across all parameter combinations with no regression.

Baseline Validation Against InferenceX Official

Conc Official (tok/s/GPU) Our Baseline (tok/s/GPU) Diff
4 49.82 49.99 +0.3%

Baseline aligns within <1% of official InferenceX data, confirming test environment reliability.

Note: All throughput numbers in this PR refer to output (decode) token throughput, never total. The "Optimization Report" and "Baseline Validation" tables show per-GPU values; the "Full Parameter Sweep" table shows aggregate (TP-summed) values from raw SGLang output_throughput. Per-GPU = aggregate / TP. Gain percentages are unit-invariant.

Related Issue

Automated optimization by Hyperloom CI.

Type of Change

  • Configuration change

Checklist

  • I have tested my changes locally
  • I have updated documentation if necessary
  • If I changed a container image or config, I have already updated perf-changelog.yaml

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@lishuoshuo-amd
Copy link
Copy Markdown
Collaborator Author

@claude review

@billishyahao
Copy link
Copy Markdown
Collaborator

cc @Duyi-Wang

@lishuoshuo-amd
Copy link
Copy Markdown
Collaborator Author

Per maintainer guidance, I moved this change from a fork-based PR to a branch directly under SemiAnalysisAI/InferenceX to avoid workflow restrictions on fork PRs.

Continuing review here: #1243

Closing this PR to avoid duplicate review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants