[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8 by lishuoshuo-amd · Pull Request #1109 · SemiAnalysisAI/InferenceX

lishuoshuo-amd · 2026-04-21T13:48:42Z

Description

Tune --num-continuous-decode-steps from 4 to 8 for DeepSeek-R1-0528 FP8 on MI355X (SGLang).
Increasing continuous decode steps reduces prefill/decode scheduling overhead, lowering per-token latency (TPOT) and improving overall throughput.

Changes

benchmarks/single_node/dsr1_fp8_mi355x.sh: --num-continuous-decode-steps 4 → 8
perf-changelog.yaml: Added changelog entry

Performance Results

Hyperloom CI Optimization Report (conc=64, 1k/1k)

Metric	Baseline	Optimized	Change
Output Throughput (per GPU)	311.65 tok/s	331.22 tok/s	+6.28%
TPOT	24.46 ms	23.01 ms	-5.93%
TTFT	581.20 ms	528.52 ms	-9.07%
vs InferenceX Official	+0.50%	+6.81%

Full Parameter Sweep (12 points, 0 failures)

Verified across the complete (tp, conc, isl, osl) search-space from amd-master.yaml:

ISL/OSL	TP	Conc	Baseline (tok/s)	Optimized (tok/s)	Gain
1k/1k	8	4	399.90	417.60	+4.43%
1k/1k	8	8	729.10	750.26	+2.90%
1k/1k	8	16	1140.92	1173.48	+2.85%
1k/1k	8	32	1683.81	1739.49	+3.31%
1k/1k	8	64	2614.64	2654.50	+1.52%
8k/1k	4	32	770.98	821.53	+6.56%
8k/1k	4	64	991.61	1031.77	+4.05%
8k/1k	8	4	310.30	366.00	+17.95%
8k/1k	8	8	625.61	636.92	+1.81%
8k/1k	8	16	902.98	925.49	+2.49%
8k/1k	8	32	1213.94	1279.19	+5.38%
8k/1k	8	64	1664.40	1709.37	+2.70%

Average gain: +4.7% — positive improvement across all parameter combinations with no regression.

Baseline Validation Against InferenceX Official

Conc	Official (tok/s/GPU)	Our Baseline (tok/s/GPU)	Diff
4	49.82	49.99	+0.3%

Baseline aligns within <1% of official InferenceX data, confirming test environment reliability.

Note: All throughput numbers in this PR refer to output (decode) token throughput, never total. The "Optimization Report" and "Baseline Validation" tables show per-GPU values; the "Full Parameter Sweep" table shows aggregate (TP-summed) values from raw SGLang output_throughput. Per-GPU = aggregate / TP. Gain percentages are unit-invariant.

Related Issue

Automated optimization by Hyperloom CI.

Type of Change

Configuration change

Checklist

I have tested my changes locally
I have updated documentation if necessary
If I changed a container image or config, I have already updated perf-changelog.yaml

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

lishuoshuo-amd · 2026-04-22T03:05:14Z

@claude review

billishyahao · 2026-04-22T04:31:39Z

cc @Duyi-Wang

…steps 4 → 8

lishuoshuo-amd · 2026-05-01T00:14:56Z

Per maintainer guidance, I moved this change from a fork-based PR to a branch directly under SemiAnalysisAI/InferenceX to avoid workflow restrictions on fork PRs.

Continuing review here: #1243

Closing this PR to avoid duplicate review.

lishuoshuo-amd requested a review from a team April 21, 2026 13:48

github-project-automation Bot added this to InferenceMAX Board Apr 21, 2026

claude Bot reviewed Apr 21, 2026

View reviewed changes

functionstackx requested review from billishyahao and chunfangamd April 22, 2026 04:33

limou102 approved these changes Apr 27, 2026

View reviewed changes

lishuoshuo-amd added 2 commits April 27, 2026 20:11

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-…

e7b0fc3

…steps 4 → 8

fix: update dsr1_fp8_mi355x

b10c872

lishuoshuo-amd force-pushed the amd/hyperloom/mi355x-tune-dsr1 branch from 54aee90 to b10c872 Compare April 27, 2026 12:12

claude Bot mentioned this pull request May 1, 2026

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8 #1243

Open

4 tasks

lishuoshuo-amd closed this May 1, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1109

[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1109
lishuoshuo-amd wants to merge 2 commits intoSemiAnalysisAI:mainfrom
lishuoshuo-amd:amd/hyperloom/mi355x-tune-dsr1

lishuoshuo-amd commented Apr 21, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

lishuoshuo-amd commented Apr 22, 2026

Uh oh!

billishyahao commented Apr 22, 2026

Uh oh!

lishuoshuo-amd commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lishuoshuo-amd commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Performance Results

Hyperloom CI Optimization Report (conc=64, 1k/1k)

Full Parameter Sweep (12 points, 0 failures)

Baseline Validation Against InferenceX Official

Related Issue

Type of Change

Checklist

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

lishuoshuo-amd commented Apr 22, 2026

Uh oh!

billishyahao commented Apr 22, 2026

Uh oh!

lishuoshuo-amd commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lishuoshuo-amd commented Apr 21, 2026 •

edited

Loading