[Perf Regression] 24 config(s) regressed @ 56d16766

## Performance Regression Detected

**Commit:** `56d16766`
**Run:** https://github.com/ROCm/ATOM/actions/runs/25124114536
**Date:** 2026-04-30T00:47:01.855594+00:00

### Regressed Configurations

| Model | ISL/OSL | Conc | Tput (cur) | Tput (base) | Δ% | TPOT (cur) | TPOT (base) | Δ% |
|-------|---------|------|-----------|------------|-----|-----------|------------|-----|
| DeepSeek-R1-0528 | 8192/1024 | 8 | 501.3 | 576.6 | -13.1% | 15.36 | 13.25 | 15.9% |
| DeepSeek-R1-0528 MTP3 | 1024/1024 | 8 | 828.8 | 966.7 | -14.3% | 9.17 | 8.02 | 14.3% |
| DeepSeek-R1-0528 MTP3 | 8192/1024 | 4 | 568.0 | 553.6 | 2.6% | 6.26 | 6.58 | -4.8% |
| DeepSeek-R1-0528 MTP3 | 8192/1024 | 8 | 822.2 | 881.8 | -6.8% | 9.02 | 8.41 | 7.2% |
| DeepSeek-R1-0528 MTP3 | 8192/1024 | 16 | 1251.9 | 1249.7 | 0.2% | 11.91 | 11.98 | -0.6% |
| DeepSeek-R1-0528 MTP3 | 8192/1024 | 32 | 1683.3 | 1820.2 | -7.5% | 17.20 | 16.48 | 4.3% |
| DeepSeek-R1-0528-MXFP4 | 1024/1024 | 32 | 1717.0 | 1799.4 | -4.6% | 17.78 | 17.21 | 3.3% |
| DeepSeek-R1-0528-MXFP4 MTP3 | 1024/1024 | 8 | 973.8 | 1038.6 | -6.2% | 7.71 | 7.29 | 5.8% |
| DeepSeek-R1-0528-MXFP4 MTP3 | 1024/1024 | 256 | 6467.0 | 6586.7 | -1.8% | 37.63 | 37.13 | 1.3% |
| GLM-5-FP8 | 1024/1024 | 128 | 3038.8 | 3068.8 | -1.0% | 40.48 | 40.15 | 0.8% |
| GLM-5.1-MXFP4 | 1024/1024 | 8 | 429.8 | 435.9 | -1.4% | 17.93 | 17.84 | 0.5% |
| Kimi-K2.5-MXFP4 | 1024/1024 | 64 | 2464.0 | 2431.6 | 1.3% | 24.98 | 25.42 | -1.8% |
| Kimi-K2.5-MXFP4 | 1024/1024 | 128 | 3624.6 | 3617.5 | 0.2% | 33.94 | 34.06 | -0.3% |
| Llama-3.3-70B-Instruct-MXFP4 | 1024/1024 | 4 | 257.6 | 261.1 | -1.4% | 14.83 | 14.67 | 1.1% |
| Llama-3.3-70B-Instruct-MXFP4 | 1024/1024 | 32 | 1746.3 | 1735.9 | 0.6% | 17.44 | 17.60 | -0.9% |
| MiniMax-M2.5 | 1024/1024 | 256 | 5519.2 | 5427.8 | 1.7% | 44.72 | 45.41 | -1.5% |
| MiniMax-M2.5-MXFP4 | 1024/1024 | 128 | 3931.3 | 3943.2 | -0.3% | 31.53 | 31.44 | 0.3% |
| Qwen3.5-397B-A17B-FP8 | 8192/1024 | 4 | 385.5 | 387.8 | -0.6% | 9.82 | 9.79 | 0.3% |
| Qwen3.5-397B-A17B-FP8 MTP3 | 8192/1024 | 64 | 2699.4 | 2719.9 | -0.8% | 22.05 | 22.09 | -0.2% |
| Qwen3.5-397B-A17B-MXFP4 | 8192/1024 | 4 | 365.1 | 367.2 | -0.6% | 10.40 | 10.36 | 0.4% |
| gpt-oss-120b | 1024/1024 | 4 | 869.7 | 911.5 | -4.6% | 4.18 | 4.20 | -0.4% |
| gpt-oss-120b | 8192/1024 | 4 | 830.1 | 816.4 | 1.7% | 4.50 | 4.59 | -2.1% |
| gpt-oss-120b | 8192/1024 | 8 | 1311.7 | 1410.3 | -7.0% | 5.74 | 5.42 | 6.1% |
| gpt-oss-120b | 8192/1024 | 32 | 3049.3 | 3035.9 | 0.4% | 9.98 | 10.06 | -0.8% |

### Performance Summary

```
# Trace Performance Summary

**File:** `DeepSeek-R1-0528_ts_20260430_005754_536.pt.trace.json.gz`

## Prefill

| # | Label | Duration |
|---|-------|----------|
| 0 | `prefill[bs=1 tok=7237 ctx=7237]` | 74.65 ms |
| 1 | `prefill[bs=2 tok=14698 ctx=[7112, 7586]]` | 76.80 ms |
| 2 | `prefill[bs=2 tok=15324 ctx=[7388, 7936]]` | 72.65 ms |
| 3 | `prefill[bs=2 tok=14146 ctx=[6830, 7316]]` | 72.62 ms |
| 4 | `prefill[bs=1 tok=7769 ctx=7769]` | 71.83 ms |
| 5 | `prefill[bs=1 tok=7152 ctx=7152]` | 71.71 ms |
| 6 | `prefill[bs=1 tok=7647 ctx=7647]` | 70.76 ms |
| 7 | `prefill[bs=1 tok=8049 ctx=8049]` | 70.45 ms |
| 8 | `prefill[bs=1 tok=7153 ctx=7153]` | 71.19 ms |
| 9 | `prefill[bs=1 tok=7973 ctx=7973]` | 70.57 ms |
| 10 | `prefill[bs=1 tok=6867 ctx=6867]` | 68.93 ms |
| 11 | `prefill[bs=1 tok=7258 ctx=7258]` | 71.43 ms |
| 12 | `prefill[bs=1 tok=8063 ctx=8063]` | 72.17 ms |

**Total prefill:** 935.76 ms

## Decode

- **Iterations:** 2009
- **Mean:** 951.7 us
- **Min:** 557.6 us
- **Max:** 2.47 ms
- **Total:** 1911.96 ms

```

### Profiler Traces

Download from [workflow artifacts](https://github.com/ROCm/ATOM/actions/runs/25124114536).
Open in [Perfetto UI](https://ui.perfetto.dev/) or Chrome `chrome://tracing` for analysis.

### Next Steps
1. Download `profiler-analysis-25124114536` artifact
2. Open trace files in Perfetto UI
3. Compare kernel durations against previous traces
4. Identify bottleneck changes


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf Regression] 24 config(s) regressed @ 56d16766 #673

Performance Regression Detected

Regressed Configurations

Performance Summary

Profiler Traces

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	ISL/OSL	Conc	Tput (cur)	Tput (base)	Δ%	TPOT (cur)	TPOT (base)	Δ%
DeepSeek-R1-0528	8192/1024	8	501.3	576.6	-13.1%	15.36	13.25	15.9%
DeepSeek-R1-0528 MTP3	1024/1024	8	828.8	966.7	-14.3%	9.17	8.02	14.3%
DeepSeek-R1-0528 MTP3	8192/1024	4	568.0	553.6	2.6%	6.26	6.58	-4.8%
DeepSeek-R1-0528 MTP3	8192/1024	8	822.2	881.8	-6.8%	9.02	8.41	7.2%
DeepSeek-R1-0528 MTP3	8192/1024	16	1251.9	1249.7	0.2%	11.91	11.98	-0.6%
DeepSeek-R1-0528 MTP3	8192/1024	32	1683.3	1820.2	-7.5%	17.20	16.48	4.3%
DeepSeek-R1-0528-MXFP4	1024/1024	32	1717.0	1799.4	-4.6%	17.78	17.21	3.3%
DeepSeek-R1-0528-MXFP4 MTP3	1024/1024	8	973.8	1038.6	-6.2%	7.71	7.29	5.8%
DeepSeek-R1-0528-MXFP4 MTP3	1024/1024	256	6467.0	6586.7	-1.8%	37.63	37.13	1.3%
GLM-5-FP8	1024/1024	128	3038.8	3068.8	-1.0%	40.48	40.15	0.8%
GLM-5.1-MXFP4	1024/1024	8	429.8	435.9	-1.4%	17.93	17.84	0.5%
Kimi-K2.5-MXFP4	1024/1024	64	2464.0	2431.6	1.3%	24.98	25.42	-1.8%
Kimi-K2.5-MXFP4	1024/1024	128	3624.6	3617.5	0.2%	33.94	34.06	-0.3%
Llama-3.3-70B-Instruct-MXFP4	1024/1024	4	257.6	261.1	-1.4%	14.83	14.67	1.1%
Llama-3.3-70B-Instruct-MXFP4	1024/1024	32	1746.3	1735.9	0.6%	17.44	17.60	-0.9%
MiniMax-M2.5	1024/1024	256	5519.2	5427.8	1.7%	44.72	45.41	-1.5%
MiniMax-M2.5-MXFP4	1024/1024	128	3931.3	3943.2	-0.3%	31.53	31.44	0.3%
Qwen3.5-397B-A17B-FP8	8192/1024	4	385.5	387.8	-0.6%	9.82	9.79	0.3%
Qwen3.5-397B-A17B-FP8 MTP3	8192/1024	64	2699.4	2719.9	-0.8%	22.05	22.09	-0.2%
Qwen3.5-397B-A17B-MXFP4	8192/1024	4	365.1	367.2	-0.6%	10.40	10.36	0.4%
gpt-oss-120b	1024/1024	4	869.7	911.5	-4.6%	4.18	4.20	-0.4%
gpt-oss-120b	8192/1024	4	830.1	816.4	1.7%	4.50	4.59	-2.1%
gpt-oss-120b	8192/1024	8	1311.7	1410.3	-7.0%	5.74	5.42	6.1%
gpt-oss-120b	8192/1024	32	3049.3	3035.9	0.4%	9.98	10.06	-0.8%

[Perf Regression] 24 config(s) regressed @ 56d16766 #673

Description

Performance Regression Detected

Regressed Configurations

Performance Summary

Profiler Traces

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions