Skip to content

[Perf Regression] 24 config(s) regressed @ 56d16766 #673

@github-actions

Description

@github-actions

Performance Regression Detected

Commit: 56d16766
Run: https://github.com/ROCm/ATOM/actions/runs/25124114536
Date: 2026-04-30T00:47:01.855594+00:00

Regressed Configurations

Model ISL/OSL Conc Tput (cur) Tput (base) Δ% TPOT (cur) TPOT (base) Δ%
DeepSeek-R1-0528 8192/1024 8 501.3 576.6 -13.1% 15.36 13.25 15.9%
DeepSeek-R1-0528 MTP3 1024/1024 8 828.8 966.7 -14.3% 9.17 8.02 14.3%
DeepSeek-R1-0528 MTP3 8192/1024 4 568.0 553.6 2.6% 6.26 6.58 -4.8%
DeepSeek-R1-0528 MTP3 8192/1024 8 822.2 881.8 -6.8% 9.02 8.41 7.2%
DeepSeek-R1-0528 MTP3 8192/1024 16 1251.9 1249.7 0.2% 11.91 11.98 -0.6%
DeepSeek-R1-0528 MTP3 8192/1024 32 1683.3 1820.2 -7.5% 17.20 16.48 4.3%
DeepSeek-R1-0528-MXFP4 1024/1024 32 1717.0 1799.4 -4.6% 17.78 17.21 3.3%
DeepSeek-R1-0528-MXFP4 MTP3 1024/1024 8 973.8 1038.6 -6.2% 7.71 7.29 5.8%
DeepSeek-R1-0528-MXFP4 MTP3 1024/1024 256 6467.0 6586.7 -1.8% 37.63 37.13 1.3%
GLM-5-FP8 1024/1024 128 3038.8 3068.8 -1.0% 40.48 40.15 0.8%
GLM-5.1-MXFP4 1024/1024 8 429.8 435.9 -1.4% 17.93 17.84 0.5%
Kimi-K2.5-MXFP4 1024/1024 64 2464.0 2431.6 1.3% 24.98 25.42 -1.8%
Kimi-K2.5-MXFP4 1024/1024 128 3624.6 3617.5 0.2% 33.94 34.06 -0.3%
Llama-3.3-70B-Instruct-MXFP4 1024/1024 4 257.6 261.1 -1.4% 14.83 14.67 1.1%
Llama-3.3-70B-Instruct-MXFP4 1024/1024 32 1746.3 1735.9 0.6% 17.44 17.60 -0.9%
MiniMax-M2.5 1024/1024 256 5519.2 5427.8 1.7% 44.72 45.41 -1.5%
MiniMax-M2.5-MXFP4 1024/1024 128 3931.3 3943.2 -0.3% 31.53 31.44 0.3%
Qwen3.5-397B-A17B-FP8 8192/1024 4 385.5 387.8 -0.6% 9.82 9.79 0.3%
Qwen3.5-397B-A17B-FP8 MTP3 8192/1024 64 2699.4 2719.9 -0.8% 22.05 22.09 -0.2%
Qwen3.5-397B-A17B-MXFP4 8192/1024 4 365.1 367.2 -0.6% 10.40 10.36 0.4%
gpt-oss-120b 1024/1024 4 869.7 911.5 -4.6% 4.18 4.20 -0.4%
gpt-oss-120b 8192/1024 4 830.1 816.4 1.7% 4.50 4.59 -2.1%
gpt-oss-120b 8192/1024 8 1311.7 1410.3 -7.0% 5.74 5.42 6.1%
gpt-oss-120b 8192/1024 32 3049.3 3035.9 0.4% 9.98 10.06 -0.8%

Performance Summary

# Trace Performance Summary

**File:** `DeepSeek-R1-0528_ts_20260430_005754_536.pt.trace.json.gz`

## Prefill

| # | Label | Duration |
|---|-------|----------|
| 0 | `prefill[bs=1 tok=7237 ctx=7237]` | 74.65 ms |
| 1 | `prefill[bs=2 tok=14698 ctx=[7112, 7586]]` | 76.80 ms |
| 2 | `prefill[bs=2 tok=15324 ctx=[7388, 7936]]` | 72.65 ms |
| 3 | `prefill[bs=2 tok=14146 ctx=[6830, 7316]]` | 72.62 ms |
| 4 | `prefill[bs=1 tok=7769 ctx=7769]` | 71.83 ms |
| 5 | `prefill[bs=1 tok=7152 ctx=7152]` | 71.71 ms |
| 6 | `prefill[bs=1 tok=7647 ctx=7647]` | 70.76 ms |
| 7 | `prefill[bs=1 tok=8049 ctx=8049]` | 70.45 ms |
| 8 | `prefill[bs=1 tok=7153 ctx=7153]` | 71.19 ms |
| 9 | `prefill[bs=1 tok=7973 ctx=7973]` | 70.57 ms |
| 10 | `prefill[bs=1 tok=6867 ctx=6867]` | 68.93 ms |
| 11 | `prefill[bs=1 tok=7258 ctx=7258]` | 71.43 ms |
| 12 | `prefill[bs=1 tok=8063 ctx=8063]` | 72.17 ms |

**Total prefill:** 935.76 ms

## Decode

- **Iterations:** 2009
- **Mean:** 951.7 us
- **Min:** 557.6 us
- **Max:** 2.47 ms
- **Total:** 1911.96 ms

Profiler Traces

Download from workflow artifacts.
Open in Perfetto UI or Chrome chrome://tracing for analysis.

Next Steps

  1. Download profiler-analysis-25124114536 artifact
  2. Open trace files in Perfetto UI
  3. Compare kernel durations against previous traces
  4. Identify bottleneck changes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions