Migrate tests to mi325-1gpu runner due to the slow network issues on TW cluster #1166

gyohuangxin · 2025-10-11T07:50:08Z

No description provided.

Copilot

Pull Request Overview

This PR updates the GitHub Actions workflow configuration to change the runner type for the sglang integration job from aiter-1gpu-runner to mi300x-1gpu.

Changes the runner specification for the sglang downstream workflow

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

…TW cluster (#1166)

* add asm bf16_gemm to gemm tuner * mdf_fileName (#1160) * Tuning Utility update (#1102) * update * fix * fix2 * update * update gemm/fmoe tune * modify README.md * update gen_instances.py * fix lint error * fix typo * rm env AITER_CONFIG_** in ck gen_instances.py * update a8w8_blockscale_bpreshuffle * support to read config from multiple tuned files * fix error in batch_gemm --------- Co-authored-by: Ying.Zhou2 <Ying.Zhou2@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * [MI35X] Fix meta mha 950 error to make guard return bool func itself (#1170) * Make bool guard return func itself * make all return func * write querysol fake * Migrate tests to mi325-1gpu runner due to the slow network issues on TW cluster (#1166) * Temporarily using mi355 runners util issues on mi325 runners are fixed (#1176) * Update sglang_downstream.yaml to use aiter-mi325-1gpu runner (#1181) * CI: Add Aiter Release Package CI (#1171) * update gemmTuner * update tuned_gemm.py * fix lint error * update asm maxSplitK and tuned_gemm.csv * Add AITER_CONFIG_GEMM_BF16 and profile_file params to save all temp results * fix error * add padded_m logic and update gemm_common * refactor get tuned config in tuned_gemm rm m < 256 limitation in gemm_op_a4w4.py * fix lint err * wrapper the get_gemm_a16w16_config using torch_compile_guard * fix lint error * refactor tuned_gemm.py * wrapper gemm_a16w16 with torch_compile_guard * fix lint error * add gen_fake func * update * updae GemmTuner --profile_file --------- Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: ZhangLirong <lirzhang@amd.com> Co-authored-by: valarLip <340077269@qq.com>

…TW cluster (ROCm#1166)

* add asm bf16_gemm to gemm tuner * mdf_fileName (ROCm#1160) * Tuning Utility update (ROCm#1102) * update * fix * fix2 * update * update gemm/fmoe tune * modify README.md * update gen_instances.py * fix lint error * fix typo * rm env AITER_CONFIG_** in ck gen_instances.py * update a8w8_blockscale_bpreshuffle * support to read config from multiple tuned files * fix error in batch_gemm --------- Co-authored-by: Ying.Zhou2 <Ying.Zhou2@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * [MI35X] Fix meta mha 950 error to make guard return bool func itself (ROCm#1170) * Make bool guard return func itself * make all return func * write querysol fake * Migrate tests to mi325-1gpu runner due to the slow network issues on TW cluster (ROCm#1166) * Temporarily using mi355 runners util issues on mi325 runners are fixed (ROCm#1176) * Update sglang_downstream.yaml to use aiter-mi325-1gpu runner (ROCm#1181) * CI: Add Aiter Release Package CI (ROCm#1171) * update gemmTuner * update tuned_gemm.py * fix lint error * update asm maxSplitK and tuned_gemm.csv * Add AITER_CONFIG_GEMM_BF16 and profile_file params to save all temp results * fix error * add padded_m logic and update gemm_common * refactor get tuned config in tuned_gemm rm m < 256 limitation in gemm_op_a4w4.py * fix lint err * wrapper the get_gemm_a16w16_config using torch_compile_guard * fix lint error * refactor tuned_gemm.py * wrapper gemm_a16w16 with torch_compile_guard * fix lint error * add gen_fake func * update * updae GemmTuner --profile_file --------- Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: ZhangLirong <lirzhang@amd.com> Co-authored-by: valarLip <340077269@qq.com>

Update sglang_downstream.yaml to use mi300x-1gpu runner

f43cd7c

Copilot AI review requested due to automatic review settings October 11, 2025 07:50

Copilot AI reviewed Oct 11, 2025

View reviewed changes

gyohuangxin marked this pull request as draft October 11, 2025 08:57

gyohuangxin changed the title ~~Update sglang_downstream.yaml to use mi300x-1gpu runner~~ Update triton tests to use mi300x-1gpu runner Oct 11, 2025

gyohuangxin added 3 commits October 12, 2025 00:58

Update sglang_downstream.yaml

9c81b25

Update triton tests to use mi300x-1gpu runner

8098d15

Updates

2b6c0ee

gyohuangxin changed the title ~~Update triton tests to use mi300x-1gpu runner~~ Update triton and sglang tests to use mi300x-1gpu runner Oct 13, 2025

gyohuangxin marked this pull request as ready for review October 13, 2025 04:51

Updates

f877649

gyohuangxin changed the title ~~Update triton and sglang tests to use mi300x-1gpu runner~~ Migrate tests to mi325-1gpu runner due to the slow network issues on TW cluster Oct 13, 2025

gyohuangxin merged commit 5becf90 into main Oct 13, 2025
9 of 13 checks passed

gyohuangxin deleted the gyohuangxin-patch-1 branch October 13, 2025 05:04

yzhou103 pushed a commit that referenced this pull request Oct 13, 2025

Migrate tests to mi325-1gpu runner due to the slow network issues on …

af44cf6

…TW cluster (#1166)

eliotwang pushed a commit to eliotwang/aiter that referenced this pull request Oct 21, 2025

Migrate tests to mi325-1gpu runner due to the slow network issues on …

6e2f5d9

…TW cluster (ROCm#1166)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate tests to mi325-1gpu runner due to the slow network issues on TW cluster #1166

Migrate tests to mi325-1gpu runner due to the slow network issues on TW cluster #1166

Uh oh!

gyohuangxin commented Oct 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Migrate tests to mi325-1gpu runner due to the slow network issues on TW cluster #1166

Migrate tests to mi325-1gpu runner due to the slow network issues on TW cluster #1166

Uh oh!

Conversation

gyohuangxin commented Oct 11, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants