fix mhc device by valarLip · Pull Request #2916 · ROCm/aiter

valarLip · 2026-04-25T15:33:09Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-04-25T15:33:38Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2916 --add-label <label>

Copilot

Pull request overview

Fixes mhc_pre intermediate tensor allocations to be created on the same device as the input residual, preventing device-mismatch failures when the global default device is not CUDA.

Changes:

Reorders imports in aiter/ops/mhc.py.
Allocates out_pad, sqrsum, post_mix, comb_mix, and layer_input with device=residual.device.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T15:36:38Z

+    device = residual.device
    out_pad = torch.empty(
-        selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32
+        selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32, device=device


Consider adding a regression test that exercises mhc_pre when the global default device is CPU but inputs are explicitly on CUDA (e.g., torch.set_default_device('cpu') then create residual/fn/hc_scale/hc_base on cuda). This change fixes internal tensor allocations to follow residual.device, but current tests may still pass even if allocations accidentally fall back to the default device.

* CI: surface runner-config mapping in AMD CI job monitor (#2711) * Update runner-config.yml * CI: surface runner-config mapping in AMD CI job monitor Load GPU architecture and count from runner-config.yml so the runner fleet summary shows the configured inventory for each label. Trigger the monitor workflow when runner mappings change and install PyYAML for the runner report job. * CI: auto-update split test FILE_TIMES (#2709) Co-authored-by: gyohuangxin <42127654+gyohuangxin@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * CI: add standalone OPUS test workflow (#2716) Add a dedicated GitHub Actions workflow for op_tests/opus so OPUS validation runs independently on MI35X and MI325 runners without being mixed into the main Aiter test shards. * add signature and refine lru cache (#2710) Co-authored-by: solin <bingzhou@amd.com> * CI: separate Aiter Test concurrency for push and schedule (#2718) Split main-branch concurrency by event type so scheduled runs do not block push-triggered validation when a long queued job keeps the nightly workflow open. * fix split module jit and retune (#2719) * [TRITON] Prevent NUM_KSPLIT from reducing K dim below GROUP_K (#2661) * CI: Enable Deepseek ATOM tests on MI35X (#2667) * CI: Enable Deepseek ATOM tests on MI35X * CI: Use /models cache for MI35X ATOM DeepSeek test Route the MI35X DeepSeek job to the runner-local /models cache so it avoids downloading into /run, and make the output artifact name unique now that two DeepSeek variants run in the same workflow. * CI: Mount /models into MI35X ATOM test container Pass the runner's shared /models cache into atom_aiter_test so MI35X DeepSeek jobs can use the mounted model path. * feat: _flash_attn_forward add out args (#2648) * Replace CK/CK_TILE in MLA Reduce and Metadata Kernel with OPUS (#2717) * replace ck with opus * fix compile issue * fix waterfall and use buffer inst for lse. * Replace ck with opus for mla metadata. * Add dim=512 fp32 case for wave32 * docs: add ISA-level kernel optimization guide (#2708) * docs: add ISA-level kernel optimization guide using LLVM tools Step-by-step guide covering the full LLVM-based workflow for inspecting, modifying, and recompiling AITER GPU kernel ISA: disassemble, extract reassemblable .s, round-trip recompile with binary-identical .text verification, and profile with rocprofv3. Includes Python extraction script handling branch label word-offset addressing, llvm-objcopy section swap for preserving kernel metadata, and rocprofv3 kernel-trace + ATT profiling instructions. * docs: add ISA optimization code examples and Dockerfile Runnable companion to the ISA kernel optimization guide: - extract_asm.py: standalone ASM extraction with CLI interface - analyze_kernel.py: instruction mix analysis and rocprofv3 profile parser - roundtrip.sh: end-to-end disassemble/extract/recompile/verify script - Dockerfile: ROCm 7.2.1 dev environment with all tools pre-installed including ATT trace decoder built from source * style: fix black formatting and ruff lint in ISA optimization examples - Rename loop var 'l' to 'ln' to fix E741 (ambiguous variable name) - Remove extraneous f-prefix on strings without placeholders (F541) - Apply black auto-formatting * style: fix black formatting for CI compatibility - Add blank line between module docstring and imports (E302) - Collapse multiline f-string call arguments --------- Co-authored-by: Peng Sun <pensun@Pengs-MacBook-Pro.local> * Add run_config and compare in tuner (#2375) * Add run_config/compare support to GemmTuner (bf16) - Add config_env_name, _clear_op_caches(), and run_config() to GemmTuner so --run_config and --compare flags work for bf16 GEMM tuning - Update bubbly-exploring-turtle.md plan doc to reflect the full implementation including --compare, config_env_name, cache clearing, and post-tune config switching architecture Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add --run_config and --compare benchmark support to all tuners Add infrastructure in base_tuner.py for production operator benchmarking: - --run_config: benchmark only, no tuning - --compare: pre-tune benchmark, tune, post-tune benchmark with comparison table - Config env switching and cache clearing for post-tune benchmarks Implement run_config() and _clear_op_caches() in all CK-based tuners: gemm_a8w8, gemm_a8w8_bpreshuffle, gemm_a8w8_blockscale, gemm_a8w8_blockscale_bpreshuffle, gemm_a4w4_blockscale, gemm_moe_2stages, batched_gemm_a8w8, batched_gemm_bf16 * Revert unintended composable_kernel submodule change * Fix review comments and remove intermediate plan docs - Save/restore AITER_REBUILD original value instead of hardcoding 0 - Use defensive strip() for mixed-type object columns in _read_csv - Remove docs/bubbly-exploring-turtle.md and docs/run_config_benchmark.md (consolidated into csrc/.claude/add_run_config_to_tuner.md) * update ref rtol,atol * Fix tuner cache invalidation, run_config preshuffle, and compare workflow - Fix _clear_op_caches for all tuners: properly clear lru_cache and internal dict/attribute caches (a4w4, a8w8 variants, fmoe) so post-tune benchmarks use freshly tuned configs instead of stale ones. - Fix fmoe run_config: preshuffle weights before calling fused_moe to match production layout (tuner always tunes with bpreshuffle=True), preventing preshuffle_on/off module mismatch and 99%+ error. - Add defensive warning in fused_moe get_2stage_cfgs when tuned config is found but is_shuffled=False. - Fix run_config to read shapes from tuned CSV and set config env var. - Fix --compare workflow: run post-tune benchmark before tune_summary to avoid summary errors blocking verification. - Fix base_tuner _set_config_env_for_run_config return value. - Use print instead of logger.info for benchmark tables. * fix format * fix format * update readme * fix lint error * fix lint * update csv only when perf improves * format * fix lint * revert format for some files * clarify compare and gated update flow Make --compare keep a candidate csv and require --update_improved before writing back tuned results so the CLI stays explicit and easier to extend. Made-with: Cursor * fix flydsl GemmTuner review issues Trigger FlyDSL package validation before importing tuning kernels and align the FlyDSL bias cast order with the runtime path to avoid unintended dtype promotion. Made-with: Cursor * update * revert claude md * update shape_grouped * fix format * fix bug * fix lint error --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * Revert "fix(car): craph capture err (#2638)" (#2735) This reverts commit 5759ee2943b5326dec23910135381b426ff65196. * [TRITON] Swiglu and reduce refactor (#2583) * Move swiglu to a util file + add optional residual flag * refactor reduce and make it compatible with >65k tokens * Update aiter/ops/triton/_triton_kernels/moe/reduce.py Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * [aiter] type hints mismatch (#2728) * fix: silence false warning in fmha_fwd_v3 when use_asm_v3 is disabled (#2744) When `use_asm_v3` is `false`, `fmha_fwd_v3()` correctly returns `-1` to fall back to the CK path, but it also emits a misleading "unsupported condition in fwd_v3!!!" warning. This is not an unsupported condition — the caller intentionally opted out of v3. Separate the `use_asm_v3` check into an early return without a warning, so the `AITER_LOG_WARNING` only fires for genuinely unsupported parameter combinations (wrong head dims, unsupported dtypes, bias, dropout, wrong arch). Made-with: Cursor * Add FlyDSL GEMM AOT precompile support (#2741) * feat(aot): add MoE FlyDSL AOT pre-compilation module Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> Signed-off-by: zhiding512 <zhimding@amd.com> * refactor(aot): remove --stage flag from MoE AOT module Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> Signed-off-by: zhiding512 <zhimding@amd.com> * reformat * feat(aot): integrate FlyDSL MoE AOT precompilation into setup.py Move moe.py into aiter/aot/flydsl/, support multiple CSV configs, simplify compile_one_config to use COMPILE_ONLY=1 env var, and add MoE AOT pre-compilation step to the package build in setup.py. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> Signed-off-by: zhimding <zhiming.ding@amd.com> * feat(aot): FlyDSL MoE AOT with COMPILE_ONLY dummy-tensor precompilation Rework FlyDSL MoE AOT to use COMPILE_ONLY=1 with dummy tensors instead of run_kernel, removing all HIP op dependencies (moe_sorting, shuffle_weight, etc.) from the precompilation path. Changes: - Replace _run_kernel with _precompile_to_cache using torch.zeros dummy tensors and COMPILE_ONLY=1 for pkl cache generation - Add sys.modules bridging in setup.py so aiter.jit.core reuses the same module instance loaded via sys.path - Auto-detect bundled flydsl_cache in aiter/__init__.py and set FLYDSL_RUNTIME_CACHE_DIR - Add KeyError to aiter/__init__.py exception handler for robustness - Support multiple CSV configs (dsv3 + kimik2) - Remove run_kernel parameter and test_bad_tile logic Signed-off-by: zhimding <zhiming.ding@amd.com> * update flydsl * update flydsl * adapt hgemm * fix black * add flydsl gemm aot precompile support --------- Signed-off-by: zhiding512 <zhimding@amd.com> Signed-off-by: zhimding <zhiming.ding@amd.com> Co-authored-by: Claude Opus 4 <noreply@anthropic.com> * gather support qk_nope_head_dim != v_head_dim (#2739) * gather support qk_nope_head_dim != v_head_dim * fix 192 pad * feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models (#2733) * feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models Tuned on MI355X (gfx950) with all backends competing (ASM, hipBLASLt, Triton, FlyDSL). New tuned configs for Llama 70B, Llama 405B, Qwen 32B. Re-tuned existing configs for GPT-OSS, DSV3, Kimi-K2 to include FlyDSL. Backend wins across 708 total shapes: - hipBLASLt: 472 (66.7%) - ASM: 131 (18.5%) - FlyDSL: 70 (9.9%) - Triton: 7 (1.0%) - Mixed/other: 28 (4.0%) * feat: retune BF16 GEMM without hipBLASLt, add GLM-5 and 3 new models Re-tuned all BF16 GEMM configs on MI355X (gfx950) with --libtype asm,triton,flydsl (no hipBLASLt). Added GLM-5 (88 shapes from CI log) and new configs for Llama 70B, Llama 405B, Qwen 32B. Backend wins across 796 total shapes (7 models): - ASM: 437 (54.9%) - FlyDSL: 224 (28.1%) - Triton: 135 (17.0%) Per-model breakdown: - GPT-OSS (57): asm=54, triton=3 (bias=True, no FlyDSL support) - DSV3 (58): flydsl=22, triton=18, asm=18 - Kimi-K2 (125): asm=77, flydsl=46, triton=2 - GLM-5 (88): asm=42, flydsl=30, triton=16 - Llama 70B (156): asm=84, flydsl=49, triton=23 - Llama 405B (156): asm=89, flydsl=43, triton=24 - Qwen 32B (156): asm=73, triton=49, flydsl=34 Tuning time without hipBLASLt: 4h total (long pole: 405B @ 4h) vs with hipBLASLt: 10h+ total (long pole: 405B @ 8h+) * Hoist introspection out of per-call ctype dispatch (#2742) * Hoist inspect.signature/typing.get_type_hints out of per-call ctypes dispatch These two introspection calls were recomputed on every invocation of the ctypes caller closure (~79µs + ~91µs per call). Since the decorated function's signature and type hints are immutable, compute them once at decoration time and capture via closure. Made-with: Cursor * update ruff format * update black format --------- Co-authored-by: amd-ruitang3 <rui.tang2@amd.com> * Make FlyDSL LDS checks architecture-aware and reduce tuner failure noise (#2732) * Handle FlyDSL LDS limits and candidate failures Use shared-memory-per-block queries to keep FlyDSL LDS checks architecture-aware, and surface candidate failures as concise runtime warnings so tuning can continue without noisy tracebacks. Made-with: Cursor * Keep tuner topk local per shape Avoid mutating the shared topk value while post-processing one shape so later shape groups keep the intended candidate limit. Made-with: Cursor * fix lint * Cache FlyDSL shared memory queries Avoid repeated device property lookups while validating FlyDSL kernel configs by caching the default device selection and shared-memory-per-block queries. Made-with: Cursor * fix lint * Update aiter/ops/flydsl/gemm_tune/flydsl_gemm_a8w8_bpreshuffle_common.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * parse kernel name to select flydsl kernel * fix black format error * refine --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: solin <bingzhou@amd.com> * 16x256 kernel (#2722) Co-authored-by: Sergey Solo <ssolovye@amd.com> * rebase linear attn for new flydsl version (#2746) * Replace unsafe uses of std::unordered_map with SynchronizedCache (#2221) * Make AiterAsmKernel load hsaco on each GPU it is used on * Replace unsafe uses of std::unordered_map with SynchronizedCache * Fix Triton MoE GEMM shared memory exhaustion by reducing stage count (#2723) * Fix Triton MoE GEMM shared memory exhaustion - Reduce num_stages in kernel configs - Lowered LDS usage to avoid shared memory OOR - Fix triton.runtime.errors.OutOfResources errors in MoE GEMM kernels * Fix: set num_stages=1 on gfx950 using get_arch() conditionally for gfx950 to ensure no bottlenecks for gfx942 * Add determinism for fused mul add test * Format fused mul add test with black * Annotate moe_sorting_dispatch_policy as int for fused_moe (#2639) The type annotation bool was incorrect for moe_sorting_dispatch_policy, which accepts int values. The @torch_compile_guard decorator uses these annotations to generate PyTorch custom op schemas; with bool, PyTorch schema enforcement casts any value to bool, so dispatch_policy=2 becomes bool(2)=True (1), silently losing the intended policy. Using int allows callers to set dispatch_policy=2 correctly. Fixes: #2576 Signed-off-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Tres Popp <tres.popp@amd.com> * fix moe splitk aot and jit (#2738) * fix moe splitk aot and jit * split moe aot to serveral libs base on tuned_moe configs * update copyrights * fix typo * test shuffle as default and fix moe split jit * Update quant.pyfix: add pack_dim to per_1x32_f4_quant for tl.dot_scaled RHS compatibility (#2704) * Update quant.py * Refactor per_1x32_f4_quant function signature * Fix function definition formatting in quant.py * Refactor per_1x32_f4_quant_for_dot_scaled definition * Restore semantic.py to match main branch * Flydsl align if else usage to be compatible with all versions (#2740) * Add bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent) (#2729) * Add bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent) * black mla.py * fix short kv len split error * Support final LSE output in non-persistent MLA reduce kernel * black mla.py * ruff error --------- Co-authored-by: minmengdie <memin@amd.com> * CI: scope check signal artifact downloads to matching run (#2757) Avoid repo-wide artifact searches in downstream workflows so check-signal can fetch the pre-checks signal without hitting the GitHub Actions API rate limit. * feat: aiter whls nightly (#2514) * feat: aiter whls nightly * fix: * CI: add docker username input for aiter release workflow (#2535) * fix: * fix: address review comments and improve nightly workflow - Enable promote for PRs (devreleases) after tests pass - Fix inputs context for schedule/PR events (use github.event.inputs) - Add set -o pipefail to prevent tee masking test failures - Add release_type validation in promote workflow - Guard AWS credentials against fork PRs - Remove S3 bucket details from summaries, show only index URLs - Simplify dispatch inputs: single docker_image, python_version, gpu_archs - Add build_py310/build_py312 controls to aiter-release workflow * style: fix Black formatting in generate_summary.py * fix: add tabulate dependency for test_activation.py in wheel tests * feat: add ci:nightly label trigger with full pipeline including promote - PR trigger via ci:nightly label runs Build → Test → Promote to nightlies - release_type set to nightlies for labeled PRs (same as scheduled runs) - Promotion no longer blocked for PRs, enables end-to-end validation * fix: keys * fix: requested * fix: remove testcode * Update: move generate_summary.py --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * CI: reduce retention for workflow artifacts (#2758) Shorten retention periods for high-volume workflow artifacts so routine logs and shard metadata expire sooner while wheel artifacts remain available long enough for downstream consumers. * CI: upgrade sglang downstream to v0.5.10 (#2753) * CI: upgrade sglang downstream to v0.5.10 on MI35X/MI355 * CI: keep sglang v0.5.10 upgrade on MI325 * Update sglang_downstream.yaml * CI: use isolated sglang workspace for downstream test * fix(car): sglang prefill launch error kernel (#2745) Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> Co-authored-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * Add unified attention support to bench_models (#2724) Add sink and window support to mha handler * Use AITER_CONFIGS for FlyDSL AOT defaults (#2756) Align FlyDSL GEMM and MoE AOT precompile defaults with runtime config resolution, and update the FlyDSL dependency to the required version. Made-with: Cursor Signed-off-by: zhimding <zhimding@amd.com> * fix fused_dynamic_mxfp4_quant_moe_sort_hip err in EP (#2759) * Fix `test_moe_routing_sigmoid_top1_fused` reference implementation tie-breaking (#2750) Replace `torch.topk` with `torch.argmax` + `torch.gather` in the reference implementation. `torch.topk` doesn't guarantee stable indices for tied elements, and its tie-breaking behaviour changed between ROCm 7.2.0 and 7.2.2, causing 10 tests to fail for `N=128`, where integer inputs produce many tied sigmoid scores. `torch.argmax` always returns the first (leftmost) maximum, matching `tl.argmax(..., tie_break_left=True)` in the Triton kernel. * [fix](cache): add eps to avoid diving zero (#2763) Co-authored-by: perzhang <perzhang@amd.com> * [FlyDSL MOE] update a8w4 moe (#2726) * update a8w4 * update gptoss csv * update gptoss_fp4 prefill config * update files format * update fix with rws & update a8w4 kimi tune cfgs * update * update fix code style * update code style * fix ci * revert fused_moe.py modifies * revert a16w4 * add instances * update ut & env var name * rename files * fix ci * remove use_cfg * remove tile m 16 --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * MI350 mla ps mode suppport nhead128,1 128,2 128,3 128,4 64,4 64,2 32,4 through kernel mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co (#2727) * MI350 mla ps mode suppport nhead128,1 128,2 128,3 128,4 64,4 64,2 32,4 through kernel hsa/gfx950/mla/mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co * del the useless code * black error * fix sparse test error * update config (#2771) * fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules (#2645) * chip_info: add GFX_CU_NUM_MAP and get_build_targets() * aiter/configs: migrate tuned GEMM CSVs to add gfx as first column * csrc: fix gen_instances.py to filter by (gfx, cu_num) build targets * aiter/ops: add gfx to runtime GEMM dispatch lookup keys * aiter/utility: add gfx to GemmCommonTuner key and tune result output * csrc, gradlib: add gfx to all GEMM tuner output keys * op_tests: fix is_shape_tuned to filter by (gfx, cu_num) * fix(configs): resolve model_configs merge conflicts and add gfx column * op_tests: add CSV input, output saving, and stable iter counts to a8w8 GEMM test scripts * fix(merge): resolve conflict in gemm_op_a4w4.py after main sync The merge commit 6a18cd6e7 accidentally preserved conflict markers in gemm_op_a4w4.py. Apply the gfx-aware dispatch fix (same pattern as gemm_op_a8w8.py) — use (gfx, cu_num, M, N, K) key when the CSV has a gfx column, fall back to (cu_num, M, N, K) for old CSVs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(configs): add missing gfx column to dsv3 model_configs overrides * op_tests: add bpreshuffle-csv entry point and skip_ck flag to test_gemm_a8w8 * op_tests: add gfx filter unit tests and repro CSVs for both GEMM modules * fix(ck_gemm): key C++ dispatch map by (cu_num,M,N,K) to prevent multi-arch kernel collisions, share build_tune_dict helpers across all 9 CK GEMM modules * op_tests/configs/gemm_codegen_gfx_filter.csv * chip_info: split arch constants and env-only build targets into torch-free build_targets.py * op_tests: fix repro CSV gfx942/304 kernels to be valid for M=1 and M=32 * chip_info: use bare import for build_targets to fix build context ModuleNotFoundErro * docs: add gfx column to tuning CSV examples and update cu_num description in all 8 GEMM READMEs * lint: apply black formatting and fix ruff violations in modified files * lint: fix black/ruff violations in csrc gen_instances and gradlib * fix(gemm_op_a8w8): eliminate StopIteration risk and use AITER_CONFIGS for defaults * fix(chip_info): guard kernelId/kernelName lookups with .get() to avoid KeyError on malformed CSVs * fix(base_tuner): add gfx legacy fallback to if branch of get_retune_gemm_list * docs(test_gemm_codegen): fix comment reference for GFX_CU_NUM_MAP location * fix(gemm_dispatch_utils): check HIP return codes in get_device_cu_num() * fix(chip_info): add get_gfx_runtime() and fix GPU_ARCHS=native in get_build_targets() * chore(configs): sync dsv3/kimik2 bf16 tuned gemm CSVs with main and add gfx column * fix(op_tests): use get_gfx_runtime() in GEMM test files for correct arch detection * fix(core): self-heal CSV dedup without requiring a re-run * fix(chip_info): add shape and arch context to kernelId/kernelName skip warnings * fix(chip_info): use logger.warning instead of print for kernel skip warnings * style(chip_info): fix E402 import order after logger initialization * fix(gemm_dispatch_utils): initialize device to -1 to clarify output-parameter intent * test(test_gemm_codegen): fix Section 3 runtime dispatch tests to use live GPU * fix(gemm_op_a8w8): remove duplicate get_gfx_runtime import * docs(chip_info): fix build_tune_dict docstring for kernels_by_name fallback * fix(gemm): extend C++ dispatch key with gfx arch string — (cu_num,M,N,K) → (gfx,cu_num,M,N,K) * style(chip_info, test_gemm_codegen): apply black/ruff formatting * feat: add PRETUNE_MODULES build flag to auto-tune GEMM shapes on live GPU * feat(pretune): add run_tune_direct() and CLI for standalone retuning on installed aiter * refactor(pretune): remove run_tune_direct wrapper, add input validation and dedup to CLI * fix(pretune): suppress ruff F841/E402 false positives on eval-scope variable and path-dependent import * refactor(pretune): extract _parse_module_list, fix silent skip of unsupported modules in setup.py path, add deduplication * docs(pretune, setup, test_gemm_codegen): fix stale docstrings and add missing inline comments * fix(pretune): write tuned results to source CSV, not ephemeral /tmp; add regression test * setup.py: import pretune directly to avoid premature aiter package init * pretune: add warmup API — check_tuning_coverage, warn_if_undertuned, warmup * pretune: tune only missing model shapes in warmup(), not full CSV * fix(pretune): remove vLLM-specific env var hint from warmup() warning * revert: remove warmup API from pretune.py * fix(tuners): clear module-level CSV caches in _clear_op_caches * fix(build): add _parse_gpu_archs_env() * fix(docs/tests): docstring accuracy, test coverage, and gfx-aware dedup * fix(tests): route aiter logger to stdout in test_pretune to fix warning ordering * fix(gemm_dispatch_utils): cache cu_num and gfx per device ID via SynchronizedCache * tuning: use get_gfx_runtime() in tuner imports so live GPU arch is used instead of GPU_ARCHS env * fix(configs): add missing gfx column to bf16 model_configs CSVs introduced during main merge * raise error when having duplicate shape entries * fix(configs): remove duplicate shape entries from a8w8_blockscale_bpreshuffle_tuned_gemm_qwen3.5_397b.csv * resolve duplicated shapes --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Ying.Zhou2 <Ying.Zhou2@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * add triton fallback for mi455 gptoss & dsfp4 (#2657) * add triton fallback for mi455 gptoss & dsfp4 * rm nk config * fix rotary * fix recompile issue (#2777) * Fix fused_gemm_a8w8_blockscale_a16w16 unit test failures (#2766) * Fix fused_gemm_a8w8_blockscale_a16w16 unit tests * fix format with black * remove guard and keeping num_stages as is - works * walkaround invalid kernels in moe tunner (#2785) * CI: auto-update split test FILE_TIMES (#2795) * CI: auto-update split test FILE_TIMES * chore: trigger pull_request checks --------- Co-authored-by: gyohuangxin <42127654+gyohuangxin@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * [fix] remove glm5 triton tuned gemm (#2803) * remove glm5 triton tuned gemm * tuned * add tuned bf16 gemm for qwen3next and qwen3.5 (#2799) * add qwen3next qwen3.5 bf16 untuned gemm Signed-off-by: ganyi <ygan@amd.com> * revert the bf16 template csv Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com> * [Perf][WIP]update qwen fmoe tuned configs (#2812) * update qwen fmoe tuned configs * fix typo * CI: run SGLang downstream tests on MI355 (#2807) * CI: run SGLang downstream tests on MI355 Switch the downstream workflow to the MI355 runner and mi35x image path so sglang exercises the gfx950 configuration instead of the MI325/mi30x path. * Update sglang_downstream.yaml * Update sglang_downstream.yaml * CI: allow nightly check-signal to reuse Checks push runs (#2819) Nightly Aiter Test runs look up pre-check signal artifacts by SHA, but the Checks workflow does not run with event=schedule. Skip the event filter for schedule-triggered workflows so nightly jobs can reuse the matching push-based Checks run instead of failing early in check-signal. * [fix] fix compare error (#2805) * fix compare error * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * CI: log in to Docker before building aiter image (#2824) Authenticate before the build so pulling the base image does not hit registry rate limits. Reuse the same Docker session for the later image push. * [FLYDSL]: aiter flydsl if const_expr (#2776) * [FLYDSL]: aiter flydsl if const_expr * [FLYDSL]: black * add tuner test suite (#2734) * add tuner test suite * Update op_tests/tuning_tests/test_mp_tuner_logic.py Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * add run_config test * fix tuned csv * add missed test * Update op_tests/tuning_tests/test_run_config.py Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix format * update test and tuner * fix format * format * update test * format --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix lse address compute (#2810) * Add vcs_versioning as transitive dependency of setuptools_scm>=10 (#2834) * Revert kimi2.5 fp4 tp4 moe tuned result on bs 32,64 (#2836) * remove gemm tuning (#2838) Signed-off-by: ganyi <ygan@amd.com> * disable import flydsl_gdr_decode (#2840) * Xiaobing/gptoss small m (#2775) * add small m hgemm * update gptos config * updata small_m_hgmm * clear code * code format * fix code format * remove moe_gemm_2stage.py change * refact code * recover 0 from SMALL_M_B_TO_LDS_WAVES_PER_EU_OPTIONS * update gemm code * refact code --------- Co-authored-by: zhuyuhua-v <yuhzhu@amd.com> * [CK_TILE] Fix bpreshuffle compile failure due to stale CShuffleEpilogueProblem arg (#2811) * [fix] Remove stale TiledMMAPermuteN arg from CShuffleEpilogueProblem * [fix] Remove stale TiledMMAPermuteN arg from blockscale CShuffleEpilogueProblem * add static_assert for TiledMMAPermuteN in blockscale and moe_2stages * update subproject commit for composable_kernel * update flydsl version (#2802) * update flydsl version * fix version err * CI: fix pre-check signal artifact lookup for PR reruns (#2848) Resolve check-signal failures when labeled or rerun pull_request jobs get a new GitHub merge SHA. Download the checks-signal artifact from the matched Checks run instead of rebuilding the artifact name from the current workflow SHA. * [Fix]fix moe tunner (#2831) * CI: To fix prebuild image issue (#2847) * Fix a8w8 flydsl tune (#2809) * use flyc.compile * add csv-driven mode covering mode_configs/configs a8w8_bpreshuffle csvs Co-authored-by: solin <bingzhou@amd.com> * fix other lse addr compute issue (#2833) * CI: align tuning test workflow with README (#2849) * CI: add nightly tuning validation workflow Split the PR #2734 tuning test plan into dedicated nightly CPU and GPU jobs, and make the CPU-only tuner infra tests importable without a ROCm runtime so the new validation workflow can run on ubuntu-latest. * CI: align tuning test workflow with README Redo the tuning test PR to match the README commands directly: keep the workflow manual-only, run the documented level01/tune_pipeline/run_config suites from a ROCm container, and drop the unrelated utility-layer changes from the PR scope. * CI: switch tuning tests to MI35X runner Run the manual tuning test workflow on the MI35X single-GPU runner and set GPU_ARCHS to gfx950 so the dispatched validation matches the requested hardware target. * gqa ratio support in mha test (#2798) * update * update * fix odd num head * ci(release): add manylinux2_28+ROCm builder path with auditwheel gate (#2851) Adds opt-in manylinux + ROCm wheel build path (use_manylinux=true) using pytorch/manylinux2_28-builder:rocm7.X images, with auditwheel repair + symbol-floor verification step. Default off so existing dispatches are unchanged. Fixes the libstdc++/glibc ABI mismatch tracked in #2843. Validated by test build #24761365935 — wheel passes manylinux_2_28 contract (GLIBCXX max 3.4.22, GLIBC max 2.28). Co-authored-by: Peng Sun <Peng.Sun@amd.com> * add asm load kernel log (#2800) * add asm load kernel log * update according to review * update log * Update gfx942&gfx950 PA PS kernels and write stride_scale_page in asm_pa (#2796) * clean code and reduce num_stages to 1 for large tile configs in fused_gemm_afp4wfp4_a16w16 to prevent LDS exhaustion with ASYNC_COPY (#2784) * [TRITON] Add Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441) Move Gluon kernel to _gluon_kernels/gfx942/moe/ directory and refactor K-dimension unrolling to use gl.static_range + tuple accumulation, removing the UNROLL_TIMES <= 3 constraint. Raise Gluon dispatch threshold to M >= 4096 for optimal performance. Key optimizations: - Manual LICM: A matrix, x_scale, and gammas pre-loaded outside N loop - K-dimension loop fully unrolled via gl.static_range at AST level - Explicit BlockedLayout + MFMA for optimal INT8 register usage - SUB_BLOCK_SIZE_N=64 inner loop for large BLOCK_N Co-authored-by: Claude Opus 4 <noreply@anthropic.com> * [TRITON] Reduce MHA UTs (#2612) * Reduce tests for mha * Remove single parametrized functions * Remove single parametrized functions * Lint * Lint * Address comments * Fix UTs * Address test failures * Address test failures * Address comments * Fix CI failures * Relax test_mha_with_sink bwd tolerance --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lucas Santos <Lucas.Santos@amd.com> * Add gfx942 Triton A8W8 blockscale GEMM configs for GLM-5 shapes for 70k in and 300 out (#2743) Co-authored-by: root <root@chi-mi300x-003.ord.vultr.cpe.ice.amd.com> * docs: polish README with ecosystem, news, and performance highlights (#2859) * docs: polish README with ecosystem, news, and performance highlights - Replace GitHub-hosted image with local aiter_logo.png, center-aligned - Add CI, Release, Docs, and GitHub stars badges - Add News section with chronological blog posts and publications - Add Ecosystem section with framework integration table (vLLM, SGLang, ATOM, JAX, and customer proprietary engines) - Add performance highlights and supported hardware tables - Refactor operators section to link to op_tests/ directory - Polish description and feature list docs: add aiter logo assets for README * docs: add Opus section, MLA tutorial link, and fix release badge - Add Opus lightweight C++ template section alongside FlyDSL and Iris - Add MLA decode kernel tutorial to News section - Change release badge to show version from shields.io, link to releases page * docs: update badges and spacing in README header - Remove stars badge, add last commit badge - Add spacing between logo and badges * docs: add v0.1.12.post1 release to News section * docs: adjust spacing between logo and badges * docs: widen spacing between logo and badges * make rmsnorm quant fusion support gemma (#2853) * gemma rmsnorm quant support Signed-off-by: ganyi <ygan@amd.com> * change test Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com> * ci(release): add torch_pin + torch_index_url inputs for ABI-targeted release builds The manylinux release path currently calls 'pip install torch' with no version pin, picking up whatever wheel is latest on the auto-derived download.pytorch.org/whl/rocmX.Y index at build time. That works for nightlies but is fragile for tagged releases that need to be ABI- compatible with a specific downstream container. Concrete failure: AITER wheels built today against torch 2.11.0+rocm7.2 ImportError at runtime inside vllm/vllm-openai-rocm:v0.19.1, which ships a custom torch 2.10 build. The 2.10 -> 2.11 transition changed the c10::cuda::getCurrentCUDAStream masquerade alias surface; AITER .so files (e.g. module_gemm_a8w8_blockscale.so) hit the missing symbol when dlopened in the vllm container. This change adds two optional, additive workflow inputs (workflow_dispatch and workflow_call): - torch_pin: e.g. '2.10.0+rocm7.1'. Empty = latest (existing behavior). - torch_index_url: override the auto-derived index URL. Empty = legacy 'https://download.pytorch.org/whl/rocmX.Y' from the builder image tag. Default behavior is unchanged. When both inputs are empty the install step runs exactly as today. For v0.1.12.post2 release builds we will dispatch with: torch_pin=2.10.0+rocm7.1 torch_index_url=https://download.pytorch.org/whl/rocm7.1 Validated end-to-end on a side branch: wheel built with these pins loaded cleanly inside vllm/vllm-openai-rocm:v0.19.1 and ran DSR1 serving on both MI300X and MI355X. Refs: #2843 * Revert "CI: run SGLang downstream tests on MI355" (#2873) * Revert "CI: run SGLang downstream tests on MI355 (#2807)" This reverts commit a3f54e3ce76b40177ac3f1ab53763ea8ee9d2bb9. * Update sglang_downstream.yaml * ci: rename aiter MI300X runners (#2872) Replace the legacy aiter single- and multi-GPU runner labels with the linux-aiter-mi300x equivalents so CI jobs and runner metadata target the new MI300X fleet consistently. * CI: raise atom-test accuracy step timeout to 90 minutes (#2876) Align with longer ATOM server startup wait (45m) plus warmup and lm_eval. * Add torch in gemm a16w16 tune (#2860) * add torch,skinny gemm in bf16 gemm tuner * fix format * fix lint * CI: run Triton workflow on MI35X only (#2871) * CI: run Triton workflow on MI35X only Remove the opt-in MI325 Triton path and its PR label guidance so the Triton workflow always runs on MI35X. Keep the PR welcome comment aligned with the simplified CI behavior. * MLA PS mode support nhead8,2 in MI308 (#2852) * MLA PS mode support nhead8,2 in MI308 * nhead8,2 through m16x1_16x4 * delete redundant kernel * fix the comment --------- Co-authored-by: Xu, Shengnan <117875955+shengnxu@users.noreply.github.com> * Dev/aot fix (#2856) --------- Co-authored-by: coderfeli <felix.li@amd.com> * Update kimik2 FP4 tuned fMoE config (#2845) * Update kimik2 FP4 tuned fMoE config with 256 CU tuning results Made-with: Cursor * Fix kernel names for kimik2 FP4 token=8 expert=385 topk=9 config entry --------- Co-authored-by: okorzh-amd <okorzh-amd@users.noreply.github.com> * Fix sliding window mtp (#2829) * fix(pa): port sliding window mtp fixes to main Reapply the sliding-window MTP decode, PS reduce, and KV_BLOCK_SIZE=1024 fixes on top of the latest main so the change can be reviewed and merged from a branch with shared history. Made-with: Cursor * fix fallback * fix(pa): Use FlyDSL PS reduce for sliding-window MTP Signed-off-by: fsx950223 <fsx950223@outlook.com> Made-with: Cursor * fix(pa): Move Gluon decode imports to module top Signed-off-by: fsx950223 <fsx950223@outlook.com> Made-with: Cursor * fix regression --------- Signed-off-by: fsx950223 <fsx950223@outlook.com> * [DO NOT MERGE] CI TESTS (#2894) * CI: Temporarily running the build jobs on linux-aiter-mi300x-1 due to network issues (#2895) * CI: Fix Aiter installation issues in sglang downstream test (#2897) * CI: add opt-in MI300X Triton jobs for PRs and main (#2883) Keep MI35X as the default Triton PR path, add ci:triton-300x to start extra MI300X jobs on PRs, and run both architectures on main. Update the PR welcome comment to document the new label and main-branch behavior. * Revert "CI: Temporarily running the aiter build on linux-aiter-mi300x-1 due to network issues." (#2899) * Revert "[DO NOT MERGE] CI TESTS (#2894)" This reverts commit 0f059acf89220ecbf52919e20798e7e86fb3acbb. * Update vllm_benchmark.yaml * add optimized prefill gdn kernels for qwen3_5 (#2686) * add optimized prefill gdn kernels for qwen3_5 * refine code style * add ssm_state vk_layout kernels for vllm support * add default turnoff for triton autotune * Introduce asm fmoe kernels that do not require bf16->fp8 quantization (#2262) * Introduce asm fmoe kernels that do not require bf16->fp8 quantization * Update quantization division precision to be closer to IEEE correctness * Add transpose_scale flag * Update kernels with fixed s_waitcnt * Add 16x128 kernel merged with quantization * Remove legacy code * Remove one more redundant line * Revert changes to sub_X_cnt calulations for fmoe asm * Update the tuner to support x_bf16 kernel * Update pandas interface * Fix formatting * Generate config with the new tuner * Update base tuner to support merging csv files without strict column matching * Fix formatting * Fix core.py to handle columns merge * Fix formatting --------- Co-authored-by: Sergey Solo <ssolovye@amd.com> * Revert "Fix sliding window mtp (#2829)" (#2909) This reverts commit 6b8ce4ec3423441ad0a2f072d50f498ac711714c. Co-authored-by: perzhang <perzhang@amd.com> * fix(fmha): support >4GB KV cache in batch prefill via runtime dispatch (#2893) * test: expand test_batch_prefill_large_kvcache for >4GB KV cache overflow Rewrite test_batch_prefill_large_kvcache to validate the per-tile SRD rebase fix for >4GB KV caches across all page sizes, dtypes, and attention configurations: - Add page_size=1 and 16 (page_size < kN0, exercises rebase path) - Add GQA (16, 8) in addition to MHA (8, 8) - Add causal masking with CK-compatible attn_mask for SDPA reference - Use full KV cache (4.5GB) with pages spanning the overflow boundary - Use torch SDPA as reference (memory-efficient backend, no score matrix materialization) - Add scatter_pages parameter (False only; True for future global_load_lds flat addressing) - Add GPU memory check to skip configs that exceed HBM capacity Test matrix: 24 cases (3 page_sizes × 2 dtypes × 2 causal × 2 GQA × 1 scatter) * test: add GPU sync after CK kernel in large_kvcache test Add torch.cuda.synchronize() after CK kernel launch in test_batch_prefill_large_kvcache to ensure all async GPU work completes before memory is freed between tests. Without this sync, repeated allocate/free cycles of large KV cache buffers (~20GB) with mixed dtype (bf16→fp8) can trigger GPU page faults when the HIP memory allocator reuses virtual addresses that are still referenced by pending async GPU work. The fault manifests as VM_L2_PROTECTION_FAULT at address 0x0 (NULL), causing GPU reset and kernel soft lockup. * feat(fmha): runtime dispatch for >4GB KV cache in batch prefill Add use_64bit_load to batch prefill traits and runtime overflow detection. When page_block_size < 128 and max_page_byte_offset > INT32_MAX, dispatch to the flat 64-bit load kernel variant for correctness. Also add vectorized KV layout coverage to test_batch_prefill_large_kvcache. * fix: remove unused k_vector_size variable in large_kvcache test * fix(mha): improve batch_prefill TORCH_CHECK error message for >4GB KV cache Include page_size, num_pages, and dtype in the error message when kernel dispatch fails. Add hint about CDNA3+ GPU requirement when KV cache exceeds 4GB with page_size < 128. * test: update scatter_pages comment in large_kvcache test The comment incorrectly stated scatter_pages=True was "expected to FAIL". This is no longer true — the flat 64-bit load path handles scattered pages correctly. Update to describe the test's purpose instead. * fix(mha): widen batch_prefill 64-bit threshold to total KV bytes The previous check used (num_total_pages - 1) * batch_stride * element_size which measures the last-page base offset, missing within-page offsets and producing an off-by-one at exactly INT32_MAX (the largest representable SRD voffset). Switch to total KV cache footprint (num_total_pages * batch_stride * element_size > INT32_MAX) so within-page reads on the last page are covered, and drop the redundant num_total_pages > 1 guard since single-page configs trivially fit in 32 bits. Also unify wording: 4GB → 2GB (INT32_MAX byte offset for SRD voffset), matching CK's TwoGB convention. The actual hardware bound has always been 2GB; the prior comments were imprecise. Found during batch prefill template dispatch review. * docs(mha): unify >2GB wording in batch_prefill error and test The 4GB number in the TORCH_CHECK error message and the test comment was imprecise — the actual SRD voffset bound is 2GB (INT32_MAX). Update both to match the threshold check and CK's TwoGB convention. Found during batch prefill template dispatch review. * refactor(mha): drop wrapper-side use_64bit_load; let CK dispatcher decide The wrapper hardcoded kN0_min = 128 to compute the >2GB KV cache predicate, which leaked CK tile config into aiter and would silently break if a new arm with bn0 != 128 were added. The CK auto-generated dispatcher now decides per-arm using its own compile-time bn0 and per-dtype kElementBytes, so the wrapper just forwards args. Remove the `use_64bit_load` runtime field from `mha_batch_prefill_traits`, the parameter from `get_mha_batch_prefill_traits()`, and the entire predicate computation block from the dispatcher call site. Bumps CK submodule to pull in the matching codegen change. * chore(mha): bump CK + update wrapper wording for kUseGlobalLoad rename Bumps 3rdparty/composable_kernel to dd8d293ea (refactor(fmha): batch prefill review polish — assert helper + setter guards) which builds on the prior 99a3ca9af kUseGlobalLoad rename. Wrapper-side updates to match: * csrc/cpp_itfs/mha_fwd_batch_prefill.cu: rename "64-bit-load" wording in the per-arm dispatcher comment to "kUseGlobalLoad" so the wrapper comment matches the CK-side identifier. Also drops the trailing `false /* skip_min_seqlen_q */` argument from the get_mha_batch_prefill_traits call to match the upstream CK API signature change. * csrc/py_itfs_ck/mha_batch_prefill_kernels.cu: change the >2GB error message from "page_size < 128" to "page_size < kN0" so the diagnostic tracks the tile-size constant rather than a magic number. * op_tests/test_batch_prefill.py (test_batch_prefill_large_kvcache): three documentation enhancements with no behavior change — - explain why qo_len caps at 128 (causal) / 1024 (non-causal): the causal cap is a math-backend cliff for the SDPA reference, not a kernel limit; - explain that the +256 padding on kv_page_indices is a batch_prefill ABI requirement (kernel may speculatively read up to bn0=256 entries past the last valid page index); - expand the torch.cuda.synchronize comment to call out the misattribution failure mode and GPU-reset cascade risk. * test(fmha): parametrize test_batch_prefill_large_kvcache over batch_size {1, 4} Adds multi-batch coverage to the >2GB KV cache regression test. The previous single-batch coverage left the kernel's per-sequence SRD rebase path unexercised: with cu_seqlens_q=[0, qo_len] and kv_indptr= [0, num_blocks], the kernel never walks the indptr to reposition K/V SRDs across batch boundaries. After the kUseGlobalLoad rename and the new positive static_assert(kUseGlobalLoad_) calls in update_physical_pages and set_page_stride_elements, we want a regression that catches any boundary-crossing SRD bug -- the failure mode no single-batch test can detect (one batch correct, others wrong). batch_size=4 partitions the >2GB page pool across 4 sequences (last sequence absorbs the remainder), exercising 3 cross-batch SRD transitions. The SDPA reference is computed per-batch and concatenated; per-iteration free + empty_cache keeps peak memory at one batch's worth. Verified on: - gfx950 (smci355-gfx950, MI355X): 160 passed, 32 skipped - gfx942 (smc300x-clt, MI308X): 160 passed, 32 skipped Skips are the existing vectorized + page_size=1 incompatibility (3D tensor layout), now 16 per batch_size value. --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Fix top_k_per_row_prefill err when batched_token_numm > 4096 (#2901) * retune a8w8_blockscale_bpreshuffle_gemm for MI308 (#2896) * revert gptoss tuned config (#2904) --------- Co-authored-by: zhuyuhua-v <yuhzhu@amd.com> * mHC: Optimize mhc_pre performance in small M (#2915) * mHC: Optimize mhc_pre performance in small M (add tileN=16 on mhc_pre_gemm_sqrsum && add splitk on mhc_pre_big_fuse) * update test * fix(mha_bwd): pass independent strides for do in _bwd_preprocess (#2808) * fix(mha_bwd): pass independent strides for do in _bwd_preprocess - adding independent `stride_do_b/h/m/k` parameters to `_bwd_preprocess` and addressing dO with them; - passing `*do_strides` (already computed in the wrapper) alongside `*o_strides` when launching the preprocess kernel. * pass independent strides for flash_attn_fused_backward kernel * fix typo * Fix batched_gemm_a8w8 correctness for ASYNC_COPY + gfx950 UT (#2867) * [Triton]: Add MoE a16w4 (#2770) * [Triton]: Add MoE a16w4 * Fix Black issue * Removed x scales and improved arch checks in unit test * Added benchmark * Fixed PR review issues * PR review comments fixes --------- Co-authored-by: Rahul Batra <rahbatra@amd.com> * test(moe_2stage): add csv-driven mode covering model_configs fmoe csvs (#2825) --------- Signed-off-by: zhimding <zhiming.ding@amd.com> Signed-off-by: zhimding <zhimding@amd.com> * fix mhc device (#2916) * CI: add standalone OPUS test workflow (#2928) * CI: log in before starting OPUS test container Authenticate with the Docker registry before `docker run` so OPUS test jobs can pull the image reliably across runners. * CI: add docker login before ATOM base image pull (#2931) * CI: add docker login before ATOM base image pull Authenticate with the Docker registry before pulling the ATOM base image so the workflow can fetch the container reliably across runners. * CI: Use build-only-aiter runner to build Triton wheels (#2934) * fea(car): support custom group device (#2703) * fea(car): support custom group device Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * [fix]: test script code format Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fea(car): support multi communication groups entity Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fea(ag): support multi group allgather Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: test script format Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fea(car): add reduce_scatter interface Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix(car): custom group config Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> * fix: custom comm group set method Signed-off-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu> --------- Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> Signed-off-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu> Co-authored-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu> * [FLYDSL]: update version to 0.1.5.dev504, fix bug in if-else dynamic grammer (#2935) * [CK] Add StreamLLM sink token support for batch_prefill pipeline (#2794) * [CK] Add StreamLLM sink token support for batch_prefill pipeline Update CK submodule to ROCm/rocm-libraries#6479 (commit feea6be) which makes kHasSink_ a real template parameter in fmha_fwd_batch_prefill_traits_, enabling _sink/_nsink kernel variant codegen for batch_prefill. AITER-side changes thread sink_size through the full call chain: Python (aiter/ops/mha.py): - Add sink_size: int = 0 to mha_batch_prefill_func, _mha_batch_prefill, mha_batch_prefill stub, and cmdGenFunc_mha_batch_prefill - Use has_effective_sink = sink_size > 0 and (causal or has_window_mask) for _sink/_nsink module name selection, matching C++ mask logic C++ interface (csrc/): - mha_batch_prefill_traits: expose has_sink parameter (was hardcoded false) - mha_fwd_batch_prefill.cu: derive has_sink = args.sink_size > 0 - mha_batch_prefill_kernels.cu: add int sink_size param, include it in mask_identify string, set args.sink_size = mask.sink, zero-initialize fmha_batch_prefill_args{} to avoid UB - PyBind (rocm_ops.hpp) and declaration (torch/mha_batch_prefill.h): add sink_size positional parameter after window_size_right Semantics clarified: - sink_size: number of first KV tokens always attended (sink phase loop) - sink_ptr[nhead]: fixed logit for a virtual sink token in softmax (null -> -inf, non-null -> user value); independent of sink_size - window_size_left=L means k in [abs_q-L, abs_q] (L+1 tokens), verified via code derivation (block_masking.hpp) and discriminating tests Verified: - No-sink paths numerically correct vs torch reference (max_diff<0.004) - sink_ptr virtual token semantics correct (max_diff<0.004) - window=1024 + sink_size=4 + sink_ptr: max_diff=0.000488 * [test] Add StreamLLM sink token tests for batch_prefill Add test_batch_prefill_sink pytest function and supporting helpers to op_tests/test_batch_prefill.py: - ref_masked_attention_with_sink: torch reference implementing sink semantics (first sink_size KV positions always valid, sink_ptr virtual token appended to softmax attention matrix) - run_batch_prefill_sink: runs both reference and CK kernel, compares with get_tolerances() thresholds - test_batch_prefill_sink: parametrized over: batch_size=[1,2], page_size=16, head_dim=128 num_qo_heads/num_kv_heads=[(8,1),(4,4)] (GQA + MHA) qo_len=[32,128] (window_left, kv_len)=[(128,512),(1024,2048)] <- real gap in both sink_size=[4,16] sink_ptr_value=[None, 0.0, 2.0] dtype=[bfloat16, float16] Verified manually with batch=1 GQA 8/1 window=128 sink=4 ptr=2.0, batch=2 MHA 4/4 window=1024 sink=16 ptr=None, and sink_ptr=0.0. * [style] Black format op_tests/test_batch_prefill.py * [fix] Address Copilot review on batch_prefill sink PR - aiter/ops/mha.py: add missing sink_size parameter to mha_batch_prefill_fake_tensors so its signature matches the real mha_batch_prefill op; without this torch.compile / fake-tensor mode raises TypeError when invoking the op. - op_tests/test_batch_prefill.py: vectorize the StreamLLM reference mask construction. The previous double Python loop did per-element assignment into a CUDA tensor, triggering O(seqlen_q * seqlen_k) GPU sync points (262144 at seqlen_k=2048) and would time out the parametrized tests. Replace with broadcasted index tensors and a single masked_fill_. * [test] Wire StreamLLM sink scenarios into CI __main__ block CI invokes test files via `python3 "$file"`, which only runs the __main__ block; pytest functions like test_batch_prefill_sink were therefore never exercised in CI. Add a second __main__ block at the end of the file that runs run_batch_prefill_sink over a small representative parameter sweep (2 window/kv combos x 2 sink_ptr settings, sink_size=4, bf16). The helpers are defined after the original __main__, so a separate trailing __main__ block is the minimal-blast-radius way to keep both invocations independent. * [CK] Bump submodule to include StreamLLM sink for batch_prefill Update CK submodule from 08792e0b3 to d22aafb48 to pull in [CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479). * Set RNG seed in Triton tests (#2921) * [quant] refactor hip kernels (#2932) --------- Signed-off-by: zhiding512 <zhimding@amd.com> Signed-off-by: zhimding <zhiming.ding@amd.com> Signed-off-by: Tres Popp <tres.popp@amd.com> Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> Signed-off-by: zhimding <zhimding@amd.com> Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: fsx950223 <fsx950223@outlook.com> Signed-off-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: gyohuangxin <42127654+gyohuangxin@users.noreply.github.com> Co-authored-by: BingYuan.Zhou <BingYuan.Zhou@amd.com> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: Vinayak Gokhale <vinayak.gokhale@amd.com> Co-authored-by: RuibinCheung <ruibzhan@amd.com> Co-authored-by: ruanjm <jiming.ruan@amd.com> Co-authored-by: Peng <peng.sun@amd.com> Co-authored-by: Peng Sun <pensun@Pengs-MacBook-Pro.local> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: TennyWang1223 <Tenny.Wang@amd.com> Co-authored-by: Nicholas Susanto <nicholas.susanto@amd.com> Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: Su Ann Chong <suachong@amd.com> Co-authored-by: Elton <zhimding@amd.com> Co-authored-by: Jiayun <jiayyu@amd.com> Co-authored-by: Stig-Arne Grönroos <sgronroo@amd.com> Co-authored-by: amd-ruitang3 <rui.tang2@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com> Co-authored-by: Sergey Solo <ssolovye@amd.com> Co-authored-by: Yutao Xu <xytpai@foxmail.com> Co-authored-by: Dragan Mladjenovic <dragan.mladjenovic@amd.com> Co-authored-by: Nidal Danial <81209936+nidal567@users.noreply.github.com> Co-authored-by: Nico Holmberg <nico.holmberg@amd.com> Co-authored-by: Tres Popp <tres.popp@amd.com> Co-authored-by: GeisYaO <q1179671016@gmail.com> Co-authored-by: Xudong Yuan <xudoyuan@amd.com> Co-authored-by: fangche123 <Fang.Che@amd.com> Co-authored-by: minmengdie <memin@amd.com> Co-authored-by: Kiran Thumma <167153338+kiran-thumma@users.noreply.github.com> Co-authored-by: TennyWang1223 <root@hjbog-srdc-24.amd.com> Co-authored-by: lucas-santos-amd <Lucas.Santos@amd.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> Co-authored-by: PerryZhang01 <Perry.Zhang@amd.com> Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: Zzz9990 <zanzhang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: eppaneamd <rleppane@amd.com> Co-authored-by: HaonanWang98 <hwang@amd.com> Co-authored-by: Pleaplusone <ygan@amd.com> Co-authored-by: JaxChen29 <jichen@amd.com> Co-authored-by: Petr Kurapov <petr.kurapov@gmail.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: zhuyuhua-v <yuhzhu@amd.com> Co-authored-by: Yi DING <andy-ding@outlook.com> Co-authored-by: Felix Li <felix.li@amd.com> Co-authored-by: slippedJim <jim.guo@amd.com> Co-authored-by: jian.wu <jwu10003@amd.com> Co-authored-by: Pedram Ghazi <pedghazi@amd.com> Co-authored-by: root <root@chi-mi300x-003.ord.vultr.cpe.ice.amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: Xu, Shengnan <117875955+shengnxu@users.noreply.github.com> Co-authored-by: Oxana Korzh <oxana.korzh@amd.com> Co-authored-by: okorzh-amd <okorzh-amd@users.noreply.github.com> Co-authored-by: who who who <fsx950223@outlook.com> Co-authored-by: yijin <yijin@amd.com> Co-authored-by: Jeff Huang <jiaji.huang73@gmail.com> Co-authored-by: chuanbowang2026 <chuanwan@amd.com> Co-authored-by: rahulbatra85 <rahulbatra85@gmail.com> Co-authored-by: Rahul Batra <rahbatra@amd.com> Co-authored-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu> Co-authored-by: Linjun-AMD <Jun.Lin@amd.com>

fix mhc device

76ea1ed

valarLip requested review from a team and Copilot April 25, 2026 15:33

Copilot started reviewing on behalf of valarLip April 25, 2026 15:34 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

sunway513 mentioned this pull request Apr 25, 2026

[RFC] DeepSeek-V4 KV Cache Reform — closed for correctness implementation (v0.2.6) sunway513/ATOM#35

Open

33 tasks

Oseltamivir mentioned this pull request Apr 27, 2026

Aiter MHC fix and keep DSv4 ATOM conc1 SemiAnalysisAI/InferenceX#1202

Merged

valarLip merged commit 8c27e66 into main Apr 28, 2026
32 checks passed

valarLip deleted the fix_mhc_device branch April 28, 2026 03:01

Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Apr 28, 2026

Aiter MHC_pre ROCm/aiter#2916 fix

38d2da7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix mhc device#2916

fix mhc device#2916
valarLip merged 1 commit intomainfrom
fix_mhc_device

valarLip commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

valarLip commented Apr 25, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Apr 25, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants