Skip to content

fix mhc device#2916

Merged
valarLip merged 1 commit intomainfrom
fix_mhc_device
Apr 28, 2026
Merged

fix mhc device#2916
valarLip merged 1 commit intomainfrom
fix_mhc_device

Conversation

@valarLip
Copy link
Copy Markdown
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@valarLip valarLip requested review from a team and Copilot April 25, 2026 15:33
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2916 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes mhc_pre intermediate tensor allocations to be created on the same device as the input residual, preventing device-mismatch failures when the global default device is not CUDA.

Changes:

  • Reorders imports in aiter/ops/mhc.py.
  • Allocates out_pad, sqrsum, post_mix, comb_mix, and layer_input with device=residual.device.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread aiter/ops/mhc.py
Comment on lines +92 to +94
device = residual.device
out_pad = torch.empty(
selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32
selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32, device=device
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a regression test that exercises mhc_pre when the global default device is CPU but inputs are explicitly on CUDA (e.g., torch.set_default_device('cpu') then create residual/fn/hc_scale/hc_base on cuda). This change fixes internal tensor allocations to follow residual.device, but current tests may still pass even if allocations accidentally fall back to the default device.

Copilot uses AI. Check for mistakes.
@valarLip valarLip merged commit 8c27e66 into main Apr 28, 2026
32 checks passed
@valarLip valarLip deleted the fix_mhc_device branch April 28, 2026 03:01
Oseltamivir added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Apr 28, 2026
demonsan added a commit that referenced this pull request Apr 29, 2026
* CI: surface runner-config mapping in AMD CI job monitor (#2711)

* Update runner-config.yml

* CI: surface runner-config mapping in AMD CI job monitor

Load GPU architecture and count from runner-config.yml so the runner fleet summary shows the configured inventory for each label. Trigger the monitor workflow when runner mappings change and install PyYAML for the runner report job.

* CI: auto-update split test FILE_TIMES (#2709)

Co-authored-by: gyohuangxin <42127654+gyohuangxin@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* CI: add standalone OPUS test workflow (#2716)

Add a dedicated GitHub Actions workflow for op_tests/opus so OPUS validation runs independently on MI35X and MI325 runners without being mixed into the main Aiter test shards.

* add signature and refine lru cache (#2710)

Co-authored-by: solin <bingzhou@amd.com>

* CI: separate Aiter Test concurrency for push and schedule (#2718)

Split main-branch concurrency by event type so scheduled runs do not block push-triggered validation when a long queued job keeps the nightly workflow open.

* fix split module jit and retune (#2719)

* [TRITON] Prevent NUM_KSPLIT from reducing K dim below GROUP_K (#2661)

* CI: Enable Deepseek ATOM tests on MI35X (#2667)

* CI: Enable Deepseek ATOM tests on MI35X

* CI: Use /models cache for MI35X ATOM DeepSeek test

Route the MI35X DeepSeek job to the runner-local /models cache so it avoids downloading into /run, and make the output artifact name unique now that two DeepSeek variants run in the same workflow.

* CI: Mount /models into MI35X ATOM test container

Pass the runner's shared /models cache into atom_aiter_test so MI35X DeepSeek jobs can use the mounted model path.

* feat: _flash_attn_forward add out args (#2648)

* Replace CK/CK_TILE in MLA Reduce and Metadata Kernel with OPUS (#2717)

* replace ck with opus

* fix compile issue

* fix waterfall and use buffer inst for lse.

* Replace ck with opus for mla metadata.

* Add dim=512 fp32 case for wave32

* docs: add ISA-level kernel optimization guide (#2708)

* docs: add ISA-level kernel optimization guide using LLVM tools

Step-by-step guide covering the full LLVM-based workflow for
inspecting, modifying, and recompiling AITER GPU kernel ISA:
disassemble, extract reassemblable .s, round-trip recompile with
binary-identical .text verification, and profile with rocprofv3.

Includes Python extraction script handling branch label word-offset
addressing, llvm-objcopy section swap for preserving kernel metadata,
and rocprofv3 kernel-trace + ATT profiling instructions.

* docs: add ISA optimization code examples and Dockerfile

Runnable companion to the ISA kernel optimization guide:
- extract_asm.py: standalone ASM extraction with CLI interface
- analyze_kernel.py: instruction mix analysis and rocprofv3 profile parser
- roundtrip.sh: end-to-end disassemble/extract/recompile/verify script
- Dockerfile: ROCm 7.2.1 dev environment with all tools pre-installed
  including ATT trace decoder built from source

* style: fix black formatting and ruff lint in ISA optimization examples

- Rename loop var 'l' to 'ln' to fix E741 (ambiguous variable name)
- Remove extraneous f-prefix on strings without placeholders (F541)
- Apply black auto-formatting

* style: fix black formatting for CI compatibility

- Add blank line between module docstring and imports (E302)
- Collapse multiline f-string call arguments

---------

Co-authored-by: Peng Sun <pensun@Pengs-MacBook-Pro.local>

* Add run_config and compare in tuner (#2375)

* Add run_config/compare support to GemmTuner (bf16)

- Add config_env_name, _clear_op_caches(), and run_config() to GemmTuner
  so --run_config and --compare flags work for bf16 GEMM tuning
- Update bubbly-exploring-turtle.md plan doc to reflect the full
  implementation including --compare, config_env_name, cache clearing,
  and post-tune config switching architecture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add --run_config and --compare benchmark support to all tuners

Add infrastructure in base_tuner.py for production operator benchmarking:
- --run_config: benchmark only, no tuning
- --compare: pre-tune benchmark, tune, post-tune benchmark with comparison table
- Config env switching and cache clearing for post-tune benchmarks

Implement run_config() and _clear_op_caches() in all CK-based tuners:
gemm_a8w8, gemm_a8w8_bpreshuffle, gemm_a8w8_blockscale,
gemm_a8w8_blockscale_bpreshuffle, gemm_a4w4_blockscale,
gemm_moe_2stages, batched_gemm_a8w8, batched_gemm_bf16

* Revert unintended composable_kernel submodule change

* Fix review comments and remove intermediate plan docs

- Save/restore AITER_REBUILD original value instead of hardcoding 0
- Use defensive strip() for mixed-type object columns in _read_csv
- Remove docs/bubbly-exploring-turtle.md and docs/run_config_benchmark.md
  (consolidated into csrc/.claude/add_run_config_to_tuner.md)

* update ref rtol,atol

* Fix tuner cache invalidation, run_config preshuffle, and compare workflow

- Fix _clear_op_caches for all tuners: properly clear lru_cache and
  internal dict/attribute caches (a4w4, a8w8 variants, fmoe) so
  post-tune benchmarks use freshly tuned configs instead of stale ones.
- Fix fmoe run_config: preshuffle weights before calling fused_moe to
  match production layout (tuner always tunes with bpreshuffle=True),
  preventing preshuffle_on/off module mismatch and 99%+ error.
- Add defensive warning in fused_moe get_2stage_cfgs when tuned config
  is found but is_shuffled=False.
- Fix run_config to read shapes from tuned CSV and set config env var.
- Fix --compare workflow: run post-tune benchmark before tune_summary
  to avoid summary errors blocking verification.
- Fix base_tuner _set_config_env_for_run_config return value.
- Use print instead of logger.info for benchmark tables.

* fix format

* fix format

* update readme

* fix lint error

* fix lint

* update csv only when perf improves

* format

* fix lint

* revert format for some files

* clarify compare and gated update flow

Make --compare keep a candidate csv and require --update_improved before writing back tuned results so the CLI stays explicit and easier to extend.

Made-with: Cursor

* fix flydsl GemmTuner review issues

Trigger FlyDSL package validation before importing tuning kernels and align the FlyDSL bias cast order with the runtime path to avoid unintended dtype promotion.

Made-with: Cursor

* update

* revert claude md

* update shape_grouped

* fix format

* fix bug

* fix lint error

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "fix(car): craph capture err (#2638)" (#2735)

This reverts commit 5759ee2943b5326dec23910135381b426ff65196.

* [TRITON] Swiglu and reduce refactor (#2583)

* Move swiglu to a util file + add optional residual flag

* refactor reduce and make it compatible with >65k tokens

* Update aiter/ops/triton/_triton_kernels/moe/reduce.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

---------

Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* [aiter] type hints mismatch (#2728)

* fix: silence false warning in fmha_fwd_v3 when use_asm_v3 is disabled (#2744)

When `use_asm_v3` is `false`, `fmha_fwd_v3()` correctly returns `-1` to
fall back to the CK path, but it also emits a misleading "unsupported
condition in fwd_v3!!!" warning. This is not an unsupported condition —
the caller intentionally opted out of v3.

Separate the `use_asm_v3` check into an early return without a warning,
so the `AITER_LOG_WARNING` only fires for genuinely unsupported parameter
combinations (wrong head dims, unsupported dtypes, bias, dropout, wrong
arch).

Made-with: Cursor

* Add FlyDSL GEMM AOT precompile support (#2741)

* feat(aot): add MoE FlyDSL AOT pre-compilation module

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: zhiding512 <zhimding@amd.com>

* refactor(aot): remove --stage flag from MoE AOT module

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: zhiding512 <zhimding@amd.com>

* reformat

* feat(aot): integrate FlyDSL MoE AOT precompilation into setup.py

Move moe.py into aiter/aot/flydsl/, support multiple CSV configs,
simplify compile_one_config to use COMPILE_ONLY=1 env var, and add
MoE AOT pre-compilation step to the package build in setup.py.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: zhimding <zhiming.ding@amd.com>

* feat(aot): FlyDSL MoE AOT with COMPILE_ONLY dummy-tensor precompilation

Rework FlyDSL MoE AOT to use COMPILE_ONLY=1 with dummy tensors instead
of run_kernel, removing all HIP op dependencies (moe_sorting,
shuffle_weight, etc.) from the precompilation path.

Changes:
- Replace _run_kernel with _precompile_to_cache using torch.zeros
  dummy tensors and COMPILE_ONLY=1 for pkl cache generation
- Add sys.modules bridging in setup.py so aiter.jit.core reuses the
  same module instance loaded via sys.path
- Auto-detect bundled flydsl_cache in aiter/__init__.py and set
  FLYDSL_RUNTIME_CACHE_DIR
- Add KeyError to aiter/__init__.py exception handler for robustness
- Support multiple CSV configs (dsv3 + kimik2)
- Remove run_kernel parameter and test_bad_tile logic

Signed-off-by: zhimding <zhiming.ding@amd.com>

* update flydsl

* update flydsl

* adapt hgemm

* fix black

* add flydsl gemm aot precompile support

---------

Signed-off-by: zhiding512 <zhimding@amd.com>
Signed-off-by: zhimding <zhiming.ding@amd.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

* gather support qk_nope_head_dim != v_head_dim (#2739)

* gather support qk_nope_head_dim != v_head_dim

* fix 192 pad

* feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models (#2733)

* feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models

Tuned on MI355X (gfx950) with all backends competing (ASM, hipBLASLt,
Triton, FlyDSL). New tuned configs for Llama 70B, Llama 405B, Qwen 32B.
Re-tuned existing configs for GPT-OSS, DSV3, Kimi-K2 to include FlyDSL.

Backend wins across 708 total shapes:
- hipBLASLt: 472 (66.7%)
- ASM: 131 (18.5%)
- FlyDSL: 70 (9.9%)
- Triton: 7 (1.0%)
- Mixed/other: 28 (4.0%)

* feat: retune BF16 GEMM without hipBLASLt, add GLM-5 and 3 new models

Re-tuned all BF16 GEMM configs on MI355X (gfx950) with --libtype
asm,triton,flydsl (no hipBLASLt). Added GLM-5 (88 shapes from CI log)
and new configs for Llama 70B, Llama 405B, Qwen 32B.

Backend wins across 796 total shapes (7 models):
- ASM: 437 (54.9%)
- FlyDSL: 224 (28.1%)
- Triton: 135 (17.0%)

Per-model breakdown:
- GPT-OSS (57): asm=54, triton=3 (bias=True, no FlyDSL support)
- DSV3 (58): flydsl=22, triton=18, asm=18
- Kimi-K2 (125): asm=77, flydsl=46, triton=2
- GLM-5 (88): asm=42, flydsl=30, triton=16
- Llama 70B (156): asm=84, flydsl=49, triton=23
- Llama 405B (156): asm=89, flydsl=43, triton=24
- Qwen 32B (156): asm=73, triton=49, flydsl=34

Tuning time without hipBLASLt: 4h total (long pole: 405B @ 4h)
vs with hipBLASLt: 10h+ total (long pole: 405B @ 8h+)

* Hoist introspection out of per-call ctype dispatch (#2742)

* Hoist inspect.signature/typing.get_type_hints out of per-call ctypes dispatch

These two introspection calls were recomputed on every invocation of
the ctypes caller closure (~79µs + ~91µs per call). Since the decorated
function's signature and type hints are immutable, compute them once at
decoration time and capture via closure.

Made-with: Cursor

* update ruff format

* update black format

---------

Co-authored-by: amd-ruitang3 <rui.tang2@amd.com>

* Make FlyDSL LDS checks architecture-aware and reduce tuner failure noise (#2732)

* Handle FlyDSL LDS limits and candidate failures

Use shared-memory-per-block queries to keep FlyDSL LDS checks architecture-aware, and surface candidate failures as concise runtime warnings so tuning can continue without noisy tracebacks.

Made-with: Cursor

* Keep tuner topk local per shape

Avoid mutating the shared topk value while post-processing one shape so later shape groups keep the intended candidate limit.

Made-with: Cursor

* fix lint

* Cache FlyDSL shared memory queries

Avoid repeated device property lookups while validating FlyDSL kernel configs by caching the default device selection and shared-memory-per-block queries.

Made-with: Cursor

* fix lint

* Update aiter/ops/flydsl/gemm_tune/flydsl_gemm_a8w8_bpreshuffle_common.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* parse kernel name to select flydsl kernel

* fix black format error

* refine

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: solin <bingzhou@amd.com>

* 16x256 kernel (#2722)

Co-authored-by: Sergey Solo <ssolovye@amd.com>

* rebase linear attn for new flydsl version (#2746)

* Replace unsafe uses of std::unordered_map with SynchronizedCache (#2221)

* Make AiterAsmKernel load hsaco on each GPU it is used on

* Replace unsafe uses of std::unordered_map with SynchronizedCache

* Fix Triton MoE GEMM shared memory exhaustion by reducing stage count (#2723)

* Fix Triton MoE GEMM shared memory exhaustion

- Reduce num_stages in kernel configs
- Lowered LDS usage to avoid shared memory OOR
- Fix triton.runtime.errors.OutOfResources errors in MoE GEMM kernels

* Fix: set num_stages=1 on gfx950 using get_arch() conditionally for gfx950 to ensure no bottlenecks for gfx942

* Add determinism for fused mul add test

* Format fused mul add test with black

* Annotate moe_sorting_dispatch_policy as int for fused_moe (#2639)

The type annotation bool was incorrect for moe_sorting_dispatch_policy, which
accepts int values. The @torch_compile_guard decorator uses these
annotations to generate PyTorch custom op schemas; with bool, PyTorch schema
enforcement casts any value to bool, so dispatch_policy=2 becomes bool(2)=True
(1), silently losing the intended policy. Using int allows callers to set
dispatch_policy=2 correctly.

Fixes: #2576

Signed-off-by: Tres Popp <tres.popp@amd.com>
Co-authored-by: Tres Popp <tres.popp@amd.com>

* fix moe splitk aot and jit (#2738)

* fix moe splitk aot and jit

* split moe aot to serveral libs base on tuned_moe configs

* update copyrights

* fix typo

* test shuffle as default and fix moe split jit

* Update quant.pyfix: add pack_dim to per_1x32_f4_quant for tl.dot_scaled RHS compatibility (#2704)

* Update quant.py

* Refactor per_1x32_f4_quant function signature

* Fix function definition formatting in quant.py

* Refactor per_1x32_f4_quant_for_dot_scaled definition

* Restore semantic.py to match main branch

* Flydsl align if else usage to be compatible with all versions (#2740)

* Add bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent) (#2729)

* Add bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent)

* black mla.py

* fix short kv len split error

* Support final LSE output in non-persistent MLA reduce kernel

* black mla.py

* ruff error

---------

Co-authored-by: minmengdie <memin@amd.com>

* CI: scope check signal artifact downloads to matching run (#2757)

Avoid repo-wide artifact searches in downstream workflows so check-signal can fetch the pre-checks signal without hitting the GitHub Actions API rate limit.

* feat: aiter whls nightly (#2514)

* feat: aiter whls nightly

* fix:

* CI: add docker username input for aiter release workflow (#2535)

* fix:

* fix: address review comments and improve nightly workflow
- Enable promote for PRs (devreleases) after tests pass
- Fix inputs context for schedule/PR events (use github.event.inputs)
- Add set -o pipefail to prevent tee masking test failures
- Add release_type validation in promote workflow
- Guard AWS credentials against fork PRs
- Remove S3 bucket details from summaries, show only index URLs
- Simplify dispatch inputs: single docker_image, python_version, gpu_archs
- Add build_py310/build_py312 controls to aiter-release workflow

* style: fix Black formatting in generate_summary.py

* fix: add tabulate dependency for test_activation.py in wheel tests

* feat: add ci:nightly label trigger with full pipeline including promote

- PR trigger via ci:nightly label runs Build → Test → Promote to nightlies
- release_type set to nightlies for labeled PRs (same as scheduled runs)
- Promotion no longer blocked for PRs, enables end-to-end validation

* fix: keys

* fix: requested

* fix: remove testcode

* Update: move generate_summary.py

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* CI: reduce retention for workflow artifacts (#2758)

Shorten retention periods for high-volume workflow artifacts so routine logs and shard metadata expire sooner while wheel artifacts remain available long enough for downstream consumers.

* CI: upgrade sglang downstream to v0.5.10 (#2753)

* CI: upgrade sglang downstream to v0.5.10 on MI35X/MI355

* CI: keep sglang v0.5.10 upgrade on MI325

* Update sglang_downstream.yaml

* CI: use isolated sglang workspace for downstream test

* fix(car): sglang prefill launch error kernel (#2745)

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Co-authored-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

* Add unified attention support to bench_models (#2724)

Add sink and window support to mha handler

* Use AITER_CONFIGS for FlyDSL AOT defaults (#2756)

Align FlyDSL GEMM and MoE AOT precompile defaults with runtime config resolution, and update the FlyDSL dependency to the required version.


Made-with: Cursor

Signed-off-by: zhimding <zhimding@amd.com>

* fix fused_dynamic_mxfp4_quant_moe_sort_hip err in EP (#2759)

* Fix `test_moe_routing_sigmoid_top1_fused` reference implementation tie-breaking (#2750)

Replace `torch.topk` with `torch.argmax` + `torch.gather` in the reference
implementation. `torch.topk` doesn't guarantee stable indices for tied elements,
and its tie-breaking behaviour changed between ROCm 7.2.0 and 7.2.2, causing 10
tests to fail for `N=128`, where integer inputs produce many tied sigmoid
scores. `torch.argmax` always returns the first (leftmost) maximum, matching
`tl.argmax(..., tie_break_left=True)` in the Triton kernel.

* [fix](cache): add eps to avoid diving zero (#2763)

Co-authored-by: perzhang <perzhang@amd.com>

* [FlyDSL MOE] update a8w4 moe (#2726)

* update a8w4

* update gptoss csv

* update gptoss_fp4 prefill config

* update files format

* update fix with rws & update a8w4 kimi tune cfgs

* update

* update fix code style

* update code style

* fix ci

* revert fused_moe.py modifies

* revert a16w4

* add instances

* update ut & env var name

* rename files

* fix ci

* remove use_cfg

* remove tile m 16

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* MI350 mla ps mode suppport nhead128,1 128,2 128,3 128,4 64,4 64,2 32,4 through kernel mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co (#2727)

* MI350 mla ps mode suppport nhead128,1 128,2 128,3 128,4 64,4 64,2 32,4 through kernel hsa/gfx950/mla/mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co

* del the useless code

* black error

* fix sparse test error

* update config (#2771)

* fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules (#2645)

* chip_info: add GFX_CU_NUM_MAP and get_build_targets()

* aiter/configs: migrate tuned GEMM CSVs to add gfx as first column

* csrc: fix gen_instances.py to filter by (gfx, cu_num) build targets

* aiter/ops: add gfx to runtime GEMM dispatch lookup keys

* aiter/utility: add gfx to GemmCommonTuner key and tune result output

* csrc, gradlib: add gfx to all GEMM tuner output keys

* op_tests: fix is_shape_tuned to filter by (gfx, cu_num)

* fix(configs): resolve model_configs merge conflicts and add gfx column

* op_tests: add CSV input, output saving, and stable iter counts to a8w8 GEMM test scripts

* fix(merge): resolve conflict in gemm_op_a4w4.py after main sync

The merge commit 6a18cd6e7 accidentally preserved conflict markers in
gemm_op_a4w4.py. Apply the gfx-aware dispatch fix (same pattern as
gemm_op_a8w8.py) — use (gfx, cu_num, M, N, K) key when the CSV has a
gfx column, fall back to (cu_num, M, N, K) for old CSVs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(configs): add missing gfx column to dsv3 model_configs overrides

* op_tests: add bpreshuffle-csv entry point and skip_ck flag to test_gemm_a8w8

* op_tests: add gfx filter unit tests and repro CSVs for both GEMM modules

* fix(ck_gemm): key C++ dispatch map by (cu_num,M,N,K) to prevent multi-arch kernel collisions, share build_tune_dict helpers across all 9 CK GEMM modules

* op_tests/configs/gemm_codegen_gfx_filter.csv

* chip_info: split arch constants and env-only build targets into torch-free build_targets.py

* op_tests: fix repro CSV gfx942/304 kernels to be valid for M=1 and M=32

* chip_info: use bare import for build_targets to fix build context ModuleNotFoundErro

* docs: add gfx column to tuning CSV examples and update cu_num description in all 8 GEMM READMEs

* lint: apply black formatting and fix ruff violations in modified files

* lint: fix black/ruff violations in csrc gen_instances and gradlib

* fix(gemm_op_a8w8): eliminate StopIteration risk and use AITER_CONFIGS for defaults

* fix(chip_info): guard kernelId/kernelName lookups with .get() to avoid KeyError on malformed CSVs

* fix(base_tuner): add gfx legacy fallback to if branch of get_retune_gemm_list

* docs(test_gemm_codegen): fix comment reference for GFX_CU_NUM_MAP location

* fix(gemm_dispatch_utils): check HIP return codes in get_device_cu_num()

* fix(chip_info): add get_gfx_runtime() and fix GPU_ARCHS=native in get_build_targets()

* chore(configs): sync dsv3/kimik2 bf16 tuned gemm CSVs with main and add gfx column

* fix(op_tests): use get_gfx_runtime() in GEMM test files for correct arch detection

* fix(core): self-heal CSV dedup without requiring a re-run

* fix(chip_info): add shape and arch context to kernelId/kernelName skip warnings

* fix(chip_info): use logger.warning instead of print for kernel skip warnings

* style(chip_info): fix E402 import order after logger initialization

* fix(gemm_dispatch_utils): initialize device to -1 to clarify output-parameter intent

* test(test_gemm_codegen): fix Section 3 runtime dispatch tests to use live GPU

* fix(gemm_op_a8w8): remove duplicate get_gfx_runtime import

* docs(chip_info): fix build_tune_dict docstring for kernels_by_name fallback

* fix(gemm): extend C++ dispatch key with gfx arch string — (cu_num,M,N,K) → (gfx,cu_num,M,N,K)

* style(chip_info, test_gemm_codegen): apply black/ruff formatting

* feat: add PRETUNE_MODULES build flag to auto-tune GEMM shapes on live GPU

* feat(pretune): add run_tune_direct() and CLI for standalone retuning on installed aiter

* refactor(pretune): remove run_tune_direct wrapper, add input validation and dedup to CLI

* fix(pretune): suppress ruff F841/E402 false positives on eval-scope variable and path-dependent import

* refactor(pretune): extract _parse_module_list, fix silent skip of unsupported modules in setup.py path, add deduplication

* docs(pretune, setup, test_gemm_codegen): fix stale docstrings and add missing inline comments

* fix(pretune): write tuned results to source CSV, not ephemeral /tmp; add regression test

* setup.py: import pretune directly to avoid premature aiter package init

* pretune: add warmup API — check_tuning_coverage, warn_if_undertuned, warmup

* pretune: tune only missing model shapes in warmup(), not full CSV

* fix(pretune): remove vLLM-specific env var hint from warmup() warning

* revert: remove warmup API from pretune.py

* fix(tuners): clear module-level CSV caches in _clear_op_caches

* fix(build): add _parse_gpu_archs_env()

* fix(docs/tests): docstring accuracy, test coverage, and gfx-aware dedup

* fix(tests): route aiter logger to stdout in test_pretune to fix warning ordering

* fix(gemm_dispatch_utils): cache cu_num and gfx per device ID via SynchronizedCache

* tuning: use get_gfx_runtime() in tuner imports so live GPU arch is used instead of GPU_ARCHS env

* fix(configs): add missing gfx column to bf16 model_configs CSVs introduced during main merge

* raise error when having duplicate shape entries

* fix(configs): remove duplicate shape entries from a8w8_blockscale_bpreshuffle_tuned_gemm_qwen3.5_397b.csv

* resolve duplicated shapes

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ying.Zhou2 <Ying.Zhou2@amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* add triton fallback for mi455 gptoss & dsfp4 (#2657)

* add triton fallback for mi455 gptoss & dsfp4

* rm nk config

* fix rotary

* fix recompile issue (#2777)

* Fix fused_gemm_a8w8_blockscale_a16w16 unit test failures (#2766)

* Fix fused_gemm_a8w8_blockscale_a16w16 unit tests

* fix format with black

* remove guard and keeping num_stages as is - works

* walkaround invalid kernels in moe tunner (#2785)

* CI: auto-update split test FILE_TIMES (#2795)

* CI: auto-update split test FILE_TIMES

* chore: trigger pull_request checks

---------

Co-authored-by: gyohuangxin <42127654+gyohuangxin@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* [fix] remove glm5 triton tuned gemm (#2803)

* remove glm5 triton tuned gemm

* tuned

* add tuned bf16 gemm for qwen3next and qwen3.5 (#2799)

* add qwen3next qwen3.5 bf16 untuned gemm

Signed-off-by: ganyi <ygan@amd.com>

* revert the bf16 template csv

Signed-off-by: ganyi <ygan@amd.com>

---------

Signed-off-by: ganyi <ygan@amd.com>

* [Perf][WIP]update qwen fmoe tuned configs (#2812)

* update qwen fmoe tuned configs

* fix typo

* CI: run SGLang downstream tests on MI355 (#2807)

* CI: run SGLang downstream tests on MI355

Switch the downstream workflow to the MI355 runner and mi35x image path so sglang exercises the gfx950 configuration instead of the MI325/mi30x path.

* Update sglang_downstream.yaml

* Update sglang_downstream.yaml

* CI: allow nightly check-signal to reuse Checks push runs (#2819)

Nightly Aiter Test runs look up pre-check signal artifacts by SHA, but the Checks workflow does not run with event=schedule. Skip the event filter for schedule-triggered workflows so nightly jobs can reuse the matching push-based Checks run instead of failing early in check-signal.

* [fix] fix compare error (#2805)

* fix compare error

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* CI: log in to Docker before building aiter image (#2824)

Authenticate before the build so pulling the base image does not hit registry rate limits. Reuse the same Docker session for the later image push.

* [FLYDSL]: aiter flydsl if const_expr (#2776)

* [FLYDSL]: aiter flydsl if const_expr

* [FLYDSL]: black

* add tuner test suite (#2734)

* add tuner test suite

* Update op_tests/tuning_tests/test_mp_tuner_logic.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* add run_config  test

* fix tuned csv

* add missed test

* Update op_tests/tuning_tests/test_run_config.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix format

* update test and  tuner

* fix format

* format

* update test

* format

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix lse address compute (#2810)

* Add vcs_versioning as transitive dependency of setuptools_scm>=10 (#2834)

* Revert kimi2.5 fp4 tp4 moe tuned result on bs 32,64 (#2836)

* remove gemm tuning (#2838)

Signed-off-by: ganyi <ygan@amd.com>

* disable import flydsl_gdr_decode (#2840)

* Xiaobing/gptoss small m (#2775)

* add small m hgemm

* update gptos config

* updata small_m_hgmm

* clear code

* code format

* fix code format

* remove moe_gemm_2stage.py change

* refact code

* recover 0 from SMALL_M_B_TO_LDS_WAVES_PER_EU_OPTIONS

* update gemm code

* refact code

---------

Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>

* [CK_TILE] Fix bpreshuffle compile failure due to stale CShuffleEpilogueProblem arg (#2811)

* [fix] Remove stale TiledMMAPermuteN arg from CShuffleEpilogueProblem

* [fix] Remove stale TiledMMAPermuteN arg from blockscale CShuffleEpilogueProblem

* add static_assert for TiledMMAPermuteN in blockscale and moe_2stages

* update subproject commit for composable_kernel

* update flydsl version (#2802)

* update flydsl version
* fix version err

* CI: fix pre-check signal artifact lookup for PR reruns (#2848)

Resolve check-signal failures when labeled or rerun pull_request jobs get a new GitHub merge SHA. Download the checks-signal artifact from the matched Checks run instead of rebuilding the artifact name from the current workflow SHA.

* [Fix]fix moe tunner (#2831)

* CI: To fix prebuild image issue (#2847)

* Fix a8w8 flydsl tune (#2809)

* use flyc.compile
* add csv-driven mode covering mode_configs/configs a8w8_bpreshuffle csvs
Co-authored-by: solin <bingzhou@amd.com>

* fix other lse addr compute issue (#2833)

* CI: align tuning test workflow with README (#2849)

* CI: add nightly tuning validation workflow

Split the PR #2734 tuning test plan into dedicated nightly CPU and GPU jobs, and make the CPU-only tuner infra tests importable without a ROCm runtime so the new validation workflow can run on ubuntu-latest.

* CI: align tuning test workflow with README

Redo the tuning test PR to match the README commands directly: keep the workflow manual-only, run the documented level01/tune_pipeline/run_config suites from a ROCm container, and drop the unrelated utility-layer changes from the PR scope.

* CI: switch tuning tests to MI35X runner

Run the manual tuning test workflow on the MI35X single-GPU runner and set GPU_ARCHS to gfx950 so the dispatched validation matches the requested hardware target.

* gqa ratio support in mha test (#2798)

* update

* update

* fix odd num head

* ci(release): add manylinux2_28+ROCm builder path with auditwheel gate (#2851)

Adds opt-in manylinux + ROCm wheel build path (use_manylinux=true) using pytorch/manylinux2_28-builder:rocm7.X images, with auditwheel repair + symbol-floor verification step. Default off so existing dispatches are unchanged. Fixes the libstdc++/glibc ABI mismatch tracked in #2843. Validated by test build #24761365935 — wheel passes manylinux_2_28 contract (GLIBCXX max 3.4.22, GLIBC max 2.28).

Co-authored-by: Peng Sun <Peng.Sun@amd.com>

* add asm load kernel log (#2800)

* add asm load kernel log

* update according to review

* update log

* Update gfx942&gfx950 PA PS kernels and write stride_scale_page in asm_pa (#2796)

* clean code and reduce num_stages to 1 for large tile configs in fused_gemm_afp4wfp4_a16w16 to prevent LDS exhaustion with ASYNC_COPY (#2784)

* [TRITON] Add Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441)

Move Gluon kernel to _gluon_kernels/gfx942/moe/ directory and refactor
K-dimension unrolling to use gl.static_range + tuple accumulation,
removing the UNROLL_TIMES <= 3 constraint. Raise Gluon dispatch
threshold to M >= 4096 for optimal performance.

Key optimizations:
- Manual LICM: A matrix, x_scale, and gammas pre-loaded outside N loop
- K-dimension loop fully unrolled via gl.static_range at AST level
- Explicit BlockedLayout + MFMA for optimal INT8 register usage
- SUB_BLOCK_SIZE_N=64 inner loop for large BLOCK_N

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

* [TRITON] Reduce MHA UTs (#2612)

* Reduce tests for mha

* Remove single parametrized functions

* Remove single parametrized functions

* Lint

* Lint

* Address comments

* Fix UTs

* Address test failures

* Address test failures

* Address comments

* Fix CI failures

* Relax test_mha_with_sink bwd tolerance

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lucas Santos <Lucas.Santos@amd.com>

* Add gfx942 Triton A8W8 blockscale GEMM configs for GLM-5 shapes for 70k in and 300 out (#2743)

Co-authored-by: root <root@chi-mi300x-003.ord.vultr.cpe.ice.amd.com>

* docs: polish README with ecosystem, news, and performance highlights (#2859)

* docs: polish README with ecosystem, news, and performance highlights

- Replace GitHub-hosted image with local aiter_logo.png, center-aligned
- Add CI, Release, Docs, and GitHub stars badges
- Add News section with chronological blog posts and publications
- Add Ecosystem section with framework integration table (vLLM, SGLang,
  ATOM, JAX, and customer proprietary engines)
- Add performance highlights and supported hardware tables
- Refactor operators section to link to op_tests/ directory
- Polish description and feature list

docs: add aiter logo assets for README

* docs: add Opus section, MLA tutorial link, and fix release badge

- Add Opus lightweight C++ template section alongside FlyDSL and Iris
- Add MLA decode kernel tutorial to News section
- Change release badge to show version from shields.io, link to releases page

* docs: update badges and spacing in README header

- Remove stars badge, add last commit badge
- Add spacing between logo and badges

* docs: add v0.1.12.post1 release to News section

* docs: adjust spacing between logo and badges

* docs: widen spacing between logo and badges

* make rmsnorm quant fusion support gemma (#2853)

* gemma rmsnorm quant support

Signed-off-by: ganyi <ygan@amd.com>

* change test

Signed-off-by: ganyi <ygan@amd.com>

---------

Signed-off-by: ganyi <ygan@amd.com>

* ci(release): add torch_pin + torch_index_url inputs for ABI-targeted release builds

The manylinux release path currently calls 'pip install torch' with no
version pin, picking up whatever wheel is latest on the auto-derived
download.pytorch.org/whl/rocmX.Y index at build time. That works for
nightlies but is fragile for tagged releases that need to be ABI-
compatible with a specific downstream container.

Concrete failure: AITER wheels built today against torch 2.11.0+rocm7.2
ImportError at runtime inside vllm/vllm-openai-rocm:v0.19.1, which
ships a custom torch 2.10 build. The 2.10 -> 2.11 transition changed
the c10::cuda::getCurrentCUDAStream masquerade alias surface; AITER
.so files (e.g. module_gemm_a8w8_blockscale.so) hit the missing
symbol when dlopened in the vllm container.

This change adds two optional, additive workflow inputs (workflow_dispatch
and workflow_call):

  - torch_pin: e.g. '2.10.0+rocm7.1'. Empty = latest (existing behavior).
  - torch_index_url: override the auto-derived index URL. Empty = legacy
    'https://download.pytorch.org/whl/rocmX.Y' from the builder image tag.

Default behavior is unchanged. When both inputs are empty the install
step runs exactly as today.

For v0.1.12.post2 release builds we will dispatch with:
  torch_pin=2.10.0+rocm7.1
  torch_index_url=https://download.pytorch.org/whl/rocm7.1

Validated end-to-end on a side branch: wheel built with these pins
loaded cleanly inside vllm/vllm-openai-rocm:v0.19.1 and ran DSR1
serving on both MI300X and MI355X.

Refs: #2843

* Revert "CI: run SGLang downstream tests on MI355" (#2873)

* Revert "CI: run SGLang downstream tests on MI355 (#2807)"

This reverts commit a3f54e3ce76b40177ac3f1ab53763ea8ee9d2bb9.

* Update sglang_downstream.yaml

* ci: rename aiter MI300X runners (#2872)

Replace the legacy aiter single- and multi-GPU runner labels with the linux-aiter-mi300x equivalents so CI jobs and runner metadata target the new MI300X fleet consistently.

* CI: raise atom-test accuracy step timeout to 90 minutes (#2876)

Align with longer ATOM server startup wait (45m) plus warmup and lm_eval.

* Add torch in gemm a16w16 tune (#2860)

* add torch,skinny gemm in bf16 gemm tuner

* fix format

* fix lint

* CI: run Triton workflow on MI35X only (#2871)

* CI: run Triton workflow on MI35X only

Remove the opt-in MI325 Triton path and its PR label guidance so the Triton workflow always runs on MI35X. Keep the PR welcome comment aligned with the simplified CI behavior.

* MLA PS mode support nhead8,2 in MI308 (#2852)

* MLA PS mode support nhead8,2 in MI308

* nhead8,2 through m16x1_16x4

* delete redundant kernel

* fix the comment

---------

Co-authored-by: Xu, Shengnan <117875955+shengnxu@users.noreply.github.com>

* Dev/aot fix (#2856)

---------

Co-authored-by: coderfeli <felix.li@amd.com>

* Update kimik2 FP4 tuned fMoE config (#2845)

* Update kimik2 FP4 tuned fMoE config with 256 CU tuning results

Made-with: Cursor

* Fix kernel names for kimik2 FP4 token=8 expert=385 topk=9 config entry

---------

Co-authored-by: okorzh-amd <okorzh-amd@users.noreply.github.com>

* Fix sliding window mtp (#2829)

* fix(pa): port sliding window mtp fixes to main

Reapply the sliding-window MTP decode, PS reduce, and KV_BLOCK_SIZE=1024 fixes on top of the latest main so the change can be reviewed and merged from a branch with shared history.

Made-with: Cursor

* fix fallback

* fix(pa): Use FlyDSL PS reduce for sliding-window MTP

Signed-off-by: fsx950223 <fsx950223@outlook.com>
Made-with: Cursor

* fix(pa): Move Gluon decode imports to module top

Signed-off-by: fsx950223 <fsx950223@outlook.com>
Made-with: Cursor

* fix regression

---------

Signed-off-by: fsx950223 <fsx950223@outlook.com>

* [DO NOT MERGE] CI TESTS (#2894)

* CI: Temporarily running the build jobs on linux-aiter-mi300x-1 due to network issues (#2895)

* CI: Fix Aiter installation issues in sglang downstream test (#2897)

* CI: add opt-in MI300X Triton jobs for PRs and main (#2883)

Keep MI35X as the default Triton PR path, add ci:triton-300x to start extra MI300X jobs on PRs, and run both architectures on main. Update the PR welcome comment to document the new label and main-branch behavior.

* Revert "CI: Temporarily running the aiter build on linux-aiter-mi300x-1 due to network issues." (#2899)

* Revert "[DO NOT MERGE] CI TESTS (#2894)"

This reverts commit 0f059acf89220ecbf52919e20798e7e86fb3acbb.

* Update vllm_benchmark.yaml

* add optimized prefill gdn kernels for qwen3_5 (#2686)

* add optimized prefill gdn kernels for qwen3_5

* refine code style

* add ssm_state vk_layout kernels for vllm support

* add default turnoff for triton autotune

* Introduce asm fmoe kernels that do not require bf16->fp8 quantization (#2262)

* Introduce asm fmoe kernels that do not require bf16->fp8 quantization

* Update quantization division precision to be closer to IEEE correctness

* Add transpose_scale flag

* Update kernels with fixed s_waitcnt

* Add 16x128 kernel merged with quantization

* Remove legacy code

* Remove one more redundant line

* Revert changes to sub_X_cnt calulations for fmoe asm

* Update the tuner to support x_bf16 kernel

* Update pandas interface

* Fix formatting

* Generate config with the new tuner

* Update base tuner to support merging csv files without strict column matching

* Fix formatting

* Fix core.py to handle columns merge

* Fix formatting

---------

Co-authored-by: Sergey Solo <ssolovye@amd.com>

* Revert "Fix sliding window mtp (#2829)" (#2909)

This reverts commit 6b8ce4ec3423441ad0a2f072d50f498ac711714c.

Co-authored-by: perzhang <perzhang@amd.com>

* fix(fmha): support >4GB KV cache in batch prefill via runtime dispatch (#2893)

* test: expand test_batch_prefill_large_kvcache for >4GB KV cache overflow

Rewrite test_batch_prefill_large_kvcache to validate the per-tile SRD
rebase fix for >4GB KV caches across all page sizes, dtypes, and
attention configurations:

- Add page_size=1 and 16 (page_size < kN0, exercises rebase path)
- Add GQA (16, 8) in addition to MHA (8, 8)
- Add causal masking with CK-compatible attn_mask for SDPA reference
- Use full KV cache (4.5GB) with pages spanning the overflow boundary
- Use torch SDPA as reference (memory-efficient backend, no score
  matrix materialization)
- Add scatter_pages parameter (False only; True for future
  global_load_lds flat addressing)
- Add GPU memory check to skip configs that exceed HBM capacity

Test matrix: 24 cases (3 page_sizes × 2 dtypes × 2 causal × 2 GQA × 1 scatter)

* test: add GPU sync after CK kernel in large_kvcache test

Add torch.cuda.synchronize() after CK kernel launch in
test_batch_prefill_large_kvcache to ensure all async GPU work
completes before memory is freed between tests.

Without this sync, repeated allocate/free cycles of large KV cache
buffers (~20GB) with mixed dtype (bf16→fp8) can trigger GPU page
faults when the HIP memory allocator reuses virtual addresses that
are still referenced by pending async GPU work. The fault manifests
as VM_L2_PROTECTION_FAULT at address 0x0 (NULL), causing GPU reset
and kernel soft lockup.

* feat(fmha): runtime dispatch for >4GB KV cache in batch prefill

Add use_64bit_load to batch prefill traits and runtime overflow detection.
When page_block_size < 128 and max_page_byte_offset > INT32_MAX, dispatch
to the flat 64-bit load kernel variant for correctness.

Also add vectorized KV layout coverage to test_batch_prefill_large_kvcache.

* fix: remove unused k_vector_size variable in large_kvcache test

* fix(mha): improve batch_prefill TORCH_CHECK error message for >4GB KV cache

Include page_size, num_pages, and dtype in the error message when kernel
dispatch fails. Add hint about CDNA3+ GPU requirement when KV cache
exceeds 4GB with page_size < 128.

* test: update scatter_pages comment in large_kvcache test

The comment incorrectly stated scatter_pages=True was "expected to FAIL".
This is no longer true — the flat 64-bit load path handles scattered
pages correctly. Update to describe the test's purpose instead.

* fix(mha): widen batch_prefill 64-bit threshold to total KV bytes

The previous check used (num_total_pages - 1) * batch_stride * element_size
which measures the last-page base offset, missing within-page offsets and
producing an off-by-one at exactly INT32_MAX (the largest representable
SRD voffset). Switch to total KV cache footprint (num_total_pages *
batch_stride * element_size > INT32_MAX) so within-page reads on the last
page are covered, and drop the redundant num_total_pages > 1 guard since
single-page configs trivially fit in 32 bits.

Also unify wording: 4GB → 2GB (INT32_MAX byte offset for SRD voffset),
matching CK's TwoGB convention. The actual hardware bound has always been
2GB; the prior comments were imprecise.

Found during batch prefill template dispatch review.

* docs(mha): unify >2GB wording in batch_prefill error and test

The 4GB number in the TORCH_CHECK error message and the test comment was
imprecise — the actual SRD voffset bound is 2GB (INT32_MAX). Update both
to match the threshold check and CK's TwoGB convention.

Found during batch prefill template dispatch review.

* refactor(mha): drop wrapper-side use_64bit_load; let CK dispatcher decide

The wrapper hardcoded kN0_min = 128 to compute the >2GB KV cache
predicate, which leaked CK tile config into aiter and would silently
break if a new arm with bn0 != 128 were added. The CK auto-generated
dispatcher now decides per-arm using its own compile-time bn0 and
per-dtype kElementBytes, so the wrapper just forwards args.

Remove the `use_64bit_load` runtime field from `mha_batch_prefill_traits`,
the parameter from `get_mha_batch_prefill_traits()`, and the entire
predicate computation block from the dispatcher call site. Bumps CK
submodule to pull in the matching codegen change.

* chore(mha): bump CK + update wrapper wording for kUseGlobalLoad rename

Bumps 3rdparty/composable_kernel to dd8d293ea (refactor(fmha): batch
prefill review polish — assert helper + setter guards) which builds on
the prior 99a3ca9af kUseGlobalLoad rename.

Wrapper-side updates to match:

* csrc/cpp_itfs/mha_fwd_batch_prefill.cu: rename "64-bit-load" wording
  in the per-arm dispatcher comment to "kUseGlobalLoad" so the wrapper
  comment matches the CK-side identifier. Also drops the trailing
  `false /* skip_min_seqlen_q */` argument from the
  get_mha_batch_prefill_traits call to match the upstream CK API
  signature change.

* csrc/py_itfs_ck/mha_batch_prefill_kernels.cu: change the >2GB error
  message from "page_size < 128" to "page_size < kN0" so the diagnostic
  tracks the tile-size constant rather than a magic number.

* op_tests/test_batch_prefill.py (test_batch_prefill_large_kvcache):
  three documentation enhancements with no behavior change —
  - explain why qo_len caps at 128 (causal) / 1024 (non-causal): the
    causal cap is a math-backend cliff for the SDPA reference, not a
    kernel limit;
  - explain that the +256 padding on kv_page_indices is a batch_prefill
    ABI requirement (kernel may speculatively read up to bn0=256
    entries past the last valid page index);
  - expand the torch.cuda.synchronize comment to call out the
    misattribution failure mode and GPU-reset cascade risk.

* test(fmha): parametrize test_batch_prefill_large_kvcache over batch_size {1, 4}

Adds multi-batch coverage to the >2GB KV cache regression test.

The previous single-batch coverage left the kernel's per-sequence SRD
rebase path unexercised: with cu_seqlens_q=[0, qo_len] and kv_indptr=
[0, num_blocks], the kernel never walks the indptr to reposition K/V
SRDs across batch boundaries. After the kUseGlobalLoad rename and the
new positive static_assert(kUseGlobalLoad_) calls in update_physical_pages
and set_page_stride_elements, we want a regression that catches any
boundary-crossing SRD bug -- the failure mode no single-batch test can
detect (one batch correct, others wrong).

batch_size=4 partitions the >2GB page pool across 4 sequences (last
sequence absorbs the remainder), exercising 3 cross-batch SRD transitions.
The SDPA reference is computed per-batch and concatenated; per-iteration
free + empty_cache keeps peak memory at one batch's worth.

Verified on:
  - gfx950 (smci355-gfx950, MI355X): 160 passed, 32 skipped
  - gfx942 (smc300x-clt, MI308X):    160 passed, 32 skipped

Skips are the existing vectorized + page_size=1 incompatibility
(3D tensor layout), now 16 per batch_size value.

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Fix top_k_per_row_prefill err when batched_token_numm > 4096 (#2901)

* retune a8w8_blockscale_bpreshuffle_gemm for MI308 (#2896)

* revert gptoss tuned config (#2904)

---------

Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>

* mHC: Optimize mhc_pre performance in small M (#2915)

* mHC: Optimize mhc_pre performance in small M (add tileN=16 on mhc_pre_gemm_sqrsum  && add splitk on mhc_pre_big_fuse)

* update test

* fix(mha_bwd): pass independent strides for do in _bwd_preprocess (#2808)

* fix(mha_bwd): pass independent strides for do in _bwd_preprocess

- adding independent `stride_do_b/h/m/k` parameters to `_bwd_preprocess`
  and addressing dO with them;
- passing `*do_strides` (already computed in the wrapper) alongside
  `*o_strides` when launching the preprocess kernel.

* pass independent strides for flash_attn_fused_backward kernel

* fix typo

* Fix batched_gemm_a8w8 correctness for ASYNC_COPY + gfx950 UT (#2867)

* [Triton]: Add MoE a16w4 (#2770)

* [Triton]: Add MoE a16w4

* Fix Black issue

* Removed x scales and improved arch checks in unit test

* Added benchmark

* Fixed PR review issues

* PR review comments fixes

---------

Co-authored-by: Rahul Batra <rahbatra@amd.com>

* test(moe_2stage): add csv-driven mode covering model_configs fmoe csvs (#2825)

---------

Signed-off-by: zhimding <zhiming.ding@amd.com>
Signed-off-by: zhimding <zhimding@amd.com>

* fix mhc device (#2916)

* CI: add standalone OPUS test workflow (#2928)

* CI: log in before starting OPUS test container

Authenticate with the Docker registry before `docker run` so OPUS test jobs can pull the image reliably across runners.

* CI: add docker login before ATOM base image pull (#2931)

* CI: add docker login before ATOM base image pull

Authenticate with the Docker registry before pulling the ATOM base image so the workflow can fetch the container reliably across runners.

* CI: Use build-only-aiter runner to build Triton wheels (#2934)

* fea(car): support custom group device (#2703)

* fea(car): support custom group device

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

* [fix]: test script code format

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

* fea(car): support multi communication groups entity

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

* fea(ag): support multi group allgather

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

* fix: test script format

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

* fea(car): add reduce_scatter interface

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

* fix(car): custom group config

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>

* fix: custom comm group set method

Signed-off-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>

---------

Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Signed-off-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>
Co-authored-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>

* [FLYDSL]: update version to 0.1.5.dev504, fix bug in if-else dynamic grammer (#2935)

* [CK] Add StreamLLM sink token support for batch_prefill pipeline (#2794)

* [CK] Add StreamLLM sink token support for batch_prefill pipeline

Update CK submodule to ROCm/rocm-libraries#6479 (commit feea6be) which
makes kHasSink_ a real template parameter in fmha_fwd_batch_prefill_traits_,
enabling _sink/_nsink kernel variant codegen for batch_prefill.

AITER-side changes thread sink_size through the full call chain:

Python (aiter/ops/mha.py):
- Add sink_size: int = 0 to mha_batch_prefill_func, _mha_batch_prefill,
  mha_batch_prefill stub, and cmdGenFunc_mha_batch_prefill
- Use has_effective_sink = sink_size > 0 and (causal or has_window_mask)
  for _sink/_nsink module name selection, matching C++ mask logic

C++ interface (csrc/):
- mha_batch_prefill_traits: expose has_sink parameter (was hardcoded false)
- mha_fwd_batch_prefill.cu: derive has_sink = args.sink_size > 0
- mha_batch_prefill_kernels.cu: add int sink_size param, include it in
  mask_identify string, set args.sink_size = mask.sink, zero-initialize
  fmha_batch_prefill_args{} to avoid UB
- PyBind (rocm_ops.hpp) and declaration (torch/mha_batch_prefill.h):
  add sink_size positional parameter after window_size_right

Semantics clarified:
- sink_size: number of first KV tokens always attended (sink phase loop)
- sink_ptr[nhead]: fixed logit for a virtual sink token in softmax
  (null -> -inf, non-null -> user value); independent of sink_size
- window_size_left=L means k in [abs_q-L, abs_q] (L+1 tokens), verified
  via code derivation (block_masking.hpp) and discriminating tests

Verified:
- No-sink paths numerically correct vs torch reference (max_diff<0.004)
- sink_ptr virtual token semantics correct (max_diff<0.004)
- window=1024 + sink_size=4 + sink_ptr: max_diff=0.000488

* [test] Add StreamLLM sink token tests for batch_prefill

Add test_batch_prefill_sink pytest function and supporting helpers
to op_tests/test_batch_prefill.py:

- ref_masked_attention_with_sink: torch reference implementing
  sink semantics (first sink_size KV positions always valid,
  sink_ptr virtual token appended to softmax attention matrix)

- run_batch_prefill_sink: runs both reference and CK kernel,
  compares with get_tolerances() thresholds

- test_batch_prefill_sink: parametrized over:
    batch_size=[1,2], page_size=16, head_dim=128
    num_qo_heads/num_kv_heads=[(8,1),(4,4)] (GQA + MHA)
    qo_len=[32,128]
    (window_left, kv_len)=[(128,512),(1024,2048)]  <- real gap in both
    sink_size=[4,16]
    sink_ptr_value=[None, 0.0, 2.0]
    dtype=[bfloat16, float16]

Verified manually with batch=1 GQA 8/1 window=128 sink=4 ptr=2.0,
batch=2 MHA 4/4 window=1024 sink=16 ptr=None, and sink_ptr=0.0.

* [style] Black format op_tests/test_batch_prefill.py

* [fix] Address Copilot review on batch_prefill sink PR

- aiter/ops/mha.py: add missing sink_size parameter to
  mha_batch_prefill_fake_tensors so its signature matches the real
  mha_batch_prefill op; without this torch.compile / fake-tensor mode
  raises TypeError when invoking the op.

- op_tests/test_batch_prefill.py: vectorize the StreamLLM reference
  mask construction. The previous double Python loop did per-element
  assignment into a CUDA tensor, triggering O(seqlen_q * seqlen_k)
  GPU sync points (262144 at seqlen_k=2048) and would time out the
  parametrized tests. Replace with broadcasted index tensors and a
  single masked_fill_.

* [test] Wire StreamLLM sink scenarios into CI __main__ block

CI invokes test files via `python3 "$file"`, which only runs the
__main__ block; pytest functions like test_batch_prefill_sink were
therefore never exercised in CI.

Add a second __main__ block at the end of the file that runs
run_batch_prefill_sink over a small representative parameter sweep
(2 window/kv combos x 2 sink_ptr settings, sink_size=4, bf16). The
helpers are defined after the original __main__, so a separate
trailing __main__ block is the minimal-blast-radius way to keep
both invocations independent.

* [CK] Bump submodule to include StreamLLM sink for batch_prefill

Update CK submodule from 08792e0b3 to d22aafb48 to pull in
[CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479).

* Set RNG seed in Triton tests (#2921)

* [quant] refactor hip kernels (#2932)

---------

Signed-off-by: zhiding512 <zhimding@amd.com>
Signed-off-by: zhimding <zhiming.ding@amd.com>
Signed-off-by: Tres Popp <tres.popp@amd.com>
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Signed-off-by: zhimding <zhimding@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: gyohuangxin <42127654+gyohuangxin@users.noreply.github.com>
Co-authored-by: BingYuan.Zhou <BingYuan.Zhou@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: Vinayak Gokhale <vinayak.gokhale@amd.com>
Co-authored-by: RuibinCheung <ruibzhan@amd.com>
Co-authored-by: ruanjm <jiming.ruan@amd.com>
Co-authored-by: Peng <peng.sun@amd.com>
Co-authored-by: Peng Sun <pensun@Pengs-MacBook-Pro.local>
Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: TennyWang1223 <Tenny.Wang@amd.com>
Co-authored-by: Nicholas Susanto <nicholas.susanto@amd.com>
Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com>
Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
Co-authored-by: Su Ann Chong <suachong@amd.com>
Co-authored-by: Elton <zhimding@amd.com>
Co-authored-by: Jiayun <jiayyu@amd.com>
Co-authored-by: Stig-Arne Grönroos <sgronroo@amd.com>
Co-authored-by: amd-ruitang3 <rui.tang2@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com>
Co-authored-by: Sergey Solo <ssolovye@amd.com>
Co-authored-by: Yutao Xu <xytpai@foxmail.com>
Co-authored-by: Dragan Mladjenovic <dragan.mladjenovic@amd.com>
Co-authored-by: Nidal Danial <81209936+nidal567@users.noreply.github.com>
Co-authored-by: Nico Holmberg <nico.holmberg@amd.com>
Co-authored-by: Tres Popp <tres.popp@amd.com>
Co-authored-by: GeisYaO <q1179671016@gmail.com>
Co-authored-by: Xudong Yuan <xudoyuan@amd.com>
Co-authored-by: fangche123 <Fang.Che@amd.com>
Co-authored-by: minmengdie <memin@amd.com>
Co-authored-by: Kiran Thumma <167153338+kiran-thumma@users.noreply.github.com>
Co-authored-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Co-authored-by: lucas-santos-amd <Lucas.Santos@amd.com>
Co-authored-by: la <46212055+junhaha666@users.noreply.github.com>
Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com>
Co-authored-by: PerryZhang01 <Perry.Zhang@amd.com>
Co-authored-by: perzhang <perzhang@amd.com>
Co-authored-by: Zzz9990 <zanzhang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: eppaneamd <rleppane@amd.com>
Co-authored-by: HaonanWang98 <hwang@amd.com>
Co-authored-by: Pleaplusone <ygan@amd.com>
Co-authored-by: JaxChen29 <jichen@amd.com>
Co-authored-by: Petr Kurapov <petr.kurapov@gmail.com>
Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com>
Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>
Co-authored-by: Yi DING <andy-ding@outlook.com>
Co-authored-by: Felix Li <felix.li@amd.com>
Co-authored-by: slippedJim <jim.guo@amd.com>
Co-authored-by: jian.wu <jwu10003@amd.com>
Co-authored-by: Pedram Ghazi <pedghazi@amd.com>
Co-authored-by: root <root@chi-mi300x-003.ord.vultr.cpe.ice.amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: Xu, Shengnan <117875955+shengnxu@users.noreply.github.com>
Co-authored-by: Oxana Korzh <oxana.korzh@amd.com>
Co-authored-by: okorzh-amd <okorzh-amd@users.noreply.github.com>
Co-authored-by: who who who <fsx950223@outlook.com>
Co-authored-by: yijin <yijin@amd.com>
Co-authored-by: Jeff Huang <jiaji.huang73@gmail.com>
Co-authored-by: chuanbowang2026 <chuanwan@amd.com>
Co-authored-by: rahulbatra85 <rahulbatra85@gmail.com>
Co-authored-by: Rahul Batra <rahbatra@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>
Co-authored-by: Linjun-AMD <Jun.Lin@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants