Conversation
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
There was a problem hiding this comment.
Pull request overview
Fixes mhc_pre intermediate tensor allocations to be created on the same device as the input residual, preventing device-mismatch failures when the global default device is not CUDA.
Changes:
- Reorders imports in
aiter/ops/mhc.py. - Allocates
out_pad,sqrsum,post_mix,comb_mix, andlayer_inputwithdevice=residual.device.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| device = residual.device | ||
| out_pad = torch.empty( | ||
| selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32 | ||
| selected_splitk, m, (hc_mult3 + 31) // 32 * 32, dtype=dtypes.fp32, device=device |
There was a problem hiding this comment.
Consider adding a regression test that exercises mhc_pre when the global default device is CPU but inputs are explicitly on CUDA (e.g., torch.set_default_device('cpu') then create residual/fn/hc_scale/hc_base on cuda). This change fixes internal tensor allocations to follow residual.device, but current tests may still pass even if allocations accidentally fall back to the default device.
* CI: surface runner-config mapping in AMD CI job monitor (#2711)
* Update runner-config.yml
* CI: surface runner-config mapping in AMD CI job monitor
Load GPU architecture and count from runner-config.yml so the runner fleet summary shows the configured inventory for each label. Trigger the monitor workflow when runner mappings change and install PyYAML for the runner report job.
* CI: auto-update split test FILE_TIMES (#2709)
Co-authored-by: gyohuangxin <42127654+gyohuangxin@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
* CI: add standalone OPUS test workflow (#2716)
Add a dedicated GitHub Actions workflow for op_tests/opus so OPUS validation runs independently on MI35X and MI325 runners without being mixed into the main Aiter test shards.
* add signature and refine lru cache (#2710)
Co-authored-by: solin <bingzhou@amd.com>
* CI: separate Aiter Test concurrency for push and schedule (#2718)
Split main-branch concurrency by event type so scheduled runs do not block push-triggered validation when a long queued job keeps the nightly workflow open.
* fix split module jit and retune (#2719)
* [TRITON] Prevent NUM_KSPLIT from reducing K dim below GROUP_K (#2661)
* CI: Enable Deepseek ATOM tests on MI35X (#2667)
* CI: Enable Deepseek ATOM tests on MI35X
* CI: Use /models cache for MI35X ATOM DeepSeek test
Route the MI35X DeepSeek job to the runner-local /models cache so it avoids downloading into /run, and make the output artifact name unique now that two DeepSeek variants run in the same workflow.
* CI: Mount /models into MI35X ATOM test container
Pass the runner's shared /models cache into atom_aiter_test so MI35X DeepSeek jobs can use the mounted model path.
* feat: _flash_attn_forward add out args (#2648)
* Replace CK/CK_TILE in MLA Reduce and Metadata Kernel with OPUS (#2717)
* replace ck with opus
* fix compile issue
* fix waterfall and use buffer inst for lse.
* Replace ck with opus for mla metadata.
* Add dim=512 fp32 case for wave32
* docs: add ISA-level kernel optimization guide (#2708)
* docs: add ISA-level kernel optimization guide using LLVM tools
Step-by-step guide covering the full LLVM-based workflow for
inspecting, modifying, and recompiling AITER GPU kernel ISA:
disassemble, extract reassemblable .s, round-trip recompile with
binary-identical .text verification, and profile with rocprofv3.
Includes Python extraction script handling branch label word-offset
addressing, llvm-objcopy section swap for preserving kernel metadata,
and rocprofv3 kernel-trace + ATT profiling instructions.
* docs: add ISA optimization code examples and Dockerfile
Runnable companion to the ISA kernel optimization guide:
- extract_asm.py: standalone ASM extraction with CLI interface
- analyze_kernel.py: instruction mix analysis and rocprofv3 profile parser
- roundtrip.sh: end-to-end disassemble/extract/recompile/verify script
- Dockerfile: ROCm 7.2.1 dev environment with all tools pre-installed
including ATT trace decoder built from source
* style: fix black formatting and ruff lint in ISA optimization examples
- Rename loop var 'l' to 'ln' to fix E741 (ambiguous variable name)
- Remove extraneous f-prefix on strings without placeholders (F541)
- Apply black auto-formatting
* style: fix black formatting for CI compatibility
- Add blank line between module docstring and imports (E302)
- Collapse multiline f-string call arguments
---------
Co-authored-by: Peng Sun <pensun@Pengs-MacBook-Pro.local>
* Add run_config and compare in tuner (#2375)
* Add run_config/compare support to GemmTuner (bf16)
- Add config_env_name, _clear_op_caches(), and run_config() to GemmTuner
so --run_config and --compare flags work for bf16 GEMM tuning
- Update bubbly-exploring-turtle.md plan doc to reflect the full
implementation including --compare, config_env_name, cache clearing,
and post-tune config switching architecture
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add --run_config and --compare benchmark support to all tuners
Add infrastructure in base_tuner.py for production operator benchmarking:
- --run_config: benchmark only, no tuning
- --compare: pre-tune benchmark, tune, post-tune benchmark with comparison table
- Config env switching and cache clearing for post-tune benchmarks
Implement run_config() and _clear_op_caches() in all CK-based tuners:
gemm_a8w8, gemm_a8w8_bpreshuffle, gemm_a8w8_blockscale,
gemm_a8w8_blockscale_bpreshuffle, gemm_a4w4_blockscale,
gemm_moe_2stages, batched_gemm_a8w8, batched_gemm_bf16
* Revert unintended composable_kernel submodule change
* Fix review comments and remove intermediate plan docs
- Save/restore AITER_REBUILD original value instead of hardcoding 0
- Use defensive strip() for mixed-type object columns in _read_csv
- Remove docs/bubbly-exploring-turtle.md and docs/run_config_benchmark.md
(consolidated into csrc/.claude/add_run_config_to_tuner.md)
* update ref rtol,atol
* Fix tuner cache invalidation, run_config preshuffle, and compare workflow
- Fix _clear_op_caches for all tuners: properly clear lru_cache and
internal dict/attribute caches (a4w4, a8w8 variants, fmoe) so
post-tune benchmarks use freshly tuned configs instead of stale ones.
- Fix fmoe run_config: preshuffle weights before calling fused_moe to
match production layout (tuner always tunes with bpreshuffle=True),
preventing preshuffle_on/off module mismatch and 99%+ error.
- Add defensive warning in fused_moe get_2stage_cfgs when tuned config
is found but is_shuffled=False.
- Fix run_config to read shapes from tuned CSV and set config env var.
- Fix --compare workflow: run post-tune benchmark before tune_summary
to avoid summary errors blocking verification.
- Fix base_tuner _set_config_env_for_run_config return value.
- Use print instead of logger.info for benchmark tables.
* fix format
* fix format
* update readme
* fix lint error
* fix lint
* update csv only when perf improves
* format
* fix lint
* revert format for some files
* clarify compare and gated update flow
Make --compare keep a candidate csv and require --update_improved before writing back tuned results so the CLI stays explicit and easier to extend.
Made-with: Cursor
* fix flydsl GemmTuner review issues
Trigger FlyDSL package validation before importing tuning kernels and align the FlyDSL bias cast order with the runtime path to avoid unintended dtype promotion.
Made-with: Cursor
* update
* revert claude md
* update shape_grouped
* fix format
* fix bug
* fix lint error
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Revert "fix(car): craph capture err (#2638)" (#2735)
This reverts commit 5759ee2943b5326dec23910135381b426ff65196.
* [TRITON] Swiglu and reduce refactor (#2583)
* Move swiglu to a util file + add optional residual flag
* refactor reduce and make it compatible with >65k tokens
* Update aiter/ops/triton/_triton_kernels/moe/reduce.py
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
---------
Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* [aiter] type hints mismatch (#2728)
* fix: silence false warning in fmha_fwd_v3 when use_asm_v3 is disabled (#2744)
When `use_asm_v3` is `false`, `fmha_fwd_v3()` correctly returns `-1` to
fall back to the CK path, but it also emits a misleading "unsupported
condition in fwd_v3!!!" warning. This is not an unsupported condition —
the caller intentionally opted out of v3.
Separate the `use_asm_v3` check into an early return without a warning,
so the `AITER_LOG_WARNING` only fires for genuinely unsupported parameter
combinations (wrong head dims, unsupported dtypes, bias, dropout, wrong
arch).
Made-with: Cursor
* Add FlyDSL GEMM AOT precompile support (#2741)
* feat(aot): add MoE FlyDSL AOT pre-compilation module
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: zhiding512 <zhimding@amd.com>
* refactor(aot): remove --stage flag from MoE AOT module
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: zhiding512 <zhimding@amd.com>
* reformat
* feat(aot): integrate FlyDSL MoE AOT precompilation into setup.py
Move moe.py into aiter/aot/flydsl/, support multiple CSV configs,
simplify compile_one_config to use COMPILE_ONLY=1 env var, and add
MoE AOT pre-compilation step to the package build in setup.py.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Signed-off-by: zhimding <zhiming.ding@amd.com>
* feat(aot): FlyDSL MoE AOT with COMPILE_ONLY dummy-tensor precompilation
Rework FlyDSL MoE AOT to use COMPILE_ONLY=1 with dummy tensors instead
of run_kernel, removing all HIP op dependencies (moe_sorting,
shuffle_weight, etc.) from the precompilation path.
Changes:
- Replace _run_kernel with _precompile_to_cache using torch.zeros
dummy tensors and COMPILE_ONLY=1 for pkl cache generation
- Add sys.modules bridging in setup.py so aiter.jit.core reuses the
same module instance loaded via sys.path
- Auto-detect bundled flydsl_cache in aiter/__init__.py and set
FLYDSL_RUNTIME_CACHE_DIR
- Add KeyError to aiter/__init__.py exception handler for robustness
- Support multiple CSV configs (dsv3 + kimik2)
- Remove run_kernel parameter and test_bad_tile logic
Signed-off-by: zhimding <zhiming.ding@amd.com>
* update flydsl
* update flydsl
* adapt hgemm
* fix black
* add flydsl gemm aot precompile support
---------
Signed-off-by: zhiding512 <zhimding@amd.com>
Signed-off-by: zhimding <zhiming.ding@amd.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
* gather support qk_nope_head_dim != v_head_dim (#2739)
* gather support qk_nope_head_dim != v_head_dim
* fix 192 pad
* feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models (#2733)
* feat: add/retune BF16 GEMM configs with FlyDSL backend for 6 models
Tuned on MI355X (gfx950) with all backends competing (ASM, hipBLASLt,
Triton, FlyDSL). New tuned configs for Llama 70B, Llama 405B, Qwen 32B.
Re-tuned existing configs for GPT-OSS, DSV3, Kimi-K2 to include FlyDSL.
Backend wins across 708 total shapes:
- hipBLASLt: 472 (66.7%)
- ASM: 131 (18.5%)
- FlyDSL: 70 (9.9%)
- Triton: 7 (1.0%)
- Mixed/other: 28 (4.0%)
* feat: retune BF16 GEMM without hipBLASLt, add GLM-5 and 3 new models
Re-tuned all BF16 GEMM configs on MI355X (gfx950) with --libtype
asm,triton,flydsl (no hipBLASLt). Added GLM-5 (88 shapes from CI log)
and new configs for Llama 70B, Llama 405B, Qwen 32B.
Backend wins across 796 total shapes (7 models):
- ASM: 437 (54.9%)
- FlyDSL: 224 (28.1%)
- Triton: 135 (17.0%)
Per-model breakdown:
- GPT-OSS (57): asm=54, triton=3 (bias=True, no FlyDSL support)
- DSV3 (58): flydsl=22, triton=18, asm=18
- Kimi-K2 (125): asm=77, flydsl=46, triton=2
- GLM-5 (88): asm=42, flydsl=30, triton=16
- Llama 70B (156): asm=84, flydsl=49, triton=23
- Llama 405B (156): asm=89, flydsl=43, triton=24
- Qwen 32B (156): asm=73, triton=49, flydsl=34
Tuning time without hipBLASLt: 4h total (long pole: 405B @ 4h)
vs with hipBLASLt: 10h+ total (long pole: 405B @ 8h+)
* Hoist introspection out of per-call ctype dispatch (#2742)
* Hoist inspect.signature/typing.get_type_hints out of per-call ctypes dispatch
These two introspection calls were recomputed on every invocation of
the ctypes caller closure (~79µs + ~91µs per call). Since the decorated
function's signature and type hints are immutable, compute them once at
decoration time and capture via closure.
Made-with: Cursor
* update ruff format
* update black format
---------
Co-authored-by: amd-ruitang3 <rui.tang2@amd.com>
* Make FlyDSL LDS checks architecture-aware and reduce tuner failure noise (#2732)
* Handle FlyDSL LDS limits and candidate failures
Use shared-memory-per-block queries to keep FlyDSL LDS checks architecture-aware, and surface candidate failures as concise runtime warnings so tuning can continue without noisy tracebacks.
Made-with: Cursor
* Keep tuner topk local per shape
Avoid mutating the shared topk value while post-processing one shape so later shape groups keep the intended candidate limit.
Made-with: Cursor
* fix lint
* Cache FlyDSL shared memory queries
Avoid repeated device property lookups while validating FlyDSL kernel configs by caching the default device selection and shared-memory-per-block queries.
Made-with: Cursor
* fix lint
* Update aiter/ops/flydsl/gemm_tune/flydsl_gemm_a8w8_bpreshuffle_common.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* parse kernel name to select flydsl kernel
* fix black format error
* refine
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: solin <bingzhou@amd.com>
* 16x256 kernel (#2722)
Co-authored-by: Sergey Solo <ssolovye@amd.com>
* rebase linear attn for new flydsl version (#2746)
* Replace unsafe uses of std::unordered_map with SynchronizedCache (#2221)
* Make AiterAsmKernel load hsaco on each GPU it is used on
* Replace unsafe uses of std::unordered_map with SynchronizedCache
* Fix Triton MoE GEMM shared memory exhaustion by reducing stage count (#2723)
* Fix Triton MoE GEMM shared memory exhaustion
- Reduce num_stages in kernel configs
- Lowered LDS usage to avoid shared memory OOR
- Fix triton.runtime.errors.OutOfResources errors in MoE GEMM kernels
* Fix: set num_stages=1 on gfx950 using get_arch() conditionally for gfx950 to ensure no bottlenecks for gfx942
* Add determinism for fused mul add test
* Format fused mul add test with black
* Annotate moe_sorting_dispatch_policy as int for fused_moe (#2639)
The type annotation bool was incorrect for moe_sorting_dispatch_policy, which
accepts int values. The @torch_compile_guard decorator uses these
annotations to generate PyTorch custom op schemas; with bool, PyTorch schema
enforcement casts any value to bool, so dispatch_policy=2 becomes bool(2)=True
(1), silently losing the intended policy. Using int allows callers to set
dispatch_policy=2 correctly.
Fixes: #2576
Signed-off-by: Tres Popp <tres.popp@amd.com>
Co-authored-by: Tres Popp <tres.popp@amd.com>
* fix moe splitk aot and jit (#2738)
* fix moe splitk aot and jit
* split moe aot to serveral libs base on tuned_moe configs
* update copyrights
* fix typo
* test shuffle as default and fix moe split jit
* Update quant.pyfix: add pack_dim to per_1x32_f4_quant for tl.dot_scaled RHS compatibility (#2704)
* Update quant.py
* Refactor per_1x32_f4_quant function signature
* Fix function definition formatting in quant.py
* Refactor per_1x32_f4_quant_for_dot_scaled definition
* Restore semantic.py to match main branch
* Flydsl align if else usage to be compatible with all versions (#2740)
* Add bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent) (#2729)
* Add bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent)
* black mla.py
* fix short kv len split error
* Support final LSE output in non-persistent MLA reduce kernel
* black mla.py
* ruff error
---------
Co-authored-by: minmengdie <memin@amd.com>
* CI: scope check signal artifact downloads to matching run (#2757)
Avoid repo-wide artifact searches in downstream workflows so check-signal can fetch the pre-checks signal without hitting the GitHub Actions API rate limit.
* feat: aiter whls nightly (#2514)
* feat: aiter whls nightly
* fix:
* CI: add docker username input for aiter release workflow (#2535)
* fix:
* fix: address review comments and improve nightly workflow
- Enable promote for PRs (devreleases) after tests pass
- Fix inputs context for schedule/PR events (use github.event.inputs)
- Add set -o pipefail to prevent tee masking test failures
- Add release_type validation in promote workflow
- Guard AWS credentials against fork PRs
- Remove S3 bucket details from summaries, show only index URLs
- Simplify dispatch inputs: single docker_image, python_version, gpu_archs
- Add build_py310/build_py312 controls to aiter-release workflow
* style: fix Black formatting in generate_summary.py
* fix: add tabulate dependency for test_activation.py in wheel tests
* feat: add ci:nightly label trigger with full pipeline including promote
- PR trigger via ci:nightly label runs Build → Test → Promote to nightlies
- release_type set to nightlies for labeled PRs (same as scheduled runs)
- Promotion no longer blocked for PRs, enables end-to-end validation
* fix: keys
* fix: requested
* fix: remove testcode
* Update: move generate_summary.py
---------
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
* CI: reduce retention for workflow artifacts (#2758)
Shorten retention periods for high-volume workflow artifacts so routine logs and shard metadata expire sooner while wheel artifacts remain available long enough for downstream consumers.
* CI: upgrade sglang downstream to v0.5.10 (#2753)
* CI: upgrade sglang downstream to v0.5.10 on MI35X/MI355
* CI: keep sglang v0.5.10 upgrade on MI325
* Update sglang_downstream.yaml
* CI: use isolated sglang workspace for downstream test
* fix(car): sglang prefill launch error kernel (#2745)
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Co-authored-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
* Add unified attention support to bench_models (#2724)
Add sink and window support to mha handler
* Use AITER_CONFIGS for FlyDSL AOT defaults (#2756)
Align FlyDSL GEMM and MoE AOT precompile defaults with runtime config resolution, and update the FlyDSL dependency to the required version.
Made-with: Cursor
Signed-off-by: zhimding <zhimding@amd.com>
* fix fused_dynamic_mxfp4_quant_moe_sort_hip err in EP (#2759)
* Fix `test_moe_routing_sigmoid_top1_fused` reference implementation tie-breaking (#2750)
Replace `torch.topk` with `torch.argmax` + `torch.gather` in the reference
implementation. `torch.topk` doesn't guarantee stable indices for tied elements,
and its tie-breaking behaviour changed between ROCm 7.2.0 and 7.2.2, causing 10
tests to fail for `N=128`, where integer inputs produce many tied sigmoid
scores. `torch.argmax` always returns the first (leftmost) maximum, matching
`tl.argmax(..., tie_break_left=True)` in the Triton kernel.
* [fix](cache): add eps to avoid diving zero (#2763)
Co-authored-by: perzhang <perzhang@amd.com>
* [FlyDSL MOE] update a8w4 moe (#2726)
* update a8w4
* update gptoss csv
* update gptoss_fp4 prefill config
* update files format
* update fix with rws & update a8w4 kimi tune cfgs
* update
* update fix code style
* update code style
* fix ci
* revert fused_moe.py modifies
* revert a16w4
* add instances
* update ut & env var name
* rename files
* fix ci
* remove use_cfg
* remove tile m 16
---------
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* MI350 mla ps mode suppport nhead128,1 128,2 128,3 128,4 64,4 64,2 32,4 through kernel mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co (#2727)
* MI350 mla ps mode suppport nhead128,1 128,2 128,3 128,4 64,4 64,2 32,4 through kernel hsa/gfx950/mla/mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co
* del the useless code
* black error
* fix sparse test error
* update config (#2771)
* fix(ck_gemm): fix multi-arch build targeting and kernel dispatch across all CK GEMM modules (#2645)
* chip_info: add GFX_CU_NUM_MAP and get_build_targets()
* aiter/configs: migrate tuned GEMM CSVs to add gfx as first column
* csrc: fix gen_instances.py to filter by (gfx, cu_num) build targets
* aiter/ops: add gfx to runtime GEMM dispatch lookup keys
* aiter/utility: add gfx to GemmCommonTuner key and tune result output
* csrc, gradlib: add gfx to all GEMM tuner output keys
* op_tests: fix is_shape_tuned to filter by (gfx, cu_num)
* fix(configs): resolve model_configs merge conflicts and add gfx column
* op_tests: add CSV input, output saving, and stable iter counts to a8w8 GEMM test scripts
* fix(merge): resolve conflict in gemm_op_a4w4.py after main sync
The merge commit 6a18cd6e7 accidentally preserved conflict markers in
gemm_op_a4w4.py. Apply the gfx-aware dispatch fix (same pattern as
gemm_op_a8w8.py) — use (gfx, cu_num, M, N, K) key when the CSV has a
gfx column, fall back to (cu_num, M, N, K) for old CSVs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(configs): add missing gfx column to dsv3 model_configs overrides
* op_tests: add bpreshuffle-csv entry point and skip_ck flag to test_gemm_a8w8
* op_tests: add gfx filter unit tests and repro CSVs for both GEMM modules
* fix(ck_gemm): key C++ dispatch map by (cu_num,M,N,K) to prevent multi-arch kernel collisions, share build_tune_dict helpers across all 9 CK GEMM modules
* op_tests/configs/gemm_codegen_gfx_filter.csv
* chip_info: split arch constants and env-only build targets into torch-free build_targets.py
* op_tests: fix repro CSV gfx942/304 kernels to be valid for M=1 and M=32
* chip_info: use bare import for build_targets to fix build context ModuleNotFoundErro
* docs: add gfx column to tuning CSV examples and update cu_num description in all 8 GEMM READMEs
* lint: apply black formatting and fix ruff violations in modified files
* lint: fix black/ruff violations in csrc gen_instances and gradlib
* fix(gemm_op_a8w8): eliminate StopIteration risk and use AITER_CONFIGS for defaults
* fix(chip_info): guard kernelId/kernelName lookups with .get() to avoid KeyError on malformed CSVs
* fix(base_tuner): add gfx legacy fallback to if branch of get_retune_gemm_list
* docs(test_gemm_codegen): fix comment reference for GFX_CU_NUM_MAP location
* fix(gemm_dispatch_utils): check HIP return codes in get_device_cu_num()
* fix(chip_info): add get_gfx_runtime() and fix GPU_ARCHS=native in get_build_targets()
* chore(configs): sync dsv3/kimik2 bf16 tuned gemm CSVs with main and add gfx column
* fix(op_tests): use get_gfx_runtime() in GEMM test files for correct arch detection
* fix(core): self-heal CSV dedup without requiring a re-run
* fix(chip_info): add shape and arch context to kernelId/kernelName skip warnings
* fix(chip_info): use logger.warning instead of print for kernel skip warnings
* style(chip_info): fix E402 import order after logger initialization
* fix(gemm_dispatch_utils): initialize device to -1 to clarify output-parameter intent
* test(test_gemm_codegen): fix Section 3 runtime dispatch tests to use live GPU
* fix(gemm_op_a8w8): remove duplicate get_gfx_runtime import
* docs(chip_info): fix build_tune_dict docstring for kernels_by_name fallback
* fix(gemm): extend C++ dispatch key with gfx arch string — (cu_num,M,N,K) → (gfx,cu_num,M,N,K)
* style(chip_info, test_gemm_codegen): apply black/ruff formatting
* feat: add PRETUNE_MODULES build flag to auto-tune GEMM shapes on live GPU
* feat(pretune): add run_tune_direct() and CLI for standalone retuning on installed aiter
* refactor(pretune): remove run_tune_direct wrapper, add input validation and dedup to CLI
* fix(pretune): suppress ruff F841/E402 false positives on eval-scope variable and path-dependent import
* refactor(pretune): extract _parse_module_list, fix silent skip of unsupported modules in setup.py path, add deduplication
* docs(pretune, setup, test_gemm_codegen): fix stale docstrings and add missing inline comments
* fix(pretune): write tuned results to source CSV, not ephemeral /tmp; add regression test
* setup.py: import pretune directly to avoid premature aiter package init
* pretune: add warmup API — check_tuning_coverage, warn_if_undertuned, warmup
* pretune: tune only missing model shapes in warmup(), not full CSV
* fix(pretune): remove vLLM-specific env var hint from warmup() warning
* revert: remove warmup API from pretune.py
* fix(tuners): clear module-level CSV caches in _clear_op_caches
* fix(build): add _parse_gpu_archs_env()
* fix(docs/tests): docstring accuracy, test coverage, and gfx-aware dedup
* fix(tests): route aiter logger to stdout in test_pretune to fix warning ordering
* fix(gemm_dispatch_utils): cache cu_num and gfx per device ID via SynchronizedCache
* tuning: use get_gfx_runtime() in tuner imports so live GPU arch is used instead of GPU_ARCHS env
* fix(configs): add missing gfx column to bf16 model_configs CSVs introduced during main merge
* raise error when having duplicate shape entries
* fix(configs): remove duplicate shape entries from a8w8_blockscale_bpreshuffle_tuned_gemm_qwen3.5_397b.csv
* resolve duplicated shapes
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ying.Zhou2 <Ying.Zhou2@amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
* add triton fallback for mi455 gptoss & dsfp4 (#2657)
* add triton fallback for mi455 gptoss & dsfp4
* rm nk config
* fix rotary
* fix recompile issue (#2777)
* Fix fused_gemm_a8w8_blockscale_a16w16 unit test failures (#2766)
* Fix fused_gemm_a8w8_blockscale_a16w16 unit tests
* fix format with black
* remove guard and keeping num_stages as is - works
* walkaround invalid kernels in moe tunner (#2785)
* CI: auto-update split test FILE_TIMES (#2795)
* CI: auto-update split test FILE_TIMES
* chore: trigger pull_request checks
---------
Co-authored-by: gyohuangxin <42127654+gyohuangxin@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
* [fix] remove glm5 triton tuned gemm (#2803)
* remove glm5 triton tuned gemm
* tuned
* add tuned bf16 gemm for qwen3next and qwen3.5 (#2799)
* add qwen3next qwen3.5 bf16 untuned gemm
Signed-off-by: ganyi <ygan@amd.com>
* revert the bf16 template csv
Signed-off-by: ganyi <ygan@amd.com>
---------
Signed-off-by: ganyi <ygan@amd.com>
* [Perf][WIP]update qwen fmoe tuned configs (#2812)
* update qwen fmoe tuned configs
* fix typo
* CI: run SGLang downstream tests on MI355 (#2807)
* CI: run SGLang downstream tests on MI355
Switch the downstream workflow to the MI355 runner and mi35x image path so sglang exercises the gfx950 configuration instead of the MI325/mi30x path.
* Update sglang_downstream.yaml
* Update sglang_downstream.yaml
* CI: allow nightly check-signal to reuse Checks push runs (#2819)
Nightly Aiter Test runs look up pre-check signal artifacts by SHA, but the Checks workflow does not run with event=schedule. Skip the event filter for schedule-triggered workflows so nightly jobs can reuse the matching push-based Checks run instead of failing early in check-signal.
* [fix] fix compare error (#2805)
* fix compare error
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* CI: log in to Docker before building aiter image (#2824)
Authenticate before the build so pulling the base image does not hit registry rate limits. Reuse the same Docker session for the later image push.
* [FLYDSL]: aiter flydsl if const_expr (#2776)
* [FLYDSL]: aiter flydsl if const_expr
* [FLYDSL]: black
* add tuner test suite (#2734)
* add tuner test suite
* Update op_tests/tuning_tests/test_mp_tuner_logic.py
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* add run_config test
* fix tuned csv
* add missed test
* Update op_tests/tuning_tests/test_run_config.py
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix format
* update test and tuner
* fix format
* format
* update test
* format
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix lse address compute (#2810)
* Add vcs_versioning as transitive dependency of setuptools_scm>=10 (#2834)
* Revert kimi2.5 fp4 tp4 moe tuned result on bs 32,64 (#2836)
* remove gemm tuning (#2838)
Signed-off-by: ganyi <ygan@amd.com>
* disable import flydsl_gdr_decode (#2840)
* Xiaobing/gptoss small m (#2775)
* add small m hgemm
* update gptos config
* updata small_m_hgmm
* clear code
* code format
* fix code format
* remove moe_gemm_2stage.py change
* refact code
* recover 0 from SMALL_M_B_TO_LDS_WAVES_PER_EU_OPTIONS
* update gemm code
* refact code
---------
Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>
* [CK_TILE] Fix bpreshuffle compile failure due to stale CShuffleEpilogueProblem arg (#2811)
* [fix] Remove stale TiledMMAPermuteN arg from CShuffleEpilogueProblem
* [fix] Remove stale TiledMMAPermuteN arg from blockscale CShuffleEpilogueProblem
* add static_assert for TiledMMAPermuteN in blockscale and moe_2stages
* update subproject commit for composable_kernel
* update flydsl version (#2802)
* update flydsl version
* fix version err
* CI: fix pre-check signal artifact lookup for PR reruns (#2848)
Resolve check-signal failures when labeled or rerun pull_request jobs get a new GitHub merge SHA. Download the checks-signal artifact from the matched Checks run instead of rebuilding the artifact name from the current workflow SHA.
* [Fix]fix moe tunner (#2831)
* CI: To fix prebuild image issue (#2847)
* Fix a8w8 flydsl tune (#2809)
* use flyc.compile
* add csv-driven mode covering mode_configs/configs a8w8_bpreshuffle csvs
Co-authored-by: solin <bingzhou@amd.com>
* fix other lse addr compute issue (#2833)
* CI: align tuning test workflow with README (#2849)
* CI: add nightly tuning validation workflow
Split the PR #2734 tuning test plan into dedicated nightly CPU and GPU jobs, and make the CPU-only tuner infra tests importable without a ROCm runtime so the new validation workflow can run on ubuntu-latest.
* CI: align tuning test workflow with README
Redo the tuning test PR to match the README commands directly: keep the workflow manual-only, run the documented level01/tune_pipeline/run_config suites from a ROCm container, and drop the unrelated utility-layer changes from the PR scope.
* CI: switch tuning tests to MI35X runner
Run the manual tuning test workflow on the MI35X single-GPU runner and set GPU_ARCHS to gfx950 so the dispatched validation matches the requested hardware target.
* gqa ratio support in mha test (#2798)
* update
* update
* fix odd num head
* ci(release): add manylinux2_28+ROCm builder path with auditwheel gate (#2851)
Adds opt-in manylinux + ROCm wheel build path (use_manylinux=true) using pytorch/manylinux2_28-builder:rocm7.X images, with auditwheel repair + symbol-floor verification step. Default off so existing dispatches are unchanged. Fixes the libstdc++/glibc ABI mismatch tracked in #2843. Validated by test build #24761365935 — wheel passes manylinux_2_28 contract (GLIBCXX max 3.4.22, GLIBC max 2.28).
Co-authored-by: Peng Sun <Peng.Sun@amd.com>
* add asm load kernel log (#2800)
* add asm load kernel log
* update according to review
* update log
* Update gfx942&gfx950 PA PS kernels and write stride_scale_page in asm_pa (#2796)
* clean code and reduce num_stages to 1 for large tile configs in fused_gemm_afp4wfp4_a16w16 to prevent LDS exhaustion with ASYNC_COPY (#2784)
* [TRITON] Add Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441)
Move Gluon kernel to _gluon_kernels/gfx942/moe/ directory and refactor
K-dimension unrolling to use gl.static_range + tuple accumulation,
removing the UNROLL_TIMES <= 3 constraint. Raise Gluon dispatch
threshold to M >= 4096 for optimal performance.
Key optimizations:
- Manual LICM: A matrix, x_scale, and gammas pre-loaded outside N loop
- K-dimension loop fully unrolled via gl.static_range at AST level
- Explicit BlockedLayout + MFMA for optimal INT8 register usage
- SUB_BLOCK_SIZE_N=64 inner loop for large BLOCK_N
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
* [TRITON] Reduce MHA UTs (#2612)
* Reduce tests for mha
* Remove single parametrized functions
* Remove single parametrized functions
* Lint
* Lint
* Address comments
* Fix UTs
* Address test failures
* Address test failures
* Address comments
* Fix CI failures
* Relax test_mha_with_sink bwd tolerance
---------
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lucas Santos <Lucas.Santos@amd.com>
* Add gfx942 Triton A8W8 blockscale GEMM configs for GLM-5 shapes for 70k in and 300 out (#2743)
Co-authored-by: root <root@chi-mi300x-003.ord.vultr.cpe.ice.amd.com>
* docs: polish README with ecosystem, news, and performance highlights (#2859)
* docs: polish README with ecosystem, news, and performance highlights
- Replace GitHub-hosted image with local aiter_logo.png, center-aligned
- Add CI, Release, Docs, and GitHub stars badges
- Add News section with chronological blog posts and publications
- Add Ecosystem section with framework integration table (vLLM, SGLang,
ATOM, JAX, and customer proprietary engines)
- Add performance highlights and supported hardware tables
- Refactor operators section to link to op_tests/ directory
- Polish description and feature list
docs: add aiter logo assets for README
* docs: add Opus section, MLA tutorial link, and fix release badge
- Add Opus lightweight C++ template section alongside FlyDSL and Iris
- Add MLA decode kernel tutorial to News section
- Change release badge to show version from shields.io, link to releases page
* docs: update badges and spacing in README header
- Remove stars badge, add last commit badge
- Add spacing between logo and badges
* docs: add v0.1.12.post1 release to News section
* docs: adjust spacing between logo and badges
* docs: widen spacing between logo and badges
* make rmsnorm quant fusion support gemma (#2853)
* gemma rmsnorm quant support
Signed-off-by: ganyi <ygan@amd.com>
* change test
Signed-off-by: ganyi <ygan@amd.com>
---------
Signed-off-by: ganyi <ygan@amd.com>
* ci(release): add torch_pin + torch_index_url inputs for ABI-targeted release builds
The manylinux release path currently calls 'pip install torch' with no
version pin, picking up whatever wheel is latest on the auto-derived
download.pytorch.org/whl/rocmX.Y index at build time. That works for
nightlies but is fragile for tagged releases that need to be ABI-
compatible with a specific downstream container.
Concrete failure: AITER wheels built today against torch 2.11.0+rocm7.2
ImportError at runtime inside vllm/vllm-openai-rocm:v0.19.1, which
ships a custom torch 2.10 build. The 2.10 -> 2.11 transition changed
the c10::cuda::getCurrentCUDAStream masquerade alias surface; AITER
.so files (e.g. module_gemm_a8w8_blockscale.so) hit the missing
symbol when dlopened in the vllm container.
This change adds two optional, additive workflow inputs (workflow_dispatch
and workflow_call):
- torch_pin: e.g. '2.10.0+rocm7.1'. Empty = latest (existing behavior).
- torch_index_url: override the auto-derived index URL. Empty = legacy
'https://download.pytorch.org/whl/rocmX.Y' from the builder image tag.
Default behavior is unchanged. When both inputs are empty the install
step runs exactly as today.
For v0.1.12.post2 release builds we will dispatch with:
torch_pin=2.10.0+rocm7.1
torch_index_url=https://download.pytorch.org/whl/rocm7.1
Validated end-to-end on a side branch: wheel built with these pins
loaded cleanly inside vllm/vllm-openai-rocm:v0.19.1 and ran DSR1
serving on both MI300X and MI355X.
Refs: #2843
* Revert "CI: run SGLang downstream tests on MI355" (#2873)
* Revert "CI: run SGLang downstream tests on MI355 (#2807)"
This reverts commit a3f54e3ce76b40177ac3f1ab53763ea8ee9d2bb9.
* Update sglang_downstream.yaml
* ci: rename aiter MI300X runners (#2872)
Replace the legacy aiter single- and multi-GPU runner labels with the linux-aiter-mi300x equivalents so CI jobs and runner metadata target the new MI300X fleet consistently.
* CI: raise atom-test accuracy step timeout to 90 minutes (#2876)
Align with longer ATOM server startup wait (45m) plus warmup and lm_eval.
* Add torch in gemm a16w16 tune (#2860)
* add torch,skinny gemm in bf16 gemm tuner
* fix format
* fix lint
* CI: run Triton workflow on MI35X only (#2871)
* CI: run Triton workflow on MI35X only
Remove the opt-in MI325 Triton path and its PR label guidance so the Triton workflow always runs on MI35X. Keep the PR welcome comment aligned with the simplified CI behavior.
* MLA PS mode support nhead8,2 in MI308 (#2852)
* MLA PS mode support nhead8,2 in MI308
* nhead8,2 through m16x1_16x4
* delete redundant kernel
* fix the comment
---------
Co-authored-by: Xu, Shengnan <117875955+shengnxu@users.noreply.github.com>
* Dev/aot fix (#2856)
---------
Co-authored-by: coderfeli <felix.li@amd.com>
* Update kimik2 FP4 tuned fMoE config (#2845)
* Update kimik2 FP4 tuned fMoE config with 256 CU tuning results
Made-with: Cursor
* Fix kernel names for kimik2 FP4 token=8 expert=385 topk=9 config entry
---------
Co-authored-by: okorzh-amd <okorzh-amd@users.noreply.github.com>
* Fix sliding window mtp (#2829)
* fix(pa): port sliding window mtp fixes to main
Reapply the sliding-window MTP decode, PS reduce, and KV_BLOCK_SIZE=1024 fixes on top of the latest main so the change can be reviewed and merged from a branch with shared history.
Made-with: Cursor
* fix fallback
* fix(pa): Use FlyDSL PS reduce for sliding-window MTP
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Made-with: Cursor
* fix(pa): Move Gluon decode imports to module top
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Made-with: Cursor
* fix regression
---------
Signed-off-by: fsx950223 <fsx950223@outlook.com>
* [DO NOT MERGE] CI TESTS (#2894)
* CI: Temporarily running the build jobs on linux-aiter-mi300x-1 due to network issues (#2895)
* CI: Fix Aiter installation issues in sglang downstream test (#2897)
* CI: add opt-in MI300X Triton jobs for PRs and main (#2883)
Keep MI35X as the default Triton PR path, add ci:triton-300x to start extra MI300X jobs on PRs, and run both architectures on main. Update the PR welcome comment to document the new label and main-branch behavior.
* Revert "CI: Temporarily running the aiter build on linux-aiter-mi300x-1 due to network issues." (#2899)
* Revert "[DO NOT MERGE] CI TESTS (#2894)"
This reverts commit 0f059acf89220ecbf52919e20798e7e86fb3acbb.
* Update vllm_benchmark.yaml
* add optimized prefill gdn kernels for qwen3_5 (#2686)
* add optimized prefill gdn kernels for qwen3_5
* refine code style
* add ssm_state vk_layout kernels for vllm support
* add default turnoff for triton autotune
* Introduce asm fmoe kernels that do not require bf16->fp8 quantization (#2262)
* Introduce asm fmoe kernels that do not require bf16->fp8 quantization
* Update quantization division precision to be closer to IEEE correctness
* Add transpose_scale flag
* Update kernels with fixed s_waitcnt
* Add 16x128 kernel merged with quantization
* Remove legacy code
* Remove one more redundant line
* Revert changes to sub_X_cnt calulations for fmoe asm
* Update the tuner to support x_bf16 kernel
* Update pandas interface
* Fix formatting
* Generate config with the new tuner
* Update base tuner to support merging csv files without strict column matching
* Fix formatting
* Fix core.py to handle columns merge
* Fix formatting
---------
Co-authored-by: Sergey Solo <ssolovye@amd.com>
* Revert "Fix sliding window mtp (#2829)" (#2909)
This reverts commit 6b8ce4ec3423441ad0a2f072d50f498ac711714c.
Co-authored-by: perzhang <perzhang@amd.com>
* fix(fmha): support >4GB KV cache in batch prefill via runtime dispatch (#2893)
* test: expand test_batch_prefill_large_kvcache for >4GB KV cache overflow
Rewrite test_batch_prefill_large_kvcache to validate the per-tile SRD
rebase fix for >4GB KV caches across all page sizes, dtypes, and
attention configurations:
- Add page_size=1 and 16 (page_size < kN0, exercises rebase path)
- Add GQA (16, 8) in addition to MHA (8, 8)
- Add causal masking with CK-compatible attn_mask for SDPA reference
- Use full KV cache (4.5GB) with pages spanning the overflow boundary
- Use torch SDPA as reference (memory-efficient backend, no score
matrix materialization)
- Add scatter_pages parameter (False only; True for future
global_load_lds flat addressing)
- Add GPU memory check to skip configs that exceed HBM capacity
Test matrix: 24 cases (3 page_sizes × 2 dtypes × 2 causal × 2 GQA × 1 scatter)
* test: add GPU sync after CK kernel in large_kvcache test
Add torch.cuda.synchronize() after CK kernel launch in
test_batch_prefill_large_kvcache to ensure all async GPU work
completes before memory is freed between tests.
Without this sync, repeated allocate/free cycles of large KV cache
buffers (~20GB) with mixed dtype (bf16→fp8) can trigger GPU page
faults when the HIP memory allocator reuses virtual addresses that
are still referenced by pending async GPU work. The fault manifests
as VM_L2_PROTECTION_FAULT at address 0x0 (NULL), causing GPU reset
and kernel soft lockup.
* feat(fmha): runtime dispatch for >4GB KV cache in batch prefill
Add use_64bit_load to batch prefill traits and runtime overflow detection.
When page_block_size < 128 and max_page_byte_offset > INT32_MAX, dispatch
to the flat 64-bit load kernel variant for correctness.
Also add vectorized KV layout coverage to test_batch_prefill_large_kvcache.
* fix: remove unused k_vector_size variable in large_kvcache test
* fix(mha): improve batch_prefill TORCH_CHECK error message for >4GB KV cache
Include page_size, num_pages, and dtype in the error message when kernel
dispatch fails. Add hint about CDNA3+ GPU requirement when KV cache
exceeds 4GB with page_size < 128.
* test: update scatter_pages comment in large_kvcache test
The comment incorrectly stated scatter_pages=True was "expected to FAIL".
This is no longer true — the flat 64-bit load path handles scattered
pages correctly. Update to describe the test's purpose instead.
* fix(mha): widen batch_prefill 64-bit threshold to total KV bytes
The previous check used (num_total_pages - 1) * batch_stride * element_size
which measures the last-page base offset, missing within-page offsets and
producing an off-by-one at exactly INT32_MAX (the largest representable
SRD voffset). Switch to total KV cache footprint (num_total_pages *
batch_stride * element_size > INT32_MAX) so within-page reads on the last
page are covered, and drop the redundant num_total_pages > 1 guard since
single-page configs trivially fit in 32 bits.
Also unify wording: 4GB → 2GB (INT32_MAX byte offset for SRD voffset),
matching CK's TwoGB convention. The actual hardware bound has always been
2GB; the prior comments were imprecise.
Found during batch prefill template dispatch review.
* docs(mha): unify >2GB wording in batch_prefill error and test
The 4GB number in the TORCH_CHECK error message and the test comment was
imprecise — the actual SRD voffset bound is 2GB (INT32_MAX). Update both
to match the threshold check and CK's TwoGB convention.
Found during batch prefill template dispatch review.
* refactor(mha): drop wrapper-side use_64bit_load; let CK dispatcher decide
The wrapper hardcoded kN0_min = 128 to compute the >2GB KV cache
predicate, which leaked CK tile config into aiter and would silently
break if a new arm with bn0 != 128 were added. The CK auto-generated
dispatcher now decides per-arm using its own compile-time bn0 and
per-dtype kElementBytes, so the wrapper just forwards args.
Remove the `use_64bit_load` runtime field from `mha_batch_prefill_traits`,
the parameter from `get_mha_batch_prefill_traits()`, and the entire
predicate computation block from the dispatcher call site. Bumps CK
submodule to pull in the matching codegen change.
* chore(mha): bump CK + update wrapper wording for kUseGlobalLoad rename
Bumps 3rdparty/composable_kernel to dd8d293ea (refactor(fmha): batch
prefill review polish — assert helper + setter guards) which builds on
the prior 99a3ca9af kUseGlobalLoad rename.
Wrapper-side updates to match:
* csrc/cpp_itfs/mha_fwd_batch_prefill.cu: rename "64-bit-load" wording
in the per-arm dispatcher comment to "kUseGlobalLoad" so the wrapper
comment matches the CK-side identifier. Also drops the trailing
`false /* skip_min_seqlen_q */` argument from the
get_mha_batch_prefill_traits call to match the upstream CK API
signature change.
* csrc/py_itfs_ck/mha_batch_prefill_kernels.cu: change the >2GB error
message from "page_size < 128" to "page_size < kN0" so the diagnostic
tracks the tile-size constant rather than a magic number.
* op_tests/test_batch_prefill.py (test_batch_prefill_large_kvcache):
three documentation enhancements with no behavior change —
- explain why qo_len caps at 128 (causal) / 1024 (non-causal): the
causal cap is a math-backend cliff for the SDPA reference, not a
kernel limit;
- explain that the +256 padding on kv_page_indices is a batch_prefill
ABI requirement (kernel may speculatively read up to bn0=256
entries past the last valid page index);
- expand the torch.cuda.synchronize comment to call out the
misattribution failure mode and GPU-reset cascade risk.
* test(fmha): parametrize test_batch_prefill_large_kvcache over batch_size {1, 4}
Adds multi-batch coverage to the >2GB KV cache regression test.
The previous single-batch coverage left the kernel's per-sequence SRD
rebase path unexercised: with cu_seqlens_q=[0, qo_len] and kv_indptr=
[0, num_blocks], the kernel never walks the indptr to reposition K/V
SRDs across batch boundaries. After the kUseGlobalLoad rename and the
new positive static_assert(kUseGlobalLoad_) calls in update_physical_pages
and set_page_stride_elements, we want a regression that catches any
boundary-crossing SRD bug -- the failure mode no single-batch test can
detect (one batch correct, others wrong).
batch_size=4 partitions the >2GB page pool across 4 sequences (last
sequence absorbs the remainder), exercising 3 cross-batch SRD transitions.
The SDPA reference is computed per-batch and concatenated; per-iteration
free + empty_cache keeps peak memory at one batch's worth.
Verified on:
- gfx950 (smci355-gfx950, MI355X): 160 passed, 32 skipped
- gfx942 (smc300x-clt, MI308X): 160 passed, 32 skipped
Skips are the existing vectorized + page_size=1 incompatibility
(3D tensor layout), now 16 per batch_size value.
---------
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
* Fix top_k_per_row_prefill err when batched_token_numm > 4096 (#2901)
* retune a8w8_blockscale_bpreshuffle_gemm for MI308 (#2896)
* revert gptoss tuned config (#2904)
---------
Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>
* mHC: Optimize mhc_pre performance in small M (#2915)
* mHC: Optimize mhc_pre performance in small M (add tileN=16 on mhc_pre_gemm_sqrsum && add splitk on mhc_pre_big_fuse)
* update test
* fix(mha_bwd): pass independent strides for do in _bwd_preprocess (#2808)
* fix(mha_bwd): pass independent strides for do in _bwd_preprocess
- adding independent `stride_do_b/h/m/k` parameters to `_bwd_preprocess`
and addressing dO with them;
- passing `*do_strides` (already computed in the wrapper) alongside
`*o_strides` when launching the preprocess kernel.
* pass independent strides for flash_attn_fused_backward kernel
* fix typo
* Fix batched_gemm_a8w8 correctness for ASYNC_COPY + gfx950 UT (#2867)
* [Triton]: Add MoE a16w4 (#2770)
* [Triton]: Add MoE a16w4
* Fix Black issue
* Removed x scales and improved arch checks in unit test
* Added benchmark
* Fixed PR review issues
* PR review comments fixes
---------
Co-authored-by: Rahul Batra <rahbatra@amd.com>
* test(moe_2stage): add csv-driven mode covering model_configs fmoe csvs (#2825)
---------
Signed-off-by: zhimding <zhiming.ding@amd.com>
Signed-off-by: zhimding <zhimding@amd.com>
* fix mhc device (#2916)
* CI: add standalone OPUS test workflow (#2928)
* CI: log in before starting OPUS test container
Authenticate with the Docker registry before `docker run` so OPUS test jobs can pull the image reliably across runners.
* CI: add docker login before ATOM base image pull (#2931)
* CI: add docker login before ATOM base image pull
Authenticate with the Docker registry before pulling the ATOM base image so the workflow can fetch the container reliably across runners.
* CI: Use build-only-aiter runner to build Triton wheels (#2934)
* fea(car): support custom group device (#2703)
* fea(car): support custom group device
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
* [fix]: test script code format
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
* fea(car): support multi communication groups entity
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
* fea(ag): support multi group allgather
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
* fix: test script format
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
* fea(car): add reduce_scatter interface
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
* fix(car): custom group config
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
* fix: custom comm group set method
Signed-off-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>
---------
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Signed-off-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>
Co-authored-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>
* [FLYDSL]: update version to 0.1.5.dev504, fix bug in if-else dynamic grammer (#2935)
* [CK] Add StreamLLM sink token support for batch_prefill pipeline (#2794)
* [CK] Add StreamLLM sink token support for batch_prefill pipeline
Update CK submodule to ROCm/rocm-libraries#6479 (commit feea6be) which
makes kHasSink_ a real template parameter in fmha_fwd_batch_prefill_traits_,
enabling _sink/_nsink kernel variant codegen for batch_prefill.
AITER-side changes thread sink_size through the full call chain:
Python (aiter/ops/mha.py):
- Add sink_size: int = 0 to mha_batch_prefill_func, _mha_batch_prefill,
mha_batch_prefill stub, and cmdGenFunc_mha_batch_prefill
- Use has_effective_sink = sink_size > 0 and (causal or has_window_mask)
for _sink/_nsink module name selection, matching C++ mask logic
C++ interface (csrc/):
- mha_batch_prefill_traits: expose has_sink parameter (was hardcoded false)
- mha_fwd_batch_prefill.cu: derive has_sink = args.sink_size > 0
- mha_batch_prefill_kernels.cu: add int sink_size param, include it in
mask_identify string, set args.sink_size = mask.sink, zero-initialize
fmha_batch_prefill_args{} to avoid UB
- PyBind (rocm_ops.hpp) and declaration (torch/mha_batch_prefill.h):
add sink_size positional parameter after window_size_right
Semantics clarified:
- sink_size: number of first KV tokens always attended (sink phase loop)
- sink_ptr[nhead]: fixed logit for a virtual sink token in softmax
(null -> -inf, non-null -> user value); independent of sink_size
- window_size_left=L means k in [abs_q-L, abs_q] (L+1 tokens), verified
via code derivation (block_masking.hpp) and discriminating tests
Verified:
- No-sink paths numerically correct vs torch reference (max_diff<0.004)
- sink_ptr virtual token semantics correct (max_diff<0.004)
- window=1024 + sink_size=4 + sink_ptr: max_diff=0.000488
* [test] Add StreamLLM sink token tests for batch_prefill
Add test_batch_prefill_sink pytest function and supporting helpers
to op_tests/test_batch_prefill.py:
- ref_masked_attention_with_sink: torch reference implementing
sink semantics (first sink_size KV positions always valid,
sink_ptr virtual token appended to softmax attention matrix)
- run_batch_prefill_sink: runs both reference and CK kernel,
compares with get_tolerances() thresholds
- test_batch_prefill_sink: parametrized over:
batch_size=[1,2], page_size=16, head_dim=128
num_qo_heads/num_kv_heads=[(8,1),(4,4)] (GQA + MHA)
qo_len=[32,128]
(window_left, kv_len)=[(128,512),(1024,2048)] <- real gap in both
sink_size=[4,16]
sink_ptr_value=[None, 0.0, 2.0]
dtype=[bfloat16, float16]
Verified manually with batch=1 GQA 8/1 window=128 sink=4 ptr=2.0,
batch=2 MHA 4/4 window=1024 sink=16 ptr=None, and sink_ptr=0.0.
* [style] Black format op_tests/test_batch_prefill.py
* [fix] Address Copilot review on batch_prefill sink PR
- aiter/ops/mha.py: add missing sink_size parameter to
mha_batch_prefill_fake_tensors so its signature matches the real
mha_batch_prefill op; without this torch.compile / fake-tensor mode
raises TypeError when invoking the op.
- op_tests/test_batch_prefill.py: vectorize the StreamLLM reference
mask construction. The previous double Python loop did per-element
assignment into a CUDA tensor, triggering O(seqlen_q * seqlen_k)
GPU sync points (262144 at seqlen_k=2048) and would time out the
parametrized tests. Replace with broadcasted index tensors and a
single masked_fill_.
* [test] Wire StreamLLM sink scenarios into CI __main__ block
CI invokes test files via `python3 "$file"`, which only runs the
__main__ block; pytest functions like test_batch_prefill_sink were
therefore never exercised in CI.
Add a second __main__ block at the end of the file that runs
run_batch_prefill_sink over a small representative parameter sweep
(2 window/kv combos x 2 sink_ptr settings, sink_size=4, bf16). The
helpers are defined after the original __main__, so a separate
trailing __main__ block is the minimal-blast-radius way to keep
both invocations independent.
* [CK] Bump submodule to include StreamLLM sink for batch_prefill
Update CK submodule from 08792e0b3 to d22aafb48 to pull in
[CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479).
* Set RNG seed in Triton tests (#2921)
* [quant] refactor hip kernels (#2932)
---------
Signed-off-by: zhiding512 <zhimding@amd.com>
Signed-off-by: zhimding <zhiming.ding@amd.com>
Signed-off-by: Tres Popp <tres.popp@amd.com>
Signed-off-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Signed-off-by: zhimding <zhimding@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: gyohuangxin <42127654+gyohuangxin@users.noreply.github.com>
Co-authored-by: BingYuan.Zhou <BingYuan.Zhou@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: Vinayak Gokhale <vinayak.gokhale@amd.com>
Co-authored-by: RuibinCheung <ruibzhan@amd.com>
Co-authored-by: ruanjm <jiming.ruan@amd.com>
Co-authored-by: Peng <peng.sun@amd.com>
Co-authored-by: Peng Sun <pensun@Pengs-MacBook-Pro.local>
Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: TennyWang1223 <Tenny.Wang@amd.com>
Co-authored-by: Nicholas Susanto <nicholas.susanto@amd.com>
Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com>
Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
Co-authored-by: Su Ann Chong <suachong@amd.com>
Co-authored-by: Elton <zhimding@amd.com>
Co-authored-by: Jiayun <jiayyu@amd.com>
Co-authored-by: Stig-Arne Grönroos <sgronroo@amd.com>
Co-authored-by: amd-ruitang3 <rui.tang2@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Sergey Solovyev <sergey.solovyev@amd.com>
Co-authored-by: Sergey Solo <ssolovye@amd.com>
Co-authored-by: Yutao Xu <xytpai@foxmail.com>
Co-authored-by: Dragan Mladjenovic <dragan.mladjenovic@amd.com>
Co-authored-by: Nidal Danial <81209936+nidal567@users.noreply.github.com>
Co-authored-by: Nico Holmberg <nico.holmberg@amd.com>
Co-authored-by: Tres Popp <tres.popp@amd.com>
Co-authored-by: GeisYaO <q1179671016@gmail.com>
Co-authored-by: Xudong Yuan <xudoyuan@amd.com>
Co-authored-by: fangche123 <Fang.Che@amd.com>
Co-authored-by: minmengdie <memin@amd.com>
Co-authored-by: Kiran Thumma <167153338+kiran-thumma@users.noreply.github.com>
Co-authored-by: TennyWang1223 <root@hjbog-srdc-24.amd.com>
Co-authored-by: lucas-santos-amd <Lucas.Santos@amd.com>
Co-authored-by: la <46212055+junhaha666@users.noreply.github.com>
Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com>
Co-authored-by: PerryZhang01 <Perry.Zhang@amd.com>
Co-authored-by: perzhang <perzhang@amd.com>
Co-authored-by: Zzz9990 <zanzhang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: eppaneamd <rleppane@amd.com>
Co-authored-by: HaonanWang98 <hwang@amd.com>
Co-authored-by: Pleaplusone <ygan@amd.com>
Co-authored-by: JaxChen29 <jichen@amd.com>
Co-authored-by: Petr Kurapov <petr.kurapov@gmail.com>
Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com>
Co-authored-by: zhuyuhua-v <yuhzhu@amd.com>
Co-authored-by: Yi DING <andy-ding@outlook.com>
Co-authored-by: Felix Li <felix.li@amd.com>
Co-authored-by: slippedJim <jim.guo@amd.com>
Co-authored-by: jian.wu <jwu10003@amd.com>
Co-authored-by: Pedram Ghazi <pedghazi@amd.com>
Co-authored-by: root <root@chi-mi300x-003.ord.vultr.cpe.ice.amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: Xu, Shengnan <117875955+shengnxu@users.noreply.github.com>
Co-authored-by: Oxana Korzh <oxana.korzh@amd.com>
Co-authored-by: okorzh-amd <okorzh-amd@users.noreply.github.com>
Co-authored-by: who who who <fsx950223@outlook.com>
Co-authored-by: yijin <yijin@amd.com>
Co-authored-by: Jeff Huang <jiaji.huang73@gmail.com>
Co-authored-by: chuanbowang2026 <chuanwan@amd.com>
Co-authored-by: rahulbatra85 <rahulbatra85@gmail.com>
Co-authored-by: Rahul Batra <rahbatra@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-n01-17.cs-aus.dcgpu>
Co-authored-by: Linjun-AMD <Jun.Lin@amd.com>
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist