Skip to content

Conversation

@fsx950223
Copy link
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings October 28, 2025 10:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds ahead-of-time (AOT) compilation support by introducing a new module to pre-compile various configurations of the PA (Paged Attention) operation and adding conditional logging guards to reduce verbosity during bulk compilation.

  • Adds aiter/aot/pa_v1.py to generate and compile all PA operation variants
  • Adds logging level guards (AITER_LOG_MORE >= 2) to reduce compilation log noise

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
csrc/cpp_itfs/utils.py Adds conditional logging guards around build start/finish messages and template compilation logs
aiter/aot/pa_v1.py New AOT compilation script that generates and compiles PA operation variants using parallel processing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@coderfeli coderfeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coderfeli coderfeli merged commit 77a5427 into main Oct 30, 2025
16 checks passed
@coderfeli coderfeli deleted the aot_fix branch October 30, 2025 06:49
ganyi1996ppo pushed a commit that referenced this pull request Nov 19, 2025
* fix aot

* remove other kernels path

* fix aot

* format code

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>
zhuyuhua-v pushed a commit that referenced this pull request Nov 23, 2025
* fix aot

* remove other kernels path

* fix aot

* format code

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>
LJ-underdog pushed a commit that referenced this pull request Nov 27, 2025
* fix aot

* remove other kernels path

* fix aot

* format code

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>
valarLip added a commit that referenced this pull request Dec 16, 2025
* add sink_size parameter in mha_fwd and varlen_mha_fwd

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update mha.py

* update mha_varlen_fwd_kernels

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* ca_refactor_fix (#1268)

* ca_refactor_fix

* more update

* fix_fp4_quant_dtype (#1271)

* add sample use outer exponential (#1267)

* update sample outer exp

* update mix sample use outer exponential

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Fix rowwise a8w8 gemm in swizzled hipb_mm (#1258)

* fix rowwise

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* guard hipb_mm output type in unit test

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* correct tuning for rowwise

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* expose b_preshuffle for tuning

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix lint

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

---------

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>

* CI: Use TW cluster to run sglang tests (#1273)

* Catchall PR for all 355_wip related changes (#1148)

* Catchall PR for all 355_wip related changes

Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com>
Co-authored-by:  Mehmet Cagri <mehmet.kaymak@amd.com>

* add triton fp4 gemm preshuffle (#1187)

* Remove AOT path

* [TRITON] Add Positional Encoding (PE) support to Triton MHA kernels

Related PR: #1184

Patch applied according to Cagri's request.

* Tune more fp4 gemm shapes

* Add aot compilation option for fp4 preshuffled gemm

* Add precompiled fp4 preshuffled gemm kernels

* Add readme

* re-gen AOT binary files for LL 70B FP4 GEMM and update kernels and API helper, fix fused_mul_add logger bug (#1242)

* DS 355_wip fused_shared_expert (#1218)

documentation, fix some bugs, UT

* fused_rms_fp8_group_quant num_warps tunning

* black formatting

* remove redundent files

* unified attn. reorg., fixes, exp2 update

* add MI300 config for fused_gemm_a8w8_blockscale_a16w16

* recover test_mha to upstream

* black formatting

* update test fused_kv_cache

* bypass some UTs for MI300

* update FP8 assert for test_mha_varlen

* skip UT

---------

Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com>
Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com>
Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com>

* [MI35X] fix core check (#1276)

* fix core check

* update

* Refactor gemm bf16 tuner (#1275)

* refactor GemmTuner

* update

* fix lint error

* fix lint error

* CI: Operators tuning pipelines (#1163)

* CI: Operators tunning pipelines

* Updates

* Updates

* Updates

* Updates

* Show computing unints

* Updates

* Updates

* Add op_tune.sh

* Updates

* Disable a4w4

* Updates the error handling

* Updates the error handling

* Updates

* Updates

* Updates

* Updates

* Update .github/scripts/op_tune.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Updates

* Add uloading tuned CSVs

* Updates

* Add shape name

* Add shape arg

* Allows users to select the shapes they want to tune and specify the arguments they need for tuning.

* Only be triggered when modify the untuned csv files under aiter configs foleder

* Test

* Updates

* Updates

* Update .github/workflows/operators-tuning.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/ck_gemm_a8w8_blockscale_bpreshuffle/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update a4w4_blockscale_untuned_gemm.csv

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix the problem that v3's performance is worse than ck's (#1237)

* fix fwd v3 kernel perf and opt err

* fix the python mha test error (#1277)

* Reuse custom decorator in core and torch guard (#1278)

* reuse custom decorator in core and torch guard

* remove useless

* rename all_reduce to avoid same name

* rebase

* fix bug

* use cpu device default

---------

Co-authored-by: root <root@hjbog-srdc-39.amd.com>

* integrate deep gemm (#1265)

* integrate m grouped gemm

* update ck

* add limit for 950

* rename deepgeem

* add a tuned config and insert entries in untuned config (#1243)

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Enable large batch size and optimization of non-Ragged batching (#1269)

* Enable large batch size and optimization of non-Ragged batching

* Add RAGGED_BATCH to test_la.py and bench_la.py

* add few more fw ds f4 untuned and tuned shapes for using asm kernel (#1298)

* CI: Optimize autotuning pipeline and inital the docs (#1286)

* CI: Optimize autotuning pipeline and inital the docs

* topk per row kernel (#1262)

* initial commit for topk per row kernel

* topk per row kernel initial commit

* Fix the typo issue

* Add the topk per row kernel

* optimizations for topk_per_row kernel

* fix overflow

* add unit test for topk_per_row_decode

* update test for decode

* apply vector dispatch from carlus

---------

Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu>
Co-authored-by: valarLip <340077269@qq.com>

* fix aot (#1279)

* fix aot

* remove other kernels path

* fix aot

* format code

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Fix ATOM fp8 model quant fail issue in torch compile (#1299)

* Fix fp8 issue in torch compile

* use less code

* feat - pa_fwd support block map with stride in num_kv_heads_dim (#1301)

* Fix how to update accumulator for dot_scaled (#1297)

* CI: Optimize autotuning pipeline docs (#1300)

* CI: Optimize autotuning pipeline docs

* Update docs/autotuning_pipeline.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix the lint issue (#1307)

* fix fwd perf calc error (#1305)

* fix fwd perf calc error

* black aiter/ops/triton/_triton_kernels/gemm_afp4wfp4.py

* add the asm kernel performance of fwd and bwd (#1270)

* add the asm kernel performance of the attention forwards and attention backwards

* modify perf data

* fix perf data

* add a16 perf data

* Fused TopK and Sigmoid kernel (#1251)

* Add topk softmax

* Add test for topk sigmoid

* register the op properly

* apply black

* don't use constexpr with std::string

* bump ck to include topk sigmoid commit

* hipify

* add argparse to the topk sigmoid test, also add pytest

* use own module instead of asm moe

* black formatting

* add missing file

* revert changes to module_moe_asm

* Ar rms (#1290)

* [fea]: add fused allreduce rmsnorm kernel

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix: fuse ar rms interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix: fuse ar rms interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix ck branch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* update ar interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Dsv32 cache (#1314)

* add indexer_k_quant_and_cache & cp_gather_indexer_k_quant_cache

* ndexer_k_quant_and_cache opt kernel and add test

* update

* update2

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Fix displaying supported architectures (#1316)

Currently it will look like this:
```log
File "TransformerEngine/3rdparty/aiter/aiter/jit/utils/chip_info.py", line 77, in get_gfx_custom_op_core
  raise KeyError(
KeyError: 'Unknown GPU architecture: . Supported architectures: [0, 1, 2, 3, 4, 5, 6, 7, 8]'
```

Signed-off-by: Hollow Man <hollowman@opensuse.org>

* using standalone pybind (#1317)

* using standalone pybind

* fix

* update

* Enable mha bwd hd192_hd128 (#1308)

* update codegen.py

* update kernels & kernel launch

* fix fa bwd dq_acc shape

* remove mask in python api

* CI: Add pre-check status check (#1252)

Creates a unified pre-checks.yaml workflow that runs Black, Ruff, and dependency checks, uploading success/failure signal artifacts
Download and verify the signal artifacts in the other heavy jobs. If the verification succeeds, the heavy jobs will continue running. If the verification fails, the heavy jobs will exit immediately.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [CK_TILE] fmha: Add backward pass support for padded inputs (#1212)

* [CK_TILE] fmha: Add backward pass support for padded inputs

Introduces support for padded sequence lengths in the backward pass of the variable-length flash attention (fmha_v3_varlen_bwd).
- Updated Python and C++ function signatures to accept optional `cu_seqlens_q_padded` and `cu_seqlens_k_padded` arguments.
- Modified the underlying CUDA kernels and code generation scripts to pass padding information via the new `seqlen_q_ptr` and `seqlen_k_ptr` fields in
     the CK `fmha_bwd_args` struct.
- Modified the underlying kernels and code generation scripts to correctly handle pointers for both padded and unpadded sequence data.
- Added comprehensive gradient verification to the test suite (`test_mha_varlen.py`) to ensure the correctness of the backward pass with various
     padding scenarios.

* [CK_TILE] fmha: Adapt to composable_kernel padding API changes

Refactor the FMHA forward and backward pass to align with the updated padding API in `composable_kernel`.

- Argument Simplification: Removed the manual calculation of `seqlen_q` and `seqlen_k` from `cu_seqlens` in the `mha.cu` interface. The underlying kernels now handle this logic.
- API Alignment: Updated the arguments passed to `aiter::mha_fwd` and `aiter::mha_bwd` to match the new `composable_kernel` API. This involves passing `cu_seqlen` pointers directly.
- Kernel Interface Update: Modified the `codegen.py` scripts for `gfx942` and `gfx950` to reflect the changes in the kernel's function signatures and argument handling for padded and unpadded sequence lengths.

* fix build error in op_tests/cpp/mha/benchmark_mha_*.cpp

* Mla splitkv enhance split alg inte (#1233)

* add num_kv_splits_indptr to mla for mtp<=4 case for now

* update

* update new kernel

* infrastructures

* 1st version of split kernel

* Fix issues raised by Lingpeng and fix the issue on batch_size

* update mla

* update mla_stage2

* 1st draft of v1 split program

* add kv_offset

* mla_splitkv_enhance_split_alg_inte

* splitkv debug

* 1st version of reduce kernel

* metadata & kernel finish

* add reduce

* final_lse is optional now.

* update kernel

* bug fix

* bug fix 1

* modify reduce api

* update kernel

* fix max splits

* bug fix 3

* fix s80 early return

* udpate calculation of partial_indx

* add per split test

* make lse support by ref

* test split

* fix redundant calculation of head offset in reduce kernel

* add custom test

* Add support of 128 head size

Fix how to get head count

fff

* update comments

* 1. Let large work be assigned first.
2. Add tolerance to the tile which is slightly smaller than kv_limit.

* Calculate kv_limit dynamically

* Fix bug about difference in split_kv(bool)

* add test

* fix seed

* Add global tolerance 16 in kv seqlen because main kernel cannot handle small splits (kv_seqlen<4) well.

* Fix warp=1 error

* Add redundant mode to make the size of output of metadata be fixed add new param: no_redundant. Reduce can support redundant input in reduce_indptr as well.

fix comm

* fp8 setup

* first version of device metadata

aaa

* Add work_ptrs

* Compatibility to CUDA Graph

* Refactor code. Merge 2 iterations of generate work together.

* Make sure that each batch of workload can never be splited to more than #cluster of tiles.

* Adjust metadata. Get 1% perf gain.

* Paralize most of metadata kernel

Make get_cost_top() paralized.

aaa

* add scale

* 1. Use warp-level bitonic sort to sort batch idx based on their cost for reducing #splits. 2. Use CK's warp ops.

* fp8 function pass

* Fix issues:
1. avg_workload cannot handle any batch!
2. split_kv(bool) is not correct when all the clusters are full.

* fp8 ready

* fix

* persistent ready

* add nv acc test

* rename

* updata metashape

* update reduce cu num

* update optest for mla

* fix cu num

* Update metadata and reduce kernels.

* rename kernels

* Add new param kv_granularity to metadata kernel.

* Introduce cal_workload_limit_global_v2

* Support qhead=128 cases.

* Change get_mla_metadata() api. Make some not important parameters be optional through a dict.

* fix potential problem on calculating tot_qo_tiles

typo

* refactor metadata files

aaa

* update metadata v1_2

* update gqa_128 mla_ps & fix metadata v1_2

* Optimize mla metadata v1.2

* Optimize mla metadata v1.2 Part.2

* Optimize mla metadata v1.2 Part.3

* update qlen <=4

* fix mla qlen1

* Optimize mla metadata v1.2 Part.4

* Make reduce_final_map be optional in mla_reduce_v1

* Slightly increase reduce perf

* Add persistent mode for mla reduce kernel

* add mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co

* update deepseekv32 sparse mla metadata

* update mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co

* Adjust code for sparse attn

* Optimize the a16w8 kernel

* Improve metadata v1.1 perf

* Make metadata v1.1 support sparse attn

bug fix

tiny fix

* Remove redundant code in mla_reduce

* futile struggle

* Fix issue after merge. aiter main branch is using torch.library.infer_schema which doesn't allow dict as parameter. Thus, change the API for metadata.

* Adjust metadata v1.1 and make this branch be ready to be merged to main branch.

* remove invalid co kernel

* Fix issue brought from f794ae4 which disabled hipify by default.

* support qolen>1 for sparse mla

* make code become prettier

* Fix issue in metadata v1.1

* Fix issue in test_mla.py

* Fix lint fails

* Fix sub-test fails in op_test/test_mla.py

* Fix regression in test_mla.py where mtp>1

* Add head_dim=128 support to reduce

* Add nhead=8 for pa and add assert to make sure the input tensors are in
float32.

* fix issue in vllm benchmark for deepseek: remove metadata v0 because it's not compatible with hip graph

* fix lint

* Revert all the change about mi350 gemm.

* add a8w8 and a16w8 kernel in mla mi350

* add A8W8 Non-persistent mode kernel

* Fix issue reported by Copilot

* add mla non-persistent test

* script: update a16w8 kernel

* rm test_mla_persistent_mi350.py and support mi350 in test_mla_persistent.py

* add mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps.co

* fix a8w8 num_kv_split=1

* Fix issue in metadata v1.2 on qo_tiles > 1

* fix ut bandwidth

* Use nhead=16 simulate cases that nhead=16*N where N is in range(32,16*32+1,16)

aaa

Fix regression in sparse attn from the fix in metadata v1.2 for multi qo tile issue

* Add new api get_mla_metadata_info

* fix lint format issues

* Adjust get_mla_metadata_info_v1's parameters.

* update A16W8 kernel

* update A16W8 kernel2

* update A16W8 for mi300

* fix ut and rename some kernels

* rename mla kernel name for head 128

* remove log

* fix format

* add nativly back

* change zeros into empty

* fix with comments

---------

Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com>
Co-authored-by: zanzhang <zanzhang@amd.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: minmengdie <memin@amd.com>

* Fix gemm tuner error mi350 (#1313)

* workaround-retry tuning when encounter invalid pointer

* workaround-retry tuning when encounter invalid pointer

* fix  lint error

* Update gemm_tuner.py

em timeout

* CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton (#1325)

* CI: Skip triton in Aiter standard and multigpu tests

* Add retries when building triton

* Add ninja installation

* Fix global variable torch_fp8 initialization caused issue (#1322)

`hipGetDeviceProperties` is called by the `torch_fp8` initialization. It will trigger all the HIP runtime initialization in global variable initialization. There are two issues:

- There are several global variables involved in the runtime initialization too. The initialization order of global variables is not guaranteed. So it may use uninitialized global variables for the runtime initialization.

- When there is a forked child process, needs to initialize its own HIP runtime to get proper GPU driver kernel context and handles. But since there is a runtime initialized globally in the parent process, the forked process will just consider the runtime is initialized and use it directly. But it is actually invalid.

The fix is to ensure `hipGetDeviceProperties` is only called when actually needed, not during static initialization

To repro the issue:
1. fork a child process
2. call torch.empty on the child process

It will get a `hipErrorInvalidValue` error.

Co-authored-by: Hui Zhou <huizhou@meta.com>

* Add transpose scale to the triton fused_rms_fp8_group_quant (#1291)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning (#1315)

* update AOT, always pad x_scale when generating input, add UT

* update act_mul_mxfp4_quant, fused_rms_mxfp4_quant

* add LL FP4 configs and AOT files for TP8 shapes

* fix UT bug

* add LL TP2 and TP4 shapes

* [TRITON] Kernel naming: add reusable constexpr repr helper (#1260)

* Kernel naming: add reusable constexpr repr helper for gemm a16w16

* add missing params to the repr

* Merge tuned file (#1327)

* merge tuned_file with same prefix

* fix lint

* rename tuned_gemm.csv to bf16_tuned_gemm.csv to avoid matching wrong file

* update

* update README.md of bf16 GemmTuner

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* fix graph_breaks by return tensor for bool op (#1333)

* fix_bf16gemm_asm (#1329)

* fix_bf16gemm_asm

* update

* update

* Improve Memory Usage in MLA (#1338)

* Improve mla memory

* further reduce memory usage

* Fix lint issues

* Fix issue reported by Jun Chen.

* fix tune error caused by merge tuned_file (#1342)

* fix tune error caused by merge tuned_file

* fix lint error, rm some log

* rm rocblas in tuner (#1337)

* [Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant (#1328)

* add gemm_a16w8_blockscale and fused_reduce_rms_fp8_group_quant

* black formatting

* add MI300 config

* fix commit

* Add block_m=16 for a8w8_ck_moe_blockscale (#1081)

* Add block_m=16 for a8w8_ck_moe_blockscale

* fix moe_blk_scale token<64 to ck2stage for ds shape

* fp8 moe bugfix tuning rebase tuned_fmoe.csv

* add L2 check

* reformat

---------

Co-authored-by: Oscar Xu <huuaiguxu@amd.com>
Co-authored-by: xudoyuan <xudoyuan@amd.com>

* Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel (#1330)

* Fused Triton RMSNorm and FP8 static quantization

* Formatted python scripts

* [TRITON] GEMM kernels nomenclature changes (#1283)

* Kernel naming: add reusable constexpr repr helper (#1260)

* Add missing API documentation

* Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. (#1359)

* [Triton] Disable failing lean attention tests (#1357)

* add config (#1355)

* add how_v3_bf16_cvt control to the Python API (#1351)

* add how_v3_bf16_cvt in fwd_v3

* fix the fwd compile

* [fix]: car 6 rank coredump (#1335)

* [fix]: car 6 rank coredump

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: add residual out in ar rms

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic (#1348)

* make can_mha_v3 uninplace and fix guard

* fix error

* wrapper _flash_attn_backward in custom

* make gemm a8w8/a4w4 into custom

* add some op in custom

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* [TRITON] GEMM kernels nomenclature changes (#1292)

* implemented the use of kernel repr helper to standardize kernel metadata representation

* Add Missing API documentation

* remove commented code

* remove FILL_VALUE to keep kernels name meaningful

* [TRITON] Initial implementations of sparse attention kernels (#1296)

fp8_mqa_logits: Calculate the logits (prefill stage) to be used for topk
unified_attention_sparse_mla: Sparse attention implementation for the deepseek like MLA using the MHA approach where kv cache is [seq_len_kv, 1, HEAD_SIZE + kv_lora_rank + rope_rank] and q is [seq_len, NUM_HEADS, kv_lora_rank + rope_rank]

* [MI35X]cktile moe a16w4 support (#1341)

* draft of cktile moe

* align the interface of main branch to make cktile moe compile pass

* refine code

* refine ck moe

* fix CI build fail about code style

* remove ck blockscale moe modification

* refine code

* fix CI build fail of unsupport block_m=16

* refine format

* fix conflict

* update

* format

* fix format

* update

* update

* update

* format

* format

* remove useless

* fix sorting

---------

Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: Oscar Xu <huuaiguxu@amd.com>
Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: felix <felix.li@amd.com>

* [TRITON] Batched GEMM kernels nomenclature changes (#1293)

* implemented the use of kernel repr helper to standardize kernel metadata representation

- batched_gemm_bf16.py
- batched_gemm_a8w8.py
- batched_gemm_afp4wfp4.py (main + reduce kernel)
- batched_gemm_afp4wfp4_pre_quant.py
- batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.py

* Add Missing API documentation

* remove Dtype to avoid having invalid names in the repr

* [TRITON] Instruction shape fix for Gluon gemm_a8w8_blockscale kernel (#1261)

* fixed instr_shape error for mfma layout

* removed test skips for uneven K

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* moe mxfp4 block_m = 64/128 (#1266)

* moe mxfp4 block_m = 64/128

* update a4w4_gemm2_kernels_list

* add instance tile_m=32

* tuned configuration

* Update test_moe_2stage.py

* refactor

* update v1 pipeline

* update badcase

* fix fp4 moe tuner

* reformat

* revert ck update

* update ck

* Moe mxfp4 ck preshf bns (#1312)

* python code of nbs compatible

* bns compatible

* fix global

* bug fix

---------

Co-authored-by: zhimding <zhimding@amd.com>

* add AITER_MXFP4_MOE_SF switch for mxfp4 moe

* v3 n128

* 32x32 v1

* resolve ck conflict

* rm use_int4=True

* reformatted op_tests/test_moe_2stage.py

* AITER_MXFP4_MOE_SF bugfix

* revert torch.int4

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: felix <felix.li@amd.com>

* bug fix (#1370)

* [opus] enhance opus utility (#1324)

* enhance tuple and add waitcnt

* refactor vectorized issue space

* wip cached layout

* support cached layout

* add smem

* support broadcast dtype in store()

* Fix issue in metadata v1.2 where batch size is too large (#1352)

* Fix issue in metadata v1.2 where batch size is too large. V1.1 is hopeless in these cases...

* lds_partial_info is not used when there is no further tile splits on qo.

* [GEMM][Config] add a8w8 block scale tuned config for deepseek-v3 (#1310)

* add a8w8 gemm tuned config with block scale for deepseek-v3 shapes

* reorganize the config files

* revert unnecessnary changes

* add headers

* move ds configs to specific model config file

---------

Co-authored-by: guanbao <gyu@amd.com>

* support all logit values (#1323)

* support all logit values

* fix tests

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* CI: Skip triton in Aiter standard and multigpu tests (#1374)

Triton tests will only run when changing file under:
 - aiter/ops/triton
 - op_tests/triton_tests

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add the performance data bar chart in the readme (#1372)

* add the performance data bar chart in the readme

* fix the wrong remarks

* force ds ptpc moe use 1 stage moe (#1373)

* [TRITON]  MHA PA optimizations (#1245)

Optimizations:
New config.
Preloading v
Reordering pe matmul
Simplifying scaling ops (for sm_scale and log_2(e) related scaling)
Masking related changes

* Enable fa multii target build on other arch (#1318)

* enable fa multii target build on other arch

* update arch info when dispatch in python api

* update

* Support mixed V2/V3 arches

* format

---------

Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com>

* [Triton] DS FP4 triton fusion (#1371)

* add fused_gemm_afp4wfp4_a16w16.py

* fix bug

* add fused_reduce_act_mul_and_mxfp4_quant

* add gemm_a16wfp4.py

* fix

* fix

* fix

* clean up

* repr

* update AOT with repr

* fix bug

* add dummy heuristics

* add output_unquantized_inp1 to fused_rms_mxfp4_quant

* add configs

* fix bug, tune fused_reduce_act_mul_and_mxfp4_quant

* fix

* fix

* final clean up

* add batched_gemm_a16wfp4

* clean up

* add config

* add default config

* remove old kernels, add API redirection and deprecation warning

* add fused_gemm_afp4wfp4_mul_add

* [TRITON] Simplify and optimize triton_kernels moe code and move it into aiter (#1326)

First PR for MoE with optimized support for GPTOSS shapes and fp8 x fp4.

* Use torch.zeros_like instead of empty_like to prevent accruacy drop (#1387)

* CI: Temporarily using old vllm nightly image (#1389)

* Revert "[Triton] DS FP4 triton fusion (#1371)" (#1392)

This reverts commit 4aabf79.

* add a8w8 ptpc gemm config for dsv3 (#1382)

* add ninja to install_requires in setup.py, fix ck gemm a8w8 bshuffle heuristic dispatch not support mnk=(2048,2112,7168)

* add 085 ptpc gemm tune config

* rename csv

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Test the CI on both MI325 and MI355 (#1364)

* Always run tests on mi355

* [Triton] change BF16 GEMM config filename (#1398)

* Support distributed_init_method and DP in init_distributed (#1353)

* support distributed_init_method in init_distributed

* support dp in aiter distribute

* fix only tp error

* FA V3(fp8) and paged Attention compressed (CI green) (#1065)

FA V3(fp8) and paged Attention compressed

FA V3(fp8) and paged Attention

FP8 Prefill work compressed

Fa V3 api

Compress fp8 work so far

pull cast out of torch function

e2e fp8 stub

emulate fa v3

ignore

remove example

clean up forward

save

fp8 backward

ignore train artifacts

just use return_attn_probs

match fa behvaior

save fa ref

add fa_ref

fix dropout bug

add link

optional fp8 p descale

rename to v3

fa v3

clean up

match backward

min diff

update varlen api

clean up FP8_P_DESCALE

update bench and test

lint

fix mha varlen bug

remove .gitignore

save

lint

remove skip

bring back skips

add fa module

update v2 interface

create mha_v3

add old v3 path

update fa module

tests passing

sync bwd changes

lint fa module

add kvcache api and test

fix lint

fp8 works

test fp8 only

add paged tests

add flash_attn_with_kvcache to v2

test varlen

move to _triton_kernels

test_mha_backward working with v3

upgrade to cleanedup modeule

get test_mha_backward_varlen working

clean up

fix lint bug

move casting functions to utils

fix lint

Update aiter/ops/triton/utils/mha_kernel_utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update aiter/ops/triton/utils/mha_kernel_utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

use Optional

update from main_perf

lint

update fp8 backward

lint

descale factor is fp32

lint

dequant backward

match dims

Sync with FA main_perf

dequant in backward

pass descale to kernel directly

better kernel naming

simple fp8 path. no transpose

clean up bwd

save

bring back split

min diff

pass descale to bwd

lint

fix bwd nans

FP8_AUTO_DESCALE

use hk for other backwards

fp8 wrapper

lint

rm matrix_instr_nonkdim

split v2 and v3 cleanly

lint

back to og

minimal change

test_mha passes

test green

* is_shuffled (#1377)

* is_shuffled

* shuffle_weight bugfix

* rm AITER_MXFP4_MOE_SF

* preshuffle bugfix

* refactor

* refactor bugfix

* add bns/preshuffle moe mxfp4 UT tests

* add L2 verification

* black op_tests/test_moe_2stage.py

---------

Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Ar rms new interface (#1401)

* [fix]: fused_ar_rms interface

Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>

* delete comment

Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>

* change ut case

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix ut format err

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: ar acc err

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* minor fix for mi355 (#1408)

* [Fix] Add sliding window feature for paged_attention_v1 (#1362)

* add sliding window to paged_attention_v1

* Avoid use fmha_v3_varlen_fwd on unsupported architecture gfx90a

* make  `sliding_window` default to 0 for better compatibility

* fix possible used compilation problem

* fix ci failure

* fix ci failure

* add a single test to avoid increasing test time a lot

---------

Co-authored-by: Xiake Sun <xiake.sun@amd.com>

* fused_qk_rope_cat_and_cache_mla: Fix Triton compilation error and batch size constraint and output tensor sizing (#1407)

* Fix Triton compilation error by nesting OUTPUT_Q_NOPE_ZEROS condition

* Correct batch size constraint and output tensor sizing

* max mla splits perbatch (#1390)

* fix issues

* add limit for split num per batch

* fix non-ps num kv split

* fix issue for big batch size

* fix logits alloc

* fix black code stype

* fix ut

* update git ignore& remove aiter/install_mode

* update qh16 qseqlen4 kernel

* update

---------

Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* topk_per_row_opt (#1394)

* topk_per_row_opt

* Optimize topk by integrating air-topk kernel (TODO: proper workspace size computation and init) (#1409)

* update

* slightly optmize

* add back

---------

Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de>
Co-authored-by: carlushuang <carlus.huang@amd.com>

* Fix fused_rms_mxfp4_quant comment (#1369)

* leanAttn softmax fix for spurious data mismatch test failures (#1396)

* leanAttn softmax fix for spurious data mismatch test failures

* black fix

* Remove unused parameters per PR review request

* Black fix

* Add reduce_scatter api (#1413)

* add reduce_scatter api

* add reduce_scatter api

* fix error in fmoe_tuner (#1405)

* fix error in fmoe_tuner

* fix error when tuning QuantType.per_1x32

* rm redundant code

* optimize thread divergence (#1421)

* [TRITON] complex number multiplication that supports 3D ROPE triton kernel (#1061)

* complex number multiplication that supports 3D ROPE triton kernel

* Merge remote-tracking branch 'origin/feature/rope3d-fix' and resolve conflicts

* confilcts resolve

* conflicts resolve

* moved code places

* fix typo

* fix typo

---------

Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Feat: pa_mqa_logits performance optimization & support kv_preshuffle + blocksize16/64 (#1424)

* Update gluon_pa_mqa_logits using preshuffle

* Minor update

* Add finer pipeline granularity

* Add sched_group_barrier optimization

* Minor update

* Support blocksize64 for preshuffle pa_mqa_logits

* Support logits JIT on triton 3.5

* Improve splitkv strategy

* Eliminate redundant conditon check

* Add missing oob check

* Resolve reviews

* [Config] add tuned moe and gemm config for qwen3 235b (#1378)

* add moe tuned config

* add gemm tuned config

* move tuned moe config to model specific file

---------

Co-authored-by: guanbao <gyu@amd.com>

* fix repeated unnecessary device check (#1221)

* remove device check

* 8 devices

* more

* more

---------

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* remove lru func in fake (#1429)

* Temporarily disable the test on mi355 (#1437)

* Enable MI355 test on main branch

* CI: Aiter tests bug fix

* [M308] tune silu&act (#1404)

* Vectorized loads and stores  combined with the packed multiply path

* format code II

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* add deepseek ep moe tune config (#1431)

* add ptpc deepseek ep moe tuned config

* add block deepseek ep moe tune config

* using 1stage moe for ptpc deepseek

* [TRITON] Moe a8w4 tuning (#1410)

* fuse routing kernels for small batches

* tune batch=1024

* [TRITON]  Apply config-aware naming (kernel_repr) to attention kernels (#1295)

* Apply kernel_repr to attention kernels

Applied make_kernel_repr helper to 4 attention kernel files:
- pa_decode.py (6 kernels)
- pa_prefill.py (2 kernels)
- chunked_pa_prefill.py (1 kernel)
- mla_decode_rope.py (2 kernels)

Each kernel now has config-aware naming with constexpr parameters
included in the kernel metadata name.

Base: amd/satya/kernel_config_to_name

* Apply kernel_repr to attention kernels

* fix indentation error and add kernel_repr to a missed kernel

* Add descriptions to missing API descriptions

* remove unused imports

* fix runtime error

* revert lean atten to main

* lean atten repr and API desc

* formatting fix

* Update aiter/ops/triton/pod_attention.py

* [fix]: prebuild gen so (#1412)

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* [TRITON] FP8 MQA optimizations (#1422)

FP8 MQA optimizations AND bench. script

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning (#1366)

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning

* add required keys to fused_moe_1stage_dict

* add kernel descriptors and code object files

* add 32x128 file descriptors and code objects for tuning

* move code objects and kernel descriptors to correct csv

* remove unnecessary import, add quant type argument

* move fused_moe_stage1_tkw1 into fused_moe.py

* remove unnecessary kernel code object files

* Add missing comma

* saved modified tuned fmoe config for testing purposes

* apply black required formatting

* remove fused_moe_stage1_tkw1 and place aiter.fmoe_g1u1_tkw1 under fused_moe_1stage

* remove unnecesary arguments

* apply black formatting

* simplify aiter.fmoe_g1u1_tkw1 call

* add doweight_stage1 column to fused_moe_1stage_dict map and remove elif condition to select run_1stage=True

* add doweight_stage1 to query key

* modidy elif to select run_stage=True for tokens > 16

* apply black formatting

* removing csv and .co files as they will come in separate commit

* removing log logger.info(f[get_2stage_cfgs] run_1stage)

---------

Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>

* CI: Move some tests to MI355 due to the network issue of TW cluster (#1446)

* CI: Move some tests to MI355 due to the network issue of TW cluster

* Modify the GPU_ARCH of sglang tests

* CI: Move Triton tests from TW cluster to internal cluster (#1451)

* tune a8w8_blockscale&bpreshuffle for tencent (#1444)

* tune a8w8_blockscale&bpreshuffle for tencent

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv

* update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_tuned_gemm.csv

* updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* [fix]: add ar switch (#1376)

* [fix]: add ar switch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: call ar naive

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: add ar switch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* CI: Test pypi connection (#1458)

* cktile weight preshuffle test and auto tuning for a8w8 (#1400)

* cktile bpreshuffle && tuning code

* refine code

* refine code

* refine code

* refine

* refine

* fix merge conflict

* fix conflict

* fix CI build fail

* refine code

* align aiter interface

* refine

* add get_padded_M

---------

Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>

* fix_prebuiild_asm (#1439)

* mdf_bf16gemm

* update

* update f4gemm

* update

* update

* f4gemm bugs fix

* f4gemm fix2

* update

* update moe 2 stages

* update codegen

* update gemm_a8w8_asm

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@.com>

* fix merged tuned config error (#1460)

* fix merged tuned config error

* update

---------

Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* update triton  version check for pa_mqa_logits (#1440)

* update triton  version check for pa_mqa_logits

* Add some explanation for aot-check

* Support pa_mqa_logits aot load on triton>=3.5

* Support pa_mqa_logits aot load on triton>=3.5

---------

Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>

* fix all_reduce_fake (#1465)

* CI: Use ausartifactory.amd.com in pip installation (#1469)

* update codegen (#1471)

* update codegen

* update

* update

* update

* fix

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: valarLip <340077269@qq.com>

* CI: Fix sglang CI test (#1473)

* CI: Add checkout retry in vLLM benchmark tests (#1476)

* CI: Move SGlang and Triton tests to MI300 runners (#1485)

* fix merging aiter config  (#1443)

* change to merge config when used

* fix lint

* fix error in GemmTuner

* fix lint

* fix error when runing deepseek

* fix lint error

* revert other format change

* fix gemm_op_a8w8.py

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* fix fmoe tune preshuffle error (#1430)

* set preshuffle=False default

* fix lint

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Fix issue: Add nhead=128 support to bf16 and align restrictions (#1450)

* fix fwd_v3 output/lse is nan when kseq=0 and fix qseq >> kseq error (#1442)

* fix output/lse is nan when kseq=0

* fix gfx950 128_128 fwd_v3

* update the k_seq=0 error in MI300 and MI308

* tune a8w8_blockscale&bpreshuffle for tencent (#1444)

* tune a8w8_blockscale&bpreshuffle for tencent

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv

* update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_tuned_gemm.csv

* updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update the smoke test

* update the smoke test

* fix MI300 and MI308 err

* fix qseq >> kseq error MI300 and MI308

* fix qseq >> kseq error in MI355

* fix the MI300 error

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
Co-authored-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* fix the build error of rtp (#1438)

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Add 32x64 to tuned fmoe config (#1386)

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning

* add required keys to fused_moe_1stage_dict

* add kernel descriptors and code object files

* add 32x128 file descriptors and code objects for tuning

* move code objects and kernel descriptors to correct csv

* remove unnecessary import, add quant type argument

* move fused_moe_stage1_tkw1 into fused_moe.py

* remove unnecessary kernel code object files

* add kernel descriptor to tuned fmoe config, add kernel descriptors to related csvs and add related code objects

* create kernel descriptors and kernel co files with correct tags

* some fix for support gpt_oss (#1488)

* CI: Revert vllm_benchmark to use the latest nightly image (#1402)

* add sink_size parameter in mha_fwd and varlen_mha_fwd

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update ck api

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Remove redundant assignment to sink_size

* Update mha_fwd_kernels.cu

* Update mha.py

* Add false argument to fmha_batch_prefill call

* update ck commit

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Handle sink_size with conditional window_size length

* update fmha_api

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update ck commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update ck_commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update ck commmit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix op test error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update mha.py

* Update csrc/include/torch/mha_fwd.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/py_itfs_ck/mha_fwd_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/py_itfs_ck/mha_varlen_fwd_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add some comments

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update mha_fwd_generate.py

* Clarify sink_size parameter in asm_mha_varlen_fwd.cu

Updated the comment for the sink_size parameter.

* update ck commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: la <46212055+junhaha666@users.noreply.github.com>
Co-authored-by: kliuae-amd <KuanFu.Liu@amd.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: azaidy <aliasger.zaidy@amd.com>
Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com>
Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com>
Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com>
Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: minmengdie <memin@amd.com>
Co-authored-by: ZhangLirong <lirzhang@amd.com>
Co-authored-by: root <root@hjbog-srdc-39.amd.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: valechen <115046356+valechen@users.noreply.github.com>
Co-authored-by: ukannika <uma.kannikanti@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: who who who <fsx950223@outlook.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: baowendin <46412693+baowendin@users.noreply.github.com>
Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com>
Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: TennyWang1223 <Tenny.Wang@amd.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org>
Co-authored-by: slippedJim <jim.guo@amd.com>
Co-authored-by: Jeff Huang <jiaji.huang73@gmail.com>
Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com>
Co-authored-by: zanzhang <zanzhang@amd.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: Hui Zhou <zhou_hui@outlook.com>
Co-authored-by: Hui Zhou <huizhou@meta.com>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: Satya Nikhil Kodukula <skodukul@amd.com>
Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com>
Co-authored-by: Oscar Xu <huuaiguxu@amd.com>
Co-authored-by: xudoyuan <xudoyuan@amd.com>
Co-authored-by: Farel Lukas <farlukas@amd.com>
Co-authored-by: Satya Nikhil Kodukula <nikhil.kodukula@gmail.com>
Co-authored-by: BingYuan.Zhou <BingYuan.Zhou@amd.com>
Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: felix <felix.li@amd.com>
Co-authored-by: eky-amd <ethan.ky@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: gbyu-amd <Guanbao.Yu@amd.com>
Co-authored-by: guanbao <gyu@amd.com>
Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com>
Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Co-authored-by: Michael Melesse <micmelesse@gmail.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: luocheng25 <cheng.luo@amd.com>
Co-authored-by: Xiake Sun <xiake.sun@amd.com>
Co-authored-by: Lyu, Xudong <xudonlyu@amd.com>
Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de>
Co-authored-by: Drew Wadsworth <drew.wadsworth@gmail.com>
Co-authored-by: yinfengLiu <yinfeliu@amd.com>
Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Co-authored-by: zufayu <zufa.yu@amd.com>
Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: Anton Saukkonen <63663359+antsaukk@users.noreply.github.com>
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@.com>
Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
zhuyuhua-v added a commit that referenced this pull request Dec 17, 2025
* add sink_size parameter in mha_fwd and varlen_mha_fwd

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update mha.py

* update mha_varlen_fwd_kernels

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* ca_refactor_fix (#1268)

* ca_refactor_fix

* more update

* fix_fp4_quant_dtype (#1271)

* add sample use outer exponential (#1267)

* update sample outer exp

* update mix sample use outer exponential

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Fix rowwise a8w8 gemm in swizzled hipb_mm (#1258)

* fix rowwise

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* guard hipb_mm output type in unit test

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* correct tuning for rowwise

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* expose b_preshuffle for tuning

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix lint

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

---------

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>

* CI: Use TW cluster to run sglang tests (#1273)

* Catchall PR for all 355_wip related changes (#1148)

* Catchall PR for all 355_wip related changes

Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com>
Co-authored-by:  Mehmet Cagri <mehmet.kaymak@amd.com>

* add triton fp4 gemm preshuffle (#1187)

* Remove AOT path

* [TRITON] Add Positional Encoding (PE) support to Triton MHA kernels

Related PR: #1184

Patch applied according to Cagri's request.

* Tune more fp4 gemm shapes

* Add aot compilation option for fp4 preshuffled gemm

* Add precompiled fp4 preshuffled gemm kernels

* Add readme

* re-gen AOT binary files for LL 70B FP4 GEMM and update kernels and API helper, fix fused_mul_add logger bug (#1242)

* DS 355_wip fused_shared_expert (#1218)

documentation, fix some bugs, UT

* fused_rms_fp8_group_quant num_warps tunning

* black formatting

* remove redundent files

* unified attn. reorg., fixes, exp2 update

* add MI300 config for fused_gemm_a8w8_blockscale_a16w16

* recover test_mha to upstream

* black formatting

* update test fused_kv_cache

* bypass some UTs for MI300

* update FP8 assert for test_mha_varlen

* skip UT

---------

Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com>
Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com>
Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com>

* [MI35X] fix core check (#1276)

* fix core check

* update

* Refactor gemm bf16 tuner (#1275)

* refactor GemmTuner

* update

* fix lint error

* fix lint error

* CI: Operators tuning pipelines (#1163)

* CI: Operators tunning pipelines

* Updates

* Updates

* Updates

* Updates

* Show computing unints

* Updates

* Updates

* Add op_tune.sh

* Updates

* Disable a4w4

* Updates the error handling

* Updates the error handling

* Updates

* Updates

* Updates

* Updates

* Update .github/scripts/op_tune.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Updates

* Add uloading tuned CSVs

* Updates

* Add shape name

* Add shape arg

* Allows users to select the shapes they want to tune and specify the arguments they need for tuning.

* Only be triggered when modify the untuned csv files under aiter configs foleder

* Test

* Updates

* Updates

* Update .github/workflows/operators-tuning.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/ck_gemm_a8w8_blockscale_bpreshuffle/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update a4w4_blockscale_untuned_gemm.csv

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix the problem that v3's performance is worse than ck's (#1237)

* fix fwd v3 kernel perf and opt err

* fix the python mha test error (#1277)

* Reuse custom decorator in core and torch guard (#1278)

* reuse custom decorator in core and torch guard

* remove useless

* rename all_reduce to avoid same name

* rebase

* fix bug

* use cpu device default

---------

Co-authored-by: root <root@hjbog-srdc-39.amd.com>

* integrate deep gemm (#1265)

* integrate m grouped gemm

* update ck

* add limit for 950

* rename deepgeem

* add a tuned config and insert entries in untuned config (#1243)

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Enable large batch size and optimization of non-Ragged batching (#1269)

* Enable large batch size and optimization of non-Ragged batching

* Add RAGGED_BATCH to test_la.py and bench_la.py

* add few more fw ds f4 untuned and tuned shapes for using asm kernel (#1298)

* CI: Optimize autotuning pipeline and inital the docs (#1286)

* CI: Optimize autotuning pipeline and inital the docs

* topk per row kernel (#1262)

* initial commit for topk per row kernel

* topk per row kernel initial commit

* Fix the typo issue

* Add the topk per row kernel

* optimizations for topk_per_row kernel

* fix overflow

* add unit test for topk_per_row_decode

* update test for decode

* apply vector dispatch from carlus

---------

Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu>
Co-authored-by: valarLip <340077269@qq.com>

* fix aot (#1279)

* fix aot

* remove other kernels path

* fix aot

* format code

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Fix ATOM fp8 model quant fail issue in torch compile (#1299)

* Fix fp8 issue in torch compile

* use less code

* feat - pa_fwd support block map with stride in num_kv_heads_dim (#1301)

* Fix how to update accumulator for dot_scaled (#1297)

* CI: Optimize autotuning pipeline docs (#1300)

* CI: Optimize autotuning pipeline docs

* Update docs/autotuning_pipeline.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix the lint issue (#1307)

* fix fwd perf calc error (#1305)

* fix fwd perf calc error

* black aiter/ops/triton/_triton_kernels/gemm_afp4wfp4.py

* add the asm kernel performance of fwd and bwd (#1270)

* add the asm kernel performance of the attention forwards and attention backwards

* modify perf data

* fix perf data

* add a16 perf data

* Fused TopK and Sigmoid kernel (#1251)

* Add topk softmax

* Add test for topk sigmoid

* register the op properly

* apply black

* don't use constexpr with std::string

* bump ck to include topk sigmoid commit

* hipify

* add argparse to the topk sigmoid test, also add pytest

* use own module instead of asm moe

* black formatting

* add missing file

* revert changes to module_moe_asm

* Ar rms (#1290)

* [fea]: add fused allreduce rmsnorm kernel

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix: fuse ar rms interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix: fuse ar rms interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix ck branch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* update ar interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Dsv32 cache (#1314)

* add indexer_k_quant_and_cache & cp_gather_indexer_k_quant_cache

* ndexer_k_quant_and_cache opt kernel and add test

* update

* update2

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Fix displaying supported architectures (#1316)

Currently it will look like this:
```log
File "TransformerEngine/3rdparty/aiter/aiter/jit/utils/chip_info.py", line 77, in get_gfx_custom_op_core
  raise KeyError(
KeyError: 'Unknown GPU architecture: . Supported architectures: [0, 1, 2, 3, 4, 5, 6, 7, 8]'
```

Signed-off-by: Hollow Man <hollowman@opensuse.org>

* using standalone pybind (#1317)

* using standalone pybind

* fix

* update

* Enable mha bwd hd192_hd128 (#1308)

* update codegen.py

* update kernels & kernel launch

* fix fa bwd dq_acc shape

* remove mask in python api

* CI: Add pre-check status check (#1252)

Creates a unified pre-checks.yaml workflow that runs Black, Ruff, and dependency checks, uploading success/failure signal artifacts
Download and verify the signal artifacts in the other heavy jobs. If the verification succeeds, the heavy jobs will continue running. If the verification fails, the heavy jobs will exit immediately.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [CK_TILE] fmha: Add backward pass support for padded inputs (#1212)

* [CK_TILE] fmha: Add backward pass support for padded inputs

Introduces support for padded sequence lengths in the backward pass of the variable-length flash attention (fmha_v3_varlen_bwd).
- Updated Python and C++ function signatures to accept optional `cu_seqlens_q_padded` and `cu_seqlens_k_padded` arguments.
- Modified the underlying CUDA kernels and code generation scripts to pass padding information via the new `seqlen_q_ptr` and `seqlen_k_ptr` fields in
     the CK `fmha_bwd_args` struct.
- Modified the underlying kernels and code generation scripts to correctly handle pointers for both padded and unpadded sequence data.
- Added comprehensive gradient verification to the test suite (`test_mha_varlen.py`) to ensure the correctness of the backward pass with various
     padding scenarios.

* [CK_TILE] fmha: Adapt to composable_kernel padding API changes

Refactor the FMHA forward and backward pass to align with the updated padding API in `composable_kernel`.

- Argument Simplification: Removed the manual calculation of `seqlen_q` and `seqlen_k` from `cu_seqlens` in the `mha.cu` interface. The underlying kernels now handle this logic.
- API Alignment: Updated the arguments passed to `aiter::mha_fwd` and `aiter::mha_bwd` to match the new `composable_kernel` API. This involves passing `cu_seqlen` pointers directly.
- Kernel Interface Update: Modified the `codegen.py` scripts for `gfx942` and `gfx950` to reflect the changes in the kernel's function signatures and argument handling for padded and unpadded sequence lengths.

* fix build error in op_tests/cpp/mha/benchmark_mha_*.cpp

* Mla splitkv enhance split alg inte (#1233)

* add num_kv_splits_indptr to mla for mtp<=4 case for now

* update

* update new kernel

* infrastructures

* 1st version of split kernel

* Fix issues raised by Lingpeng and fix the issue on batch_size

* update mla

* update mla_stage2

* 1st draft of v1 split program

* add kv_offset

* mla_splitkv_enhance_split_alg_inte

* splitkv debug

* 1st version of reduce kernel

* metadata & kernel finish

* add reduce

* final_lse is optional now.

* update kernel

* bug fix

* bug fix 1

* modify reduce api

* update kernel

* fix max splits

* bug fix 3

* fix s80 early return

* udpate calculation of partial_indx

* add per split test

* make lse support by ref

* test split

* fix redundant calculation of head offset in reduce kernel

* add custom test

* Add support of 128 head size

Fix how to get head count

fff

* update comments

* 1. Let large work be assigned first.
2. Add tolerance to the tile which is slightly smaller than kv_limit.

* Calculate kv_limit dynamically

* Fix bug about difference in split_kv(bool)

* add test

* fix seed

* Add global tolerance 16 in kv seqlen because main kernel cannot handle small splits (kv_seqlen<4) well.

* Fix warp=1 error

* Add redundant mode to make the size of output of metadata be fixed add new param: no_redundant. Reduce can support redundant input in reduce_indptr as well.

fix comm

* fp8 setup

* first version of device metadata

aaa

* Add work_ptrs

* Compatibility to CUDA Graph

* Refactor code. Merge 2 iterations of generate work together.

* Make sure that each batch of workload can never be splited to more than #cluster of tiles.

* Adjust metadata. Get 1% perf gain.

* Paralize most of metadata kernel

Make get_cost_top() paralized.

aaa

* add scale

* 1. Use warp-level bitonic sort to sort batch idx based on their cost for reducing #splits. 2. Use CK's warp ops.

* fp8 function pass

* Fix issues:
1. avg_workload cannot handle any batch!
2. split_kv(bool) is not correct when all the clusters are full.

* fp8 ready

* fix

* persistent ready

* add nv acc test

* rename

* updata metashape

* update reduce cu num

* update optest for mla

* fix cu num

* Update metadata and reduce kernels.

* rename kernels

* Add new param kv_granularity to metadata kernel.

* Introduce cal_workload_limit_global_v2

* Support qhead=128 cases.

* Change get_mla_metadata() api. Make some not important parameters be optional through a dict.

* fix potential problem on calculating tot_qo_tiles

typo

* refactor metadata files

aaa

* update metadata v1_2

* update gqa_128 mla_ps & fix metadata v1_2

* Optimize mla metadata v1.2

* Optimize mla metadata v1.2 Part.2

* Optimize mla metadata v1.2 Part.3

* update qlen <=4

* fix mla qlen1

* Optimize mla metadata v1.2 Part.4

* Make reduce_final_map be optional in mla_reduce_v1

* Slightly increase reduce perf

* Add persistent mode for mla reduce kernel

* add mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co

* update deepseekv32 sparse mla metadata

* update mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co

* Adjust code for sparse attn

* Optimize the a16w8 kernel

* Improve metadata v1.1 perf

* Make metadata v1.1 support sparse attn

bug fix

tiny fix

* Remove redundant code in mla_reduce

* futile struggle

* Fix issue after merge. aiter main branch is using torch.library.infer_schema which doesn't allow dict as parameter. Thus, change the API for metadata.

* Adjust metadata v1.1 and make this branch be ready to be merged to main branch.

* remove invalid co kernel

* Fix issue brought from f794ae4 which disabled hipify by default.

* support qolen>1 for sparse mla

* make code become prettier

* Fix issue in metadata v1.1

* Fix issue in test_mla.py

* Fix lint fails

* Fix sub-test fails in op_test/test_mla.py

* Fix regression in test_mla.py where mtp>1

* Add head_dim=128 support to reduce

* Add nhead=8 for pa and add assert to make sure the input tensors are in
float32.

* fix issue in vllm benchmark for deepseek: remove metadata v0 because it's not compatible with hip graph

* fix lint

* Revert all the change about mi350 gemm.

* add a8w8 and a16w8 kernel in mla mi350

* add A8W8 Non-persistent mode kernel

* Fix issue reported by Copilot

* add mla non-persistent test

* script: update a16w8 kernel

* rm test_mla_persistent_mi350.py and support mi350 in test_mla_persistent.py

* add mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps.co

* fix a8w8 num_kv_split=1

* Fix issue in metadata v1.2 on qo_tiles > 1

* fix ut bandwidth

* Use nhead=16 simulate cases that nhead=16*N where N is in range(32,16*32+1,16)

aaa

Fix regression in sparse attn from the fix in metadata v1.2 for multi qo tile issue

* Add new api get_mla_metadata_info

* fix lint format issues

* Adjust get_mla_metadata_info_v1's parameters.

* update A16W8 kernel

* update A16W8 kernel2

* update A16W8 for mi300

* fix ut and rename some kernels

* rename mla kernel name for head 128

* remove log

* fix format

* add nativly back

* change zeros into empty

* fix with comments

---------

Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com>
Co-authored-by: zanzhang <zanzhang@amd.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: minmengdie <memin@amd.com>

* Fix gemm tuner error mi350 (#1313)

* workaround-retry tuning when encounter invalid pointer

* workaround-retry tuning when encounter invalid pointer

* fix  lint error

* Update gemm_tuner.py

em timeout

* CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton (#1325)

* CI: Skip triton in Aiter standard and multigpu tests

* Add retries when building triton

* Add ninja installation

* Fix global variable torch_fp8 initialization caused issue (#1322)

`hipGetDeviceProperties` is called by the `torch_fp8` initialization. It will trigger all the HIP runtime initialization in global variable initialization. There are two issues:

- There are several global variables involved in the runtime initialization too. The initialization order of global variables is not guaranteed. So it may use uninitialized global variables for the runtime initialization.

- When there is a forked child process, needs to initialize its own HIP runtime to get proper GPU driver kernel context and handles. But since there is a runtime initialized globally in the parent process, the forked process will just consider the runtime is initialized and use it directly. But it is actually invalid.

The fix is to ensure `hipGetDeviceProperties` is only called when actually needed, not during static initialization

To repro the issue:
1. fork a child process
2. call torch.empty on the child process

It will get a `hipErrorInvalidValue` error.

Co-authored-by: Hui Zhou <huizhou@meta.com>

* Add transpose scale to the triton fused_rms_fp8_group_quant (#1291)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning (#1315)

* update AOT, always pad x_scale when generating input, add UT

* update act_mul_mxfp4_quant, fused_rms_mxfp4_quant

* add LL FP4 configs and AOT files for TP8 shapes

* fix UT bug

* add LL TP2 and TP4 shapes

* [TRITON] Kernel naming: add reusable constexpr repr helper (#1260)

* Kernel naming: add reusable constexpr repr helper for gemm a16w16

* add missing params to the repr

* Merge tuned file (#1327)

* merge tuned_file with same prefix

* fix lint

* rename tuned_gemm.csv to bf16_tuned_gemm.csv to avoid matching wrong file

* update

* update README.md of bf16 GemmTuner

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* fix graph_breaks by return tensor for bool op (#1333)

* fix_bf16gemm_asm (#1329)

* fix_bf16gemm_asm

* update

* update

* Improve Memory Usage in MLA (#1338)

* Improve mla memory

* further reduce memory usage

* Fix lint issues

* Fix issue reported by Jun Chen.

* fix tune error caused by merge tuned_file (#1342)

* fix tune error caused by merge tuned_file

* fix lint error, rm some log

* rm rocblas in tuner (#1337)

* [Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant (#1328)

* add gemm_a16w8_blockscale and fused_reduce_rms_fp8_group_quant

* black formatting

* add MI300 config

* fix commit

* Add block_m=16 for a8w8_ck_moe_blockscale (#1081)

* Add block_m=16 for a8w8_ck_moe_blockscale

* fix moe_blk_scale token<64 to ck2stage for ds shape

* fp8 moe bugfix tuning rebase tuned_fmoe.csv

* add L2 check

* reformat

---------

Co-authored-by: Oscar Xu <huuaiguxu@amd.com>
Co-authored-by: xudoyuan <xudoyuan@amd.com>

* Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel (#1330)

* Fused Triton RMSNorm and FP8 static quantization

* Formatted python scripts

* [TRITON] GEMM kernels nomenclature changes (#1283)

* Kernel naming: add reusable constexpr repr helper (#1260)

* Add missing API documentation

* Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. (#1359)

* [Triton] Disable failing lean attention tests (#1357)

* add config (#1355)

* add how_v3_bf16_cvt control to the Python API (#1351)

* add how_v3_bf16_cvt in fwd_v3

* fix the fwd compile

* [fix]: car 6 rank coredump (#1335)

* [fix]: car 6 rank coredump

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: add residual out in ar rms

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic (#1348)

* make can_mha_v3 uninplace and fix guard

* fix error

* wrapper _flash_attn_backward in custom

* make gemm a8w8/a4w4 into custom

* add some op in custom

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* [TRITON] GEMM kernels nomenclature changes (#1292)

* implemented the use of kernel repr helper to standardize kernel metadata representation

* Add Missing API documentation

* remove commented code

* remove FILL_VALUE to keep kernels name meaningful

* [TRITON] Initial implementations of sparse attention kernels (#1296)

fp8_mqa_logits: Calculate the logits (prefill stage) to be used for topk
unified_attention_sparse_mla: Sparse attention implementation for the deepseek like MLA using the MHA approach where kv cache is [seq_len_kv, 1, HEAD_SIZE + kv_lora_rank + rope_rank] and q is [seq_len, NUM_HEADS, kv_lora_rank + rope_rank]

* [MI35X]cktile moe a16w4 support (#1341)

* draft of cktile moe

* align the interface of main branch to make cktile moe compile pass

* refine code

* refine ck moe

* fix CI build fail about code style

* remove ck blockscale moe modification

* refine code

* fix CI build fail of unsupport block_m=16

* refine format

* fix conflict

* update

* format

* fix format

* update

* update

* update

* format

* format

* remove useless

* fix sorting

---------

Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: Oscar Xu <huuaiguxu@amd.com>
Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: felix <felix.li@amd.com>

* [TRITON] Batched GEMM kernels nomenclature changes (#1293)

* implemented the use of kernel repr helper to standardize kernel metadata representation

- batched_gemm_bf16.py
- batched_gemm_a8w8.py
- batched_gemm_afp4wfp4.py (main + reduce kernel)
- batched_gemm_afp4wfp4_pre_quant.py
- batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.py

* Add Missing API documentation

* remove Dtype to avoid having invalid names in the repr

* [TRITON] Instruction shape fix for Gluon gemm_a8w8_blockscale kernel (#1261)

* fixed instr_shape error for mfma layout

* removed test skips for uneven K

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* moe mxfp4 block_m = 64/128 (#1266)

* moe mxfp4 block_m = 64/128

* update a4w4_gemm2_kernels_list

* add instance tile_m=32

* tuned configuration

* Update test_moe_2stage.py

* refactor

* update v1 pipeline

* update badcase

* fix fp4 moe tuner

* reformat

* revert ck update

* update ck

* Moe mxfp4 ck preshf bns (#1312)

* python code of nbs compatible

* bns compatible

* fix global

* bug fix

---------

Co-authored-by: zhimding <zhimding@amd.com>

* add AITER_MXFP4_MOE_SF switch for mxfp4 moe

* v3 n128

* 32x32 v1

* resolve ck conflict

* rm use_int4=True

* reformatted op_tests/test_moe_2stage.py

* AITER_MXFP4_MOE_SF bugfix

* revert torch.int4

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: felix <felix.li@amd.com>

* bug fix (#1370)

* [opus] enhance opus utility (#1324)

* enhance tuple and add waitcnt

* refactor vectorized issue space

* wip cached layout

* support cached layout

* add smem

* support broadcast dtype in store()

* Fix issue in metadata v1.2 where batch size is too large (#1352)

* Fix issue in metadata v1.2 where batch size is too large. V1.1 is hopeless in these cases...

* lds_partial_info is not used when there is no further tile splits on qo.

* [GEMM][Config] add a8w8 block scale tuned config for deepseek-v3 (#1310)

* add a8w8 gemm tuned config with block scale for deepseek-v3 shapes

* reorganize the config files

* revert unnecessnary changes

* add headers

* move ds configs to specific model config file

---------

Co-authored-by: guanbao <gyu@amd.com>

* support all logit values (#1323)

* support all logit values

* fix tests

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* CI: Skip triton in Aiter standard and multigpu tests (#1374)

Triton tests will only run when changing file under:
 - aiter/ops/triton
 - op_tests/triton_tests

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add the performance data bar chart in the readme (#1372)

* add the performance data bar chart in the readme

* fix the wrong remarks

* force ds ptpc moe use 1 stage moe (#1373)

* [TRITON]  MHA PA optimizations (#1245)

Optimizations:
New config.
Preloading v
Reordering pe matmul
Simplifying scaling ops (for sm_scale and log_2(e) related scaling)
Masking related changes

* Enable fa multii target build on other arch (#1318)

* enable fa multii target build on other arch

* update arch info when dispatch in python api

* update

* Support mixed V2/V3 arches

* format

---------

Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com>

* [Triton] DS FP4 triton fusion (#1371)

* add fused_gemm_afp4wfp4_a16w16.py

* fix bug

* add fused_reduce_act_mul_and_mxfp4_quant

* add gemm_a16wfp4.py

* fix

* fix

* fix

* clean up

* repr

* update AOT with repr

* fix bug

* add dummy heuristics

* add output_unquantized_inp1 to fused_rms_mxfp4_quant

* add configs

* fix bug, tune fused_reduce_act_mul_and_mxfp4_quant

* fix

* fix

* final clean up

* add batched_gemm_a16wfp4

* clean up

* add config

* add default config

* remove old kernels, add API redirection and deprecation warning

* add fused_gemm_afp4wfp4_mul_add

* [TRITON] Simplify and optimize triton_kernels moe code and move it into aiter (#1326)

First PR for MoE with optimized support for GPTOSS shapes and fp8 x fp4.

* Use torch.zeros_like instead of empty_like to prevent accruacy drop (#1387)

* CI: Temporarily using old vllm nightly image (#1389)

* Revert "[Triton] DS FP4 triton fusion (#1371)" (#1392)

This reverts commit 4aabf79.

* add a8w8 ptpc gemm config for dsv3 (#1382)

* add ninja to install_requires in setup.py, fix ck gemm a8w8 bshuffle heuristic dispatch not support mnk=(2048,2112,7168)

* add 085 ptpc gemm tune config

* rename csv

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Test the CI on both MI325 and MI355 (#1364)

* Always run tests on mi355

* [Triton] change BF16 GEMM config filename (#1398)

* Support distributed_init_method and DP in init_distributed (#1353)

* support distributed_init_method in init_distributed

* support dp in aiter distribute

* fix only tp error

* FA V3(fp8) and paged Attention compressed (CI green) (#1065)

FA V3(fp8) and paged Attention compressed

FA V3(fp8) and paged Attention

FP8 Prefill work compressed

Fa V3 api

Compress fp8 work so far

pull cast out of torch function

e2e fp8 stub

emulate fa v3

ignore

remove example

clean up forward

save

fp8 backward

ignore train artifacts

just use return_attn_probs

match fa behvaior

save fa ref

add fa_ref

fix dropout bug

add link

optional fp8 p descale

rename to v3

fa v3

clean up

match backward

min diff

update varlen api

clean up FP8_P_DESCALE

update bench and test

lint

fix mha varlen bug

remove .gitignore

save

lint

remove skip

bring back skips

add fa module

update v2 interface

create mha_v3

add old v3 path

update fa module

tests passing

sync bwd changes

lint fa module

add kvcache api and test

fix lint

fp8 works

test fp8 only

add paged tests

add flash_attn_with_kvcache to v2

test varlen

move to _triton_kernels

test_mha_backward working with v3

upgrade to cleanedup modeule

get test_mha_backward_varlen working

clean up

fix lint bug

move casting functions to utils

fix lint

Update aiter/ops/triton/utils/mha_kernel_utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update aiter/ops/triton/utils/mha_kernel_utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

use Optional

update from main_perf

lint

update fp8 backward

lint

descale factor is fp32

lint

dequant backward

match dims

Sync with FA main_perf

dequant in backward

pass descale to kernel directly

better kernel naming

simple fp8 path. no transpose

clean up bwd

save

bring back split

min diff

pass descale to bwd

lint

fix bwd nans

FP8_AUTO_DESCALE

use hk for other backwards

fp8 wrapper

lint

rm matrix_instr_nonkdim

split v2 and v3 cleanly

lint

back to og

minimal change

test_mha passes

test green

* is_shuffled (#1377)

* is_shuffled

* shuffle_weight bugfix

* rm AITER_MXFP4_MOE_SF

* preshuffle bugfix

* refactor

* refactor bugfix

* add bns/preshuffle moe mxfp4 UT tests

* add L2 verification

* black op_tests/test_moe_2stage.py

---------

Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Ar rms new interface (#1401)

* [fix]: fused_ar_rms interface

Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>

* delete comment

Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>

* change ut case

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix ut format err

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: ar acc err

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* minor fix for mi355 (#1408)

* [Fix] Add sliding window feature for paged_attention_v1 (#1362)

* add sliding window to paged_attention_v1

* Avoid use fmha_v3_varlen_fwd on unsupported architecture gfx90a

* make  `sliding_window` default to 0 for better compatibility

* fix possible used compilation problem

* fix ci failure

* fix ci failure

* add a single test to avoid increasing test time a lot

---------

Co-authored-by: Xiake Sun <xiake.sun@amd.com>

* fused_qk_rope_cat_and_cache_mla: Fix Triton compilation error and batch size constraint and output tensor sizing (#1407)

* Fix Triton compilation error by nesting OUTPUT_Q_NOPE_ZEROS condition

* Correct batch size constraint and output tensor sizing

* max mla splits perbatch (#1390)

* fix issues

* add limit for split num per batch

* fix non-ps num kv split

* fix issue for big batch size

* fix logits alloc

* fix black code stype

* fix ut

* update git ignore& remove aiter/install_mode

* update qh16 qseqlen4 kernel

* update

---------

Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* topk_per_row_opt (#1394)

* topk_per_row_opt

* Optimize topk by integrating air-topk kernel (TODO: proper workspace size computation and init) (#1409)

* update

* slightly optmize

* add back

---------

Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de>
Co-authored-by: carlushuang <carlus.huang@amd.com>

* Fix fused_rms_mxfp4_quant comment (#1369)

* leanAttn softmax fix for spurious data mismatch test failures (#1396)

* leanAttn softmax fix for spurious data mismatch test failures

* black fix

* Remove unused parameters per PR review request

* Black fix

* Add reduce_scatter api (#1413)

* add reduce_scatter api

* add reduce_scatter api

* fix error in fmoe_tuner (#1405)

* fix error in fmoe_tuner

* fix error when tuning QuantType.per_1x32

* rm redundant code

* optimize thread divergence (#1421)

* [TRITON] complex number multiplication that supports 3D ROPE triton kernel (#1061)

* complex number multiplication that supports 3D ROPE triton kernel

* Merge remote-tracking branch 'origin/feature/rope3d-fix' and resolve conflicts

* confilcts resolve

* conflicts resolve

* moved code places

* fix typo

* fix typo

---------

Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Feat: pa_mqa_logits performance optimization & support kv_preshuffle + blocksize16/64 (#1424)

* Update gluon_pa_mqa_logits using preshuffle

* Minor update

* Add finer pipeline granularity

* Add sched_group_barrier optimization

* Minor update

* Support blocksize64 for preshuffle pa_mqa_logits

* Support logits JIT on triton 3.5

* Improve splitkv strategy

* Eliminate redundant conditon check

* Add missing oob check

* Resolve reviews

* [Config] add tuned moe and gemm config for qwen3 235b (#1378)

* add moe tuned config

* add gemm tuned config

* move tuned moe config to model specific file

---------

Co-authored-by: guanbao <gyu@amd.com>

* fix repeated unnecessary device check (#1221)

* remove device check

* 8 devices

* more

* more

---------

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* remove lru func in fake (#1429)

* Temporarily disable the test on mi355 (#1437)

* Enable MI355 test on main branch

* CI: Aiter tests bug fix

* [M308] tune silu&act (#1404)

* Vectorized loads and stores  combined with the packed multiply path

* format code II

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* add deepseek ep moe tune config (#1431)

* add ptpc deepseek ep moe tuned config

* add block deepseek ep moe tune config

* using 1stage moe for ptpc deepseek

* [TRITON] Moe a8w4 tuning (#1410)

* fuse routing kernels for small batches

* tune batch=1024

* [TRITON]  Apply config-aware naming (kernel_repr) to attention kernels (#1295)

* Apply kernel_repr to attention kernels

Applied make_kernel_repr helper to 4 attention kernel files:
- pa_decode.py (6 kernels)
- pa_prefill.py (2 kernels)
- chunked_pa_prefill.py (1 kernel)
- mla_decode_rope.py (2 kernels)

Each kernel now has config-aware naming with constexpr parameters
included in the kernel metadata name.

Base: amd/satya/kernel_config_to_name

* Apply kernel_repr to attention kernels

* fix indentation error and add kernel_repr to a missed kernel

* Add descriptions to missing API descriptions

* remove unused imports

* fix runtime error

* revert lean atten to main

* lean atten repr and API desc

* formatting fix

* Update aiter/ops/triton/pod_attention.py

* [fix]: prebuild gen so (#1412)

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* [TRITON] FP8 MQA optimizations (#1422)

FP8 MQA optimizations AND bench. script

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning (#1366)

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning

* add required keys to fused_moe_1stage_dict

* add kernel descriptors and code object files

* add 32x128 file descriptors and code objects for tuning

* move code objects and kernel descriptors to correct csv

* remove unnecessary import, add quant type argument

* move fused_moe_stage1_tkw1 into fused_moe.py

* remove unnecessary kernel code object files

* Add missing comma

* saved modified tuned fmoe config for testing purposes

* apply black required formatting

* remove fused_moe_stage1_tkw1 and place aiter.fmoe_g1u1_tkw1 under fused_moe_1stage

* remove unnecesary arguments

* apply black formatting

* simplify aiter.fmoe_g1u1_tkw1 call

* add doweight_stage1 column to fused_moe_1stage_dict map and remove elif condition to select run_1stage=True

* add doweight_stage1 to query key

* modidy elif to select run_stage=True for tokens > 16

* apply black formatting

* removing csv and .co files as they will come in separate commit

* removing log logger.info(f[get_2stage_cfgs] run_1stage)

---------

Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>

* CI: Move some tests to MI355 due to the network issue of TW cluster (#1446)

* CI: Move some tests to MI355 due to the network issue of TW cluster

* Modify the GPU_ARCH of sglang tests

* CI: Move Triton tests from TW cluster to internal cluster (#1451)

* tune a8w8_blockscale&bpreshuffle for tencent (#1444)

* tune a8w8_blockscale&bpreshuffle for tencent

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv

* update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_tuned_gemm.csv

* updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* [fix]: add ar switch (#1376)

* [fix]: add ar switch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: call ar naive

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: add ar switch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* CI: Test pypi connection (#1458)

* cktile weight preshuffle test and auto tuning for a8w8 (#1400)

* cktile bpreshuffle && tuning code

* refine code

* refine code

* refine code

* refine

* refine

* fix merge conflict

* fix conflict

* fix CI build fail

* refine code

* align aiter interface

* refine

* add get_padded_M

---------

Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>

* fix_prebuiild_asm (#1439)

* mdf_bf16gemm

* update

* update f4gemm

* update

* update

* f4gemm bugs fix

* f4gemm fix2

* update

* update moe 2 stages

* update codegen

* update gemm_a8w8_asm

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@.com>

* fix merged tuned config error (#1460)

* fix merged tuned config error

* update

---------

Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* update triton  version check for pa_mqa_logits (#1440)

* update triton  version check for pa_mqa_logits

* Add some explanation for aot-check

* Support pa_mqa_logits aot load on triton>=3.5

* Support pa_mqa_logits aot load on triton>=3.5

---------

Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>

* fix all_reduce_fake (#1465)

* CI: Use ausartifactory.amd.com in pip installation (#1469)

* update codegen (#1471)

* update codegen

* update

* update

* update

* fix

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: valarLip <340077269@qq.com>

* CI: Fix sglang CI test (#1473)

* CI: Add checkout retry in vLLM benchmark tests (#1476)

* CI: Move SGlang and Triton tests to MI300 runners (#1485)

* fix merging aiter config  (#1443)

* change to merge config when used

* fix lint

* fix error in GemmTuner

* fix lint

* fix error when runing deepseek

* fix lint error

* revert other format change

* fix gemm_op_a8w8.py

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* fix fmoe tune preshuffle error (#1430)

* set preshuffle=False default

* fix lint

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Fix issue: Add nhead=128 support to bf16 and align restrictions (#1450)

* fix fwd_v3 output/lse is nan when kseq=0 and fix qseq >> kseq error (#1442)

* fix output/lse is nan when kseq=0

* fix gfx950 128_128 fwd_v3

* update the k_seq=0 error in MI300 and MI308

* tune a8w8_blockscale&bpreshuffle for tencent (#1444)

* tune a8w8_blockscale&bpreshuffle for tencent

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv

* update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_tuned_gemm.csv

* updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update the smoke test

* update the smoke test

* fix MI300 and MI308 err

* fix qseq >> kseq error MI300 and MI308

* fix qseq >> kseq error in MI355

* fix the MI300 error

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
Co-authored-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* fix the build error of rtp (#1438)

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Add 32x64 to tuned fmoe config (#1386)

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning

* add required keys to fused_moe_1stage_dict

* add kernel descriptors and code object files

* add 32x128 file descriptors and code objects for tuning

* move code objects and kernel descriptors to correct csv

* remove unnecessary import, add quant type argument

* move fused_moe_stage1_tkw1 into fused_moe.py

* remove unnecessary kernel code object files

* add kernel descriptor to tuned fmoe config, add kernel descriptors to related csvs and add related code objects

* create kernel descriptors and kernel co files with correct tags

* some fix for support gpt_oss (#1488)

* CI: Revert vllm_benchmark to use the latest nightly image (#1402)

* add sink_size parameter in mha_fwd and varlen_mha_fwd

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update ck api

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Remove redundant assignment to sink_size

* Update mha_fwd_kernels.cu

* Update mha.py

* Add false argument to fmha_batch_prefill call

* update ck commit

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Handle sink_size with conditional window_size length

* update fmha_api

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update ck commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update ck_commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update ck commmit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix op test error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update mha.py

* Update csrc/include/torch/mha_fwd.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/py_itfs_ck/mha_fwd_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/py_itfs_ck/mha_varlen_fwd_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add some comments

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update mha_fwd_generate.py

* Clarify sink_size parameter in asm_mha_varlen_fwd.cu

Updated the comment for the sink_size parameter.

* update ck commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: la <46212055+junhaha666@users.noreply.github.com>
Co-authored-by: kliuae-amd <KuanFu.Liu@amd.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: azaidy <aliasger.zaidy@amd.com>
Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com>
Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com>
Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com>
Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: minmengdie <memin@amd.com>
Co-authored-by: ZhangLirong <lirzhang@amd.com>
Co-authored-by: root <root@hjbog-srdc-39.amd.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: valechen <115046356+valechen@users.noreply.github.com>
Co-authored-by: ukannika <uma.kannikanti@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: who who who <fsx950223@outlook.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: baowendin <46412693+baowendin@users.noreply.github.com>
Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com>
Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: TennyWang1223 <Tenny.Wang@amd.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org>
Co-authored-by: slippedJim <jim.guo@amd.com>
Co-authored-by: Jeff Huang <jiaji.huang73@gmail.com>
Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com>
Co-authored-by: zanzhang <zanzhang@amd.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: Hui Zhou <zhou_hui@outlook.com>
Co-authored-by: Hui Zhou <huizhou@meta.com>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: Satya Nikhil Kodukula <skodukul@amd.com>
Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com>
Co-authored-by: Oscar Xu <huuaiguxu@amd.com>
Co-authored-by: xudoyuan <xudoyuan@amd.com>
Co-authored-by: Farel Lukas <farlukas@amd.com>
Co-authored-by: Satya Nikhil Kodukula <nikhil.kodukula@gmail.com>
Co-authored-by: BingYuan.Zhou <BingYuan.Zhou@amd.com>
Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: felix <felix.li@amd.com>
Co-authored-by: eky-amd <ethan.ky@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: gbyu-amd <Guanbao.Yu@amd.com>
Co-authored-by: guanbao <gyu@amd.com>
Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com>
Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Co-authored-by: Michael Melesse <micmelesse@gmail.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: luocheng25 <cheng.luo@amd.com>
Co-authored-by: Xiake Sun <xiake.sun@amd.com>
Co-authored-by: Lyu, Xudong <xudonlyu@amd.com>
Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de>
Co-authored-by: Drew Wadsworth <drew.wadsworth@gmail.com>
Co-authored-by: yinfengLiu <yinfeliu@amd.com>
Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Co-authored-by: zufayu <zufa.yu@amd.com>
Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: Anton Saukkonen <63663359+antsaukk@users.noreply.github.com>
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@.com>
Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
ZhangLirong-amd added a commit that referenced this pull request Dec 29, 2025
* add sink_size parameter in mha_fwd and varlen_mha_fwd

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update mha.py

* update mha_varlen_fwd_kernels

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* ca_refactor_fix (#1268)

* ca_refactor_fix

* more update

* fix_fp4_quant_dtype (#1271)

* add sample use outer exponential (#1267)

* update sample outer exp

* update mix sample use outer exponential

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Fix rowwise a8w8 gemm in swizzled hipb_mm (#1258)

* fix rowwise

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* guard hipb_mm output type in unit test

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* correct tuning for rowwise

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* expose b_preshuffle for tuning

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

* fix lint

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

---------

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>

* CI: Use TW cluster to run sglang tests (#1273)

* Catchall PR for all 355_wip related changes (#1148)

* Catchall PR for all 355_wip related changes

Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com>
Co-authored-by:  Mehmet Cagri <mehmet.kaymak@amd.com>

* add triton fp4 gemm preshuffle (#1187)

* Remove AOT path

* [TRITON] Add Positional Encoding (PE) support to Triton MHA kernels

Related PR: #1184

Patch applied according to Cagri's request.

* Tune more fp4 gemm shapes

* Add aot compilation option for fp4 preshuffled gemm

* Add precompiled fp4 preshuffled gemm kernels

* Add readme

* re-gen AOT binary files for LL 70B FP4 GEMM and update kernels and API helper, fix fused_mul_add logger bug (#1242)

* DS 355_wip fused_shared_expert (#1218)

documentation, fix some bugs, UT

* fused_rms_fp8_group_quant num_warps tunning

* black formatting

* remove redundent files

* unified attn. reorg., fixes, exp2 update

* add MI300 config for fused_gemm_a8w8_blockscale_a16w16

* recover test_mha to upstream

* black formatting

* update test fused_kv_cache

* bypass some UTs for MI300

* update FP8 assert for test_mha_varlen

* skip UT

---------

Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com>
Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com>
Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com>

* [MI35X] fix core check (#1276)

* fix core check

* update

* Refactor gemm bf16 tuner (#1275)

* refactor GemmTuner

* update

* fix lint error

* fix lint error

* CI: Operators tuning pipelines (#1163)

* CI: Operators tunning pipelines

* Updates

* Updates

* Updates

* Updates

* Show computing unints

* Updates

* Updates

* Add op_tune.sh

* Updates

* Disable a4w4

* Updates the error handling

* Updates the error handling

* Updates

* Updates

* Updates

* Updates

* Update .github/scripts/op_tune.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Updates

* Add uloading tuned CSVs

* Updates

* Add shape name

* Add shape arg

* Allows users to select the shapes they want to tune and specify the arguments they need for tuning.

* Only be triggered when modify the untuned csv files under aiter configs foleder

* Test

* Updates

* Updates

* Update .github/workflows/operators-tuning.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/ck_gemm_a8w8_blockscale_bpreshuffle/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update a4w4_blockscale_untuned_gemm.csv

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix the problem that v3's performance is worse than ck's (#1237)

* fix fwd v3 kernel perf and opt err

* fix the python mha test error (#1277)

* Reuse custom decorator in core and torch guard (#1278)

* reuse custom decorator in core and torch guard

* remove useless

* rename all_reduce to avoid same name

* rebase

* fix bug

* use cpu device default

---------

Co-authored-by: root <root@hjbog-srdc-39.amd.com>

* integrate deep gemm (#1265)

* integrate m grouped gemm

* update ck

* add limit for 950

* rename deepgeem

* add a tuned config and insert entries in untuned config (#1243)

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Enable large batch size and optimization of non-Ragged batching (#1269)

* Enable large batch size and optimization of non-Ragged batching

* Add RAGGED_BATCH to test_la.py and bench_la.py

* add few more fw ds f4 untuned and tuned shapes for using asm kernel (#1298)

* CI: Optimize autotuning pipeline and inital the docs (#1286)

* CI: Optimize autotuning pipeline and inital the docs

* topk per row kernel (#1262)

* initial commit for topk per row kernel

* topk per row kernel initial commit

* Fix the typo issue

* Add the topk per row kernel

* optimizations for topk_per_row kernel

* fix overflow

* add unit test for topk_per_row_decode

* update test for decode

* apply vector dispatch from carlus

---------

Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu>
Co-authored-by: valarLip <340077269@qq.com>

* fix aot (#1279)

* fix aot

* remove other kernels path

* fix aot

* format code

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Fix ATOM fp8 model quant fail issue in torch compile (#1299)

* Fix fp8 issue in torch compile

* use less code

* feat - pa_fwd support block map with stride in num_kv_heads_dim (#1301)

* Fix how to update accumulator for dot_scaled (#1297)

* CI: Optimize autotuning pipeline docs (#1300)

* CI: Optimize autotuning pipeline docs

* Update docs/autotuning_pipeline.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix the lint issue (#1307)

* fix fwd perf calc error (#1305)

* fix fwd perf calc error

* black aiter/ops/triton/_triton_kernels/gemm_afp4wfp4.py

* add the asm kernel performance of fwd and bwd (#1270)

* add the asm kernel performance of the attention forwards and attention backwards

* modify perf data

* fix perf data

* add a16 perf data

* Fused TopK and Sigmoid kernel (#1251)

* Add topk softmax

* Add test for topk sigmoid

* register the op properly

* apply black

* don't use constexpr with std::string

* bump ck to include topk sigmoid commit

* hipify

* add argparse to the topk sigmoid test, also add pytest

* use own module instead of asm moe

* black formatting

* add missing file

* revert changes to module_moe_asm

* Ar rms (#1290)

* [fea]: add fused allreduce rmsnorm kernel

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix: fuse ar rms interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix: fuse ar rms interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix ck branch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* update ar interface

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Dsv32 cache (#1314)

* add indexer_k_quant_and_cache & cp_gather_indexer_k_quant_cache

* ndexer_k_quant_and_cache opt kernel and add test

* update

* update2

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Fix displaying supported architectures (#1316)

Currently it will look like this:
```log
File "TransformerEngine/3rdparty/aiter/aiter/jit/utils/chip_info.py", line 77, in get_gfx_custom_op_core
  raise KeyError(
KeyError: 'Unknown GPU architecture: . Supported architectures: [0, 1, 2, 3, 4, 5, 6, 7, 8]'
```

Signed-off-by: Hollow Man <hollowman@opensuse.org>

* using standalone pybind (#1317)

* using standalone pybind

* fix

* update

* Enable mha bwd hd192_hd128 (#1308)

* update codegen.py

* update kernels & kernel launch

* fix fa bwd dq_acc shape

* remove mask in python api

* CI: Add pre-check status check (#1252)

Creates a unified pre-checks.yaml workflow that runs Black, Ruff, and dependency checks, uploading success/failure signal artifacts
Download and verify the signal artifacts in the other heavy jobs. If the verification succeeds, the heavy jobs will continue running. If the verification fails, the heavy jobs will exit immediately.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [CK_TILE] fmha: Add backward pass support for padded inputs (#1212)

* [CK_TILE] fmha: Add backward pass support for padded inputs

Introduces support for padded sequence lengths in the backward pass of the variable-length flash attention (fmha_v3_varlen_bwd).
- Updated Python and C++ function signatures to accept optional `cu_seqlens_q_padded` and `cu_seqlens_k_padded` arguments.
- Modified the underlying CUDA kernels and code generation scripts to pass padding information via the new `seqlen_q_ptr` and `seqlen_k_ptr` fields in
     the CK `fmha_bwd_args` struct.
- Modified the underlying kernels and code generation scripts to correctly handle pointers for both padded and unpadded sequence data.
- Added comprehensive gradient verification to the test suite (`test_mha_varlen.py`) to ensure the correctness of the backward pass with various
     padding scenarios.

* [CK_TILE] fmha: Adapt to composable_kernel padding API changes

Refactor the FMHA forward and backward pass to align with the updated padding API in `composable_kernel`.

- Argument Simplification: Removed the manual calculation of `seqlen_q` and `seqlen_k` from `cu_seqlens` in the `mha.cu` interface. The underlying kernels now handle this logic.
- API Alignment: Updated the arguments passed to `aiter::mha_fwd` and `aiter::mha_bwd` to match the new `composable_kernel` API. This involves passing `cu_seqlen` pointers directly.
- Kernel Interface Update: Modified the `codegen.py` scripts for `gfx942` and `gfx950` to reflect the changes in the kernel's function signatures and argument handling for padded and unpadded sequence lengths.

* fix build error in op_tests/cpp/mha/benchmark_mha_*.cpp

* Mla splitkv enhance split alg inte (#1233)

* add num_kv_splits_indptr to mla for mtp<=4 case for now

* update

* update new kernel

* infrastructures

* 1st version of split kernel

* Fix issues raised by Lingpeng and fix the issue on batch_size

* update mla

* update mla_stage2

* 1st draft of v1 split program

* add kv_offset

* mla_splitkv_enhance_split_alg_inte

* splitkv debug

* 1st version of reduce kernel

* metadata & kernel finish

* add reduce

* final_lse is optional now.

* update kernel

* bug fix

* bug fix 1

* modify reduce api

* update kernel

* fix max splits

* bug fix 3

* fix s80 early return

* udpate calculation of partial_indx

* add per split test

* make lse support by ref

* test split

* fix redundant calculation of head offset in reduce kernel

* add custom test

* Add support of 128 head size

Fix how to get head count

fff

* update comments

* 1. Let large work be assigned first.
2. Add tolerance to the tile which is slightly smaller than kv_limit.

* Calculate kv_limit dynamically

* Fix bug about difference in split_kv(bool)

* add test

* fix seed

* Add global tolerance 16 in kv seqlen because main kernel cannot handle small splits (kv_seqlen<4) well.

* Fix warp=1 error

* Add redundant mode to make the size of output of metadata be fixed add new param: no_redundant. Reduce can support redundant input in reduce_indptr as well.

fix comm

* fp8 setup

* first version of device metadata

aaa

* Add work_ptrs

* Compatibility to CUDA Graph

* Refactor code. Merge 2 iterations of generate work together.

* Make sure that each batch of workload can never be splited to more than #cluster of tiles.

* Adjust metadata. Get 1% perf gain.

* Paralize most of metadata kernel

Make get_cost_top() paralized.

aaa

* add scale

* 1. Use warp-level bitonic sort to sort batch idx based on their cost for reducing #splits. 2. Use CK's warp ops.

* fp8 function pass

* Fix issues:
1. avg_workload cannot handle any batch!
2. split_kv(bool) is not correct when all the clusters are full.

* fp8 ready

* fix

* persistent ready

* add nv acc test

* rename

* updata metashape

* update reduce cu num

* update optest for mla

* fix cu num

* Update metadata and reduce kernels.

* rename kernels

* Add new param kv_granularity to metadata kernel.

* Introduce cal_workload_limit_global_v2

* Support qhead=128 cases.

* Change get_mla_metadata() api. Make some not important parameters be optional through a dict.

* fix potential problem on calculating tot_qo_tiles

typo

* refactor metadata files

aaa

* update metadata v1_2

* update gqa_128 mla_ps & fix metadata v1_2

* Optimize mla metadata v1.2

* Optimize mla metadata v1.2 Part.2

* Optimize mla metadata v1.2 Part.3

* update qlen <=4

* fix mla qlen1

* Optimize mla metadata v1.2 Part.4

* Make reduce_final_map be optional in mla_reduce_v1

* Slightly increase reduce perf

* Add persistent mode for mla reduce kernel

* add mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co

* update deepseekv32 sparse mla metadata

* update mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co

* Adjust code for sparse attn

* Optimize the a16w8 kernel

* Improve metadata v1.1 perf

* Make metadata v1.1 support sparse attn

bug fix

tiny fix

* Remove redundant code in mla_reduce

* futile struggle

* Fix issue after merge. aiter main branch is using torch.library.infer_schema which doesn't allow dict as parameter. Thus, change the API for metadata.

* Adjust metadata v1.1 and make this branch be ready to be merged to main branch.

* remove invalid co kernel

* Fix issue brought from f794ae4 which disabled hipify by default.

* support qolen>1 for sparse mla

* make code become prettier

* Fix issue in metadata v1.1

* Fix issue in test_mla.py

* Fix lint fails

* Fix sub-test fails in op_test/test_mla.py

* Fix regression in test_mla.py where mtp>1

* Add head_dim=128 support to reduce

* Add nhead=8 for pa and add assert to make sure the input tensors are in
float32.

* fix issue in vllm benchmark for deepseek: remove metadata v0 because it's not compatible with hip graph

* fix lint

* Revert all the change about mi350 gemm.

* add a8w8 and a16w8 kernel in mla mi350

* add A8W8 Non-persistent mode kernel

* Fix issue reported by Copilot

* add mla non-persistent test

* script: update a16w8 kernel

* rm test_mla_persistent_mi350.py and support mi350 in test_mla_persistent.py

* add mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps.co

* fix a8w8 num_kv_split=1

* Fix issue in metadata v1.2 on qo_tiles > 1

* fix ut bandwidth

* Use nhead=16 simulate cases that nhead=16*N where N is in range(32,16*32+1,16)

aaa

Fix regression in sparse attn from the fix in metadata v1.2 for multi qo tile issue

* Add new api get_mla_metadata_info

* fix lint format issues

* Adjust get_mla_metadata_info_v1's parameters.

* update A16W8 kernel

* update A16W8 kernel2

* update A16W8 for mi300

* fix ut and rename some kernels

* rename mla kernel name for head 128

* remove log

* fix format

* add nativly back

* change zeros into empty

* fix with comments

---------

Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com>
Co-authored-by: zanzhang <zanzhang@amd.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: minmengdie <memin@amd.com>

* Fix gemm tuner error mi350 (#1313)

* workaround-retry tuning when encounter invalid pointer

* workaround-retry tuning when encounter invalid pointer

* fix  lint error

* Update gemm_tuner.py

em timeout

* CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton (#1325)

* CI: Skip triton in Aiter standard and multigpu tests

* Add retries when building triton

* Add ninja installation

* Fix global variable torch_fp8 initialization caused issue (#1322)

`hipGetDeviceProperties` is called by the `torch_fp8` initialization. It will trigger all the HIP runtime initialization in global variable initialization. There are two issues:

- There are several global variables involved in the runtime initialization too. The initialization order of global variables is not guaranteed. So it may use uninitialized global variables for the runtime initialization.

- When there is a forked child process, needs to initialize its own HIP runtime to get proper GPU driver kernel context and handles. But since there is a runtime initialized globally in the parent process, the forked process will just consider the runtime is initialized and use it directly. But it is actually invalid.

The fix is to ensure `hipGetDeviceProperties` is only called when actually needed, not during static initialization

To repro the issue:
1. fork a child process
2. call torch.empty on the child process

It will get a `hipErrorInvalidValue` error.

Co-authored-by: Hui Zhou <huizhou@meta.com>

* Add transpose scale to the triton fused_rms_fp8_group_quant (#1291)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning (#1315)

* update AOT, always pad x_scale when generating input, add UT

* update act_mul_mxfp4_quant, fused_rms_mxfp4_quant

* add LL FP4 configs and AOT files for TP8 shapes

* fix UT bug

* add LL TP2 and TP4 shapes

* [TRITON] Kernel naming: add reusable constexpr repr helper (#1260)

* Kernel naming: add reusable constexpr repr helper for gemm a16w16

* add missing params to the repr

* Merge tuned file (#1327)

* merge tuned_file with same prefix

* fix lint

* rename tuned_gemm.csv to bf16_tuned_gemm.csv to avoid matching wrong file

* update

* update README.md of bf16 GemmTuner

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* fix graph_breaks by return tensor for bool op (#1333)

* fix_bf16gemm_asm (#1329)

* fix_bf16gemm_asm

* update

* update

* Improve Memory Usage in MLA (#1338)

* Improve mla memory

* further reduce memory usage

* Fix lint issues

* Fix issue reported by Jun Chen.

* fix tune error caused by merge tuned_file (#1342)

* fix tune error caused by merge tuned_file

* fix lint error, rm some log

* rm rocblas in tuner (#1337)

* [Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant (#1328)

* add gemm_a16w8_blockscale and fused_reduce_rms_fp8_group_quant

* black formatting

* add MI300 config

* fix commit

* Add block_m=16 for a8w8_ck_moe_blockscale (#1081)

* Add block_m=16 for a8w8_ck_moe_blockscale

* fix moe_blk_scale token<64 to ck2stage for ds shape

* fp8 moe bugfix tuning rebase tuned_fmoe.csv

* add L2 check

* reformat

---------

Co-authored-by: Oscar Xu <huuaiguxu@amd.com>
Co-authored-by: xudoyuan <xudoyuan@amd.com>

* Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel (#1330)

* Fused Triton RMSNorm and FP8 static quantization

* Formatted python scripts

* [TRITON] GEMM kernels nomenclature changes (#1283)

* Kernel naming: add reusable constexpr repr helper (#1260)

* Add missing API documentation

* Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. (#1359)

* [Triton] Disable failing lean attention tests (#1357)

* add config (#1355)

* add how_v3_bf16_cvt control to the Python API (#1351)

* add how_v3_bf16_cvt in fwd_v3

* fix the fwd compile

* [fix]: car 6 rank coredump (#1335)

* [fix]: car 6 rank coredump

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: add residual out in ar rms

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic (#1348)

* make can_mha_v3 uninplace and fix guard

* fix error

* wrapper _flash_attn_backward in custom

* make gemm a8w8/a4w4 into custom

* add some op in custom

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* [TRITON] GEMM kernels nomenclature changes (#1292)

* implemented the use of kernel repr helper to standardize kernel metadata representation

* Add Missing API documentation

* remove commented code

* remove FILL_VALUE to keep kernels name meaningful

* [TRITON] Initial implementations of sparse attention kernels (#1296)

fp8_mqa_logits: Calculate the logits (prefill stage) to be used for topk
unified_attention_sparse_mla: Sparse attention implementation for the deepseek like MLA using the MHA approach where kv cache is [seq_len_kv, 1, HEAD_SIZE + kv_lora_rank + rope_rank] and q is [seq_len, NUM_HEADS, kv_lora_rank + rope_rank]

* [MI35X]cktile moe a16w4 support (#1341)

* draft of cktile moe

* align the interface of main branch to make cktile moe compile pass

* refine code

* refine ck moe

* fix CI build fail about code style

* remove ck blockscale moe modification

* refine code

* fix CI build fail of unsupport block_m=16

* refine format

* fix conflict

* update

* format

* fix format

* update

* update

* update

* format

* format

* remove useless

* fix sorting

---------

Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: Oscar Xu <huuaiguxu@amd.com>
Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: felix <felix.li@amd.com>

* [TRITON] Batched GEMM kernels nomenclature changes (#1293)

* implemented the use of kernel repr helper to standardize kernel metadata representation

- batched_gemm_bf16.py
- batched_gemm_a8w8.py
- batched_gemm_afp4wfp4.py (main + reduce kernel)
- batched_gemm_afp4wfp4_pre_quant.py
- batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.py

* Add Missing API documentation

* remove Dtype to avoid having invalid names in the repr

* [TRITON] Instruction shape fix for Gluon gemm_a8w8_blockscale kernel (#1261)

* fixed instr_shape error for mfma layout

* removed test skips for uneven K

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* moe mxfp4 block_m = 64/128 (#1266)

* moe mxfp4 block_m = 64/128

* update a4w4_gemm2_kernels_list

* add instance tile_m=32

* tuned configuration

* Update test_moe_2stage.py

* refactor

* update v1 pipeline

* update badcase

* fix fp4 moe tuner

* reformat

* revert ck update

* update ck

* Moe mxfp4 ck preshf bns (#1312)

* python code of nbs compatible

* bns compatible

* fix global

* bug fix

---------

Co-authored-by: zhimding <zhimding@amd.com>

* add AITER_MXFP4_MOE_SF switch for mxfp4 moe

* v3 n128

* 32x32 v1

* resolve ck conflict

* rm use_int4=True

* reformatted op_tests/test_moe_2stage.py

* AITER_MXFP4_MOE_SF bugfix

* revert torch.int4

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: felix <felix.li@amd.com>

* bug fix (#1370)

* [opus] enhance opus utility (#1324)

* enhance tuple and add waitcnt

* refactor vectorized issue space

* wip cached layout

* support cached layout

* add smem

* support broadcast dtype in store()

* Fix issue in metadata v1.2 where batch size is too large (#1352)

* Fix issue in metadata v1.2 where batch size is too large. V1.1 is hopeless in these cases...

* lds_partial_info is not used when there is no further tile splits on qo.

* [GEMM][Config] add a8w8 block scale tuned config for deepseek-v3 (#1310)

* add a8w8 gemm tuned config with block scale for deepseek-v3 shapes

* reorganize the config files

* revert unnecessnary changes

* add headers

* move ds configs to specific model config file

---------

Co-authored-by: guanbao <gyu@amd.com>

* support all logit values (#1323)

* support all logit values

* fix tests

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* CI: Skip triton in Aiter standard and multigpu tests (#1374)

Triton tests will only run when changing file under:
 - aiter/ops/triton
 - op_tests/triton_tests

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add the performance data bar chart in the readme (#1372)

* add the performance data bar chart in the readme

* fix the wrong remarks

* force ds ptpc moe use 1 stage moe (#1373)

* [TRITON]  MHA PA optimizations (#1245)

Optimizations:
New config.
Preloading v
Reordering pe matmul
Simplifying scaling ops (for sm_scale and log_2(e) related scaling)
Masking related changes

* Enable fa multii target build on other arch (#1318)

* enable fa multii target build on other arch

* update arch info when dispatch in python api

* update

* Support mixed V2/V3 arches

* format

---------

Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com>

* [Triton] DS FP4 triton fusion (#1371)

* add fused_gemm_afp4wfp4_a16w16.py

* fix bug

* add fused_reduce_act_mul_and_mxfp4_quant

* add gemm_a16wfp4.py

* fix

* fix

* fix

* clean up

* repr

* update AOT with repr

* fix bug

* add dummy heuristics

* add output_unquantized_inp1 to fused_rms_mxfp4_quant

* add configs

* fix bug, tune fused_reduce_act_mul_and_mxfp4_quant

* fix

* fix

* final clean up

* add batched_gemm_a16wfp4

* clean up

* add config

* add default config

* remove old kernels, add API redirection and deprecation warning

* add fused_gemm_afp4wfp4_mul_add

* [TRITON] Simplify and optimize triton_kernels moe code and move it into aiter (#1326)

First PR for MoE with optimized support for GPTOSS shapes and fp8 x fp4.

* Use torch.zeros_like instead of empty_like to prevent accruacy drop (#1387)

* CI: Temporarily using old vllm nightly image (#1389)

* Revert "[Triton] DS FP4 triton fusion (#1371)" (#1392)

This reverts commit 4aabf79.

* add a8w8 ptpc gemm config for dsv3 (#1382)

* add ninja to install_requires in setup.py, fix ck gemm a8w8 bshuffle heuristic dispatch not support mnk=(2048,2112,7168)

* add 085 ptpc gemm tune config

* rename csv

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Test the CI on both MI325 and MI355 (#1364)

* Always run tests on mi355

* [Triton] change BF16 GEMM config filename (#1398)

* Support distributed_init_method and DP in init_distributed (#1353)

* support distributed_init_method in init_distributed

* support dp in aiter distribute

* fix only tp error

* FA V3(fp8) and paged Attention compressed (CI green) (#1065)

FA V3(fp8) and paged Attention compressed

FA V3(fp8) and paged Attention

FP8 Prefill work compressed

Fa V3 api

Compress fp8 work so far

pull cast out of torch function

e2e fp8 stub

emulate fa v3

ignore

remove example

clean up forward

save

fp8 backward

ignore train artifacts

just use return_attn_probs

match fa behvaior

save fa ref

add fa_ref

fix dropout bug

add link

optional fp8 p descale

rename to v3

fa v3

clean up

match backward

min diff

update varlen api

clean up FP8_P_DESCALE

update bench and test

lint

fix mha varlen bug

remove .gitignore

save

lint

remove skip

bring back skips

add fa module

update v2 interface

create mha_v3

add old v3 path

update fa module

tests passing

sync bwd changes

lint fa module

add kvcache api and test

fix lint

fp8 works

test fp8 only

add paged tests

add flash_attn_with_kvcache to v2

test varlen

move to _triton_kernels

test_mha_backward working with v3

upgrade to cleanedup modeule

get test_mha_backward_varlen working

clean up

fix lint bug

move casting functions to utils

fix lint

Update aiter/ops/triton/utils/mha_kernel_utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update aiter/ops/triton/utils/mha_kernel_utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

use Optional

update from main_perf

lint

update fp8 backward

lint

descale factor is fp32

lint

dequant backward

match dims

Sync with FA main_perf

dequant in backward

pass descale to kernel directly

better kernel naming

simple fp8 path. no transpose

clean up bwd

save

bring back split

min diff

pass descale to bwd

lint

fix bwd nans

FP8_AUTO_DESCALE

use hk for other backwards

fp8 wrapper

lint

rm matrix_instr_nonkdim

split v2 and v3 cleanly

lint

back to og

minimal change

test_mha passes

test green

* is_shuffled (#1377)

* is_shuffled

* shuffle_weight bugfix

* rm AITER_MXFP4_MOE_SF

* preshuffle bugfix

* refactor

* refactor bugfix

* add bns/preshuffle moe mxfp4 UT tests

* add L2 verification

* black op_tests/test_moe_2stage.py

---------

Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Ar rms new interface (#1401)

* [fix]: fused_ar_rms interface

Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>

* delete comment

Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>

* change ut case

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* fix ut format err

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: ar acc err

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* minor fix for mi355 (#1408)

* [Fix] Add sliding window feature for paged_attention_v1 (#1362)

* add sliding window to paged_attention_v1

* Avoid use fmha_v3_varlen_fwd on unsupported architecture gfx90a

* make  `sliding_window` default to 0 for better compatibility

* fix possible used compilation problem

* fix ci failure

* fix ci failure

* add a single test to avoid increasing test time a lot

---------

Co-authored-by: Xiake Sun <xiake.sun@amd.com>

* fused_qk_rope_cat_and_cache_mla: Fix Triton compilation error and batch size constraint and output tensor sizing (#1407)

* Fix Triton compilation error by nesting OUTPUT_Q_NOPE_ZEROS condition

* Correct batch size constraint and output tensor sizing

* max mla splits perbatch (#1390)

* fix issues

* add limit for split num per batch

* fix non-ps num kv split

* fix issue for big batch size

* fix logits alloc

* fix black code stype

* fix ut

* update git ignore& remove aiter/install_mode

* update qh16 qseqlen4 kernel

* update

---------

Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* topk_per_row_opt (#1394)

* topk_per_row_opt

* Optimize topk by integrating air-topk kernel (TODO: proper workspace size computation and init) (#1409)

* update

* slightly optmize

* add back

---------

Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de>
Co-authored-by: carlushuang <carlus.huang@amd.com>

* Fix fused_rms_mxfp4_quant comment (#1369)

* leanAttn softmax fix for spurious data mismatch test failures (#1396)

* leanAttn softmax fix for spurious data mismatch test failures

* black fix

* Remove unused parameters per PR review request

* Black fix

* Add reduce_scatter api (#1413)

* add reduce_scatter api

* add reduce_scatter api

* fix error in fmoe_tuner (#1405)

* fix error in fmoe_tuner

* fix error when tuning QuantType.per_1x32

* rm redundant code

* optimize thread divergence (#1421)

* [TRITON] complex number multiplication that supports 3D ROPE triton kernel (#1061)

* complex number multiplication that supports 3D ROPE triton kernel

* Merge remote-tracking branch 'origin/feature/rope3d-fix' and resolve conflicts

* confilcts resolve

* conflicts resolve

* moved code places

* fix typo

* fix typo

---------

Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* Feat: pa_mqa_logits performance optimization & support kv_preshuffle + blocksize16/64 (#1424)

* Update gluon_pa_mqa_logits using preshuffle

* Minor update

* Add finer pipeline granularity

* Add sched_group_barrier optimization

* Minor update

* Support blocksize64 for preshuffle pa_mqa_logits

* Support logits JIT on triton 3.5

* Improve splitkv strategy

* Eliminate redundant conditon check

* Add missing oob check

* Resolve reviews

* [Config] add tuned moe and gemm config for qwen3 235b (#1378)

* add moe tuned config

* add gemm tuned config

* move tuned moe config to model specific file

---------

Co-authored-by: guanbao <gyu@amd.com>

* fix repeated unnecessary device check (#1221)

* remove device check

* 8 devices

* more

* more

---------

Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>

* remove lru func in fake (#1429)

* Temporarily disable the test on mi355 (#1437)

* Enable MI355 test on main branch

* CI: Aiter tests bug fix

* [M308] tune silu&act (#1404)

* Vectorized loads and stores  combined with the packed multiply path

* format code II

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* add deepseek ep moe tune config (#1431)

* add ptpc deepseek ep moe tuned config

* add block deepseek ep moe tune config

* using 1stage moe for ptpc deepseek

* [TRITON] Moe a8w4 tuning (#1410)

* fuse routing kernels for small batches

* tune batch=1024

* [TRITON]  Apply config-aware naming (kernel_repr) to attention kernels (#1295)

* Apply kernel_repr to attention kernels

Applied make_kernel_repr helper to 4 attention kernel files:
- pa_decode.py (6 kernels)
- pa_prefill.py (2 kernels)
- chunked_pa_prefill.py (1 kernel)
- mla_decode_rope.py (2 kernels)

Each kernel now has config-aware naming with constexpr parameters
included in the kernel metadata name.

Base: amd/satya/kernel_config_to_name

* Apply kernel_repr to attention kernels

* fix indentation error and add kernel_repr to a missed kernel

* Add descriptions to missing API descriptions

* remove unused imports

* fix runtime error

* revert lean atten to main

* lean atten repr and API desc

* formatting fix

* Update aiter/ops/triton/pod_attention.py

* [fix]: prebuild gen so (#1412)

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* [TRITON] FP8 MQA optimizations (#1422)

FP8 MQA optimizations AND bench. script

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning (#1366)

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning

* add required keys to fused_moe_1stage_dict

* add kernel descriptors and code object files

* add 32x128 file descriptors and code objects for tuning

* move code objects and kernel descriptors to correct csv

* remove unnecessary import, add quant type argument

* move fused_moe_stage1_tkw1 into fused_moe.py

* remove unnecessary kernel code object files

* Add missing comma

* saved modified tuned fmoe config for testing purposes

* apply black required formatting

* remove fused_moe_stage1_tkw1 and place aiter.fmoe_g1u1_tkw1 under fused_moe_1stage

* remove unnecesary arguments

* apply black formatting

* simplify aiter.fmoe_g1u1_tkw1 call

* add doweight_stage1 column to fused_moe_1stage_dict map and remove elif condition to select run_1stage=True

* add doweight_stage1 to query key

* modidy elif to select run_stage=True for tokens > 16

* apply black formatting

* removing csv and .co files as they will come in separate commit

* removing log logger.info(f[get_2stage_cfgs] run_1stage)

---------

Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>

* CI: Move some tests to MI355 due to the network issue of TW cluster (#1446)

* CI: Move some tests to MI355 due to the network issue of TW cluster

* Modify the GPU_ARCH of sglang tests

* CI: Move Triton tests from TW cluster to internal cluster (#1451)

* tune a8w8_blockscale&bpreshuffle for tencent (#1444)

* tune a8w8_blockscale&bpreshuffle for tencent

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv

* update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_tuned_gemm.csv

* updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* [fix]: add ar switch (#1376)

* [fix]: add ar switch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: call ar naive

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: add ar switch

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* CI: Test pypi connection (#1458)

* cktile weight preshuffle test and auto tuning for a8w8 (#1400)

* cktile bpreshuffle && tuning code

* refine code

* refine code

* refine code

* refine

* refine

* fix merge conflict

* fix conflict

* fix CI build fail

* refine code

* align aiter interface

* refine

* add get_padded_M

---------

Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>

* fix_prebuiild_asm (#1439)

* mdf_bf16gemm

* update

* update f4gemm

* update

* update

* f4gemm bugs fix

* f4gemm fix2

* update

* update moe 2 stages

* update codegen

* update gemm_a8w8_asm

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@.com>

* fix merged tuned config error (#1460)

* fix merged tuned config error

* update

---------

Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* update triton  version check for pa_mqa_logits (#1440)

* update triton  version check for pa_mqa_logits

* Add some explanation for aot-check

* Support pa_mqa_logits aot load on triton>=3.5

* Support pa_mqa_logits aot load on triton>=3.5

---------

Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>

* fix all_reduce_fake (#1465)

* CI: Use ausartifactory.amd.com in pip installation (#1469)

* update codegen (#1471)

* update codegen

* update

* update

* update

* fix

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: valarLip <340077269@qq.com>

* CI: Fix sglang CI test (#1473)

* CI: Add checkout retry in vLLM benchmark tests (#1476)

* CI: Move SGlang and Triton tests to MI300 runners (#1485)

* fix merging aiter config  (#1443)

* change to merge config when used

* fix lint

* fix error in GemmTuner

* fix lint

* fix error when runing deepseek

* fix lint error

* revert other format change

* fix gemm_op_a8w8.py

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* fix fmoe tune preshuffle error (#1430)

* set preshuffle=False default

* fix lint

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Fix issue: Add nhead=128 support to bf16 and align restrictions (#1450)

* fix fwd_v3 output/lse is nan when kseq=0 and fix qseq >> kseq error (#1442)

* fix output/lse is nan when kseq=0

* fix gfx950 128_128 fwd_v3

* update the k_seq=0 error in MI300 and MI308

* tune a8w8_blockscale&bpreshuffle for tencent (#1444)

* tune a8w8_blockscale&bpreshuffle for tencent

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv

* update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_tuned_gemm.csv

* updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update the smoke test

* update the smoke test

* fix MI300 and MI308 err

* fix qseq >> kseq error MI300 and MI308

* fix qseq >> kseq error in MI355

* fix the MI300 error

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
Co-authored-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* fix the build error of rtp (#1438)

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Add 32x64 to tuned fmoe config (#1386)

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning

* add required keys to fused_moe_1stage_dict

* add kernel descriptors and code object files

* add 32x128 file descriptors and code objects for tuning

* move code objects and kernel descriptors to correct csv

* remove unnecessary import, add quant type argument

* move fused_moe_stage1_tkw1 into fused_moe.py

* remove unnecessary kernel code object files

* add kernel descriptor to tuned fmoe config, add kernel descriptors to related csvs and add related code objects

* create kernel descriptors and kernel co files with correct tags

* some fix for support gpt_oss (#1488)

* CI: Revert vllm_benchmark to use the latest nightly image (#1402)

* add sink_size parameter in mha_fwd and varlen_mha_fwd

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update ck api

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Remove redundant assignment to sink_size

* Update mha_fwd_kernels.cu

* Update mha.py

* Add false argument to fmha_batch_prefill call

* update ck commit

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Handle sink_size with conditional window_size length

* update fmha_api

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update ck commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update ck_commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update ck commmit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix op test error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update mha.py

* Update csrc/include/torch/mha_fwd.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/py_itfs_ck/mha_fwd_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/py_itfs_ck/mha_varlen_fwd_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add some comments

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update mha_fwd_generate.py

* Clarify sink_size parameter in asm_mha_varlen_fwd.cu

Updated the comment for the sink_size parameter.

* update ck commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: la <46212055+junhaha666@users.noreply.github.com>
Co-authored-by: kliuae-amd <KuanFu.Liu@amd.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: azaidy <aliasger.zaidy@amd.com>
Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com>
Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com>
Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com>
Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: minmengdie <memin@amd.com>
Co-authored-by: ZhangLirong <lirzhang@amd.com>
Co-authored-by: root <root@hjbog-srdc-39.amd.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: valechen <115046356+valechen@users.noreply.github.com>
Co-authored-by: ukannika <uma.kannikanti@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: who who who <fsx950223@outlook.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: baowendin <46412693+baowendin@users.noreply.github.com>
Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com>
Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: TennyWang1223 <Tenny.Wang@amd.com>
Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org>
Co-authored-by: slippedJim <jim.guo@amd.com>
Co-authored-by: Jeff Huang <jiaji.huang73@gmail.com>
Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com>
Co-authored-by: zanzhang <zanzhang@amd.com>
Co-authored-by: Fang.Che <Fang.Che@amd.com>
Co-authored-by: Hui Zhou <zhou_hui@outlook.com>
Co-authored-by: Hui Zhou <huizhou@meta.com>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: Satya Nikhil Kodukula <skodukul@amd.com>
Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com>
Co-authored-by: Oscar Xu <huuaiguxu@amd.com>
Co-authored-by: xudoyuan <xudoyuan@amd.com>
Co-authored-by: Farel Lukas <farlukas@amd.com>
Co-authored-by: Satya Nikhil Kodukula <nikhil.kodukula@gmail.com>
Co-authored-by: BingYuan.Zhou <BingYuan.Zhou@amd.com>
Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: felix <felix.li@amd.com>
Co-authored-by: eky-amd <ethan.ky@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: gbyu-amd <Guanbao.Yu@amd.com>
Co-authored-by: guanbao <gyu@amd.com>
Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com>
Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Co-authored-by: Michael Melesse <micmelesse@gmail.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: luocheng25 <cheng.luo@amd.com>
Co-authored-by: Xiake Sun <xiake.sun@amd.com>
Co-authored-by: Lyu, Xudong <xudonlyu@amd.com>
Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de>
Co-authored-by: Drew Wadsworth <drew.wadsworth@gmail.com>
Co-authored-by: yinfengLiu <yinfeliu@amd.com>
Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
Co-authored-by: zufayu <zufa.yu@amd.com>
Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: Anton Saukkonen <63663359+antsaukk@users.noreply.github.com>
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@.com>
Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants