-
Notifications
You must be signed in to change notification settings - Fork 167
Enable large batch size and optimization of non-Ragged batching #1269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
micmelesse
approved these changes
Oct 29, 2025
ganyi1996ppo
pushed a commit
that referenced
this pull request
Nov 19, 2025
* Enable large batch size and optimization of non-Ragged batching * Add RAGGED_BATCH to test_la.py and bench_la.py
zhuyuhua-v
pushed a commit
that referenced
this pull request
Nov 23, 2025
* Enable large batch size and optimization of non-Ragged batching * Add RAGGED_BATCH to test_la.py and bench_la.py
LJ-underdog
pushed a commit
that referenced
this pull request
Nov 27, 2025
* Enable large batch size and optimization of non-Ragged batching * Add RAGGED_BATCH to test_la.py and bench_la.py
valarLip
added a commit
that referenced
this pull request
Dec 16, 2025
* add sink_size parameter in mha_fwd and varlen_mha_fwd Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update mha.py * update mha_varlen_fwd_kernels Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * ca_refactor_fix (#1268) * ca_refactor_fix * more update * fix_fp4_quant_dtype (#1271) * add sample use outer exponential (#1267) * update sample outer exp * update mix sample use outer exponential --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Fix rowwise a8w8 gemm in swizzled hipb_mm (#1258) * fix rowwise Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * guard hipb_mm output type in unit test Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * correct tuning for rowwise Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * expose b_preshuffle for tuning Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix lint Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> --------- Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> * CI: Use TW cluster to run sglang tests (#1273) * Catchall PR for all 355_wip related changes (#1148) * Catchall PR for all 355_wip related changes Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com> Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com> * add triton fp4 gemm preshuffle (#1187) * Remove AOT path * [TRITON] Add Positional Encoding (PE) support to Triton MHA kernels Related PR: #1184 Patch applied according to Cagri's request. * Tune more fp4 gemm shapes * Add aot compilation option for fp4 preshuffled gemm * Add precompiled fp4 preshuffled gemm kernels * Add readme * re-gen AOT binary files for LL 70B FP4 GEMM and update kernels and API helper, fix fused_mul_add logger bug (#1242) * DS 355_wip fused_shared_expert (#1218) documentation, fix some bugs, UT * fused_rms_fp8_group_quant num_warps tunning * black formatting * remove redundent files * unified attn. reorg., fixes, exp2 update * add MI300 config for fused_gemm_a8w8_blockscale_a16w16 * recover test_mha to upstream * black formatting * update test fused_kv_cache * bypass some UTs for MI300 * update FP8 assert for test_mha_varlen * skip UT --------- Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com> Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> * [MI35X] fix core check (#1276) * fix core check * update * Refactor gemm bf16 tuner (#1275) * refactor GemmTuner * update * fix lint error * fix lint error * CI: Operators tuning pipelines (#1163) * CI: Operators tunning pipelines * Updates * Updates * Updates * Updates * Show computing unints * Updates * Updates * Add op_tune.sh * Updates * Disable a4w4 * Updates the error handling * Updates the error handling * Updates * Updates * Updates * Updates * Update .github/scripts/op_tune.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Updates * Add uloading tuned CSVs * Updates * Add shape name * Add shape arg * Allows users to select the shapes they want to tune and specify the arguments they need for tuning. * Only be triggered when modify the untuned csv files under aiter configs foleder * Test * Updates * Updates * Update .github/workflows/operators-tuning.yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/ck_gemm_a8w8_blockscale_bpreshuffle/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update a4w4_blockscale_untuned_gemm.csv --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix the problem that v3's performance is worse than ck's (#1237) * fix fwd v3 kernel perf and opt err * fix the python mha test error (#1277) * Reuse custom decorator in core and torch guard (#1278) * reuse custom decorator in core and torch guard * remove useless * rename all_reduce to avoid same name * rebase * fix bug * use cpu device default --------- Co-authored-by: root <root@hjbog-srdc-39.amd.com> * integrate deep gemm (#1265) * integrate m grouped gemm * update ck * add limit for 950 * rename deepgeem * add a tuned config and insert entries in untuned config (#1243) Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Enable large batch size and optimization of non-Ragged batching (#1269) * Enable large batch size and optimization of non-Ragged batching * Add RAGGED_BATCH to test_la.py and bench_la.py * add few more fw ds f4 untuned and tuned shapes for using asm kernel (#1298) * CI: Optimize autotuning pipeline and inital the docs (#1286) * CI: Optimize autotuning pipeline and inital the docs * topk per row kernel (#1262) * initial commit for topk per row kernel * topk per row kernel initial commit * Fix the typo issue * Add the topk per row kernel * optimizations for topk_per_row kernel * fix overflow * add unit test for topk_per_row_decode * update test for decode * apply vector dispatch from carlus --------- Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu> Co-authored-by: valarLip <340077269@qq.com> * fix aot (#1279) * fix aot * remove other kernels path * fix aot * format code --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com> * Fix ATOM fp8 model quant fail issue in torch compile (#1299) * Fix fp8 issue in torch compile * use less code * feat - pa_fwd support block map with stride in num_kv_heads_dim (#1301) * Fix how to update accumulator for dot_scaled (#1297) * CI: Optimize autotuning pipeline docs (#1300) * CI: Optimize autotuning pipeline docs * Update docs/autotuning_pipeline.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix the lint issue (#1307) * fix fwd perf calc error (#1305) * fix fwd perf calc error * black aiter/ops/triton/_triton_kernels/gemm_afp4wfp4.py * add the asm kernel performance of fwd and bwd (#1270) * add the asm kernel performance of the attention forwards and attention backwards * modify perf data * fix perf data * add a16 perf data * Fused TopK and Sigmoid kernel (#1251) * Add topk softmax * Add test for topk sigmoid * register the op properly * apply black * don't use constexpr with std::string * bump ck to include topk sigmoid commit * hipify * add argparse to the topk sigmoid test, also add pytest * use own module instead of asm moe * black formatting * add missing file * revert changes to module_moe_asm * Ar rms (#1290) * [fea]: add fused allreduce rmsnorm kernel Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: fuse ar rms interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: fuse ar rms interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix ck branch Signed-off-by: root <root@hjbog-srdc-24.amd.com> * update ar interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * Dsv32 cache (#1314) * add indexer_k_quant_and_cache & cp_gather_indexer_k_quant_cache * ndexer_k_quant_and_cache opt kernel and add test * update * update2 --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Fix displaying supported architectures (#1316) Currently it will look like this: ```log File "TransformerEngine/3rdparty/aiter/aiter/jit/utils/chip_info.py", line 77, in get_gfx_custom_op_core raise KeyError( KeyError: 'Unknown GPU architecture: . Supported architectures: [0, 1, 2, 3, 4, 5, 6, 7, 8]' ``` Signed-off-by: Hollow Man <hollowman@opensuse.org> * using standalone pybind (#1317) * using standalone pybind * fix * update * Enable mha bwd hd192_hd128 (#1308) * update codegen.py * update kernels & kernel launch * fix fa bwd dq_acc shape * remove mask in python api * CI: Add pre-check status check (#1252) Creates a unified pre-checks.yaml workflow that runs Black, Ruff, and dependency checks, uploading success/failure signal artifacts Download and verify the signal artifacts in the other heavy jobs. If the verification succeeds, the heavy jobs will continue running. If the verification fails, the heavy jobs will exit immediately. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [CK_TILE] fmha: Add backward pass support for padded inputs (#1212) * [CK_TILE] fmha: Add backward pass support for padded inputs Introduces support for padded sequence lengths in the backward pass of the variable-length flash attention (fmha_v3_varlen_bwd). - Updated Python and C++ function signatures to accept optional `cu_seqlens_q_padded` and `cu_seqlens_k_padded` arguments. - Modified the underlying CUDA kernels and code generation scripts to pass padding information via the new `seqlen_q_ptr` and `seqlen_k_ptr` fields in the CK `fmha_bwd_args` struct. - Modified the underlying kernels and code generation scripts to correctly handle pointers for both padded and unpadded sequence data. - Added comprehensive gradient verification to the test suite (`test_mha_varlen.py`) to ensure the correctness of the backward pass with various padding scenarios. * [CK_TILE] fmha: Adapt to composable_kernel padding API changes Refactor the FMHA forward and backward pass to align with the updated padding API in `composable_kernel`. - Argument Simplification: Removed the manual calculation of `seqlen_q` and `seqlen_k` from `cu_seqlens` in the `mha.cu` interface. The underlying kernels now handle this logic. - API Alignment: Updated the arguments passed to `aiter::mha_fwd` and `aiter::mha_bwd` to match the new `composable_kernel` API. This involves passing `cu_seqlen` pointers directly. - Kernel Interface Update: Modified the `codegen.py` scripts for `gfx942` and `gfx950` to reflect the changes in the kernel's function signatures and argument handling for padded and unpadded sequence lengths. * fix build error in op_tests/cpp/mha/benchmark_mha_*.cpp * Mla splitkv enhance split alg inte (#1233) * add num_kv_splits_indptr to mla for mtp<=4 case for now * update * update new kernel * infrastructures * 1st version of split kernel * Fix issues raised by Lingpeng and fix the issue on batch_size * update mla * update mla_stage2 * 1st draft of v1 split program * add kv_offset * mla_splitkv_enhance_split_alg_inte * splitkv debug * 1st version of reduce kernel * metadata & kernel finish * add reduce * final_lse is optional now. * update kernel * bug fix * bug fix 1 * modify reduce api * update kernel * fix max splits * bug fix 3 * fix s80 early return * udpate calculation of partial_indx * add per split test * make lse support by ref * test split * fix redundant calculation of head offset in reduce kernel * add custom test * Add support of 128 head size Fix how to get head count fff * update comments * 1. Let large work be assigned first. 2. Add tolerance to the tile which is slightly smaller than kv_limit. * Calculate kv_limit dynamically * Fix bug about difference in split_kv(bool) * add test * fix seed * Add global tolerance 16 in kv seqlen because main kernel cannot handle small splits (kv_seqlen<4) well. * Fix warp=1 error * Add redundant mode to make the size of output of metadata be fixed add new param: no_redundant. Reduce can support redundant input in reduce_indptr as well. fix comm * fp8 setup * first version of device metadata aaa * Add work_ptrs * Compatibility to CUDA Graph * Refactor code. Merge 2 iterations of generate work together. * Make sure that each batch of workload can never be splited to more than #cluster of tiles. * Adjust metadata. Get 1% perf gain. * Paralize most of metadata kernel Make get_cost_top() paralized. aaa * add scale * 1. Use warp-level bitonic sort to sort batch idx based on their cost for reducing #splits. 2. Use CK's warp ops. * fp8 function pass * Fix issues: 1. avg_workload cannot handle any batch! 2. split_kv(bool) is not correct when all the clusters are full. * fp8 ready * fix * persistent ready * add nv acc test * rename * updata metashape * update reduce cu num * update optest for mla * fix cu num * Update metadata and reduce kernels. * rename kernels * Add new param kv_granularity to metadata kernel. * Introduce cal_workload_limit_global_v2 * Support qhead=128 cases. * Change get_mla_metadata() api. Make some not important parameters be optional through a dict. * fix potential problem on calculating tot_qo_tiles typo * refactor metadata files aaa * update metadata v1_2 * update gqa_128 mla_ps & fix metadata v1_2 * Optimize mla metadata v1.2 * Optimize mla metadata v1.2 Part.2 * Optimize mla metadata v1.2 Part.3 * update qlen <=4 * fix mla qlen1 * Optimize mla metadata v1.2 Part.4 * Make reduce_final_map be optional in mla_reduce_v1 * Slightly increase reduce perf * Add persistent mode for mla reduce kernel * add mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co * update deepseekv32 sparse mla metadata * update mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co * Adjust code for sparse attn * Optimize the a16w8 kernel * Improve metadata v1.1 perf * Make metadata v1.1 support sparse attn bug fix tiny fix * Remove redundant code in mla_reduce * futile struggle * Fix issue after merge. aiter main branch is using torch.library.infer_schema which doesn't allow dict as parameter. Thus, change the API for metadata. * Adjust metadata v1.1 and make this branch be ready to be merged to main branch. * remove invalid co kernel * Fix issue brought from f794ae4 which disabled hipify by default. * support qolen>1 for sparse mla * make code become prettier * Fix issue in metadata v1.1 * Fix issue in test_mla.py * Fix lint fails * Fix sub-test fails in op_test/test_mla.py * Fix regression in test_mla.py where mtp>1 * Add head_dim=128 support to reduce * Add nhead=8 for pa and add assert to make sure the input tensors are in float32. * fix issue in vllm benchmark for deepseek: remove metadata v0 because it's not compatible with hip graph * fix lint * Revert all the change about mi350 gemm. * add a8w8 and a16w8 kernel in mla mi350 * add A8W8 Non-persistent mode kernel * Fix issue reported by Copilot * add mla non-persistent test * script: update a16w8 kernel * rm test_mla_persistent_mi350.py and support mi350 in test_mla_persistent.py * add mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps.co * fix a8w8 num_kv_split=1 * Fix issue in metadata v1.2 on qo_tiles > 1 * fix ut bandwidth * Use nhead=16 simulate cases that nhead=16*N where N is in range(32,16*32+1,16) aaa Fix regression in sparse attn from the fix in metadata v1.2 for multi qo tile issue * Add new api get_mla_metadata_info * fix lint format issues * Adjust get_mla_metadata_info_v1's parameters. * update A16W8 kernel * update A16W8 kernel2 * update A16W8 for mi300 * fix ut and rename some kernels * rename mla kernel name for head 128 * remove log * fix format * add nativly back * change zeros into empty * fix with comments --------- Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com> Co-authored-by: zanzhang <zanzhang@amd.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: minmengdie <memin@amd.com> * Fix gemm tuner error mi350 (#1313) * workaround-retry tuning when encounter invalid pointer * workaround-retry tuning when encounter invalid pointer * fix lint error * Update gemm_tuner.py em timeout * CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton (#1325) * CI: Skip triton in Aiter standard and multigpu tests * Add retries when building triton * Add ninja installation * Fix global variable torch_fp8 initialization caused issue (#1322) `hipGetDeviceProperties` is called by the `torch_fp8` initialization. It will trigger all the HIP runtime initialization in global variable initialization. There are two issues: - There are several global variables involved in the runtime initialization too. The initialization order of global variables is not guaranteed. So it may use uninitialized global variables for the runtime initialization. - When there is a forked child process, needs to initialize its own HIP runtime to get proper GPU driver kernel context and handles. But since there is a runtime initialized globally in the parent process, the forked process will just consider the runtime is initialized and use it directly. But it is actually invalid. The fix is to ensure `hipGetDeviceProperties` is only called when actually needed, not during static initialization To repro the issue: 1. fork a child process 2. call torch.empty on the child process It will get a `hipErrorInvalidValue` error. Co-authored-by: Hui Zhou <huizhou@meta.com> * Add transpose scale to the triton fused_rms_fp8_group_quant (#1291) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * [Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning (#1315) * update AOT, always pad x_scale when generating input, add UT * update act_mul_mxfp4_quant, fused_rms_mxfp4_quant * add LL FP4 configs and AOT files for TP8 shapes * fix UT bug * add LL TP2 and TP4 shapes * [TRITON] Kernel naming: add reusable constexpr repr helper (#1260) * Kernel naming: add reusable constexpr repr helper for gemm a16w16 * add missing params to the repr * Merge tuned file (#1327) * merge tuned_file with same prefix * fix lint * rename tuned_gemm.csv to bf16_tuned_gemm.csv to avoid matching wrong file * update * update README.md of bf16 GemmTuner --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * fix graph_breaks by return tensor for bool op (#1333) * fix_bf16gemm_asm (#1329) * fix_bf16gemm_asm * update * update * Improve Memory Usage in MLA (#1338) * Improve mla memory * further reduce memory usage * Fix lint issues * Fix issue reported by Jun Chen. * fix tune error caused by merge tuned_file (#1342) * fix tune error caused by merge tuned_file * fix lint error, rm some log * rm rocblas in tuner (#1337) * [Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant (#1328) * add gemm_a16w8_blockscale and fused_reduce_rms_fp8_group_quant * black formatting * add MI300 config * fix commit * Add block_m=16 for a8w8_ck_moe_blockscale (#1081) * Add block_m=16 for a8w8_ck_moe_blockscale * fix moe_blk_scale token<64 to ck2stage for ds shape * fp8 moe bugfix tuning rebase tuned_fmoe.csv * add L2 check * reformat --------- Co-authored-by: Oscar Xu <huuaiguxu@amd.com> Co-authored-by: xudoyuan <xudoyuan@amd.com> * Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel (#1330) * Fused Triton RMSNorm and FP8 static quantization * Formatted python scripts * [TRITON] GEMM kernels nomenclature changes (#1283) * Kernel naming: add reusable constexpr repr helper (#1260) * Add missing API documentation * Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. (#1359) * [Triton] Disable failing lean attention tests (#1357) * add config (#1355) * add how_v3_bf16_cvt control to the Python API (#1351) * add how_v3_bf16_cvt in fwd_v3 * fix the fwd compile * [fix]: car 6 rank coredump (#1335) * [fix]: car 6 rank coredump Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: add residual out in ar rms Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic (#1348) * make can_mha_v3 uninplace and fix guard * fix error * wrapper _flash_attn_backward in custom * make gemm a8w8/a4w4 into custom * add some op in custom --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * [TRITON] GEMM kernels nomenclature changes (#1292) * implemented the use of kernel repr helper to standardize kernel metadata representation * Add Missing API documentation * remove commented code * remove FILL_VALUE to keep kernels name meaningful * [TRITON] Initial implementations of sparse attention kernels (#1296) fp8_mqa_logits: Calculate the logits (prefill stage) to be used for topk unified_attention_sparse_mla: Sparse attention implementation for the deepseek like MLA using the MHA approach where kv cache is [seq_len_kv, 1, HEAD_SIZE + kv_lora_rank + rope_rank] and q is [seq_len, NUM_HEADS, kv_lora_rank + rope_rank] * [MI35X]cktile moe a16w4 support (#1341) * draft of cktile moe * align the interface of main branch to make cktile moe compile pass * refine code * refine ck moe * fix CI build fail about code style * remove ck blockscale moe modification * refine code * fix CI build fail of unsupport block_m=16 * refine format * fix conflict * update * format * fix format * update * update * update * format * format * remove useless * fix sorting --------- Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Oscar Xu <huuaiguxu@amd.com> Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: felix <felix.li@amd.com> * [TRITON] Batched GEMM kernels nomenclature changes (#1293) * implemented the use of kernel repr helper to standardize kernel metadata representation - batched_gemm_bf16.py - batched_gemm_a8w8.py - batched_gemm_afp4wfp4.py (main + reduce kernel) - batched_gemm_afp4wfp4_pre_quant.py - batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.py * Add Missing API documentation * remove Dtype to avoid having invalid names in the repr * [TRITON] Instruction shape fix for Gluon gemm_a8w8_blockscale kernel (#1261) * fixed instr_shape error for mfma layout * removed test skips for uneven K --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * moe mxfp4 block_m = 64/128 (#1266) * moe mxfp4 block_m = 64/128 * update a4w4_gemm2_kernels_list * add instance tile_m=32 * tuned configuration * Update test_moe_2stage.py * refactor * update v1 pipeline * update badcase * fix fp4 moe tuner * reformat * revert ck update * update ck * Moe mxfp4 ck preshf bns (#1312) * python code of nbs compatible * bns compatible * fix global * bug fix --------- Co-authored-by: zhimding <zhimding@amd.com> * add AITER_MXFP4_MOE_SF switch for mxfp4 moe * v3 n128 * 32x32 v1 * resolve ck conflict * rm use_int4=True * reformatted op_tests/test_moe_2stage.py * AITER_MXFP4_MOE_SF bugfix * revert torch.int4 --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: felix <felix.li@amd.com> * bug fix (#1370) * [opus] enhance opus utility (#1324) * enhance tuple and add waitcnt * refactor vectorized issue space * wip cached layout * support cached layout * add smem * support broadcast dtype in store() * Fix issue in metadata v1.2 where batch size is too large (#1352) * Fix issue in metadata v1.2 where batch size is too large. V1.1 is hopeless in these cases... * lds_partial_info is not used when there is no further tile splits on qo. * [GEMM][Config] add a8w8 block scale tuned config for deepseek-v3 (#1310) * add a8w8 gemm tuned config with block scale for deepseek-v3 shapes * reorganize the config files * revert unnecessnary changes * add headers * move ds configs to specific model config file --------- Co-authored-by: guanbao <gyu@amd.com> * support all logit values (#1323) * support all logit values * fix tests --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * CI: Skip triton in Aiter standard and multigpu tests (#1374) Triton tests will only run when changing file under: - aiter/ops/triton - op_tests/triton_tests Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add the performance data bar chart in the readme (#1372) * add the performance data bar chart in the readme * fix the wrong remarks * force ds ptpc moe use 1 stage moe (#1373) * [TRITON] MHA PA optimizations (#1245) Optimizations: New config. Preloading v Reordering pe matmul Simplifying scaling ops (for sm_scale and log_2(e) related scaling) Masking related changes * Enable fa multii target build on other arch (#1318) * enable fa multii target build on other arch * update arch info when dispatch in python api * update * Support mixed V2/V3 arches * format --------- Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com> * [Triton] DS FP4 triton fusion (#1371) * add fused_gemm_afp4wfp4_a16w16.py * fix bug * add fused_reduce_act_mul_and_mxfp4_quant * add gemm_a16wfp4.py * fix * fix * fix * clean up * repr * update AOT with repr * fix bug * add dummy heuristics * add output_unquantized_inp1 to fused_rms_mxfp4_quant * add configs * fix bug, tune fused_reduce_act_mul_and_mxfp4_quant * fix * fix * final clean up * add batched_gemm_a16wfp4 * clean up * add config * add default config * remove old kernels, add API redirection and deprecation warning * add fused_gemm_afp4wfp4_mul_add * [TRITON] Simplify and optimize triton_kernels moe code and move it into aiter (#1326) First PR for MoE with optimized support for GPTOSS shapes and fp8 x fp4. * Use torch.zeros_like instead of empty_like to prevent accruacy drop (#1387) * CI: Temporarily using old vllm nightly image (#1389) * Revert "[Triton] DS FP4 triton fusion (#1371)" (#1392) This reverts commit 4aabf79. * add a8w8 ptpc gemm config for dsv3 (#1382) * add ninja to install_requires in setup.py, fix ck gemm a8w8 bshuffle heuristic dispatch not support mnk=(2048,2112,7168) * add 085 ptpc gemm tune config * rename csv --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Test the CI on both MI325 and MI355 (#1364) * Always run tests on mi355 * [Triton] change BF16 GEMM config filename (#1398) * Support distributed_init_method and DP in init_distributed (#1353) * support distributed_init_method in init_distributed * support dp in aiter distribute * fix only tp error * FA V3(fp8) and paged Attention compressed (CI green) (#1065) FA V3(fp8) and paged Attention compressed FA V3(fp8) and paged Attention FP8 Prefill work compressed Fa V3 api Compress fp8 work so far pull cast out of torch function e2e fp8 stub emulate fa v3 ignore remove example clean up forward save fp8 backward ignore train artifacts just use return_attn_probs match fa behvaior save fa ref add fa_ref fix dropout bug add link optional fp8 p descale rename to v3 fa v3 clean up match backward min diff update varlen api clean up FP8_P_DESCALE update bench and test lint fix mha varlen bug remove .gitignore save lint remove skip bring back skips add fa module update v2 interface create mha_v3 add old v3 path update fa module tests passing sync bwd changes lint fa module add kvcache api and test fix lint fp8 works test fp8 only add paged tests add flash_attn_with_kvcache to v2 test varlen move to _triton_kernels test_mha_backward working with v3 upgrade to cleanedup modeule get test_mha_backward_varlen working clean up fix lint bug move casting functions to utils fix lint Update aiter/ops/triton/utils/mha_kernel_utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Update aiter/ops/triton/utils/mha_kernel_utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> use Optional update from main_perf lint update fp8 backward lint descale factor is fp32 lint dequant backward match dims Sync with FA main_perf dequant in backward pass descale to kernel directly better kernel naming simple fp8 path. no transpose clean up bwd save bring back split min diff pass descale to bwd lint fix bwd nans FP8_AUTO_DESCALE use hk for other backwards fp8 wrapper lint rm matrix_instr_nonkdim split v2 and v3 cleanly lint back to og minimal change test_mha passes test green * is_shuffled (#1377) * is_shuffled * shuffle_weight bugfix * rm AITER_MXFP4_MOE_SF * preshuffle bugfix * refactor * refactor bugfix * add bns/preshuffle moe mxfp4 UT tests * add L2 verification * black op_tests/test_moe_2stage.py --------- Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Ar rms new interface (#1401) * [fix]: fused_ar_rms interface Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> * delete comment Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> * change ut case Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix ut format err Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: ar acc err Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * minor fix for mi355 (#1408) * [Fix] Add sliding window feature for paged_attention_v1 (#1362) * add sliding window to paged_attention_v1 * Avoid use fmha_v3_varlen_fwd on unsupported architecture gfx90a * make `sliding_window` default to 0 for better compatibility * fix possible used compilation problem * fix ci failure * fix ci failure * add a single test to avoid increasing test time a lot --------- Co-authored-by: Xiake Sun <xiake.sun@amd.com> * fused_qk_rope_cat_and_cache_mla: Fix Triton compilation error and batch size constraint and output tensor sizing (#1407) * Fix Triton compilation error by nesting OUTPUT_Q_NOPE_ZEROS condition * Correct batch size constraint and output tensor sizing * max mla splits perbatch (#1390) * fix issues * add limit for split num per batch * fix non-ps num kv split * fix issue for big batch size * fix logits alloc * fix black code stype * fix ut * update git ignore& remove aiter/install_mode * update qh16 qseqlen4 kernel * update --------- Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * topk_per_row_opt (#1394) * topk_per_row_opt * Optimize topk by integrating air-topk kernel (TODO: proper workspace size computation and init) (#1409) * update * slightly optmize * add back --------- Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de> Co-authored-by: carlushuang <carlus.huang@amd.com> * Fix fused_rms_mxfp4_quant comment (#1369) * leanAttn softmax fix for spurious data mismatch test failures (#1396) * leanAttn softmax fix for spurious data mismatch test failures * black fix * Remove unused parameters per PR review request * Black fix * Add reduce_scatter api (#1413) * add reduce_scatter api * add reduce_scatter api * fix error in fmoe_tuner (#1405) * fix error in fmoe_tuner * fix error when tuning QuantType.per_1x32 * rm redundant code * optimize thread divergence (#1421) * [TRITON] complex number multiplication that supports 3D ROPE triton kernel (#1061) * complex number multiplication that supports 3D ROPE triton kernel * Merge remote-tracking branch 'origin/feature/rope3d-fix' and resolve conflicts * confilcts resolve * conflicts resolve * moved code places * fix typo * fix typo --------- Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Feat: pa_mqa_logits performance optimization & support kv_preshuffle + blocksize16/64 (#1424) * Update gluon_pa_mqa_logits using preshuffle * Minor update * Add finer pipeline granularity * Add sched_group_barrier optimization * Minor update * Support blocksize64 for preshuffle pa_mqa_logits * Support logits JIT on triton 3.5 * Improve splitkv strategy * Eliminate redundant conditon check * Add missing oob check * Resolve reviews * [Config] add tuned moe and gemm config for qwen3 235b (#1378) * add moe tuned config * add gemm tuned config * move tuned moe config to model specific file --------- Co-authored-by: guanbao <gyu@amd.com> * fix repeated unnecessary device check (#1221) * remove device check * 8 devices * more * more --------- Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * remove lru func in fake (#1429) * Temporarily disable the test on mi355 (#1437) * Enable MI355 test on main branch * CI: Aiter tests bug fix * [M308] tune silu&act (#1404) * Vectorized loads and stores combined with the packed multiply path * format code II --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * add deepseek ep moe tune config (#1431) * add ptpc deepseek ep moe tuned config * add block deepseek ep moe tune config * using 1stage moe for ptpc deepseek * [TRITON] Moe a8w4 tuning (#1410) * fuse routing kernels for small batches * tune batch=1024 * [TRITON] Apply config-aware naming (kernel_repr) to attention kernels (#1295) * Apply kernel_repr to attention kernels Applied make_kernel_repr helper to 4 attention kernel files: - pa_decode.py (6 kernels) - pa_prefill.py (2 kernels) - chunked_pa_prefill.py (1 kernel) - mla_decode_rope.py (2 kernels) Each kernel now has config-aware naming with constexpr parameters included in the kernel metadata name. Base: amd/satya/kernel_config_to_name * Apply kernel_repr to attention kernels * fix indentation error and add kernel_repr to a missed kernel * Add descriptions to missing API descriptions * remove unused imports * fix runtime error * revert lean atten to main * lean atten repr and API desc * formatting fix * Update aiter/ops/triton/pod_attention.py * [fix]: prebuild gen so (#1412) Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * [TRITON] FP8 MQA optimizations (#1422) FP8 MQA optimizations AND bench. script * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning (#1366) * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning * add required keys to fused_moe_1stage_dict * add kernel descriptors and code object files * add 32x128 file descriptors and code objects for tuning * move code objects and kernel descriptors to correct csv * remove unnecessary import, add quant type argument * move fused_moe_stage1_tkw1 into fused_moe.py * remove unnecessary kernel code object files * Add missing comma * saved modified tuned fmoe config for testing purposes * apply black required formatting * remove fused_moe_stage1_tkw1 and place aiter.fmoe_g1u1_tkw1 under fused_moe_1stage * remove unnecesary arguments * apply black formatting * simplify aiter.fmoe_g1u1_tkw1 call * add doweight_stage1 column to fused_moe_1stage_dict map and remove elif condition to select run_1stage=True * add doweight_stage1 to query key * modidy elif to select run_stage=True for tokens > 16 * apply black formatting * removing csv and .co files as they will come in separate commit * removing log logger.info(f[get_2stage_cfgs] run_1stage) --------- Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com> * CI: Move some tests to MI355 due to the network issue of TW cluster (#1446) * CI: Move some tests to MI355 due to the network issue of TW cluster * Modify the GPU_ARCH of sglang tests * CI: Move Triton tests from TW cluster to internal cluster (#1451) * tune a8w8_blockscale&bpreshuffle for tencent (#1444) * tune a8w8_blockscale&bpreshuffle for tencent Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv * update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_tuned_gemm.csv * updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * [fix]: add ar switch (#1376) * [fix]: add ar switch Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: call ar naive Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: add ar switch Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * CI: Test pypi connection (#1458) * cktile weight preshuffle test and auto tuning for a8w8 (#1400) * cktile bpreshuffle && tuning code * refine code * refine code * refine code * refine * refine * fix merge conflict * fix conflict * fix CI build fail * refine code * align aiter interface * refine * add get_padded_M --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com> * fix_prebuiild_asm (#1439) * mdf_bf16gemm * update * update f4gemm * update * update * f4gemm bugs fix * f4gemm fix2 * update * update moe 2 stages * update codegen * update gemm_a8w8_asm * update * update * update * update * update * update * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@.com> * fix merged tuned config error (#1460) * fix merged tuned config error * update --------- Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * update triton version check for pa_mqa_logits (#1440) * update triton version check for pa_mqa_logits * Add some explanation for aot-check * Support pa_mqa_logits aot load on triton>=3.5 * Support pa_mqa_logits aot load on triton>=3.5 --------- Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> * fix all_reduce_fake (#1465) * CI: Use ausartifactory.amd.com in pip installation (#1469) * update codegen (#1471) * update codegen * update * update * update * fix --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: valarLip <340077269@qq.com> * CI: Fix sglang CI test (#1473) * CI: Add checkout retry in vLLM benchmark tests (#1476) * CI: Move SGlang and Triton tests to MI300 runners (#1485) * fix merging aiter config (#1443) * change to merge config when used * fix lint * fix error in GemmTuner * fix lint * fix error when runing deepseek * fix lint error * revert other format change * fix gemm_op_a8w8.py --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * fix fmoe tune preshuffle error (#1430) * set preshuffle=False default * fix lint --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Fix issue: Add nhead=128 support to bf16 and align restrictions (#1450) * fix fwd_v3 output/lse is nan when kseq=0 and fix qseq >> kseq error (#1442) * fix output/lse is nan when kseq=0 * fix gfx950 128_128 fwd_v3 * update the k_seq=0 error in MI300 and MI308 * tune a8w8_blockscale&bpreshuffle for tencent (#1444) * tune a8w8_blockscale&bpreshuffle for tencent Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv * update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_tuned_gemm.csv * updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update the smoke test * update the smoke test * fix MI300 and MI308 err * fix qseq >> kseq error MI300 and MI308 * fix qseq >> kseq error in MI355 * fix the MI300 error --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com> Co-authored-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * fix the build error of rtp (#1438) Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Add 32x64 to tuned fmoe config (#1386) * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning * add required keys to fused_moe_1stage_dict * add kernel descriptors and code object files * add 32x128 file descriptors and code objects for tuning * move code objects and kernel descriptors to correct csv * remove unnecessary import, add quant type argument * move fused_moe_stage1_tkw1 into fused_moe.py * remove unnecessary kernel code object files * add kernel descriptor to tuned fmoe config, add kernel descriptors to related csvs and add related code objects * create kernel descriptors and kernel co files with correct tags * some fix for support gpt_oss (#1488) * CI: Revert vllm_benchmark to use the latest nightly image (#1402) * add sink_size parameter in mha_fwd and varlen_mha_fwd Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update ck api Signed-off-by: JL-underdog <Jun.Lin@amd.com> * Remove redundant assignment to sink_size * Update mha_fwd_kernels.cu * Update mha.py * Add false argument to fmha_batch_prefill call * update ck commit Signed-off-by: JL-underdog <Jun.Lin@amd.com> * Handle sink_size with conditional window_size length * update fmha_api Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck_commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck commmit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix op test error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha.py * Update csrc/include/torch/mha_fwd.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/py_itfs_ck/mha_fwd_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/py_itfs_ck/mha_varlen_fwd_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add some comments Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha_fwd_generate.py * Clarify sink_size parameter in asm_mha_varlen_fwd.cu Updated the comment for the sink_size parameter. * update ck commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: root <root@hjbog-srdc-24.amd.com> Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> Signed-off-by: JL-underdog <Jun.Lin@amd.com> Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com> Co-authored-by: kliuae-amd <KuanFu.Liu@amd.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: azaidy <aliasger.zaidy@amd.com> Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com> Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: minmengdie <memin@amd.com> Co-authored-by: ZhangLirong <lirzhang@amd.com> Co-authored-by: root <root@hjbog-srdc-39.amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: valechen <115046356+valechen@users.noreply.github.com> Co-authored-by: ukannika <uma.kannikanti@amd.com> Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: who who who <fsx950223@outlook.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: baowendin <46412693+baowendin@users.noreply.github.com> Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com> Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: TennyWang1223 <Tenny.Wang@amd.com> Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org> Co-authored-by: slippedJim <jim.guo@amd.com> Co-authored-by: Jeff Huang <jiaji.huang73@gmail.com> Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com> Co-authored-by: zanzhang <zanzhang@amd.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: Hui Zhou <zhou_hui@outlook.com> Co-authored-by: Hui Zhou <huizhou@meta.com> Co-authored-by: TJian <tunjian1996@gmail.com> Co-authored-by: Satya Nikhil Kodukula <skodukul@amd.com> Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com> Co-authored-by: Oscar Xu <huuaiguxu@amd.com> Co-authored-by: xudoyuan <xudoyuan@amd.com> Co-authored-by: Farel Lukas <farlukas@amd.com> Co-authored-by: Satya Nikhil Kodukula <nikhil.kodukula@gmail.com> Co-authored-by: BingYuan.Zhou <BingYuan.Zhou@amd.com> Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: felix <felix.li@amd.com> Co-authored-by: eky-amd <ethan.ky@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: gbyu-amd <Guanbao.Yu@amd.com> Co-authored-by: guanbao <gyu@amd.com> Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com> Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com> Co-authored-by: luocheng25 <cheng.luo@amd.com> Co-authored-by: Xiake Sun <xiake.sun@amd.com> Co-authored-by: Lyu, Xudong <xudonlyu@amd.com> Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de> Co-authored-by: Drew Wadsworth <drew.wadsworth@gmail.com> Co-authored-by: yinfengLiu <yinfeliu@amd.com> Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com> Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: zufayu <zufa.yu@amd.com> Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Anton Saukkonen <63663359+antsaukk@users.noreply.github.com> Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
zhuyuhua-v
added a commit
that referenced
this pull request
Dec 17, 2025
* add sink_size parameter in mha_fwd and varlen_mha_fwd Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update mha.py * update mha_varlen_fwd_kernels Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * ca_refactor_fix (#1268) * ca_refactor_fix * more update * fix_fp4_quant_dtype (#1271) * add sample use outer exponential (#1267) * update sample outer exp * update mix sample use outer exponential --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Fix rowwise a8w8 gemm in swizzled hipb_mm (#1258) * fix rowwise Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * guard hipb_mm output type in unit test Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * correct tuning for rowwise Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * expose b_preshuffle for tuning Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix lint Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> --------- Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> * CI: Use TW cluster to run sglang tests (#1273) * Catchall PR for all 355_wip related changes (#1148) * Catchall PR for all 355_wip related changes Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com> Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com> * add triton fp4 gemm preshuffle (#1187) * Remove AOT path * [TRITON] Add Positional Encoding (PE) support to Triton MHA kernels Related PR: #1184 Patch applied according to Cagri's request. * Tune more fp4 gemm shapes * Add aot compilation option for fp4 preshuffled gemm * Add precompiled fp4 preshuffled gemm kernels * Add readme * re-gen AOT binary files for LL 70B FP4 GEMM and update kernels and API helper, fix fused_mul_add logger bug (#1242) * DS 355_wip fused_shared_expert (#1218) documentation, fix some bugs, UT * fused_rms_fp8_group_quant num_warps tunning * black formatting * remove redundent files * unified attn. reorg., fixes, exp2 update * add MI300 config for fused_gemm_a8w8_blockscale_a16w16 * recover test_mha to upstream * black formatting * update test fused_kv_cache * bypass some UTs for MI300 * update FP8 assert for test_mha_varlen * skip UT --------- Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com> Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> * [MI35X] fix core check (#1276) * fix core check * update * Refactor gemm bf16 tuner (#1275) * refactor GemmTuner * update * fix lint error * fix lint error * CI: Operators tuning pipelines (#1163) * CI: Operators tunning pipelines * Updates * Updates * Updates * Updates * Show computing unints * Updates * Updates * Add op_tune.sh * Updates * Disable a4w4 * Updates the error handling * Updates the error handling * Updates * Updates * Updates * Updates * Update .github/scripts/op_tune.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Updates * Add uloading tuned CSVs * Updates * Add shape name * Add shape arg * Allows users to select the shapes they want to tune and specify the arguments they need for tuning. * Only be triggered when modify the untuned csv files under aiter configs foleder * Test * Updates * Updates * Update .github/workflows/operators-tuning.yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/ck_gemm_a8w8_blockscale_bpreshuffle/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update a4w4_blockscale_untuned_gemm.csv --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix the problem that v3's performance is worse than ck's (#1237) * fix fwd v3 kernel perf and opt err * fix the python mha test error (#1277) * Reuse custom decorator in core and torch guard (#1278) * reuse custom decorator in core and torch guard * remove useless * rename all_reduce to avoid same name * rebase * fix bug * use cpu device default --------- Co-authored-by: root <root@hjbog-srdc-39.amd.com> * integrate deep gemm (#1265) * integrate m grouped gemm * update ck * add limit for 950 * rename deepgeem * add a tuned config and insert entries in untuned config (#1243) Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Enable large batch size and optimization of non-Ragged batching (#1269) * Enable large batch size and optimization of non-Ragged batching * Add RAGGED_BATCH to test_la.py and bench_la.py * add few more fw ds f4 untuned and tuned shapes for using asm kernel (#1298) * CI: Optimize autotuning pipeline and inital the docs (#1286) * CI: Optimize autotuning pipeline and inital the docs * topk per row kernel (#1262) * initial commit for topk per row kernel * topk per row kernel initial commit * Fix the typo issue * Add the topk per row kernel * optimizations for topk_per_row kernel * fix overflow * add unit test for topk_per_row_decode * update test for decode * apply vector dispatch from carlus --------- Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu> Co-authored-by: valarLip <340077269@qq.com> * fix aot (#1279) * fix aot * remove other kernels path * fix aot * format code --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com> * Fix ATOM fp8 model quant fail issue in torch compile (#1299) * Fix fp8 issue in torch compile * use less code * feat - pa_fwd support block map with stride in num_kv_heads_dim (#1301) * Fix how to update accumulator for dot_scaled (#1297) * CI: Optimize autotuning pipeline docs (#1300) * CI: Optimize autotuning pipeline docs * Update docs/autotuning_pipeline.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix the lint issue (#1307) * fix fwd perf calc error (#1305) * fix fwd perf calc error * black aiter/ops/triton/_triton_kernels/gemm_afp4wfp4.py * add the asm kernel performance of fwd and bwd (#1270) * add the asm kernel performance of the attention forwards and attention backwards * modify perf data * fix perf data * add a16 perf data * Fused TopK and Sigmoid kernel (#1251) * Add topk softmax * Add test for topk sigmoid * register the op properly * apply black * don't use constexpr with std::string * bump ck to include topk sigmoid commit * hipify * add argparse to the topk sigmoid test, also add pytest * use own module instead of asm moe * black formatting * add missing file * revert changes to module_moe_asm * Ar rms (#1290) * [fea]: add fused allreduce rmsnorm kernel Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: fuse ar rms interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: fuse ar rms interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix ck branch Signed-off-by: root <root@hjbog-srdc-24.amd.com> * update ar interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * Dsv32 cache (#1314) * add indexer_k_quant_and_cache & cp_gather_indexer_k_quant_cache * ndexer_k_quant_and_cache opt kernel and add test * update * update2 --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Fix displaying supported architectures (#1316) Currently it will look like this: ```log File "TransformerEngine/3rdparty/aiter/aiter/jit/utils/chip_info.py", line 77, in get_gfx_custom_op_core raise KeyError( KeyError: 'Unknown GPU architecture: . Supported architectures: [0, 1, 2, 3, 4, 5, 6, 7, 8]' ``` Signed-off-by: Hollow Man <hollowman@opensuse.org> * using standalone pybind (#1317) * using standalone pybind * fix * update * Enable mha bwd hd192_hd128 (#1308) * update codegen.py * update kernels & kernel launch * fix fa bwd dq_acc shape * remove mask in python api * CI: Add pre-check status check (#1252) Creates a unified pre-checks.yaml workflow that runs Black, Ruff, and dependency checks, uploading success/failure signal artifacts Download and verify the signal artifacts in the other heavy jobs. If the verification succeeds, the heavy jobs will continue running. If the verification fails, the heavy jobs will exit immediately. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [CK_TILE] fmha: Add backward pass support for padded inputs (#1212) * [CK_TILE] fmha: Add backward pass support for padded inputs Introduces support for padded sequence lengths in the backward pass of the variable-length flash attention (fmha_v3_varlen_bwd). - Updated Python and C++ function signatures to accept optional `cu_seqlens_q_padded` and `cu_seqlens_k_padded` arguments. - Modified the underlying CUDA kernels and code generation scripts to pass padding information via the new `seqlen_q_ptr` and `seqlen_k_ptr` fields in the CK `fmha_bwd_args` struct. - Modified the underlying kernels and code generation scripts to correctly handle pointers for both padded and unpadded sequence data. - Added comprehensive gradient verification to the test suite (`test_mha_varlen.py`) to ensure the correctness of the backward pass with various padding scenarios. * [CK_TILE] fmha: Adapt to composable_kernel padding API changes Refactor the FMHA forward and backward pass to align with the updated padding API in `composable_kernel`. - Argument Simplification: Removed the manual calculation of `seqlen_q` and `seqlen_k` from `cu_seqlens` in the `mha.cu` interface. The underlying kernels now handle this logic. - API Alignment: Updated the arguments passed to `aiter::mha_fwd` and `aiter::mha_bwd` to match the new `composable_kernel` API. This involves passing `cu_seqlen` pointers directly. - Kernel Interface Update: Modified the `codegen.py` scripts for `gfx942` and `gfx950` to reflect the changes in the kernel's function signatures and argument handling for padded and unpadded sequence lengths. * fix build error in op_tests/cpp/mha/benchmark_mha_*.cpp * Mla splitkv enhance split alg inte (#1233) * add num_kv_splits_indptr to mla for mtp<=4 case for now * update * update new kernel * infrastructures * 1st version of split kernel * Fix issues raised by Lingpeng and fix the issue on batch_size * update mla * update mla_stage2 * 1st draft of v1 split program * add kv_offset * mla_splitkv_enhance_split_alg_inte * splitkv debug * 1st version of reduce kernel * metadata & kernel finish * add reduce * final_lse is optional now. * update kernel * bug fix * bug fix 1 * modify reduce api * update kernel * fix max splits * bug fix 3 * fix s80 early return * udpate calculation of partial_indx * add per split test * make lse support by ref * test split * fix redundant calculation of head offset in reduce kernel * add custom test * Add support of 128 head size Fix how to get head count fff * update comments * 1. Let large work be assigned first. 2. Add tolerance to the tile which is slightly smaller than kv_limit. * Calculate kv_limit dynamically * Fix bug about difference in split_kv(bool) * add test * fix seed * Add global tolerance 16 in kv seqlen because main kernel cannot handle small splits (kv_seqlen<4) well. * Fix warp=1 error * Add redundant mode to make the size of output of metadata be fixed add new param: no_redundant. Reduce can support redundant input in reduce_indptr as well. fix comm * fp8 setup * first version of device metadata aaa * Add work_ptrs * Compatibility to CUDA Graph * Refactor code. Merge 2 iterations of generate work together. * Make sure that each batch of workload can never be splited to more than #cluster of tiles. * Adjust metadata. Get 1% perf gain. * Paralize most of metadata kernel Make get_cost_top() paralized. aaa * add scale * 1. Use warp-level bitonic sort to sort batch idx based on their cost for reducing #splits. 2. Use CK's warp ops. * fp8 function pass * Fix issues: 1. avg_workload cannot handle any batch! 2. split_kv(bool) is not correct when all the clusters are full. * fp8 ready * fix * persistent ready * add nv acc test * rename * updata metashape * update reduce cu num * update optest for mla * fix cu num * Update metadata and reduce kernels. * rename kernels * Add new param kv_granularity to metadata kernel. * Introduce cal_workload_limit_global_v2 * Support qhead=128 cases. * Change get_mla_metadata() api. Make some not important parameters be optional through a dict. * fix potential problem on calculating tot_qo_tiles typo * refactor metadata files aaa * update metadata v1_2 * update gqa_128 mla_ps & fix metadata v1_2 * Optimize mla metadata v1.2 * Optimize mla metadata v1.2 Part.2 * Optimize mla metadata v1.2 Part.3 * update qlen <=4 * fix mla qlen1 * Optimize mla metadata v1.2 Part.4 * Make reduce_final_map be optional in mla_reduce_v1 * Slightly increase reduce perf * Add persistent mode for mla reduce kernel * add mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co * update deepseekv32 sparse mla metadata * update mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co * Adjust code for sparse attn * Optimize the a16w8 kernel * Improve metadata v1.1 perf * Make metadata v1.1 support sparse attn bug fix tiny fix * Remove redundant code in mla_reduce * futile struggle * Fix issue after merge. aiter main branch is using torch.library.infer_schema which doesn't allow dict as parameter. Thus, change the API for metadata. * Adjust metadata v1.1 and make this branch be ready to be merged to main branch. * remove invalid co kernel * Fix issue brought from f794ae4 which disabled hipify by default. * support qolen>1 for sparse mla * make code become prettier * Fix issue in metadata v1.1 * Fix issue in test_mla.py * Fix lint fails * Fix sub-test fails in op_test/test_mla.py * Fix regression in test_mla.py where mtp>1 * Add head_dim=128 support to reduce * Add nhead=8 for pa and add assert to make sure the input tensors are in float32. * fix issue in vllm benchmark for deepseek: remove metadata v0 because it's not compatible with hip graph * fix lint * Revert all the change about mi350 gemm. * add a8w8 and a16w8 kernel in mla mi350 * add A8W8 Non-persistent mode kernel * Fix issue reported by Copilot * add mla non-persistent test * script: update a16w8 kernel * rm test_mla_persistent_mi350.py and support mi350 in test_mla_persistent.py * add mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps.co * fix a8w8 num_kv_split=1 * Fix issue in metadata v1.2 on qo_tiles > 1 * fix ut bandwidth * Use nhead=16 simulate cases that nhead=16*N where N is in range(32,16*32+1,16) aaa Fix regression in sparse attn from the fix in metadata v1.2 for multi qo tile issue * Add new api get_mla_metadata_info * fix lint format issues * Adjust get_mla_metadata_info_v1's parameters. * update A16W8 kernel * update A16W8 kernel2 * update A16W8 for mi300 * fix ut and rename some kernels * rename mla kernel name for head 128 * remove log * fix format * add nativly back * change zeros into empty * fix with comments --------- Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com> Co-authored-by: zanzhang <zanzhang@amd.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: minmengdie <memin@amd.com> * Fix gemm tuner error mi350 (#1313) * workaround-retry tuning when encounter invalid pointer * workaround-retry tuning when encounter invalid pointer * fix lint error * Update gemm_tuner.py em timeout * CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton (#1325) * CI: Skip triton in Aiter standard and multigpu tests * Add retries when building triton * Add ninja installation * Fix global variable torch_fp8 initialization caused issue (#1322) `hipGetDeviceProperties` is called by the `torch_fp8` initialization. It will trigger all the HIP runtime initialization in global variable initialization. There are two issues: - There are several global variables involved in the runtime initialization too. The initialization order of global variables is not guaranteed. So it may use uninitialized global variables for the runtime initialization. - When there is a forked child process, needs to initialize its own HIP runtime to get proper GPU driver kernel context and handles. But since there is a runtime initialized globally in the parent process, the forked process will just consider the runtime is initialized and use it directly. But it is actually invalid. The fix is to ensure `hipGetDeviceProperties` is only called when actually needed, not during static initialization To repro the issue: 1. fork a child process 2. call torch.empty on the child process It will get a `hipErrorInvalidValue` error. Co-authored-by: Hui Zhou <huizhou@meta.com> * Add transpose scale to the triton fused_rms_fp8_group_quant (#1291) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * [Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning (#1315) * update AOT, always pad x_scale when generating input, add UT * update act_mul_mxfp4_quant, fused_rms_mxfp4_quant * add LL FP4 configs and AOT files for TP8 shapes * fix UT bug * add LL TP2 and TP4 shapes * [TRITON] Kernel naming: add reusable constexpr repr helper (#1260) * Kernel naming: add reusable constexpr repr helper for gemm a16w16 * add missing params to the repr * Merge tuned file (#1327) * merge tuned_file with same prefix * fix lint * rename tuned_gemm.csv to bf16_tuned_gemm.csv to avoid matching wrong file * update * update README.md of bf16 GemmTuner --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * fix graph_breaks by return tensor for bool op (#1333) * fix_bf16gemm_asm (#1329) * fix_bf16gemm_asm * update * update * Improve Memory Usage in MLA (#1338) * Improve mla memory * further reduce memory usage * Fix lint issues * Fix issue reported by Jun Chen. * fix tune error caused by merge tuned_file (#1342) * fix tune error caused by merge tuned_file * fix lint error, rm some log * rm rocblas in tuner (#1337) * [Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant (#1328) * add gemm_a16w8_blockscale and fused_reduce_rms_fp8_group_quant * black formatting * add MI300 config * fix commit * Add block_m=16 for a8w8_ck_moe_blockscale (#1081) * Add block_m=16 for a8w8_ck_moe_blockscale * fix moe_blk_scale token<64 to ck2stage for ds shape * fp8 moe bugfix tuning rebase tuned_fmoe.csv * add L2 check * reformat --------- Co-authored-by: Oscar Xu <huuaiguxu@amd.com> Co-authored-by: xudoyuan <xudoyuan@amd.com> * Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel (#1330) * Fused Triton RMSNorm and FP8 static quantization * Formatted python scripts * [TRITON] GEMM kernels nomenclature changes (#1283) * Kernel naming: add reusable constexpr repr helper (#1260) * Add missing API documentation * Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. (#1359) * [Triton] Disable failing lean attention tests (#1357) * add config (#1355) * add how_v3_bf16_cvt control to the Python API (#1351) * add how_v3_bf16_cvt in fwd_v3 * fix the fwd compile * [fix]: car 6 rank coredump (#1335) * [fix]: car 6 rank coredump Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: add residual out in ar rms Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic (#1348) * make can_mha_v3 uninplace and fix guard * fix error * wrapper _flash_attn_backward in custom * make gemm a8w8/a4w4 into custom * add some op in custom --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * [TRITON] GEMM kernels nomenclature changes (#1292) * implemented the use of kernel repr helper to standardize kernel metadata representation * Add Missing API documentation * remove commented code * remove FILL_VALUE to keep kernels name meaningful * [TRITON] Initial implementations of sparse attention kernels (#1296) fp8_mqa_logits: Calculate the logits (prefill stage) to be used for topk unified_attention_sparse_mla: Sparse attention implementation for the deepseek like MLA using the MHA approach where kv cache is [seq_len_kv, 1, HEAD_SIZE + kv_lora_rank + rope_rank] and q is [seq_len, NUM_HEADS, kv_lora_rank + rope_rank] * [MI35X]cktile moe a16w4 support (#1341) * draft of cktile moe * align the interface of main branch to make cktile moe compile pass * refine code * refine ck moe * fix CI build fail about code style * remove ck blockscale moe modification * refine code * fix CI build fail of unsupport block_m=16 * refine format * fix conflict * update * format * fix format * update * update * update * format * format * remove useless * fix sorting --------- Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Oscar Xu <huuaiguxu@amd.com> Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: felix <felix.li@amd.com> * [TRITON] Batched GEMM kernels nomenclature changes (#1293) * implemented the use of kernel repr helper to standardize kernel metadata representation - batched_gemm_bf16.py - batched_gemm_a8w8.py - batched_gemm_afp4wfp4.py (main + reduce kernel) - batched_gemm_afp4wfp4_pre_quant.py - batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.py * Add Missing API documentation * remove Dtype to avoid having invalid names in the repr * [TRITON] Instruction shape fix for Gluon gemm_a8w8_blockscale kernel (#1261) * fixed instr_shape error for mfma layout * removed test skips for uneven K --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * moe mxfp4 block_m = 64/128 (#1266) * moe mxfp4 block_m = 64/128 * update a4w4_gemm2_kernels_list * add instance tile_m=32 * tuned configuration * Update test_moe_2stage.py * refactor * update v1 pipeline * update badcase * fix fp4 moe tuner * reformat * revert ck update * update ck * Moe mxfp4 ck preshf bns (#1312) * python code of nbs compatible * bns compatible * fix global * bug fix --------- Co-authored-by: zhimding <zhimding@amd.com> * add AITER_MXFP4_MOE_SF switch for mxfp4 moe * v3 n128 * 32x32 v1 * resolve ck conflict * rm use_int4=True * reformatted op_tests/test_moe_2stage.py * AITER_MXFP4_MOE_SF bugfix * revert torch.int4 --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: felix <felix.li@amd.com> * bug fix (#1370) * [opus] enhance opus utility (#1324) * enhance tuple and add waitcnt * refactor vectorized issue space * wip cached layout * support cached layout * add smem * support broadcast dtype in store() * Fix issue in metadata v1.2 where batch size is too large (#1352) * Fix issue in metadata v1.2 where batch size is too large. V1.1 is hopeless in these cases... * lds_partial_info is not used when there is no further tile splits on qo. * [GEMM][Config] add a8w8 block scale tuned config for deepseek-v3 (#1310) * add a8w8 gemm tuned config with block scale for deepseek-v3 shapes * reorganize the config files * revert unnecessnary changes * add headers * move ds configs to specific model config file --------- Co-authored-by: guanbao <gyu@amd.com> * support all logit values (#1323) * support all logit values * fix tests --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * CI: Skip triton in Aiter standard and multigpu tests (#1374) Triton tests will only run when changing file under: - aiter/ops/triton - op_tests/triton_tests Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add the performance data bar chart in the readme (#1372) * add the performance data bar chart in the readme * fix the wrong remarks * force ds ptpc moe use 1 stage moe (#1373) * [TRITON] MHA PA optimizations (#1245) Optimizations: New config. Preloading v Reordering pe matmul Simplifying scaling ops (for sm_scale and log_2(e) related scaling) Masking related changes * Enable fa multii target build on other arch (#1318) * enable fa multii target build on other arch * update arch info when dispatch in python api * update * Support mixed V2/V3 arches * format --------- Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com> * [Triton] DS FP4 triton fusion (#1371) * add fused_gemm_afp4wfp4_a16w16.py * fix bug * add fused_reduce_act_mul_and_mxfp4_quant * add gemm_a16wfp4.py * fix * fix * fix * clean up * repr * update AOT with repr * fix bug * add dummy heuristics * add output_unquantized_inp1 to fused_rms_mxfp4_quant * add configs * fix bug, tune fused_reduce_act_mul_and_mxfp4_quant * fix * fix * final clean up * add batched_gemm_a16wfp4 * clean up * add config * add default config * remove old kernels, add API redirection and deprecation warning * add fused_gemm_afp4wfp4_mul_add * [TRITON] Simplify and optimize triton_kernels moe code and move it into aiter (#1326) First PR for MoE with optimized support for GPTOSS shapes and fp8 x fp4. * Use torch.zeros_like instead of empty_like to prevent accruacy drop (#1387) * CI: Temporarily using old vllm nightly image (#1389) * Revert "[Triton] DS FP4 triton fusion (#1371)" (#1392) This reverts commit 4aabf79. * add a8w8 ptpc gemm config for dsv3 (#1382) * add ninja to install_requires in setup.py, fix ck gemm a8w8 bshuffle heuristic dispatch not support mnk=(2048,2112,7168) * add 085 ptpc gemm tune config * rename csv --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Test the CI on both MI325 and MI355 (#1364) * Always run tests on mi355 * [Triton] change BF16 GEMM config filename (#1398) * Support distributed_init_method and DP in init_distributed (#1353) * support distributed_init_method in init_distributed * support dp in aiter distribute * fix only tp error * FA V3(fp8) and paged Attention compressed (CI green) (#1065) FA V3(fp8) and paged Attention compressed FA V3(fp8) and paged Attention FP8 Prefill work compressed Fa V3 api Compress fp8 work so far pull cast out of torch function e2e fp8 stub emulate fa v3 ignore remove example clean up forward save fp8 backward ignore train artifacts just use return_attn_probs match fa behvaior save fa ref add fa_ref fix dropout bug add link optional fp8 p descale rename to v3 fa v3 clean up match backward min diff update varlen api clean up FP8_P_DESCALE update bench and test lint fix mha varlen bug remove .gitignore save lint remove skip bring back skips add fa module update v2 interface create mha_v3 add old v3 path update fa module tests passing sync bwd changes lint fa module add kvcache api and test fix lint fp8 works test fp8 only add paged tests add flash_attn_with_kvcache to v2 test varlen move to _triton_kernels test_mha_backward working with v3 upgrade to cleanedup modeule get test_mha_backward_varlen working clean up fix lint bug move casting functions to utils fix lint Update aiter/ops/triton/utils/mha_kernel_utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Update aiter/ops/triton/utils/mha_kernel_utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> use Optional update from main_perf lint update fp8 backward lint descale factor is fp32 lint dequant backward match dims Sync with FA main_perf dequant in backward pass descale to kernel directly better kernel naming simple fp8 path. no transpose clean up bwd save bring back split min diff pass descale to bwd lint fix bwd nans FP8_AUTO_DESCALE use hk for other backwards fp8 wrapper lint rm matrix_instr_nonkdim split v2 and v3 cleanly lint back to og minimal change test_mha passes test green * is_shuffled (#1377) * is_shuffled * shuffle_weight bugfix * rm AITER_MXFP4_MOE_SF * preshuffle bugfix * refactor * refactor bugfix * add bns/preshuffle moe mxfp4 UT tests * add L2 verification * black op_tests/test_moe_2stage.py --------- Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Ar rms new interface (#1401) * [fix]: fused_ar_rms interface Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> * delete comment Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> * change ut case Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix ut format err Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: ar acc err Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * minor fix for mi355 (#1408) * [Fix] Add sliding window feature for paged_attention_v1 (#1362) * add sliding window to paged_attention_v1 * Avoid use fmha_v3_varlen_fwd on unsupported architecture gfx90a * make `sliding_window` default to 0 for better compatibility * fix possible used compilation problem * fix ci failure * fix ci failure * add a single test to avoid increasing test time a lot --------- Co-authored-by: Xiake Sun <xiake.sun@amd.com> * fused_qk_rope_cat_and_cache_mla: Fix Triton compilation error and batch size constraint and output tensor sizing (#1407) * Fix Triton compilation error by nesting OUTPUT_Q_NOPE_ZEROS condition * Correct batch size constraint and output tensor sizing * max mla splits perbatch (#1390) * fix issues * add limit for split num per batch * fix non-ps num kv split * fix issue for big batch size * fix logits alloc * fix black code stype * fix ut * update git ignore& remove aiter/install_mode * update qh16 qseqlen4 kernel * update --------- Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * topk_per_row_opt (#1394) * topk_per_row_opt * Optimize topk by integrating air-topk kernel (TODO: proper workspace size computation and init) (#1409) * update * slightly optmize * add back --------- Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de> Co-authored-by: carlushuang <carlus.huang@amd.com> * Fix fused_rms_mxfp4_quant comment (#1369) * leanAttn softmax fix for spurious data mismatch test failures (#1396) * leanAttn softmax fix for spurious data mismatch test failures * black fix * Remove unused parameters per PR review request * Black fix * Add reduce_scatter api (#1413) * add reduce_scatter api * add reduce_scatter api * fix error in fmoe_tuner (#1405) * fix error in fmoe_tuner * fix error when tuning QuantType.per_1x32 * rm redundant code * optimize thread divergence (#1421) * [TRITON] complex number multiplication that supports 3D ROPE triton kernel (#1061) * complex number multiplication that supports 3D ROPE triton kernel * Merge remote-tracking branch 'origin/feature/rope3d-fix' and resolve conflicts * confilcts resolve * conflicts resolve * moved code places * fix typo * fix typo --------- Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Feat: pa_mqa_logits performance optimization & support kv_preshuffle + blocksize16/64 (#1424) * Update gluon_pa_mqa_logits using preshuffle * Minor update * Add finer pipeline granularity * Add sched_group_barrier optimization * Minor update * Support blocksize64 for preshuffle pa_mqa_logits * Support logits JIT on triton 3.5 * Improve splitkv strategy * Eliminate redundant conditon check * Add missing oob check * Resolve reviews * [Config] add tuned moe and gemm config for qwen3 235b (#1378) * add moe tuned config * add gemm tuned config * move tuned moe config to model specific file --------- Co-authored-by: guanbao <gyu@amd.com> * fix repeated unnecessary device check (#1221) * remove device check * 8 devices * more * more --------- Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * remove lru func in fake (#1429) * Temporarily disable the test on mi355 (#1437) * Enable MI355 test on main branch * CI: Aiter tests bug fix * [M308] tune silu&act (#1404) * Vectorized loads and stores combined with the packed multiply path * format code II --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * add deepseek ep moe tune config (#1431) * add ptpc deepseek ep moe tuned config * add block deepseek ep moe tune config * using 1stage moe for ptpc deepseek * [TRITON] Moe a8w4 tuning (#1410) * fuse routing kernels for small batches * tune batch=1024 * [TRITON] Apply config-aware naming (kernel_repr) to attention kernels (#1295) * Apply kernel_repr to attention kernels Applied make_kernel_repr helper to 4 attention kernel files: - pa_decode.py (6 kernels) - pa_prefill.py (2 kernels) - chunked_pa_prefill.py (1 kernel) - mla_decode_rope.py (2 kernels) Each kernel now has config-aware naming with constexpr parameters included in the kernel metadata name. Base: amd/satya/kernel_config_to_name * Apply kernel_repr to attention kernels * fix indentation error and add kernel_repr to a missed kernel * Add descriptions to missing API descriptions * remove unused imports * fix runtime error * revert lean atten to main * lean atten repr and API desc * formatting fix * Update aiter/ops/triton/pod_attention.py * [fix]: prebuild gen so (#1412) Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * [TRITON] FP8 MQA optimizations (#1422) FP8 MQA optimizations AND bench. script * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning (#1366) * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning * add required keys to fused_moe_1stage_dict * add kernel descriptors and code object files * add 32x128 file descriptors and code objects for tuning * move code objects and kernel descriptors to correct csv * remove unnecessary import, add quant type argument * move fused_moe_stage1_tkw1 into fused_moe.py * remove unnecessary kernel code object files * Add missing comma * saved modified tuned fmoe config for testing purposes * apply black required formatting * remove fused_moe_stage1_tkw1 and place aiter.fmoe_g1u1_tkw1 under fused_moe_1stage * remove unnecesary arguments * apply black formatting * simplify aiter.fmoe_g1u1_tkw1 call * add doweight_stage1 column to fused_moe_1stage_dict map and remove elif condition to select run_1stage=True * add doweight_stage1 to query key * modidy elif to select run_stage=True for tokens > 16 * apply black formatting * removing csv and .co files as they will come in separate commit * removing log logger.info(f[get_2stage_cfgs] run_1stage) --------- Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com> * CI: Move some tests to MI355 due to the network issue of TW cluster (#1446) * CI: Move some tests to MI355 due to the network issue of TW cluster * Modify the GPU_ARCH of sglang tests * CI: Move Triton tests from TW cluster to internal cluster (#1451) * tune a8w8_blockscale&bpreshuffle for tencent (#1444) * tune a8w8_blockscale&bpreshuffle for tencent Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv * update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_tuned_gemm.csv * updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * [fix]: add ar switch (#1376) * [fix]: add ar switch Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: call ar naive Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: add ar switch Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * CI: Test pypi connection (#1458) * cktile weight preshuffle test and auto tuning for a8w8 (#1400) * cktile bpreshuffle && tuning code * refine code * refine code * refine code * refine * refine * fix merge conflict * fix conflict * fix CI build fail * refine code * align aiter interface * refine * add get_padded_M --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com> * fix_prebuiild_asm (#1439) * mdf_bf16gemm * update * update f4gemm * update * update * f4gemm bugs fix * f4gemm fix2 * update * update moe 2 stages * update codegen * update gemm_a8w8_asm * update * update * update * update * update * update * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@.com> * fix merged tuned config error (#1460) * fix merged tuned config error * update --------- Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * update triton version check for pa_mqa_logits (#1440) * update triton version check for pa_mqa_logits * Add some explanation for aot-check * Support pa_mqa_logits aot load on triton>=3.5 * Support pa_mqa_logits aot load on triton>=3.5 --------- Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> * fix all_reduce_fake (#1465) * CI: Use ausartifactory.amd.com in pip installation (#1469) * update codegen (#1471) * update codegen * update * update * update * fix --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: valarLip <340077269@qq.com> * CI: Fix sglang CI test (#1473) * CI: Add checkout retry in vLLM benchmark tests (#1476) * CI: Move SGlang and Triton tests to MI300 runners (#1485) * fix merging aiter config (#1443) * change to merge config when used * fix lint * fix error in GemmTuner * fix lint * fix error when runing deepseek * fix lint error * revert other format change * fix gemm_op_a8w8.py --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * fix fmoe tune preshuffle error (#1430) * set preshuffle=False default * fix lint --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Fix issue: Add nhead=128 support to bf16 and align restrictions (#1450) * fix fwd_v3 output/lse is nan when kseq=0 and fix qseq >> kseq error (#1442) * fix output/lse is nan when kseq=0 * fix gfx950 128_128 fwd_v3 * update the k_seq=0 error in MI300 and MI308 * tune a8w8_blockscale&bpreshuffle for tencent (#1444) * tune a8w8_blockscale&bpreshuffle for tencent Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv * update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_tuned_gemm.csv * updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update the smoke test * update the smoke test * fix MI300 and MI308 err * fix qseq >> kseq error MI300 and MI308 * fix qseq >> kseq error in MI355 * fix the MI300 error --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com> Co-authored-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * fix the build error of rtp (#1438) Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Add 32x64 to tuned fmoe config (#1386) * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning * add required keys to fused_moe_1stage_dict * add kernel descriptors and code object files * add 32x128 file descriptors and code objects for tuning * move code objects and kernel descriptors to correct csv * remove unnecessary import, add quant type argument * move fused_moe_stage1_tkw1 into fused_moe.py * remove unnecessary kernel code object files * add kernel descriptor to tuned fmoe config, add kernel descriptors to related csvs and add related code objects * create kernel descriptors and kernel co files with correct tags * some fix for support gpt_oss (#1488) * CI: Revert vllm_benchmark to use the latest nightly image (#1402) * add sink_size parameter in mha_fwd and varlen_mha_fwd Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update ck api Signed-off-by: JL-underdog <Jun.Lin@amd.com> * Remove redundant assignment to sink_size * Update mha_fwd_kernels.cu * Update mha.py * Add false argument to fmha_batch_prefill call * update ck commit Signed-off-by: JL-underdog <Jun.Lin@amd.com> * Handle sink_size with conditional window_size length * update fmha_api Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck_commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck commmit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix op test error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha.py * Update csrc/include/torch/mha_fwd.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/py_itfs_ck/mha_fwd_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/py_itfs_ck/mha_varlen_fwd_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add some comments Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha_fwd_generate.py * Clarify sink_size parameter in asm_mha_varlen_fwd.cu Updated the comment for the sink_size parameter. * update ck commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: root <root@hjbog-srdc-24.amd.com> Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> Signed-off-by: JL-underdog <Jun.Lin@amd.com> Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com> Co-authored-by: kliuae-amd <KuanFu.Liu@amd.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: azaidy <aliasger.zaidy@amd.com> Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com> Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: minmengdie <memin@amd.com> Co-authored-by: ZhangLirong <lirzhang@amd.com> Co-authored-by: root <root@hjbog-srdc-39.amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: valechen <115046356+valechen@users.noreply.github.com> Co-authored-by: ukannika <uma.kannikanti@amd.com> Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: who who who <fsx950223@outlook.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: baowendin <46412693+baowendin@users.noreply.github.com> Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com> Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: TennyWang1223 <Tenny.Wang@amd.com> Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org> Co-authored-by: slippedJim <jim.guo@amd.com> Co-authored-by: Jeff Huang <jiaji.huang73@gmail.com> Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com> Co-authored-by: zanzhang <zanzhang@amd.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: Hui Zhou <zhou_hui@outlook.com> Co-authored-by: Hui Zhou <huizhou@meta.com> Co-authored-by: TJian <tunjian1996@gmail.com> Co-authored-by: Satya Nikhil Kodukula <skodukul@amd.com> Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com> Co-authored-by: Oscar Xu <huuaiguxu@amd.com> Co-authored-by: xudoyuan <xudoyuan@amd.com> Co-authored-by: Farel Lukas <farlukas@amd.com> Co-authored-by: Satya Nikhil Kodukula <nikhil.kodukula@gmail.com> Co-authored-by: BingYuan.Zhou <BingYuan.Zhou@amd.com> Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: felix <felix.li@amd.com> Co-authored-by: eky-amd <ethan.ky@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: gbyu-amd <Guanbao.Yu@amd.com> Co-authored-by: guanbao <gyu@amd.com> Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com> Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com> Co-authored-by: luocheng25 <cheng.luo@amd.com> Co-authored-by: Xiake Sun <xiake.sun@amd.com> Co-authored-by: Lyu, Xudong <xudonlyu@amd.com> Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de> Co-authored-by: Drew Wadsworth <drew.wadsworth@gmail.com> Co-authored-by: yinfengLiu <yinfeliu@amd.com> Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com> Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: zufayu <zufa.yu@amd.com> Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Anton Saukkonen <63663359+antsaukk@users.noreply.github.com> Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
ZhangLirong-amd
added a commit
that referenced
this pull request
Dec 29, 2025
* add sink_size parameter in mha_fwd and varlen_mha_fwd Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update mha.py * update mha_varlen_fwd_kernels Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * ca_refactor_fix (#1268) * ca_refactor_fix * more update * fix_fp4_quant_dtype (#1271) * add sample use outer exponential (#1267) * update sample outer exp * update mix sample use outer exponential --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Fix rowwise a8w8 gemm in swizzled hipb_mm (#1258) * fix rowwise Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * guard hipb_mm output type in unit test Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * correct tuning for rowwise Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * expose b_preshuffle for tuning Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fix lint Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> --------- Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> * CI: Use TW cluster to run sglang tests (#1273) * Catchall PR for all 355_wip related changes (#1148) * Catchall PR for all 355_wip related changes Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com> Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com> * add triton fp4 gemm preshuffle (#1187) * Remove AOT path * [TRITON] Add Positional Encoding (PE) support to Triton MHA kernels Related PR: #1184 Patch applied according to Cagri's request. * Tune more fp4 gemm shapes * Add aot compilation option for fp4 preshuffled gemm * Add precompiled fp4 preshuffled gemm kernels * Add readme * re-gen AOT binary files for LL 70B FP4 GEMM and update kernels and API helper, fix fused_mul_add logger bug (#1242) * DS 355_wip fused_shared_expert (#1218) documentation, fix some bugs, UT * fused_rms_fp8_group_quant num_warps tunning * black formatting * remove redundent files * unified attn. reorg., fixes, exp2 update * add MI300 config for fused_gemm_a8w8_blockscale_a16w16 * recover test_mha to upstream * black formatting * update test fused_kv_cache * bypass some UTs for MI300 * update FP8 assert for test_mha_varlen * skip UT --------- Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com> Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> * [MI35X] fix core check (#1276) * fix core check * update * Refactor gemm bf16 tuner (#1275) * refactor GemmTuner * update * fix lint error * fix lint error * CI: Operators tuning pipelines (#1163) * CI: Operators tunning pipelines * Updates * Updates * Updates * Updates * Show computing unints * Updates * Updates * Add op_tune.sh * Updates * Disable a4w4 * Updates the error handling * Updates the error handling * Updates * Updates * Updates * Updates * Update .github/scripts/op_tune.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Updates * Add uloading tuned CSVs * Updates * Add shape name * Add shape arg * Allows users to select the shapes they want to tune and specify the arguments they need for tuning. * Only be triggered when modify the untuned csv files under aiter configs foleder * Test * Updates * Updates * Update .github/workflows/operators-tuning.yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/ck_gemm_a8w8_blockscale_bpreshuffle/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update a4w4_blockscale_untuned_gemm.csv --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix the problem that v3's performance is worse than ck's (#1237) * fix fwd v3 kernel perf and opt err * fix the python mha test error (#1277) * Reuse custom decorator in core and torch guard (#1278) * reuse custom decorator in core and torch guard * remove useless * rename all_reduce to avoid same name * rebase * fix bug * use cpu device default --------- Co-authored-by: root <root@hjbog-srdc-39.amd.com> * integrate deep gemm (#1265) * integrate m grouped gemm * update ck * add limit for 950 * rename deepgeem * add a tuned config and insert entries in untuned config (#1243) Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Enable large batch size and optimization of non-Ragged batching (#1269) * Enable large batch size and optimization of non-Ragged batching * Add RAGGED_BATCH to test_la.py and bench_la.py * add few more fw ds f4 untuned and tuned shapes for using asm kernel (#1298) * CI: Optimize autotuning pipeline and inital the docs (#1286) * CI: Optimize autotuning pipeline and inital the docs * topk per row kernel (#1262) * initial commit for topk per row kernel * topk per row kernel initial commit * Fix the typo issue * Add the topk per row kernel * optimizations for topk_per_row kernel * fix overflow * add unit test for topk_per_row_decode * update test for decode * apply vector dispatch from carlus --------- Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu> Co-authored-by: valarLip <340077269@qq.com> * fix aot (#1279) * fix aot * remove other kernels path * fix aot * format code --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com> * Fix ATOM fp8 model quant fail issue in torch compile (#1299) * Fix fp8 issue in torch compile * use less code * feat - pa_fwd support block map with stride in num_kv_heads_dim (#1301) * Fix how to update accumulator for dot_scaled (#1297) * CI: Optimize autotuning pipeline docs (#1300) * CI: Optimize autotuning pipeline docs * Update docs/autotuning_pipeline.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix the lint issue (#1307) * fix fwd perf calc error (#1305) * fix fwd perf calc error * black aiter/ops/triton/_triton_kernels/gemm_afp4wfp4.py * add the asm kernel performance of fwd and bwd (#1270) * add the asm kernel performance of the attention forwards and attention backwards * modify perf data * fix perf data * add a16 perf data * Fused TopK and Sigmoid kernel (#1251) * Add topk softmax * Add test for topk sigmoid * register the op properly * apply black * don't use constexpr with std::string * bump ck to include topk sigmoid commit * hipify * add argparse to the topk sigmoid test, also add pytest * use own module instead of asm moe * black formatting * add missing file * revert changes to module_moe_asm * Ar rms (#1290) * [fea]: add fused allreduce rmsnorm kernel Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: fuse ar rms interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: fuse ar rms interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix ck branch Signed-off-by: root <root@hjbog-srdc-24.amd.com> * update ar interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * Dsv32 cache (#1314) * add indexer_k_quant_and_cache & cp_gather_indexer_k_quant_cache * ndexer_k_quant_and_cache opt kernel and add test * update * update2 --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Fix displaying supported architectures (#1316) Currently it will look like this: ```log File "TransformerEngine/3rdparty/aiter/aiter/jit/utils/chip_info.py", line 77, in get_gfx_custom_op_core raise KeyError( KeyError: 'Unknown GPU architecture: . Supported architectures: [0, 1, 2, 3, 4, 5, 6, 7, 8]' ``` Signed-off-by: Hollow Man <hollowman@opensuse.org> * using standalone pybind (#1317) * using standalone pybind * fix * update * Enable mha bwd hd192_hd128 (#1308) * update codegen.py * update kernels & kernel launch * fix fa bwd dq_acc shape * remove mask in python api * CI: Add pre-check status check (#1252) Creates a unified pre-checks.yaml workflow that runs Black, Ruff, and dependency checks, uploading success/failure signal artifacts Download and verify the signal artifacts in the other heavy jobs. If the verification succeeds, the heavy jobs will continue running. If the verification fails, the heavy jobs will exit immediately. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [CK_TILE] fmha: Add backward pass support for padded inputs (#1212) * [CK_TILE] fmha: Add backward pass support for padded inputs Introduces support for padded sequence lengths in the backward pass of the variable-length flash attention (fmha_v3_varlen_bwd). - Updated Python and C++ function signatures to accept optional `cu_seqlens_q_padded` and `cu_seqlens_k_padded` arguments. - Modified the underlying CUDA kernels and code generation scripts to pass padding information via the new `seqlen_q_ptr` and `seqlen_k_ptr` fields in the CK `fmha_bwd_args` struct. - Modified the underlying kernels and code generation scripts to correctly handle pointers for both padded and unpadded sequence data. - Added comprehensive gradient verification to the test suite (`test_mha_varlen.py`) to ensure the correctness of the backward pass with various padding scenarios. * [CK_TILE] fmha: Adapt to composable_kernel padding API changes Refactor the FMHA forward and backward pass to align with the updated padding API in `composable_kernel`. - Argument Simplification: Removed the manual calculation of `seqlen_q` and `seqlen_k` from `cu_seqlens` in the `mha.cu` interface. The underlying kernels now handle this logic. - API Alignment: Updated the arguments passed to `aiter::mha_fwd` and `aiter::mha_bwd` to match the new `composable_kernel` API. This involves passing `cu_seqlen` pointers directly. - Kernel Interface Update: Modified the `codegen.py` scripts for `gfx942` and `gfx950` to reflect the changes in the kernel's function signatures and argument handling for padded and unpadded sequence lengths. * fix build error in op_tests/cpp/mha/benchmark_mha_*.cpp * Mla splitkv enhance split alg inte (#1233) * add num_kv_splits_indptr to mla for mtp<=4 case for now * update * update new kernel * infrastructures * 1st version of split kernel * Fix issues raised by Lingpeng and fix the issue on batch_size * update mla * update mla_stage2 * 1st draft of v1 split program * add kv_offset * mla_splitkv_enhance_split_alg_inte * splitkv debug * 1st version of reduce kernel * metadata & kernel finish * add reduce * final_lse is optional now. * update kernel * bug fix * bug fix 1 * modify reduce api * update kernel * fix max splits * bug fix 3 * fix s80 early return * udpate calculation of partial_indx * add per split test * make lse support by ref * test split * fix redundant calculation of head offset in reduce kernel * add custom test * Add support of 128 head size Fix how to get head count fff * update comments * 1. Let large work be assigned first. 2. Add tolerance to the tile which is slightly smaller than kv_limit. * Calculate kv_limit dynamically * Fix bug about difference in split_kv(bool) * add test * fix seed * Add global tolerance 16 in kv seqlen because main kernel cannot handle small splits (kv_seqlen<4) well. * Fix warp=1 error * Add redundant mode to make the size of output of metadata be fixed add new param: no_redundant. Reduce can support redundant input in reduce_indptr as well. fix comm * fp8 setup * first version of device metadata aaa * Add work_ptrs * Compatibility to CUDA Graph * Refactor code. Merge 2 iterations of generate work together. * Make sure that each batch of workload can never be splited to more than #cluster of tiles. * Adjust metadata. Get 1% perf gain. * Paralize most of metadata kernel Make get_cost_top() paralized. aaa * add scale * 1. Use warp-level bitonic sort to sort batch idx based on their cost for reducing #splits. 2. Use CK's warp ops. * fp8 function pass * Fix issues: 1. avg_workload cannot handle any batch! 2. split_kv(bool) is not correct when all the clusters are full. * fp8 ready * fix * persistent ready * add nv acc test * rename * updata metashape * update reduce cu num * update optest for mla * fix cu num * Update metadata and reduce kernels. * rename kernels * Add new param kv_granularity to metadata kernel. * Introduce cal_workload_limit_global_v2 * Support qhead=128 cases. * Change get_mla_metadata() api. Make some not important parameters be optional through a dict. * fix potential problem on calculating tot_qo_tiles typo * refactor metadata files aaa * update metadata v1_2 * update gqa_128 mla_ps & fix metadata v1_2 * Optimize mla metadata v1.2 * Optimize mla metadata v1.2 Part.2 * Optimize mla metadata v1.2 Part.3 * update qlen <=4 * fix mla qlen1 * Optimize mla metadata v1.2 Part.4 * Make reduce_final_map be optional in mla_reduce_v1 * Slightly increase reduce perf * Add persistent mode for mla reduce kernel * add mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co * update deepseekv32 sparse mla metadata * update mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co * Adjust code for sparse attn * Optimize the a16w8 kernel * Improve metadata v1.1 perf * Make metadata v1.1 support sparse attn bug fix tiny fix * Remove redundant code in mla_reduce * futile struggle * Fix issue after merge. aiter main branch is using torch.library.infer_schema which doesn't allow dict as parameter. Thus, change the API for metadata. * Adjust metadata v1.1 and make this branch be ready to be merged to main branch. * remove invalid co kernel * Fix issue brought from f794ae4 which disabled hipify by default. * support qolen>1 for sparse mla * make code become prettier * Fix issue in metadata v1.1 * Fix issue in test_mla.py * Fix lint fails * Fix sub-test fails in op_test/test_mla.py * Fix regression in test_mla.py where mtp>1 * Add head_dim=128 support to reduce * Add nhead=8 for pa and add assert to make sure the input tensors are in float32. * fix issue in vllm benchmark for deepseek: remove metadata v0 because it's not compatible with hip graph * fix lint * Revert all the change about mi350 gemm. * add a8w8 and a16w8 kernel in mla mi350 * add A8W8 Non-persistent mode kernel * Fix issue reported by Copilot * add mla non-persistent test * script: update a16w8 kernel * rm test_mla_persistent_mi350.py and support mi350 in test_mla_persistent.py * add mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps.co * fix a8w8 num_kv_split=1 * Fix issue in metadata v1.2 on qo_tiles > 1 * fix ut bandwidth * Use nhead=16 simulate cases that nhead=16*N where N is in range(32,16*32+1,16) aaa Fix regression in sparse attn from the fix in metadata v1.2 for multi qo tile issue * Add new api get_mla_metadata_info * fix lint format issues * Adjust get_mla_metadata_info_v1's parameters. * update A16W8 kernel * update A16W8 kernel2 * update A16W8 for mi300 * fix ut and rename some kernels * rename mla kernel name for head 128 * remove log * fix format * add nativly back * change zeros into empty * fix with comments --------- Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com> Co-authored-by: zanzhang <zanzhang@amd.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: minmengdie <memin@amd.com> * Fix gemm tuner error mi350 (#1313) * workaround-retry tuning when encounter invalid pointer * workaround-retry tuning when encounter invalid pointer * fix lint error * Update gemm_tuner.py em timeout * CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton (#1325) * CI: Skip triton in Aiter standard and multigpu tests * Add retries when building triton * Add ninja installation * Fix global variable torch_fp8 initialization caused issue (#1322) `hipGetDeviceProperties` is called by the `torch_fp8` initialization. It will trigger all the HIP runtime initialization in global variable initialization. There are two issues: - There are several global variables involved in the runtime initialization too. The initialization order of global variables is not guaranteed. So it may use uninitialized global variables for the runtime initialization. - When there is a forked child process, needs to initialize its own HIP runtime to get proper GPU driver kernel context and handles. But since there is a runtime initialized globally in the parent process, the forked process will just consider the runtime is initialized and use it directly. But it is actually invalid. The fix is to ensure `hipGetDeviceProperties` is only called when actually needed, not during static initialization To repro the issue: 1. fork a child process 2. call torch.empty on the child process It will get a `hipErrorInvalidValue` error. Co-authored-by: Hui Zhou <huizhou@meta.com> * Add transpose scale to the triton fused_rms_fp8_group_quant (#1291) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * [Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning (#1315) * update AOT, always pad x_scale when generating input, add UT * update act_mul_mxfp4_quant, fused_rms_mxfp4_quant * add LL FP4 configs and AOT files for TP8 shapes * fix UT bug * add LL TP2 and TP4 shapes * [TRITON] Kernel naming: add reusable constexpr repr helper (#1260) * Kernel naming: add reusable constexpr repr helper for gemm a16w16 * add missing params to the repr * Merge tuned file (#1327) * merge tuned_file with same prefix * fix lint * rename tuned_gemm.csv to bf16_tuned_gemm.csv to avoid matching wrong file * update * update README.md of bf16 GemmTuner --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * fix graph_breaks by return tensor for bool op (#1333) * fix_bf16gemm_asm (#1329) * fix_bf16gemm_asm * update * update * Improve Memory Usage in MLA (#1338) * Improve mla memory * further reduce memory usage * Fix lint issues * Fix issue reported by Jun Chen. * fix tune error caused by merge tuned_file (#1342) * fix tune error caused by merge tuned_file * fix lint error, rm some log * rm rocblas in tuner (#1337) * [Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant (#1328) * add gemm_a16w8_blockscale and fused_reduce_rms_fp8_group_quant * black formatting * add MI300 config * fix commit * Add block_m=16 for a8w8_ck_moe_blockscale (#1081) * Add block_m=16 for a8w8_ck_moe_blockscale * fix moe_blk_scale token<64 to ck2stage for ds shape * fp8 moe bugfix tuning rebase tuned_fmoe.csv * add L2 check * reformat --------- Co-authored-by: Oscar Xu <huuaiguxu@amd.com> Co-authored-by: xudoyuan <xudoyuan@amd.com> * Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel (#1330) * Fused Triton RMSNorm and FP8 static quantization * Formatted python scripts * [TRITON] GEMM kernels nomenclature changes (#1283) * Kernel naming: add reusable constexpr repr helper (#1260) * Add missing API documentation * Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. (#1359) * [Triton] Disable failing lean attention tests (#1357) * add config (#1355) * add how_v3_bf16_cvt control to the Python API (#1351) * add how_v3_bf16_cvt in fwd_v3 * fix the fwd compile * [fix]: car 6 rank coredump (#1335) * [fix]: car 6 rank coredump Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: add residual out in ar rms Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic (#1348) * make can_mha_v3 uninplace and fix guard * fix error * wrapper _flash_attn_backward in custom * make gemm a8w8/a4w4 into custom * add some op in custom --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * [TRITON] GEMM kernels nomenclature changes (#1292) * implemented the use of kernel repr helper to standardize kernel metadata representation * Add Missing API documentation * remove commented code * remove FILL_VALUE to keep kernels name meaningful * [TRITON] Initial implementations of sparse attention kernels (#1296) fp8_mqa_logits: Calculate the logits (prefill stage) to be used for topk unified_attention_sparse_mla: Sparse attention implementation for the deepseek like MLA using the MHA approach where kv cache is [seq_len_kv, 1, HEAD_SIZE + kv_lora_rank + rope_rank] and q is [seq_len, NUM_HEADS, kv_lora_rank + rope_rank] * [MI35X]cktile moe a16w4 support (#1341) * draft of cktile moe * align the interface of main branch to make cktile moe compile pass * refine code * refine ck moe * fix CI build fail about code style * remove ck blockscale moe modification * refine code * fix CI build fail of unsupport block_m=16 * refine format * fix conflict * update * format * fix format * update * update * update * format * format * remove useless * fix sorting --------- Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Oscar Xu <huuaiguxu@amd.com> Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: felix <felix.li@amd.com> * [TRITON] Batched GEMM kernels nomenclature changes (#1293) * implemented the use of kernel repr helper to standardize kernel metadata representation - batched_gemm_bf16.py - batched_gemm_a8w8.py - batched_gemm_afp4wfp4.py (main + reduce kernel) - batched_gemm_afp4wfp4_pre_quant.py - batched_gemm_a8w8_a_per_token_group_prequant_w_per_batched_tensor_quant.py * Add Missing API documentation * remove Dtype to avoid having invalid names in the repr * [TRITON] Instruction shape fix for Gluon gemm_a8w8_blockscale kernel (#1261) * fixed instr_shape error for mfma layout * removed test skips for uneven K --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * moe mxfp4 block_m = 64/128 (#1266) * moe mxfp4 block_m = 64/128 * update a4w4_gemm2_kernels_list * add instance tile_m=32 * tuned configuration * Update test_moe_2stage.py * refactor * update v1 pipeline * update badcase * fix fp4 moe tuner * reformat * revert ck update * update ck * Moe mxfp4 ck preshf bns (#1312) * python code of nbs compatible * bns compatible * fix global * bug fix --------- Co-authored-by: zhimding <zhimding@amd.com> * add AITER_MXFP4_MOE_SF switch for mxfp4 moe * v3 n128 * 32x32 v1 * resolve ck conflict * rm use_int4=True * reformatted op_tests/test_moe_2stage.py * AITER_MXFP4_MOE_SF bugfix * revert torch.int4 --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: felix <felix.li@amd.com> * bug fix (#1370) * [opus] enhance opus utility (#1324) * enhance tuple and add waitcnt * refactor vectorized issue space * wip cached layout * support cached layout * add smem * support broadcast dtype in store() * Fix issue in metadata v1.2 where batch size is too large (#1352) * Fix issue in metadata v1.2 where batch size is too large. V1.1 is hopeless in these cases... * lds_partial_info is not used when there is no further tile splits on qo. * [GEMM][Config] add a8w8 block scale tuned config for deepseek-v3 (#1310) * add a8w8 gemm tuned config with block scale for deepseek-v3 shapes * reorganize the config files * revert unnecessnary changes * add headers * move ds configs to specific model config file --------- Co-authored-by: guanbao <gyu@amd.com> * support all logit values (#1323) * support all logit values * fix tests --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * CI: Skip triton in Aiter standard and multigpu tests (#1374) Triton tests will only run when changing file under: - aiter/ops/triton - op_tests/triton_tests Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add the performance data bar chart in the readme (#1372) * add the performance data bar chart in the readme * fix the wrong remarks * force ds ptpc moe use 1 stage moe (#1373) * [TRITON] MHA PA optimizations (#1245) Optimizations: New config. Preloading v Reordering pe matmul Simplifying scaling ops (for sm_scale and log_2(e) related scaling) Masking related changes * Enable fa multii target build on other arch (#1318) * enable fa multii target build on other arch * update arch info when dispatch in python api * update * Support mixed V2/V3 arches * format --------- Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com> * [Triton] DS FP4 triton fusion (#1371) * add fused_gemm_afp4wfp4_a16w16.py * fix bug * add fused_reduce_act_mul_and_mxfp4_quant * add gemm_a16wfp4.py * fix * fix * fix * clean up * repr * update AOT with repr * fix bug * add dummy heuristics * add output_unquantized_inp1 to fused_rms_mxfp4_quant * add configs * fix bug, tune fused_reduce_act_mul_and_mxfp4_quant * fix * fix * final clean up * add batched_gemm_a16wfp4 * clean up * add config * add default config * remove old kernels, add API redirection and deprecation warning * add fused_gemm_afp4wfp4_mul_add * [TRITON] Simplify and optimize triton_kernels moe code and move it into aiter (#1326) First PR for MoE with optimized support for GPTOSS shapes and fp8 x fp4. * Use torch.zeros_like instead of empty_like to prevent accruacy drop (#1387) * CI: Temporarily using old vllm nightly image (#1389) * Revert "[Triton] DS FP4 triton fusion (#1371)" (#1392) This reverts commit 4aabf79. * add a8w8 ptpc gemm config for dsv3 (#1382) * add ninja to install_requires in setup.py, fix ck gemm a8w8 bshuffle heuristic dispatch not support mnk=(2048,2112,7168) * add 085 ptpc gemm tune config * rename csv --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Test the CI on both MI325 and MI355 (#1364) * Always run tests on mi355 * [Triton] change BF16 GEMM config filename (#1398) * Support distributed_init_method and DP in init_distributed (#1353) * support distributed_init_method in init_distributed * support dp in aiter distribute * fix only tp error * FA V3(fp8) and paged Attention compressed (CI green) (#1065) FA V3(fp8) and paged Attention compressed FA V3(fp8) and paged Attention FP8 Prefill work compressed Fa V3 api Compress fp8 work so far pull cast out of torch function e2e fp8 stub emulate fa v3 ignore remove example clean up forward save fp8 backward ignore train artifacts just use return_attn_probs match fa behvaior save fa ref add fa_ref fix dropout bug add link optional fp8 p descale rename to v3 fa v3 clean up match backward min diff update varlen api clean up FP8_P_DESCALE update bench and test lint fix mha varlen bug remove .gitignore save lint remove skip bring back skips add fa module update v2 interface create mha_v3 add old v3 path update fa module tests passing sync bwd changes lint fa module add kvcache api and test fix lint fp8 works test fp8 only add paged tests add flash_attn_with_kvcache to v2 test varlen move to _triton_kernels test_mha_backward working with v3 upgrade to cleanedup modeule get test_mha_backward_varlen working clean up fix lint bug move casting functions to utils fix lint Update aiter/ops/triton/utils/mha_kernel_utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Update aiter/ops/triton/utils/mha_kernel_utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> use Optional update from main_perf lint update fp8 backward lint descale factor is fp32 lint dequant backward match dims Sync with FA main_perf dequant in backward pass descale to kernel directly better kernel naming simple fp8 path. no transpose clean up bwd save bring back split min diff pass descale to bwd lint fix bwd nans FP8_AUTO_DESCALE use hk for other backwards fp8 wrapper lint rm matrix_instr_nonkdim split v2 and v3 cleanly lint back to og minimal change test_mha passes test green * is_shuffled (#1377) * is_shuffled * shuffle_weight bugfix * rm AITER_MXFP4_MOE_SF * preshuffle bugfix * refactor * refactor bugfix * add bns/preshuffle moe mxfp4 UT tests * add L2 verification * black op_tests/test_moe_2stage.py --------- Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Ar rms new interface (#1401) * [fix]: fused_ar_rms interface Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> * delete comment Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> * change ut case Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix ut format err Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: ar acc err Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * minor fix for mi355 (#1408) * [Fix] Add sliding window feature for paged_attention_v1 (#1362) * add sliding window to paged_attention_v1 * Avoid use fmha_v3_varlen_fwd on unsupported architecture gfx90a * make `sliding_window` default to 0 for better compatibility * fix possible used compilation problem * fix ci failure * fix ci failure * add a single test to avoid increasing test time a lot --------- Co-authored-by: Xiake Sun <xiake.sun@amd.com> * fused_qk_rope_cat_and_cache_mla: Fix Triton compilation error and batch size constraint and output tensor sizing (#1407) * Fix Triton compilation error by nesting OUTPUT_Q_NOPE_ZEROS condition * Correct batch size constraint and output tensor sizing * max mla splits perbatch (#1390) * fix issues * add limit for split num per batch * fix non-ps num kv split * fix issue for big batch size * fix logits alloc * fix black code stype * fix ut * update git ignore& remove aiter/install_mode * update qh16 qseqlen4 kernel * update --------- Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * topk_per_row_opt (#1394) * topk_per_row_opt * Optimize topk by integrating air-topk kernel (TODO: proper workspace size computation and init) (#1409) * update * slightly optmize * add back --------- Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de> Co-authored-by: carlushuang <carlus.huang@amd.com> * Fix fused_rms_mxfp4_quant comment (#1369) * leanAttn softmax fix for spurious data mismatch test failures (#1396) * leanAttn softmax fix for spurious data mismatch test failures * black fix * Remove unused parameters per PR review request * Black fix * Add reduce_scatter api (#1413) * add reduce_scatter api * add reduce_scatter api * fix error in fmoe_tuner (#1405) * fix error in fmoe_tuner * fix error when tuning QuantType.per_1x32 * rm redundant code * optimize thread divergence (#1421) * [TRITON] complex number multiplication that supports 3D ROPE triton kernel (#1061) * complex number multiplication that supports 3D ROPE triton kernel * Merge remote-tracking branch 'origin/feature/rope3d-fix' and resolve conflicts * confilcts resolve * conflicts resolve * moved code places * fix typo * fix typo --------- Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * Feat: pa_mqa_logits performance optimization & support kv_preshuffle + blocksize16/64 (#1424) * Update gluon_pa_mqa_logits using preshuffle * Minor update * Add finer pipeline granularity * Add sched_group_barrier optimization * Minor update * Support blocksize64 for preshuffle pa_mqa_logits * Support logits JIT on triton 3.5 * Improve splitkv strategy * Eliminate redundant conditon check * Add missing oob check * Resolve reviews * [Config] add tuned moe and gemm config for qwen3 235b (#1378) * add moe tuned config * add gemm tuned config * move tuned moe config to model specific file --------- Co-authored-by: guanbao <gyu@amd.com> * fix repeated unnecessary device check (#1221) * remove device check * 8 devices * more * more --------- Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * remove lru func in fake (#1429) * Temporarily disable the test on mi355 (#1437) * Enable MI355 test on main branch * CI: Aiter tests bug fix * [M308] tune silu&act (#1404) * Vectorized loads and stores combined with the packed multiply path * format code II --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * add deepseek ep moe tune config (#1431) * add ptpc deepseek ep moe tuned config * add block deepseek ep moe tune config * using 1stage moe for ptpc deepseek * [TRITON] Moe a8w4 tuning (#1410) * fuse routing kernels for small batches * tune batch=1024 * [TRITON] Apply config-aware naming (kernel_repr) to attention kernels (#1295) * Apply kernel_repr to attention kernels Applied make_kernel_repr helper to 4 attention kernel files: - pa_decode.py (6 kernels) - pa_prefill.py (2 kernels) - chunked_pa_prefill.py (1 kernel) - mla_decode_rope.py (2 kernels) Each kernel now has config-aware naming with constexpr parameters included in the kernel metadata name. Base: amd/satya/kernel_config_to_name * Apply kernel_repr to attention kernels * fix indentation error and add kernel_repr to a missed kernel * Add descriptions to missing API descriptions * remove unused imports * fix runtime error * revert lean atten to main * lean atten repr and API desc * formatting fix * Update aiter/ops/triton/pod_attention.py * [fix]: prebuild gen so (#1412) Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> * [TRITON] FP8 MQA optimizations (#1422) FP8 MQA optimizations AND bench. script * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning (#1366) * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning * add required keys to fused_moe_1stage_dict * add kernel descriptors and code object files * add 32x128 file descriptors and code objects for tuning * move code objects and kernel descriptors to correct csv * remove unnecessary import, add quant type argument * move fused_moe_stage1_tkw1 into fused_moe.py * remove unnecessary kernel code object files * Add missing comma * saved modified tuned fmoe config for testing purposes * apply black required formatting * remove fused_moe_stage1_tkw1 and place aiter.fmoe_g1u1_tkw1 under fused_moe_1stage * remove unnecesary arguments * apply black formatting * simplify aiter.fmoe_g1u1_tkw1 call * add doweight_stage1 column to fused_moe_1stage_dict map and remove elif condition to select run_1stage=True * add doweight_stage1 to query key * modidy elif to select run_stage=True for tokens > 16 * apply black formatting * removing csv and .co files as they will come in separate commit * removing log logger.info(f[get_2stage_cfgs] run_1stage) --------- Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com> * CI: Move some tests to MI355 due to the network issue of TW cluster (#1446) * CI: Move some tests to MI355 due to the network issue of TW cluster * Modify the GPU_ARCH of sglang tests * CI: Move Triton tests from TW cluster to internal cluster (#1451) * tune a8w8_blockscale&bpreshuffle for tencent (#1444) * tune a8w8_blockscale&bpreshuffle for tencent Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv * update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_tuned_gemm.csv * updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * [fix]: add ar switch (#1376) * [fix]: add ar switch Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: call ar naive Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: add ar switch Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * CI: Test pypi connection (#1458) * cktile weight preshuffle test and auto tuning for a8w8 (#1400) * cktile bpreshuffle && tuning code * refine code * refine code * refine code * refine * refine * fix merge conflict * fix conflict * fix CI build fail * refine code * align aiter interface * refine * add get_padded_M --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com> * fix_prebuiild_asm (#1439) * mdf_bf16gemm * update * update f4gemm * update * update * f4gemm bugs fix * f4gemm fix2 * update * update moe 2 stages * update codegen * update gemm_a8w8_asm * update * update * update * update * update * update * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@.com> * fix merged tuned config error (#1460) * fix merged tuned config error * update --------- Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * update triton version check for pa_mqa_logits (#1440) * update triton version check for pa_mqa_logits * Add some explanation for aot-check * Support pa_mqa_logits aot load on triton>=3.5 * Support pa_mqa_logits aot load on triton>=3.5 --------- Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> * fix all_reduce_fake (#1465) * CI: Use ausartifactory.amd.com in pip installation (#1469) * update codegen (#1471) * update codegen * update * update * update * fix --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: valarLip <340077269@qq.com> * CI: Fix sglang CI test (#1473) * CI: Add checkout retry in vLLM benchmark tests (#1476) * CI: Move SGlang and Triton tests to MI300 runners (#1485) * fix merging aiter config (#1443) * change to merge config when used * fix lint * fix error in GemmTuner * fix lint * fix error when runing deepseek * fix lint error * revert other format change * fix gemm_op_a8w8.py --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * fix fmoe tune preshuffle error (#1430) * set preshuffle=False default * fix lint --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Fix issue: Add nhead=128 support to bf16 and align restrictions (#1450) * fix fwd_v3 output/lse is nan when kseq=0 and fix qseq >> kseq error (#1442) * fix output/lse is nan when kseq=0 * fix gfx950 128_128 fwd_v3 * update the k_seq=0 error in MI300 and MI308 * tune a8w8_blockscale&bpreshuffle for tencent (#1444) * tune a8w8_blockscale&bpreshuffle for tencent Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv * update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_tuned_gemm.csv * updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update the smoke test * update the smoke test * fix MI300 and MI308 err * fix qseq >> kseq error MI300 and MI308 * fix qseq >> kseq error in MI355 * fix the MI300 error --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com> Co-authored-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * fix the build error of rtp (#1438) Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * Add 32x64 to tuned fmoe config (#1386) * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning * add required keys to fused_moe_1stage_dict * add kernel descriptors and code object files * add 32x128 file descriptors and code objects for tuning * move code objects and kernel descriptors to correct csv * remove unnecessary import, add quant type argument * move fused_moe_stage1_tkw1 into fused_moe.py * remove unnecessary kernel code object files * add kernel descriptor to tuned fmoe config, add kernel descriptors to related csvs and add related code objects * create kernel descriptors and kernel co files with correct tags * some fix for support gpt_oss (#1488) * CI: Revert vllm_benchmark to use the latest nightly image (#1402) * add sink_size parameter in mha_fwd and varlen_mha_fwd Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update ck api Signed-off-by: JL-underdog <Jun.Lin@amd.com> * Remove redundant assignment to sink_size * Update mha_fwd_kernels.cu * Update mha.py * Add false argument to fmha_batch_prefill call * update ck commit Signed-off-by: JL-underdog <Jun.Lin@amd.com> * Handle sink_size with conditional window_size length * update fmha_api Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck_commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update ck commmit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix op test error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha.py * Update csrc/include/torch/mha_fwd.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/py_itfs_ck/mha_fwd_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/py_itfs_ck/mha_varlen_fwd_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add some comments Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update mha_fwd_generate.py * Clarify sink_size parameter in asm_mha_varlen_fwd.cu Updated the comment for the sink_size parameter. * update ck commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: root <root@hjbog-srdc-24.amd.com> Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> Signed-off-by: JL-underdog <Jun.Lin@amd.com> Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com> Co-authored-by: kliuae-amd <KuanFu.Liu@amd.com> Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: azaidy <aliasger.zaidy@amd.com> Co-authored-by: Shao-Chun Lee <Shao-Chun.Lee@amd.com> Co-authored-by: Mehmet Cagri <mehmet.kaymak@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: minmengdie <memin@amd.com> Co-authored-by: ZhangLirong <lirzhang@amd.com> Co-authored-by: root <root@hjbog-srdc-39.amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: valechen <115046356+valechen@users.noreply.github.com> Co-authored-by: ukannika <uma.kannikanti@amd.com> Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: who who who <fsx950223@outlook.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: baowendin <46412693+baowendin@users.noreply.github.com> Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com> Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: TennyWang1223 <Tenny.Wang@amd.com> Co-authored-by: ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 <hollowman@opensuse.org> Co-authored-by: slippedJim <jim.guo@amd.com> Co-authored-by: Jeff Huang <jiaji.huang73@gmail.com> Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com> Co-authored-by: zanzhang <zanzhang@amd.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: Hui Zhou <zhou_hui@outlook.com> Co-authored-by: Hui Zhou <huizhou@meta.com> Co-authored-by: TJian <tunjian1996@gmail.com> Co-authored-by: Satya Nikhil Kodukula <skodukul@amd.com> Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com> Co-authored-by: Oscar Xu <huuaiguxu@amd.com> Co-authored-by: xudoyuan <xudoyuan@amd.com> Co-authored-by: Farel Lukas <farlukas@amd.com> Co-authored-by: Satya Nikhil Kodukula <nikhil.kodukula@gmail.com> Co-authored-by: BingYuan.Zhou <BingYuan.Zhou@amd.com> Co-authored-by: solin <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: zhimding <zhimding@amd.com> Co-authored-by: felix <felix.li@amd.com> Co-authored-by: eky-amd <ethan.ky@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: gbyu-amd <Guanbao.Yu@amd.com> Co-authored-by: guanbao <gyu@amd.com> Co-authored-by: Ilya Panfilov <Ilya.Panfilov@amd.com> Co-authored-by: Lukasz Burzawa <lukasz.burzawa@amd.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com> Co-authored-by: luocheng25 <cheng.luo@amd.com> Co-authored-by: Xiake Sun <xiake.sun@amd.com> Co-authored-by: Lyu, Xudong <xudonlyu@amd.com> Co-authored-by: Cu Cui <cu.cui@alumni.uni-heidelberg.de> Co-authored-by: Drew Wadsworth <drew.wadsworth@gmail.com> Co-authored-by: yinfengLiu <yinfeliu@amd.com> Co-authored-by: Zhu Jiale <root@hjbog-srdc-52.amd.com> Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: zufayu <zufa.yu@amd.com> Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Anton Saukkonen <63663359+antsaukk@users.noreply.github.com> Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Fix a code bug that causing data mismatch for large batch sizes. Also add optimization for non-ragged batching.
Technical Details
Cast address pointers and tensor stride to 64-bits for large batch/head sizes, i.e. large # of output tiles
Test Plan
Added test cases in test_la.py
Test Result
Pass up to batch=512
Submission Checklist