add concat_and_cache_mla kernel #1194

yzhou103 · 2025-10-14T10:51:21Z

Motivation

Technical Details

add function

Test Plan

python op_test/test_concat_cache_mla.py

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull Request Overview

This PR adds a new kernel function concat_and_cache_mla to handle concatenation and caching operations for MLA (Multi-Level Attention) operations. The implementation includes CUDA kernel code, Python bindings, and comprehensive test coverage.

Key changes:

Implements a new CUDA kernel for concatenating KV components and caching them with optional FP8 quantization
Adds Python API bindings and function declarations
Provides comprehensive test suite with performance benchmarking

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
op_tests/test_concat_cache_mla.py	Comprehensive test suite with performance benchmarks for the new kernel
csrc/kernels/cache_kernels.cu	Core CUDA kernel implementation and host function for concat_and_cache_mla
csrc/include/rocm_ops.hpp	Python binding definitions for the new function
csrc/include/cache.h	Function declaration for concat_and_cache_mla
aiter/ops/cache.py	Python API wrapper for the new operation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-14T10:52:25Z

csrc/kernels/cache_kernels.cu

+      if constexpr (kv_dt == vllm::Fp8KVCacheDataType::kAuto) {
+        dst[dst_idx] = src[src_idx];
+      } else {
+        dst[dst_idx]= ck_tile::type_convert<cache_t>(


Missing space before the assignment operator. Should be dst[dst_idx] = ck_tile::type_convert<cache_t>(.

Copilot · 2025-10-14T10:52:25Z

csrc/kernels/cache_kernels.cu

+  int block_size = kv_cache.size(1);
+
+  TORCH_CHECK(kv_cache.size(2) == kv_lora_rank + pe_dim);
+  //TORCH_CHECK(kv_cache_dtype != "fp8");


Commented-out code should be removed rather than left in the codebase. If this check is needed for future implementation, consider adding a TODO comment explaining why it's disabled.

Suggested change

//TORCH_CHECK(kv_cache_dtype != "fp8");

// TODO: Enable the following check if/when "fp8" support is implemented.

// TORCH_CHECK(kv_cache_dtype != "fp8");

Copilot · 2025-10-14T10:52:26Z

csrc/kernels/cache_kernels.cu

+  //if (kv_cache_dtype == "fp8_ds_mla") {
+  //  dim3 grid(num_tokens);
+  //  // For the NoPE part, each tile of 128 elements is handled by half of one
+  //  // warp (16 threads). There are 4 total tiles, so 2 warps (64 threads).
+  //  // Lanes 0 and 16 of each warp write the scale values for that warp's tiles.
+  //  // The RoPE part (last 64 elements) is handled by another 1 warp (32
+  //  // threads). So in total, we use 3 warps (96 threads) per block.
+  //  dim3 block(96);
+  //  DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,
+  //                             CALL_CONCAT_AND_CACHE_DS_MLA);
+  //} else {
+    dim3 grid(num_tokens);
+    dim3 block(std::min(kv_lora_rank, 512));
+    DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,
+                               CALL_CONCAT_AND_CACHE_MLA);
+  //}


Large blocks of commented-out code should be removed. If this functionality is planned for future implementation, consider using feature flags or moving it to a separate branch.

Suggested change

//if (kv_cache_dtype == "fp8_ds_mla") {

// dim3 grid(num_tokens);

// // For the NoPE part, each tile of 128 elements is handled by half of one

// // warp (16 threads). There are 4 total tiles, so 2 warps (64 threads).

// // Lanes 0 and 16 of each warp write the scale values for that warp's tiles.

// // The RoPE part (last 64 elements) is handled by another 1 warp (32

// // threads). So in total, we use 3 warps (96 threads) per block.

// dim3 block(96);

// DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,

// CALL_CONCAT_AND_CACHE_DS_MLA);

//} else {

dim3 grid(num_tokens);

dim3 block(std::min(kv_lora_rank, 512));

DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,

CALL_CONCAT_AND_CACHE_MLA);

//}

dim3 grid(num_tokens);

dim3 block(std::min(kv_lora_rank, 512));

DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,

CALL_CONCAT_AND_CACHE_MLA);

Copilot · 2025-10-14T10:52:26Z

csrc/kernels/cache_kernels.cu

+  //if (kv_cache_dtype == "fp8_ds_mla") {
+  //  dim3 grid(num_tokens);
+  //  // For the NoPE part, each tile of 128 elements is handled by half of one
+  //  // warp (16 threads). There are 4 total tiles, so 2 warps (64 threads).
+  //  // Lanes 0 and 16 of each warp write the scale values for that warp's tiles.
+  //  // The RoPE part (last 64 elements) is handled by another 1 warp (32
+  //  // threads). So in total, we use 3 warps (96 threads) per block.
+  //  dim3 block(96);
+  //  DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,
+  //                             CALL_CONCAT_AND_CACHE_DS_MLA);
+  //} else {
+    dim3 grid(num_tokens);
+    dim3 block(std::min(kv_lora_rank, 512));
+    DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,
+                               CALL_CONCAT_AND_CACHE_MLA);
+  //}


Large blocks of commented-out code should be removed. If this functionality is planned for future implementation, consider using feature flags or moving it to a separate branch.

Suggested change

//if (kv_cache_dtype == "fp8_ds_mla") {

// dim3 grid(num_tokens);

// // For the NoPE part, each tile of 128 elements is handled by half of one

// // warp (16 threads). There are 4 total tiles, so 2 warps (64 threads).

// // Lanes 0 and 16 of each warp write the scale values for that warp's tiles.

// // The RoPE part (last 64 elements) is handled by another 1 warp (32

// // threads). So in total, we use 3 warps (96 threads) per block.

// dim3 block(96);

// DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,

// CALL_CONCAT_AND_CACHE_DS_MLA);

//} else {

dim3 grid(num_tokens);

dim3 block(std::min(kv_lora_rank, 512));

DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,

CALL_CONCAT_AND_CACHE_MLA);

//}

dim3 grid(num_tokens);

dim3 block(std::min(kv_lora_rank, 512));

DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,

CALL_CONCAT_AND_CACHE_MLA);

* add concat_and_cache_mla kernel * fix interface

yzhou103 added 2 commits October 14, 2025 10:29

add concat_and_cache_mla kernel

0590d9a

fix interface

dd62b75

Copilot AI review requested due to automatic review settings October 14, 2025 10:51

Merge branch 'main' into concat_cache_mla

4c9aab7

Copilot AI reviewed Oct 14, 2025

View reviewed changes

valarLip approved these changes Oct 14, 2025

View reviewed changes

valarLip merged commit b8e5dde into main Oct 14, 2025
13 of 16 checks passed

valarLip deleted the concat_cache_mla branch October 14, 2025 14:35

eliotwang pushed a commit to eliotwang/aiter that referenced this pull request Oct 21, 2025

add concat_and_cache_mla kernel (ROCm#1194)

31a2c13

* add concat_and_cache_mla kernel * fix interface

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add concat_and_cache_mla kernel #1194

add concat_and_cache_mla kernel #1194

Uh oh!

yzhou103 commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	//TORCH_CHECK(kv_cache_dtype != "fp8");
	// TODO: Enable the following check if/when "fp8" support is implemented.
	// TORCH_CHECK(kv_cache_dtype != "fp8");

add concat_and_cache_mla kernel #1194

add concat_and_cache_mla kernel #1194

Uh oh!

Conversation

yzhou103 commented Oct 14, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants