Skip to content

dsa: remove block_table_convert_triton in dsa#658

Draft
junhaha666 wants to merge 1 commit intomainfrom
jun/dsa_fix
Draft

dsa: remove block_table_convert_triton in dsa#658
junhaha666 wants to merge 1 commit intomainfrom
jun/dsa_fix

Conversation

@junhaha666
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings April 28, 2026 10:23
@junhaha666 junhaha666 marked this pull request as draft April 28, 2026 10:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates DeepSeek V3.2 sparse-MLA (“DSA”) indexing to stop relying on block_table_convert_triton and instead compute global KV indices from raw block tables using an explicit page/block size.

Changes:

  • Update the DSA prefill Triton index-conversion kernel to use PAGE_SIZE (KV cache block size) when mapping token indices to physical KV slots.
  • Adjust DeepSeek V2 indexer KV-cache handling to reshape by kv_cache_block_size and enable preshuffle/Preshuffle options in relevant AITer ops.
  • Disable quantization for weights_proj in the indexer path.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
atom/models/deepseek_v2.py Reshapes indexer KV cache by configured block size; enables preshuffle flags; disables quantization for weights_proj.
atom/model_ops/attentions/aiter_mla.py Comments out block-table conversion usage and stops passing block_tables_converted for cudagraph capture metadata.
atom/model_ops/attention_mla.py Updates DSA prefill req-local → global KV index conversion to use PAGE_SIZE.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +384 to +391
# if self.block_ratio > 1:
# block_table_convert_triton(
# var["block_tables"].gpu[:bs],
# var["block_tables_converted"].gpu[:bs],
# var["context_lens"].gpu[:bs],
# self.block_ratio,
# )
# attn_metadata.block_tables = var["block_tables_converted"].gpu[:bs]
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block_table_convert_triton(...) logic is now commented out, but the module still imports block_table_convert_triton, and the commented block will drift out of sync over time. If the intent is to remove this conversion path, please delete the dead code and drop the unused import; also consider removing/guarding any remaining block_tables_converted plumbing so it can’t accidentally override block_tables in AttentionMetaData.__init__.

Copilot uses AI. Check for mistakes.
Comment on lines +790 to +794
# block_tables_converted=(
# var["block_tables_converted"].gpu[:bs]
# if "block_tables_converted" in var
# else None
# ),
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_for_cudagraph_capture() no longer passes block_tables_converted, but other paths in this file still pass it when present (e.g., build_ubatch_metadata()), and AttentionMetaData.__init__ will override block_tables when block_tables_converted is not None. Since the conversion kernels are now disabled/commented out, block_tables_converted can be stale/uninitialized and silently break sparse attention under TBO/CUDAGraph. Please make the handling consistent: either always populate block_tables_converted before passing it, or stop passing/allocating it everywhere for MLA.

Suggested change
# block_tables_converted=(
# var["block_tables_converted"].gpu[:bs]
# if "block_tables_converted" in var
# else None
# ),
# MLA block-table conversion is currently disabled. Always pass
# None here so AttentionMetaData keeps using block_tables instead
# of accidentally overriding it with stale/uninitialized converted
# data.
block_tables_converted=None,

Copilot uses AI. Check for mistakes.
hidden_size,
self.n_head,
quant_config=quant_config,
quant_config=None,
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing weights_proj from quant_config=quant_config to quant_config=None changes quantization behavior for the indexer path while the surrounding projections (wq_b, wk) still honor quant_config. If this layer must remain unquantized for correctness/accuracy (similar to the MoE gate), add an explicit comment explaining why; otherwise this likely regresses performance or makes quantization behavior inconsistent for this module.

Suggested change
quant_config=None,
quant_config=quant_config,

Copilot uses AI. Check for mistakes.
Comment on lines 381 to +385
if attn_metadata.block_tables is None:
self.prepare_block_tables(batch)
attn_metadata.block_tables = var["block_tables"].copy_to_gpu(bs)
if self.block_ratio > 1:
block_table_convert_triton(
var["block_tables"].gpu[:bs],
var["block_tables_converted"].gpu[:bs],
var["context_lens"].gpu[:bs],
self.block_ratio,
)
attn_metadata.block_tables = var["block_tables_converted"].gpu[:bs]
# if self.block_ratio > 1:
# block_table_convert_triton(
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In prepare_prefill, attn_metadata comes from CommonAttentionBuilder.prepare_prefill(), which will populate block_tables_converted when has_cached and block_ratio > 1, and AttentionMetaData.__init__ replaces self.block_tables with the converted table. With the updated DSA index conversion (now using PAGE_SIZE and raw physical block ids), this if attn_metadata.block_tables is None: gate means cached-prefill requests will continue using a converted per-token table, which can produce incorrect KV indices. Consider always resetting attn_metadata.block_tables to the raw var["block_tables"] for the sparse/DSA path (or disabling the common conversion for MLA builders) so the DSA kernel consistently sees physical block ids.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants