dsa: remove block_table_convert_triton in dsa#658
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates DeepSeek V3.2 sparse-MLA (“DSA”) indexing to stop relying on block_table_convert_triton and instead compute global KV indices from raw block tables using an explicit page/block size.
Changes:
- Update the DSA prefill Triton index-conversion kernel to use
PAGE_SIZE(KV cache block size) when mapping token indices to physical KV slots. - Adjust DeepSeek V2 indexer KV-cache handling to reshape by
kv_cache_block_sizeand enablepreshuffle/Preshuffleoptions in relevant AITer ops. - Disable quantization for
weights_projin the indexer path.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
atom/models/deepseek_v2.py |
Reshapes indexer KV cache by configured block size; enables preshuffle flags; disables quantization for weights_proj. |
atom/model_ops/attentions/aiter_mla.py |
Comments out block-table conversion usage and stops passing block_tables_converted for cudagraph capture metadata. |
atom/model_ops/attention_mla.py |
Updates DSA prefill req-local → global KV index conversion to use PAGE_SIZE. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # if self.block_ratio > 1: | ||
| # block_table_convert_triton( | ||
| # var["block_tables"].gpu[:bs], | ||
| # var["block_tables_converted"].gpu[:bs], | ||
| # var["context_lens"].gpu[:bs], | ||
| # self.block_ratio, | ||
| # ) | ||
| # attn_metadata.block_tables = var["block_tables_converted"].gpu[:bs] |
There was a problem hiding this comment.
The block_table_convert_triton(...) logic is now commented out, but the module still imports block_table_convert_triton, and the commented block will drift out of sync over time. If the intent is to remove this conversion path, please delete the dead code and drop the unused import; also consider removing/guarding any remaining block_tables_converted plumbing so it can’t accidentally override block_tables in AttentionMetaData.__init__.
| # block_tables_converted=( | ||
| # var["block_tables_converted"].gpu[:bs] | ||
| # if "block_tables_converted" in var | ||
| # else None | ||
| # ), |
There was a problem hiding this comment.
build_for_cudagraph_capture() no longer passes block_tables_converted, but other paths in this file still pass it when present (e.g., build_ubatch_metadata()), and AttentionMetaData.__init__ will override block_tables when block_tables_converted is not None. Since the conversion kernels are now disabled/commented out, block_tables_converted can be stale/uninitialized and silently break sparse attention under TBO/CUDAGraph. Please make the handling consistent: either always populate block_tables_converted before passing it, or stop passing/allocating it everywhere for MLA.
| # block_tables_converted=( | |
| # var["block_tables_converted"].gpu[:bs] | |
| # if "block_tables_converted" in var | |
| # else None | |
| # ), | |
| # MLA block-table conversion is currently disabled. Always pass | |
| # None here so AttentionMetaData keeps using block_tables instead | |
| # of accidentally overriding it with stale/uninitialized converted | |
| # data. | |
| block_tables_converted=None, |
| hidden_size, | ||
| self.n_head, | ||
| quant_config=quant_config, | ||
| quant_config=None, |
There was a problem hiding this comment.
Changing weights_proj from quant_config=quant_config to quant_config=None changes quantization behavior for the indexer path while the surrounding projections (wq_b, wk) still honor quant_config. If this layer must remain unquantized for correctness/accuracy (similar to the MoE gate), add an explicit comment explaining why; otherwise this likely regresses performance or makes quantization behavior inconsistent for this module.
| quant_config=None, | |
| quant_config=quant_config, |
| if attn_metadata.block_tables is None: | ||
| self.prepare_block_tables(batch) | ||
| attn_metadata.block_tables = var["block_tables"].copy_to_gpu(bs) | ||
| if self.block_ratio > 1: | ||
| block_table_convert_triton( | ||
| var["block_tables"].gpu[:bs], | ||
| var["block_tables_converted"].gpu[:bs], | ||
| var["context_lens"].gpu[:bs], | ||
| self.block_ratio, | ||
| ) | ||
| attn_metadata.block_tables = var["block_tables_converted"].gpu[:bs] | ||
| # if self.block_ratio > 1: | ||
| # block_table_convert_triton( |
There was a problem hiding this comment.
In prepare_prefill, attn_metadata comes from CommonAttentionBuilder.prepare_prefill(), which will populate block_tables_converted when has_cached and block_ratio > 1, and AttentionMetaData.__init__ replaces self.block_tables with the converted table. With the updated DSA index conversion (now using PAGE_SIZE and raw physical block ids), this if attn_metadata.block_tables is None: gate means cached-prefill requests will continue using a converted per-token table, which can produce incorrect KV indices. Consider always resetting attn_metadata.block_tables to the raw var["block_tables"] for the sparse/DSA path (or disabling the common conversion for MLA builders) so the DSA kernel consistently sees physical block ids.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist