Skip to content

Conversation

@SiriusNEO
Copy link
Collaborator

@SiriusNEO SiriusNEO commented Jan 5, 2026

#1549 exposes some bugs in current lowering ParallelOp and VectorizeLoop implementation, mainly two problems:

  1. Some IsLocal conditions don't take local.var into account.
  2. If indices of a BufferStore are invarient, we shouldn't vectorize this loop.

Summary by CodeRabbit

  • Bug Fixes

    • Resolved vectorization issue when allocating scalar variables within parallel operations
    • Improved handling of store operations during loop vectorization to prevent incorrect vector size calculations
    • Enhanced local buffer classification for more accurate tile-lowering decisions
  • Tests

    • Added regression test for variable vectorization edge cases

✏️ Tip: You can customize this high-level summary in your review settings.

@SiriusNEO SiriusNEO requested a review from LeiWang1999 January 5, 2026 09:36
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

This PR extends buffer scope handling by introducing an allow_var parameter to IsLocalBuffer in the public API, updates loop vectorization logic to differentiate between load and store paths with a new is_store flag, adjusts local buffer detection in tile lowering by passing true to updated IsLocalBuffer calls, introduces a new test for variable allocation vectorization, and adds debugging instrumentation to the lowering pipeline.

Changes

Cohort / File(s) Summary
Buffer Scope Extension
src/op/utils.h, src/transform/lower_tile_op.cc
Extended IsLocalBuffer to accept optional allow_var parameter (default false); when true, accepts both "local" and "local.var" scopes. Updated three call sites in lower_tile_op.cc to pass true, affecting store_into_local, local_register_only, and has_non_local buffer checks.
Vectorization Logic
src/transform/loop_vectorize.cc
Added is_store boolean parameter to UpdateVectorSize method to track load vs. store paths; store paths now force vector_size_ to 1 when element offset is loop-invariant, effectively disabling vectorization for stores.
Testing & Debugging
testing/python/issue/test_tilelang_issue_1549.py, tilelang/engine/phase.py
New test module validating kernel behavior with T.alloc_var scalar allocation and parallel assignment. Added AST debugging print statements in LowerAndLegalize pipeline for inspection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999

Poem

🐰 A bunny hops through vectors bright,
Stores now skip the speedup flight,
Local vars get special care,
With scopes extended everywhere,
Debugging prints light the way—
Tile magic works another day! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main fixes: handling of IsLocal conditions for local.var and preventing vectorization of invariant BufferStore indices.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Fix all issues with AI Agents 🤖
In @tilelang/engine/phase.py:
- Around line 171-172: The unconditional debug prints in LowerAndLegalize (the
print(mod) and tilelang.analysis.ASTPrinter()(mod) calls) should be removed or
wrapped in the same config check used earlier; guard them with
should_enable_ast_print() so they only run when debugging is enabled, e.g., call
should_enable_ast_print() before invoking print(mod) or ASTPrinter; ensure you
reference the ASTPrinter class (tilelang.analysis.ASTPrinter) and the
LowerAndLegalize function when making the change.
🧹 Nitpick comments (2)
testing/python/issue/test_tilelang_issue_1549.py (1)

26-28: Remove unused kernel creation at line 26.

Line 26 creates a kernel that is immediately overwritten by line 28. This is unnecessary and potentially confusing.

🔎 Proposed fix
-    kernel = get_wrong_kernel()
     M = 2048
     kernel = get_wrong_kernel(M)
src/transform/loop_vectorize.cc (1)

193-193: Minor: Fix grammatical issues in comments.

  • Line 193: "Specially" → "Specifically" or "In particular"
  • Line 207: "indices is invariant" → "indices are invariant"
🔎 Proposed fixes
-      // Specially, if it's a BufferStore, we should not vectorize it.
+      // Specifically, if it's a BufferStore, we should not vectorize it.
       if (is_store) {
         vector_size_ = 1;
       }
     } else if (is_store) {
-      // If the indices is invariant for BufferStore, we should also not
+      // If the indices are invariant for BufferStore, we should also not
       // vectorize it.
       vector_size_ = 1;
     }

Also applies to: 207-208

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32aec8a and 5395503.

📒 Files selected for processing (5)
  • src/op/utils.h
  • src/transform/loop_vectorize.cc
  • src/transform/lower_tile_op.cc
  • testing/python/issue/test_tilelang_issue_1549.py
  • tilelang/engine/phase.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/issue/test_tilelang_issue_1549.py
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/issue/test_tilelang_issue_1549.py
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/loop_vectorize.cc
🧬 Code graph analysis (5)
tilelang/engine/phase.py (2)
tilelang/tools/Analyzer.py (1)
  • analysis (209-218)
tilelang/analysis/ast_printer.py (1)
  • ASTPrinter (79-102)
src/transform/lower_tile_op.cc (2)
src/op/utils.h (1)
  • IsFragmentBuffer (33-35)
src/transform/common/loop_fusion_utils.h (1)
  • IsFragmentBuffer (70-75)
testing/python/issue/test_tilelang_issue_1549.py (1)
tilelang/language/kernel.py (2)
  • num_threads (221-225)
  • threads (214-218)
src/transform/loop_vectorize.cc (2)
src/transform/atomicadd_vectorize.cc (6)
  • UpdateVectorSize (111-151)
  • UpdateVectorSize (111-112)
  • node (186-211)
  • node (186-186)
  • node (213-291)
  • node (213-213)
src/transform/loop_partition.cc (8)
  • node (44-51)
  • node (44-44)
  • node (52-59)
  • node (52-52)
  • node (174-187)
  • node (174-174)
  • node (237-245)
  • node (237-237)
src/op/utils.h (1)
src/transform/legalize_safe_memory_access.cc (6)
  • buffer (61-69)
  • buffer (61-61)
  • buffer (72-115)
  • buffer (72-73)
  • buffer (266-271)
  • buffer (266-266)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (4)
src/op/utils.h (1)

50-57: LGTM! Clean extension of local buffer detection.

The addition of the allow_var parameter properly extends IsLocalBuffer to optionally recognize local.var scope while maintaining backward compatibility through the default parameter value. This addresses the PR objective of fixing conditions that didn't account for local.var.

src/transform/lower_tile_op.cc (1)

738-738: LGTM! Consistent update of IsLocalBuffer calls.

All call sites are systematically updated to pass true for the allow_var parameter, ensuring that both local and local.var scopes are recognized in:

  • Store-into-local detection (line 738)
  • Local-register-only analysis (lines 751, 755)
  • Non-local buffer detection (lines 769-770, 774-775)

This properly addresses the PR objective of fixing IsLocal conditions that didn't account for local.var.

Also applies to: 751-751, 755-755, 769-770, 774-775

src/transform/loop_vectorize.cc (2)

173-174: LGTM! Correct logic to prevent invalid store vectorization.

The addition of the is_store parameter properly addresses the PR objective: "Loops whose BufferStore indices are invariant should not be vectorized." The implementation correctly disables vectorization for stores in two scenarios:

  1. Lines 192-197: When the element offset is independent of the loop variable (all iterations would write to the same location)
  2. Lines 206-210: When the offset is invariant within vector boundaries (all vector lanes would write to the same location)

Both conditions correctly identify cases where store vectorization would be semantically incorrect.

Also applies to: 191-210


132-132: LGTM! Correct differentiation between load and store paths.

The call sites properly pass false for loads (line 132) and true for stores (line 140), enabling the vectorization logic to apply different rules for each case.

Also applies to: 140-140

@SiriusNEO
Copy link
Collaborator Author

@regression-perf

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Performance Regression Test Report

Triggered by: @SiriusNEO
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/20711728528

Results

File Original Latency Current Latency Speedup
example_warp_specialize_gemm_softpipe_stage2 0.037057 0.039713 0.93312
example_gemv 0.276373 0.285696 0.967369
example_tilelang_gemm_fp8_2xAcc 0.187556 0.19253 0.974162
example_gemm 0.022784 0.023041 0.988846
example_topk 0.010912 0.011008 0.991279
example_dequant_groupedgemm_bf16_mxfp4_hopper 3.51573 3.54349 0.992166
example_gqa_decode 0.048449 0.048769 0.993438
example_dynamic 0.656749 0.658221 0.997764
example_mha_bwd_bshd_wgmma_pipelined 0.0255473 0.0255941 0.998173
example_gemm_intrinsics 0.03504 0.035104 0.998177
example_mha_bwd_bhsd 0.0400633 0.0401055 0.998949
example_warp_specialize_gemm_copy_0_gemm_1 0.040032 0.040065 0.999176
tilelang_example_sparse_tensorcore 0.0150255 0.0150347 0.999384
example_gqa_bwd_wgmma_pipelined 0.0740342 0.0740715 0.999497
example_mha_bwd_bshd 0.0408694 0.0408767 0.999822
example_fusedmoe_tilelang 0.130585 0.130603 0.999865
example_tilelang_gemm_splitk 1.40829 1.40845 0.999887
block_sparse_attn_tilelang 0.0102467 0.0102472 0.999951
example_linear_attn_fwd 0.0364848 0.036485 0.999995
example_gqa_bwd_tma_reduce_varlen 0.0636028 0.0635931 1.00015
example_mha_fwd_varlen 0.0454515 0.0454421 1.00021
example_linear_attn_bwd 0.151923 0.151889 1.00023
example_tilelang_gemm_splitk_vectorize_atomicadd 1.40964 1.40895 1.00049
example_gemm_schedule 0.0325652 0.032547 1.00056
example_gqa_bwd 0.0498179 0.0497896 1.00057
example_vertical_slash_sparse_attn 0.237454 0.237315 1.00059
example_elementwise_add 0.29577 0.295587 1.00062
example_tilelang_gemm_fp8_intrinsic 0.467351 0.467002 1.00075
example_tilelang_gemm_fp8 0.32223 0.321383 1.00264
example_gemm_autotune 0.022273 0.022209 1.00288
example_dequant_gemv_fp16xint4 0.0285017 0.0283418 1.00564
example_mha_inference 0.072482 0.072002 1.00667
example_convolution_autotune 1.00962 0.994389 1.01531
example_per_token_cast_to_fp8 0.0075021 0.00735538 1.01995
example_dequant_gemm_w4a8 5.5166 5.39793 1.02198
example_tilelang_nsa_decode 0.00690423 0.00674148 1.02414
example_tilelang_nsa_fwd 0.00715736 0.0069825 1.02504
topk_selector 0.0549409 0.0535491 1.02599
sparse_mla_fwd 0.145089 0.141008 1.02894
example_warp_specialize_gemm_copy_1_gemm_0 0.037953 0.036768 1.03223
sparse_mla_fwd_pipelined 0.0986826 0.0955311 1.03299
sparse_mla_bwd 0.396097 0.381529 1.03818
example_dequant_gemm_bf16_mxfp4_hopper 0.52967 0.509194 1.04021
example_warp_specialize_gemm_barrierpipe_stage2 0.039713 0.038145 1.04111
example_tilelang_block_sparse_attn 0.0105794 0.0101608 1.0412
fp8_lighting_indexer 0.0376083 0.0358558 1.04888
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0162064 0.0154454 1.04927
example_dequant_gemm_bf16_fp4_hopper 0.601899 0.573217 1.05004
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0162677 0.0154904 1.05018
example_mha_sink_fwd_bhsd_sliding_window 0.0165204 0.0157196 1.05094
example_mha_sink_fwd_bhsd 0.0167141 0.0158726 1.05302
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0152178 0.0144421 1.05372
example_tilelang_sparse_gqa_decode_varlen_indice 0.0180609 0.0171373 1.0539
example_group_per_split_token_cast_to_fp8 0.010824 0.0102425 1.05677
example_tilelang_sparse_gqa_decode_varlen_mask 0.0247203 0.023354 1.0585
example_blocksparse_gemm 0.0239958 0.0226646 1.05873
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0154126 0.0145562 1.05883
example_mha_sink_bwd_bhsd_sliding_window 0.0470414 0.0443863 1.05982
example_mha_sink_bwd_bhsd 0.0659296 0.0615957 1.07036
example_gqa_sink_bwd_bhsd_sliding_window 0.027382 0.0255256 1.07273
example_dequant_gemm_fp4_hopper 1.09396 1.01936 1.07319
example_gqa_sink_bwd_bhsd 0.044876 0.0416197 1.07824
example_convolution 1.44261 1.33138 1.08354
example_mla_decode 0.501768 0.46132 1.08768

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999 LeiWang1999 merged commit 1b00220 into tile-ai:main Jan 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants