[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop #1607

SiriusNEO · 2026-01-05T09:36:25Z

#1549 exposes some bugs in current lowering ParallelOp and VectorizeLoop implementation, mainly two problems:

Some IsLocal conditions don't take local.var into account.
If indices of a BufferStore are invarient, we shouldn't vectorize this loop.

Summary by CodeRabbit

Bug Fixes
- Resolved vectorization issue when allocating scalar variables within parallel operations
- Improved handling of store operations during loop vectorization to prevent incorrect vector size calculations
- Enhanced local buffer classification for more accurate tile-lowering decisions
Tests
- Added regression test for variable vectorization edge cases

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2026-01-05T09:36:35Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-01-05T09:36:39Z

📝 Walkthrough

Walkthrough

This PR extends buffer scope handling by introducing an allow_var parameter to IsLocalBuffer in the public API, updates loop vectorization logic to differentiate between load and store paths with a new is_store flag, adjusts local buffer detection in tile lowering by passing true to updated IsLocalBuffer calls, introduces a new test for variable allocation vectorization, and adds debugging instrumentation to the lowering pipeline.

Changes

Cohort / File(s)	Summary
Buffer Scope Extension `src/op/utils.h`, `src/transform/lower_tile_op.cc`	Extended `IsLocalBuffer` to accept optional `allow_var` parameter (default `false`); when `true`, accepts both "local" and "local.var" scopes. Updated three call sites in lower_tile_op.cc to pass `true`, affecting store_into_local, local_register_only, and has_non_local buffer checks.
Vectorization Logic `src/transform/loop_vectorize.cc`	Added `is_store` boolean parameter to `UpdateVectorSize` method to track load vs. store paths; store paths now force `vector_size_` to 1 when element offset is loop-invariant, effectively disabling vectorization for stores.
Testing & Debugging `testing/python/issue/test_tilelang_issue_1549.py`, `tilelang/engine/phase.py`	New test module validating kernel behavior with `T.alloc_var` scalar allocation and parallel assignment. Added AST debugging print statements in `LowerAndLegalize` pipeline for inspection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[Bugfix] Avoid considering local.var buffer as local #1541: Directly modifies IsLocalBuffer implementation in src/op/utils.h to adjust handling of "local.var" scope—may have conflicting or complementary intent.
[Bugfix] Fix of T.Fill for local.var #1543: Adds separate handling for "local.var" buffer scope and updates Fill lowering logic—related to the same buffer scope extension pattern.
[BugFix] Use BufferRegion in tl.cumsum to infer buffer shape #1321: Adds AST printing instrumentation in tilelang/engine/phase.py—overlaps with debugging additions in this PR.

Suggested reviewers

LeiWang1999

Poem

🐰 A bunny hops through vectors bright,
Stores now skip the speedup flight,
Local vars get special care,
With scopes extended everywhere,
Debugging prints light the way—
Tile magic works another day! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main fixes: handling of IsLocal conditions for local.var and preventing vectorization of invariant BufferStore indices.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Fix all issues with AI Agents 🤖

In @tilelang/engine/phase.py:
- Around line 171-172: The unconditional debug prints in LowerAndLegalize (the
print(mod) and tilelang.analysis.ASTPrinter()(mod) calls) should be removed or
wrapped in the same config check used earlier; guard them with
should_enable_ast_print() so they only run when debugging is enabled, e.g., call
should_enable_ast_print() before invoking print(mod) or ASTPrinter; ensure you
reference the ASTPrinter class (tilelang.analysis.ASTPrinter) and the
LowerAndLegalize function when making the change.

🧹 Nitpick comments (2)

testing/python/issue/test_tilelang_issue_1549.py (1)
26-28: Remove unused kernel creation at line 26.

Line 26 creates a kernel that is immediately overwritten by line 28. This is unnecessary and potentially confusing.
🔎 Proposed fix
-    kernel = get_wrong_kernel()
     M = 2048
     kernel = get_wrong_kernel(M)
src/transform/loop_vectorize.cc (1)
193-193: Minor: Fix grammatical issues in comments.

Line 193: "Specially" → "Specifically" or "In particular"

Line 207: "indices is invariant" → "indices are invariant"
🔎 Proposed fixes
-      // Specially, if it's a BufferStore, we should not vectorize it.
+      // Specifically, if it's a BufferStore, we should not vectorize it.
       if (is_store) {
         vector_size_ = 1;
       }
     } else if (is_store) {
-      // If the indices is invariant for BufferStore, we should also not
+      // If the indices are invariant for BufferStore, we should also not
       // vectorize it.
       vector_size_ = 1;
     }
Also applies to: 207-208

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32aec8a and 5395503.

📒 Files selected for processing (5)

src/op/utils.h
src/transform/loop_vectorize.cc
src/transform/lower_tile_op.cc
testing/python/issue/test_tilelang_issue_1549.py
tilelang/engine/phase.py

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-12-18T04:50:00.512Z

Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

testing/python/issue/test_tilelang_issue_1549.py

📚 Learning: 2025-11-14T07:56:11.098Z

Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

testing/python/issue/test_tilelang_issue_1549.py

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/loop_vectorize.cc

🧬 Code graph analysis (5)

tilelang/engine/phase.py (2)

tilelang/tools/Analyzer.py (1)

analysis (209-218)

tilelang/analysis/ast_printer.py (1)

ASTPrinter (79-102)

src/transform/lower_tile_op.cc (2)

src/op/utils.h (1)

IsFragmentBuffer (33-35)

src/transform/common/loop_fusion_utils.h (1)

IsFragmentBuffer (70-75)

testing/python/issue/test_tilelang_issue_1549.py (1)

tilelang/language/kernel.py (2)

num_threads (221-225)

threads (214-218)

src/transform/loop_vectorize.cc (2)

src/transform/atomicadd_vectorize.cc (6)

UpdateVectorSize (111-151)

UpdateVectorSize (111-112)

node (186-211)

node (186-186)

node (213-291)

node (213-213)

src/transform/loop_partition.cc (8)

node (44-51)

node (44-44)

node (52-59)

node (52-52)

node (174-187)

node (174-174)

node (237-245)

node (237-237)

src/op/utils.h (1)

src/transform/legalize_safe_memory_access.cc (6)

buffer (61-69)

buffer (61-61)

buffer (72-115)

buffer (72-73)

buffer (266-271)

buffer (266-266)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

🔇 Additional comments (4)

src/op/utils.h (1)

50-57: LGTM! Clean extension of local buffer detection.

The addition of the allow_var parameter properly extends IsLocalBuffer to optionally recognize local.var scope while maintaining backward compatibility through the default parameter value. This addresses the PR objective of fixing conditions that didn't account for local.var.

src/transform/lower_tile_op.cc (1)

738-738: LGTM! Consistent update of IsLocalBuffer calls.

All call sites are systematically updated to pass true for the allow_var parameter, ensuring that both local and local.var scopes are recognized in:

Store-into-local detection (line 738)

Local-register-only analysis (lines 751, 755)

Non-local buffer detection (lines 769-770, 774-775)

This properly addresses the PR objective of fixing IsLocal conditions that didn't account for local.var.

Also applies to: 751-751, 755-755, 769-770, 774-775

src/transform/loop_vectorize.cc (2)

173-174: LGTM! Correct logic to prevent invalid store vectorization.

The addition of the is_store parameter properly addresses the PR objective: "Loops whose BufferStore indices are invariant should not be vectorized." The implementation correctly disables vectorization for stores in two scenarios:

Lines 192-197: When the element offset is independent of the loop variable (all iterations would write to the same location)

Lines 206-210: When the offset is invariant within vector boundaries (all vector lanes would write to the same location)

Both conditions correctly identify cases where store vectorization would be semantically incorrect.

Also applies to: 191-210

132-132: LGTM! Correct differentiation between load and store paths.

The call sites properly pass false for loads (line 132) and true for stores (line 140), enabling the vectorization logic to apply different rules for each case.

Also applies to: 140-140

tilelang/engine/phase.py

SiriusNEO · 2026-01-05T09:58:57Z

@regression-perf

github-actions · 2026-01-05T11:08:03Z

Performance Regression Test Report

Triggered by: @SiriusNEO
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/20711728528

Results

File	Original Latency	Current Latency	Speedup
example_warp_specialize_gemm_softpipe_stage2	0.037057	0.039713	0.93312
example_gemv	0.276373	0.285696	0.967369
example_tilelang_gemm_fp8_2xAcc	0.187556	0.19253	0.974162
example_gemm	0.022784	0.023041	0.988846
example_topk	0.010912	0.011008	0.991279
example_dequant_groupedgemm_bf16_mxfp4_hopper	3.51573	3.54349	0.992166
example_gqa_decode	0.048449	0.048769	0.993438
example_dynamic	0.656749	0.658221	0.997764
example_mha_bwd_bshd_wgmma_pipelined	0.0255473	0.0255941	0.998173
example_gemm_intrinsics	0.03504	0.035104	0.998177
example_mha_bwd_bhsd	0.0400633	0.0401055	0.998949
example_warp_specialize_gemm_copy_0_gemm_1	0.040032	0.040065	0.999176
tilelang_example_sparse_tensorcore	0.0150255	0.0150347	0.999384
example_gqa_bwd_wgmma_pipelined	0.0740342	0.0740715	0.999497
example_mha_bwd_bshd	0.0408694	0.0408767	0.999822
example_fusedmoe_tilelang	0.130585	0.130603	0.999865
example_tilelang_gemm_splitk	1.40829	1.40845	0.999887
block_sparse_attn_tilelang	0.0102467	0.0102472	0.999951
example_linear_attn_fwd	0.0364848	0.036485	0.999995
example_gqa_bwd_tma_reduce_varlen	0.0636028	0.0635931	1.00015
example_mha_fwd_varlen	0.0454515	0.0454421	1.00021
example_linear_attn_bwd	0.151923	0.151889	1.00023
example_tilelang_gemm_splitk_vectorize_atomicadd	1.40964	1.40895	1.00049
example_gemm_schedule	0.0325652	0.032547	1.00056
example_gqa_bwd	0.0498179	0.0497896	1.00057
example_vertical_slash_sparse_attn	0.237454	0.237315	1.00059
example_elementwise_add	0.29577	0.295587	1.00062
example_tilelang_gemm_fp8_intrinsic	0.467351	0.467002	1.00075
example_tilelang_gemm_fp8	0.32223	0.321383	1.00264
example_gemm_autotune	0.022273	0.022209	1.00288
example_dequant_gemv_fp16xint4	0.0285017	0.0283418	1.00564
example_mha_inference	0.072482	0.072002	1.00667
example_convolution_autotune	1.00962	0.994389	1.01531
example_per_token_cast_to_fp8	0.0075021	0.00735538	1.01995
example_dequant_gemm_w4a8	5.5166	5.39793	1.02198
example_tilelang_nsa_decode	0.00690423	0.00674148	1.02414
example_tilelang_nsa_fwd	0.00715736	0.0069825	1.02504
topk_selector	0.0549409	0.0535491	1.02599
sparse_mla_fwd	0.145089	0.141008	1.02894
example_warp_specialize_gemm_copy_1_gemm_0	0.037953	0.036768	1.03223
sparse_mla_fwd_pipelined	0.0986826	0.0955311	1.03299
sparse_mla_bwd	0.396097	0.381529	1.03818
example_dequant_gemm_bf16_mxfp4_hopper	0.52967	0.509194	1.04021
example_warp_specialize_gemm_barrierpipe_stage2	0.039713	0.038145	1.04111
example_tilelang_block_sparse_attn	0.0105794	0.0101608	1.0412
fp8_lighting_indexer	0.0376083	0.0358558	1.04888
example_mha_sink_fwd_bhsd_wgmma_pipelined	0.0162064	0.0154454	1.04927
example_dequant_gemm_bf16_fp4_hopper	0.601899	0.573217	1.05004
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.0162677	0.0154904	1.05018
example_mha_sink_fwd_bhsd_sliding_window	0.0165204	0.0157196	1.05094
example_mha_sink_fwd_bhsd	0.0167141	0.0158726	1.05302
example_gqa_sink_fwd_bhsd_wgmma_pipelined	0.0152178	0.0144421	1.05372
example_tilelang_sparse_gqa_decode_varlen_indice	0.0180609	0.0171373	1.0539
example_group_per_split_token_cast_to_fp8	0.010824	0.0102425	1.05677
example_tilelang_sparse_gqa_decode_varlen_mask	0.0247203	0.023354	1.0585
example_blocksparse_gemm	0.0239958	0.0226646	1.05873
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.0154126	0.0145562	1.05883
example_mha_sink_bwd_bhsd_sliding_window	0.0470414	0.0443863	1.05982
example_mha_sink_bwd_bhsd	0.0659296	0.0615957	1.07036
example_gqa_sink_bwd_bhsd_sliding_window	0.027382	0.0255256	1.07273
example_dequant_gemm_fp4_hopper	1.09396	1.01936	1.07319
example_gqa_sink_bwd_bhsd	0.044876	0.0416197	1.07824
example_convolution	1.44261	1.33138	1.08354
example_mla_decode	0.501768	0.46132	1.08768

Artifacts

regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

SiriusNEO added 2 commits January 4, 2026 18:03

fix

f2fe56c

[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop

5395503

SiriusNEO requested a review from LeiWang1999 January 5, 2026 09:36

coderabbitai bot reviewed Jan 5, 2026

View reviewed changes

tilelang/engine/phase.py Outdated Show resolved Hide resolved

remove debug output

11fe7a1

LeiWang1999 merged commit 1b00220 into tile-ai:main Jan 6, 2026
6 checks passed

coderabbitai bot mentioned this pull request Jan 7, 2026

[Bugfix] Do not consider local.var as local buffer during LowerTileOP #1628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop #1607

[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop #1607

Uh oh!

SiriusNEO commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

coderabbitai bot commented Jan 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

SiriusNEO commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop #1607

[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop #1607

Uh oh!

Conversation

SiriusNEO commented Jan 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

coderabbitai bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SiriusNEO commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Performance Regression Test Report

Results

Artifacts

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SiriusNEO commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 5, 2026 •

edited

Loading