-
Notifications
You must be signed in to change notification settings - Fork 381
[Feat] PDL Support #1494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] PDL Support #1494
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds CUDA Programmatic Dependent Launch (PDL) support: new TileLang helpers ( Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Dev as Developer
participant TL as TileLang front-end
participant Compiler as TVM Compiler
participant MarkPass as MarkCudaSyncCalls (pass)
participant Codegen as CUDA codegen / NVRTC source
participant NVRTC as NVRTC (compiler/runtime)
participant Host as Host runtime / wrapper
Dev->>TL: Write kernel using T.pdl_trigger()/T.pdl_sync()
TL->>Compiler: Lower to TIR (extern calls emitted)
Compiler->>MarkPass: Run MarkCudaSyncCalls (OptimizeForTarget)
MarkPass->>Compiler: Annotate PrimFunc attrs (has_cuda_pdl_*)
Compiler->>Codegen: Generate CUDA source (may inject device runtime header)
Codegen->>NVRTC: Compile and return kernel + metadata (pdl_map)
NVRTC->>Host: Provide compiled module and pdl metadata
Host->>Host: Consult pdl_sync_map for function
alt function marked for PDL
Host->>NVRTC: Launch with programmatic stream-serialization attributes
else
Host->>NVRTC: Launch normally
end
NVRTC-->>Dev: Kernel runs and returns results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2025-12-26T06:45:51.789ZApplied to files:
🧬 Code graph analysis (1)tilelang/jit/adapter/cython/adapter.py (2)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (8)
tilelang/jit/adapter/libgen.py (1)
151-153: Consider truncatinglib_codein error message.Including the full source code in the exception can produce extremely long error messages, especially for large kernels. Consider truncating or omitting
self.lib_codefrom the error, or logging it separately at debug level.🔎 Proposed improvement
if ret.returncode != 0: command_str = " ".join(command) - raise RuntimeError(f"Compilation Failed! {command_str}\n {self.lib_code}") + raise RuntimeError(f"Compilation Failed! Command: {command_str}\nSource saved at: {src.name}")testing/python/jit/test_tilelang_jit_nvrtc.py (1)
505-552: Consider adding a compute capability check for PDL support.Based on
tilelang/contrib/nvcc.py(lines 589-594), PDL requires compute capability >= 9.0 (Hopper or later). Other tests in this file, liketest_nvrtc_im2col_tma_desc, skip on non-Hopper GPUs. Consider adding a similar check here to avoid test failures on older hardware.🔎 Proposed fix
def test_nvrtc_pdl(): """Test pdl.""" + if not check_hopper(): + import pytest + + pytest.skip("Test requires Hopper GPU (compute capability 9.0) for PDL support") N = 64tilelang/jit/adapter/wrapper.py (1)
213-213: Consider usingsetinstead ofdictforpdl_sync_map.The
pdl_sync_mapdictionary always stores0as the value (line 450). If the value is unused, asetwould be more appropriate and clearer in intent.🔎 Proposed change
- self.pdl_sync_map: dict[str, int] | None = {} + self.pdl_sync_map: set[str] | None = set()And update usages:
- if "has_cuda_pdl_sync" in attrs: - self.pdl_sync_map[function_name] = 0 + if "has_cuda_pdl_sync" in attrs: + self.pdl_sync_map.add(function_name)tilelang/transform/__init__.py (1)
403-406: LGTM with suggestion: Consider expanding the docstring.The function follows the established pattern for pass wrappers. Consider adding a more detailed docstring similar to other functions in this file, documenting the
have_pdlparameter and return type.🔎 Suggested docstring enhancement
def MarkCudaSyncCalls(have_pdl: bool = False): - """MarkCudaSyncCalls""" + """Mark CUDA synchronization calls for PDL support. + + Parameters + ---------- + have_pdl : bool + Whether the target supports PDL (Programmatic Device Launch). + PDL is available on compute capability >= 9.0 (Hopper+). + + Returns + ------- + fpass : tvm.transform.Pass + The result pass + """ return _ffi_api.MarkCudaSyncCalls(have_pdl) # type: ignoretilelang/contrib/nvcc.py (1)
590-596: Minor: Prefix unused variable with underscore.The
minorvariable fromparse_compute_versionis unpacked but never used. This is a valid static analysis hint.Proposed fix
def have_pdl(target): if target.kind.name != "cuda": return False compute_version = get_target_compute_version(target) - major, minor = parse_compute_version(compute_version) + major, _ = parse_compute_version(compute_version) return major >= 9testing/python/jit/test_tilelang_jit_ctypes.py (1)
1-1: Remove unused import.The
tvmimport is not used in this test file.Proposed fix
-from tilelang import tvm as tvm import tilelang.language as Tsrc/transform/lower_pdl.cc (2)
83-83: Typo in class name:ElininateCudaSyncCalls→EliminateCudaSyncCalls.The class name has a typo ("Elininate" instead of "Eliminate"). While this doesn't affect functionality, fixing it improves code readability and maintainability.
Proposed fix
-class ElininateCudaSyncCalls : public StmtExprMutator { +class EliminateCudaSyncCalls : public StmtExprMutator { public: static PrimFunc Substitute(PrimFunc f) { - ElininateCudaSyncCalls mutator; + EliminateCudaSyncCalls mutator; PrimFunc new_f = f; new_f.CopyOnWrite()->body = mutator.VisitStmt(f->body); return new_f; } // ... rest of class ... private: - ElininateCudaSyncCalls() = default; + EliminateCudaSyncCalls() = default; };Also update line 141:
return have_pdl ? MarkCudaSyncCalls::Substitute(f) - : ElininateCudaSyncCalls::Substitute(f); + : EliminateCudaSyncCalls::Substitute(f);
57-78: Duplicateprivate:access specifier.There are two
private:access specifiers (lines 57 and 76). Consider consolidating them for cleaner code organization.Proposed fix
-private: void CheckCall(const tir::CallNode *call) { if (!call) return; // ... implementation ... } -private: bool has_trigger_launch_ = false; bool has_grid_sync_ = false; MarkCudaSyncCalls() = default; + +private: + void CheckCall(const tir::CallNode *call) { + if (!call) + return; + // ... implementation ... + } + + bool has_trigger_launch_ = false; + bool has_grid_sync_ = false; + + MarkCudaSyncCalls() = default; };Note: The Cppcheck "syntax error" on line 147 is a false positive —
TVM_FFI_STATIC_INIT_BLOCKis a valid TVM macro for FFI registration.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
docs/programming_guides/instructions.mdsrc/transform/lower_pdl.cctesting/python/jit/test_tilelang_jit_ctypes.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/language/test_tilelang_language_pdl.pytilelang/contrib/nvcc.pytilelang/engine/phase.pytilelang/jit/adapter/libgen.pytilelang/jit/adapter/nvrtc/wrapper.pytilelang/jit/adapter/wrapper.pytilelang/language/__init__.pytilelang/language/pdl.pytilelang/transform/__init__.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_ctypes.pytilelang/jit/adapter/nvrtc/wrapper.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_pdl.py
🧬 Code graph analysis (10)
tilelang/transform/__init__.py (2)
src/transform/lower_pdl.cc (1)
MarkCudaSyncCalls(80-80)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
tilelang/language/pdl.py (1)
tilelang/language/tir/op.py (1)
call_extern(173-195)
tilelang/engine/phase.py (3)
tilelang/contrib/nvcc.py (1)
have_pdl(590-595)tilelang/transform/__init__.py (1)
MarkCudaSyncCalls(403-405)tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/language/test_tilelang_language_pdl.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
testing/python/jit/test_tilelang_jit_ctypes.py (3)
testing/python/jit/test_tilelang_jit_nvrtc.py (7)
test_nvrtc_pdl(505-552)multi_kernels_with_pdl(511-532)main(30-50)main(138-158)main(384-413)main(513-530)kernel(469-484)testing/python/language/test_tilelang_language_pdl.py (2)
main(7-16)main(23-32)tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
tilelang/contrib/nvcc.py (1)
tilelang/contrib/rocm.py (1)
parse_compute_version(179-201)
tilelang/jit/adapter/wrapper.py (1)
tilelang/jit/adapter/utils.py (1)
parse_function_call_args(315-357)
tilelang/jit/adapter/libgen.py (2)
tilelang/jit/adapter/ctypes/adapter.py (1)
lib_code(287-289)tilelang/jit/adapter/cython/adapter.py (1)
lib_code(371-373)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc
[error] 147-147: syntax error
(syntaxError)
🪛 Ruff (0.14.8)
tilelang/language/__init__.py
120-120: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
121-121: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
tilelang/language/pdl.py
4-7: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tilelang/contrib/nvcc.py
594-594: Unpacked variable minor is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tilelang/jit/adapter/libgen.py
153-153: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (8)
tilelang/jit/adapter/wrapper.py (3)
141-155: LGTM: PDL kernel launch code structure looks correct.The
KERNEL_LAUNCH_FUNC_CODEcorrectly configurescudaLaunchConfig_twithcudaLaunchAttributeProgrammaticStreamSerializationand usescudaLaunchKernelExfor the kernel launch. The attribute setup follows CUDA's programmatic stream serialization pattern.
313-330: LGTM: PDL branch in dispatch function is correctly structured.The new PDL path properly:
- Parses function call arguments
- Validates argument count matches parameters
- Uses
KERNEL_LAUNCH_FUNC_CODEfor kernel launch with PDL attributes- Includes error checking after launch
448-450: LGTM: PDL attribute detection correctly populates the sync map.The
has_cuda_pdl_syncattribute is checked and the function is registered inpdl_sync_mapfor later use during code generation.tilelang/language/__init__.py (1)
119-123: LGTM!The PDL function re-exports follow the established pattern in this file. The
# noqa: F401comments are appropriate for suppressing "imported but unused" warnings in__init__.pyfiles intended to expose public API.docs/programming_guides/instructions.md (1)
141-144: LGTM!Documentation for the new PDL synchronization helpers is clear and follows the established format of the instruction reference.
tilelang/engine/phase.py (1)
255-257: LGTM!The
MarkCudaSyncCallspass is correctly integrated afterSplitHostDevice. The unconditional application is appropriate since the pass internally handles both cases: marking sync calls when PDL is available (sm_90+) or eliminating them for older architectures.testing/python/language/test_tilelang_language_pdl.py (1)
37-61: Test implementation looks good.The tests correctly verify that PDL synchronization calls are present in generated code for sm_90 (Hopper+) and absent for sm_75 (pre-Hopper). This validates the
MarkCudaSyncCallspass behavior.src/transform/lower_pdl.cc (1)
29-43: LGTM on the MarkCudaSyncCalls implementation.The mutator correctly scans for CUDA sync extern calls and annotates the
PrimFuncwithhas_cuda_pdl_triggerandhas_cuda_pdl_syncattributes. The pattern of visiting both statements and expressions ensures comprehensive detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/transform/lower_pdl.cc (3)
9-9: Remove duplicate include.The header
tvm/tir/builtin.his included twice (line 9 and line 14).🔎 Proposed fix
#include "../op/builtin.h" #include "../target/utils.h" #include "tvm/ir/type.h" #include "tvm/tir/builtin.h" #include "tvm/tir/expr.h" #include "tvm/tir/stmt.h" #include <tvm/ffi/reflection/registry.h> #include <tvm/tir/analysis.h> -#include <tvm/tir/builtin.h> #include <tvm/tir/stmt_functor.h> #include <tvm/tir/transform.h>Also applies to: 14-14
58-74: Extract duplicated logic into a helper function.The
CheckCallmethods in bothMarkCudaSyncCalls(lines 58-74) andEliminateCudaSyncCalls(lines 111-130) contain nearly identical logic for detecting PDL sync function names. Consider extracting this into a common helper function to improve maintainability.🔎 Proposed refactor
Add a helper function in an anonymous namespace:
namespace { // Returns true if the call is a CUDA PDL sync call bool IsPDLSyncCall(const tir::CallNode *call) { if (!call) return false; if (call->op.same_as(builtin::call_extern())) { if (!call->args.empty()) { if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { std::string func_name = str_node->value; return func_name == "cudaTriggerProgrammaticLaunchCompletion" || func_name == "cudaGridDependencySynchronize"; } } } return false; } // Returns the specific PDL sync type (0=none, 1=trigger, 2=sync) int GetPDLSyncType(const tir::CallNode *call) { if (!call) return 0; if (call->op.same_as(builtin::call_extern())) { if (!call->args.empty()) { if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { std::string func_name = str_node->value; if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { return 1; } else if (func_name == "cudaGridDependencySynchronize") { return 2; } } } } return 0; } } // anonymous namespaceThen update both classes to use these helpers.
Also applies to: 111-130
138-145: Consider more descriptive pass name.The pass name
"tl.MarkCudaSyncCalls"only mentions "Mark" behavior, but the pass can also eliminate calls whenhave_pdlis false. Consider a name like"tl.ProcessCudaSyncCalls"or document the dual behavior clearly.tilelang/jit/adapter/nvrtc/wrapper.py (1)
479-486: Docstring enhancement suggestion (optional):The docstring is functional but could be improved to match the detail level of
generate_l2_persistent_map. Consider documenting what PDL synchronization is, the expected structure ofpdl_sync_map, and when/why this returns non-empty code:def generate_pdl_sync_code(self, function_name: str) -> str: - """ - Generate Python code to insert PDL synchronization for a given kernel. - """ + """Generate Python code to configure PDL synchronization for a kernel. + + PDL (Programmatic Device Launch) enables programmatic stream serialization + for fine-grained kernel synchronization on compatible CUDA architectures. + + Args: + function_name: Kernel name to check for PDL sync config + + Returns: + Python code that sets programmatic stream serialization attributes, + or empty string if PDL sync is not configured for this kernel. + """
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/transform/lower_pdl.cctilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc
[error] 147-147: syntax error
(syntaxError)
🔇 Additional comments (5)
src/transform/lower_pdl.cc (2)
29-81: LGTM!The
MarkCudaSyncCallsclass correctly implements the TVMStmtExprMutatorpattern to detect PDL synchronization calls and mark the function with appropriate attributes. The logic is defensive and properly chains to the base visitor methods.
147-151: LGTM! Static analysis false positive can be ignored.The FFI registration correctly exposes the pass through TVM's reflection mechanism. The cppcheck syntax error is a false positive—cppcheck doesn't recognize the
TVM_FFI_STATIC_INIT_BLOCKmacro, which is a valid TVM construct.tilelang/jit/adapter/nvrtc/wrapper.py (3)
172-186: LGTM! Past config overwriting issue has been resolved.The PDL synchronization code correctly assumes the
configobject already exists and modifies it in place, rather than creating a new instance. The placement at line 207 (via placeholder{11}) ensures this code executes afterconfig = CUlaunchConfig()is created at line 198.
207-207: LGTM! PDL sync placeholder correctly positioned.The placeholder
{11}is correctly placed afterconfigcreation and before kernel launch, ensuring PDL attributes can be applied to the existing config object.
423-424: LGTM! PDL sync code correctly integrated.The per-kernel PDL synchronization code is generated and passed as the 11th format argument, correctly matching the
{11}placeholder inKERNEL_LAUNCH_FUNC_PY. The pattern follows the established approach used for L2 persistent map generation.Also applies to: 438-438
|
The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)
58-74: Consider extracting duplicated CheckCall logic.The logic for detecting CUDA PDL function names is duplicated between
MarkCudaSyncCalls::CheckCallandEliminateCudaSyncCalls::CheckCall. Consider extracting this into a shared helper function to improve maintainability.🔎 Proposed refactor to extract common helper
Add a helper function in the anonymous namespace:
+namespace { + +enum class CudaPdlCallType { kNone, kTriggerLaunch, kGridSync }; + +CudaPdlCallType GetCudaPdlCallType(const tir::CallNode *call) { + if (!call || !call->op.same_as(builtin::call_extern()) || call->args.empty()) { + return CudaPdlCallType::kNone; + } + + if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { + std::string func_name = str_node->value; + if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { + return CudaPdlCallType::kTriggerLaunch; + } else if (func_name == "cudaGridDependencySynchronize") { + return CudaPdlCallType::kGridSync; + } + } + return CudaPdlCallType::kNone; +} + +} // namespace + class MarkCudaSyncCalls : public StmtExprMutator {Then simplify both CheckCall methods:
private: void CheckCall(const tir::CallNode *call) { - if (!call) - return; - if (call->op.same_as(builtin::call_extern())) { - if (!call->args.empty()) { - if (const auto *str_node = - call->args[0].as<tvm::tir::StringImmNode>()) { - std::string func_name = str_node->value; - if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { - has_trigger_launch_ = true; - } else if (func_name == "cudaGridDependencySynchronize") { - has_grid_sync_ = true; - } - } - } + auto call_type = GetCudaPdlCallType(call); + if (call_type == CudaPdlCallType::kTriggerLaunch) { + has_trigger_launch_ = true; + } else if (call_type == CudaPdlCallType::kGridSync) { + has_grid_sync_ = true; } }And in
EliminateCudaSyncCalls:private: bool CheckCall(const tir::CallNode *call) { - if (!call) - return false; - - if (call->op.same_as(builtin::call_extern())) { - if (!call->args.empty()) { - if (const auto *str_node = - call->args[0].as<tvm::tir::StringImmNode>()) { - std::string func_name = str_node->value; - if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { - return true; - } else if (func_name == "cudaGridDependencySynchronize") { - return true; - } - } - } - } - - return false; + return GetCudaPdlCallType(call) != CudaPdlCallType::kNone; }Also applies to: 111-130
138-145: Pass name doesn't reflect dual behavior.The pass is named
"tl.MarkCudaSyncCalls"but it performs two different operations based on thehave_pdlparameter: marking calls when true, or eliminating calls when false. Consider renaming to something more generic like"tl.HandleCudaSyncCalls"or"tl.ProcessCudaSyncCalls"to better reflect this dual behavior.🔎 Proposed rename
- return CreatePrimFuncPass(pass_func, 0, "tl.MarkCudaSyncCalls", {}); + return CreatePrimFuncPass(pass_func, 0, "tl.HandleCudaSyncCalls", {});
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc
[error] 147-147: syntax error
(syntaxError)
🔇 Additional comments (1)
src/transform/lower_pdl.cc (1)
147-151: Static analysis syntax error is a false positive.The Cppcheck syntax error reported at line 147 is a false positive. The
TVM_FFI_STATIC_INIT_BLOCKmacro is a valid TVM framework construct for registering FFI functions at static initialization time. The implementation is correct.
Thanks for the clarification. From my understanding, the Cython backend itself is fine. In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. |
You can manipulate this in |
Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.
https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207 There is an existing draft PR here: apache/tvm#18604 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
testing/python/jit/test_tilelang_jit_cython.py (1)
41-43: Consider optimizing output tensor initialization.Tensors
bandcare initialized withtorch.randnbut are immediately overwritten by the kernel. Consider usingtorch.emptyinstead for better performance.🔎 Proposed optimization
a = torch.randn(N, dtype=torch.float32).cuda() -b = torch.randn(N, dtype=torch.float32).cuda() -c = torch.randn(N, dtype=torch.float32).cuda() +b = torch.empty(N, dtype=torch.float32).cuda() +c = torch.empty(N, dtype=torch.float32).cuda()
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
testing/python/jit/test_tilelang_jit_cython.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_cython.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_cython.py (4)
tilelang/language/kernel.py (1)
threads(214-218)tilelang/language/loop.py (1)
Parallel(13-33)tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)tilelang/utils/tensor.py (1)
torch_assert_close(231-319)
🔇 Additional comments (5)
testing/python/jit/test_tilelang_jit_cython.py (5)
1-6: LGTM - Imports are correct.All necessary imports are present for the PDL test functionality.
8-11: LGTM - Test setup is appropriate.The test function is properly defined with a reasonable test size for validating basic PDL functionality.
14-35: LGTM - PDL usage and kernel logic are correct.The PDL primitives are used correctly:
T.pdl_trigger()is called after the first kernel completes its computation, signaling completionT.pdl_sync()is called before the second kernel begins processing, ensuring proper synchronizationThe kernel logic correctly implements the expected computation pipeline (B = A + 1.0, then C = B * 2.0).
48-55: LGTM - Test execution and verification are correct.The test properly executes the kernel, computes reference results, and verifies correctness using appropriate tolerances.
58-59: LGTM - Main guard is correctly implemented.The main guard follows the standard pattern for tilelang test files.
I locally implement similar functionality with that draft PR (: I believe this implementation is sufficient, and we can proceed by adding |
|
Thanks. Since the ctypes execution backend was removed in #1510, we can also remove the related caching mechanisms in this PR :) |
silentCoder-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, we only need to implement the extension for the TVM FFI backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)
10-10: Remove duplicate include.
tvm/tir/builtin.his included on both line 10 and line 15. Remove one of them.🔎 Proposed fix
#include "tvm/tir/builtin.h" #include "tvm/tir/expr.h" #include "tvm/tir/stmt.h" #include <tvm/ffi/reflection/registry.h> #include <tvm/tir/analysis.h> -#include <tvm/tir/builtin.h> #include <tvm/tir/stmt_functor.h> #include <tvm/tir/transform.h>Also applies to: 15-15
59-73: Consider extracting magic strings to named constants.The function name strings
"cudaTriggerProgrammaticLaunchCompletion"and"cudaGridDependencySynchronize"are hardcoded. While acceptable, extracting them to named constants would improve maintainability.🔎 Proposed refactor
At the top of the file or in the class:
namespace { constexpr const char* kCudaTriggerPDL = "cudaTriggerProgrammaticLaunchCompletion"; constexpr const char* kCudaGridSync = "cudaGridDependencySynchronize"; }Then update the checks:
std::string func_name = str_node->value; - if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { + if (func_name == kCudaTriggerPDL) { has_trigger_launch_ = true; - } else if (func_name == "cudaGridDependencySynchronize") { + } else if (func_name == kCudaGridSync) { has_grid_sync_ = true; }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/transform/lower_pdl.cctilelang/engine/phase.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tilelang/engine/phase.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/lower_pdl.cc
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc
[error] 92-92: syntax error
(syntaxError)
🔇 Additional comments (3)
src/transform/lower_pdl.cc (3)
24-34: LGTM!The
CheckLDGCallsvisitor correctly detects__ldgcalls and enforces the constraint that they cannot be used with PDL synchronization.
38-57: LGTM!The transformation logic is correct:
- Properly checks PDL support and fails if PDL functions are used when unsupported
- Correctly annotates the function with attributes based on detected PDL calls
- Runs
CheckLDGCallson the original body to detect incompatible__ldgusage
84-96: LGTM! Cppcheck syntax error is a false positive.The pass creation and FFI registration follow standard TVM patterns and are implemented correctly. The Cppcheck syntax error reported at line 92 is a false positive—
TVM_FFI_STATIC_INIT_BLOCK()is a TVM macro that expands to valid C++ code, but Cppcheck doesn't understand the macro expansion.
I have thought about this concern, and my reasoning is as follows.
For example, in the following kernel: __global__ void square(int* array, int n) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid < n)
array[tid] = array[tid] * array[tid];
cudaGridDependencySynchronize();
array[tid] = array[tid] * array[tid];
}
# nvcc -gencode arch=compute_90,code=sm_90
# compiled via godbolt.org
...
STG.E desc[UR4][R2.64], R5
.L_x_1:
BSYNC B0
.L_x_0:
ACQBULK
LDG.E R0, desc[UR4][R2.64]
IMAD R5, R0, R0, RZ
STG.E desc[UR4][R2.64], R5
EXIT
...This positional dependency is the key reason I am cautious about aggressively running If we model PDL via I also considered an alternative using with T.Kernel(T.ceildiv(N, block_size), threads=block_size) as (bx2,):
for i in T.Parallel(block_size):
idx = bx2 * block_size + i
if idx < N:
C[idx] = B[idx] * 2.0
T.pdl_sync()
for i in T.Parallel(block_size):
idx = bx2 * block_size + i
T.pdl_sync()
if idx < N:
C[idx] = B[idx] * 2.0In this case, One possible direction is to avoid These are my current thoughts. If there are mistakes in this reasoning, or if you see a cleaner or more idiomatic approach, I would greatly appreciate your guidance. |
Testing on SM_80 reveals that NVCC fails due to the A similar pattern can be observed with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tilelang/contrib/nvrtc.pytilelang/jit/adapter/nvrtc/libgen.pytilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-24T17:20:27.444Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:27.444Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.
Applied to files:
tilelang/jit/adapter/nvrtc/libgen.py
🔇 Additional comments (6)
tilelang/contrib/nvrtc.py (1)
73-76: LGTM! Simple and effective conditional header inclusion for PDL support.The string-based detection for
cudaGridDependencySynchronizeandcudaTriggerProgrammaticLaunchCompletionis pragmatic. The header is correctly inserted after the standard NVRTC header, ensuring the device runtime API declarations are available when PDL intrinsics are used.tilelang/jit/adapter/nvrtc/wrapper.py (5)
45-46: Good refactor: imports moved to module-level template.Moving
CUlaunchAttributeandCUlaunchAttributeIDimports toPREDEF_HOST_FUNC_PYavoids redundant imports per kernel and keeps the generated code cleaner.
174-182: LGTM! Config overwrite issue resolved.The template now correctly modifies the existing
configobject created inKERNEL_LAUNCH_FUNC_PYrather than creating a new one. The programmatic stream serialization attribute is properly configured.
203-203: Correct placement of PDL sync code injection.The placeholder
{11}is positioned afterconfiginitialization but beforecuLaunchKernelEx, ensuring PDL attributes are applied to the launch configuration at the right time.
419-434: Clean integration with existing two-pass code generation.The PDL sync code generation follows the same pattern as L2 persistent map handling, maintaining consistency with the existing per-kernel configuration approach.
475-482: No changes needed.pdl_sync_mapis properly initialized in the parent classTLCUDASourceWrapperat line 213 asself.pdl_sync_map: dict[str, int] | None = {}, and populated at line 450 when kernels have thehas_cuda_pdl_syncattribute. The method correctly accesses this inherited attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (3)
7-7: LGTM: Module-level pytest import is appropriate.The module-level
pytestimport supports the new PDL test and is best practice. As a follow-up, consider removing the redundant localimport pytestat line 377 intest_tvm_ffi_im2col_tma_desc().
446-451: LGTM: Compute capability check is correct for PDL.The function correctly gates PDL tests on
compute_capability[0] >= 9, aligning with the SM_90+ requirement discussed in the PR comments. Note that an identicalcheck_pdl()helper exists intest_tilelang_jit_nvrtc.py. Consolidating these into a shared test utility (e.g.,tilelang.testing) could reduce duplication across test modules.
454-505: LGTM: PDL test correctly exercises trigger/sync semantics.The test accurately validates PDL synchronization by having the first kernel trigger completion after writing
B, and the second kernel sync before readingB. The PDL API usage aligns with the definitions intilelang/language/pdl.py, and the result validation is thorough.Optional refinements:
- Test coverage: The current test uses
N=64withblock_size=256, launching only a single block per kernel. Consider adding a test case with a largerN(e.g., 1024 or 2048) to verify PDL behavior across multiple thread blocks.- Tensor allocation: Lines 491–492 initialize output tensors
bandcwithtorch.randn()but the kernel immediately overwrites them. Usingtorch.empty()would be slightly more efficient.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
3rdparty/tvmsrc/target/rt_mod_cuda.cctesting/python/jit/test_tilelang_jit_tvm_ffi.py
✅ Files skipped from review due to trivial changes (1)
- 3rdparty/tvm
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_tvm_ffi.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
🔇 Additional comments (1)
src/target/rt_mod_cuda.cc (1)
1-1: LGTM: Includes are necessary for PDL support.The new includes provide the attribute definitions (
kHasGridSync) and runtime metadata (kUseProgramaticDependentLaunch) required by the PDL detection logic at lines 43-46.Also applies to: 4-4
src/target/rt_mod_cuda.cc
Outdated
| if (f->HasNonzeroAttr(tl::attr::kHasGridSync)) { | ||
| info.launch_param_tags.push_back( | ||
| runtime::launch_param::kUseProgramaticDependentLaunch); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for both spelling variants and the constant definition
echo "=== Searching for 'Programatic' (single-m) ==="
rg 'Programatic' -C 2
echo -e "\n=== Searching for 'Programmatic' (double-m) ==="
rg 'Programmatic' -C 2
echo -e "\n=== Checking the specific constant definition ==="
rg 'kUseProgramatic' -C 3Repository: tile-ai/tilelang
Length of output: 6726
The constant name contains a spelling error that should be corrected.
The constant kUseProgramaticDependentLaunch uses single-m "Programatic", but all CUDA API functions (cudaTriggerProgrammaticLaunchCompletion), attributes (cudaLaunchAttributeProgrammaticStreamSerialization), and comments throughout the codebase consistently use double-m "Programmatic". Rename the constant to kUseProgrammaticDependentLaunch for consistency.
🤖 Prompt for AI Agents
In src/target/rt_mod_cuda.cc around lines 43 to 46, the constant name
kUseProgramaticDependentLaunch is misspelled (single “m”); rename all
uses/definitions to kUseProgrammaticDependentLaunch to match CUDA naming and
existing codebase conventions, update any declaration, definition, and
references (including headers and enums) to the corrected identifier, and run a
project-wide search/replace to ensure consistency and rebuild to catch any
remaining references.
silentCoder-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| \t\tconfig.gridDim = {0}; | ||
| \t\tconfig.blockDim = {1}; | ||
| \t\tconfig.dynamicSmemBytes = {2}; | ||
| \t\tcudaLaunchKernelEx(&config, {4}, {3}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pack kernel args for cudaLaunchKernelEx
cudaLaunchKernelEx expects its third parameter to be a void** array of pointers to the kernel arguments, but in the PDL path {3} is populated from call_args = ", ".join(args_list) (see the PDL branch in this file). That means the generated call passes raw arguments instead of an argument pointer array, so the runtime will interpret the first kernel argument as the void** array (and the next as extra), leading to invalid parameter packing or crashes for any has_cuda_pdl_sync kernel launched via the C++ wrapper.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudaLaunchKernelExC uses a void** for its arguments, while cudaLaunchKernelEx employs a template to accept ArgsTypes&&... args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/target/rt_mod_cuda.cc (1)
36-48: Fix missing closing brace for kHasGridSync condition.Line 36 opens an
ifstatement forkHasGridSync, but line 39 starts the nextifstatement without closing the first one. Add a closing brace after line 38, before theuse_cooperative_groupscheck.if (f->HasNonzeroAttr(tl::attr::kHasGridSync)) { info.launch_param_tags.push_back( runtime::launch_param::kUseProgramaticDependentLaunch); }
♻️ Duplicate comments (1)
src/target/rt_mod_cuda.cc (1)
38-38: Spelling error already flagged in previous review.The constant name
kUseProgramaticDependentLaunchuses single-m "Programatic", but should be "Programmatic" to match CUDA API naming. This issue was already identified in the previous review.
🧹 Nitpick comments (2)
tilelang/contrib/nvcc.py (1)
589-594: Use_for the unusedminorvariable.The
minorvariable is unpacked but never used. Following Python conventions and addressing the static analysis hint, prefix it with an underscore to indicate it's intentionally unused.🔎 Proposed fix
def have_pdl(target): if target.kind.name != "cuda": return False compute_version = get_target_compute_version(target) - major, minor = parse_compute_version(compute_version) + major, _ = parse_compute_version(compute_version) return major >= 9tilelang/language/__init__.py (1)
110-113: Remove unnecessary# noqa: F401directives.The F401 rule (unused imports) is not enabled in your Ruff configuration, making these
# noqadirectives unnecessary. You can safely remove them to reduce noise.🔎 Proposed fix
from .pdl import ( - pdl_trigger, # noqa: F401 - pdl_sync, # noqa: F401 + pdl_trigger, + pdl_sync, )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/target/rt_mod_cuda.cctilelang/contrib/nvcc.pytilelang/language/__init__.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.
Applied to files:
tilelang/contrib/nvcc.py
🧬 Code graph analysis (2)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
tilelang/contrib/nvcc.py (1)
tilelang/contrib/rocm.py (1)
parse_compute_version(179-201)
🪛 Ruff (0.14.10)
tilelang/language/__init__.py
111-111: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
112-112: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
tilelang/contrib/nvcc.py
593-593: Unpacked variable minor is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🔇 Additional comments (1)
src/target/rt_mod_cuda.cc (1)
1-1: LGTM!The new includes provide the necessary attribute definitions and runtime metadata structures for PDL support.
Also applies to: 4-4
* Introduced a new method, get_host_source, in the CythonKernelAdapter class to return the source code of the host function. This addition enhances the functionality of the adapter by providing direct access to the host kernel source.
|
Thanks @w169q169 @silentCoder-dev. I tested the example, and the kernels overlap in both the |
|
Merged. |

Add PDL support as per Issue #1463
Summary by CodeRabbit
New Features
Documentation
Tests
Behavior
✏️ Tip: You can customize this high-level summary in your review settings.