Skip to content

Conversation

@w169q169
Copy link
Contributor

@w169q169 w169q169 commented Dec 22, 2025

Add PDL support as per Issue #1463

Summary by CodeRabbit

  • New Features

    • Added CUDA PDL primitives pdl_trigger() and pdl_sync() and end-to-end support for programmatic dependent launches on CUDA CC ≥ 9.
  • Documentation

    • Instruction Reference gains a new "Synchronization helpers" subsection documenting the new primitives.
  • Tests

    • Added tests exercising NVRTC, JIT (ctypes/FFI), and TileLang PDL integration.
  • Behavior

    • Host launch and codegen paths now recognize and honor per-kernel PDL synchronization hints.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds CUDA Programmatic Dependent Launch (PDL) support: new TileLang helpers (T.pdl_trigger, T.pdl_sync), a compiler pass to detect/annotate PDL externs, codegen/runtime/NVRTC/wrappers to honor annotations, and tests for NVRTC/FFI/Cython paths.

Changes

Cohort / File(s) Summary
Docs & Language API
docs/programming_guides/instructions.md, tilelang/language/pdl.py, tilelang/language/__init__.py
Add documentation subsection and public helpers pdl_trigger() / pdl_sync() with re-exports.
Compiler Transform & Attrs
src/transform/lower_pdl.cc, src/transform/common/attr.h, tilelang/transform/__init__.py
New MarkCudaSyncCalls pass scans TIR extern string literals for PDL APIs, annotates PrimFunc with has_cuda_pdl_trigger / has_cuda_pdl_sync, and exposes MarkCudaSyncCallsPass via FFI; new attr names added.
Engine Pipeline
tilelang/engine/phase.py
Inject MarkCudaSyncCalls into OptimizeForTarget conditioned on have_pdl(target).
Target Codegen & Runtime Metadata
src/target/codegen_cuda.cc, src/target/rt_mod_cuda.cc
Respect has_cuda_pdl_sync: suppress __restrict__ emission when set and append kUseProgramaticDependentLaunch tag to launch metadata.
IR Rewriter
src/transform/warp_specialized_rewriter.cc
Treat PDL extern calls (cudaGridDependencySynchronize, cudaTriggerProgrammaticLaunchCompletion) as dual producer/consumer (kBoth) in warp rewriting.
NVRTC / Compiler Glue
tilelang/contrib/nvcc.py, tilelang/contrib/nvrtc.py, tilelang/jit/adapter/nvrtc/libgen.py
have_pdl(target) helper (CUDA sm >= 9); conditional inclusion of cuda_device_runtime_api.h when PDL symbols are referenced; conditional NVRTC include path handling by compiler version.
NVRTC Launch Wrapper
tilelang/jit/adapter/nvrtc/wrapper.py, tilelang/jit/adapter/nvrtc/libgen.py
Add PDL_SYNC_PY snippet and generate_pdl_sync_code() to inject per-kernel PDL launch attribute code into NVRTC Python launch wrapper.
Host Launch Wrapper
tilelang/jit/adapter/wrapper.py
Add pdl_sync_map and alternate programmatic stream-serialization launch path for functions with has_cuda_pdl_sync; change host function iteration to use name hints.
Cython Adapter
tilelang/jit/adapter/cython/adapter.py
Expose get_host_source() to return host kernel source for host-wrapper needs.
Tests
testing/python/language/test_tilelang_language_pdl.py, testing/python/jit/test_tilelang_jit_nvrtc.py, testing/python/jit/test_tilelang_jit_cython.py, testing/python/jit/test_tilelang_jit_tvm_ffi.py, testing/python/jit/test_tilelang_jit_tvm_ffi.py
Add unit/integration tests asserting emitted extern names and end-to-end PDL flows across NVRTC/FFI/Cython; tests gated on CUDA compute capability >= 9 where appropriate.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Dev as Developer
    participant TL as TileLang front-end
    participant Compiler as TVM Compiler
    participant MarkPass as MarkCudaSyncCalls (pass)
    participant Codegen as CUDA codegen / NVRTC source
    participant NVRTC as NVRTC (compiler/runtime)
    participant Host as Host runtime / wrapper

    Dev->>TL: Write kernel using T.pdl_trigger()/T.pdl_sync()
    TL->>Compiler: Lower to TIR (extern calls emitted)
    Compiler->>MarkPass: Run MarkCudaSyncCalls (OptimizeForTarget)
    MarkPass->>Compiler: Annotate PrimFunc attrs (has_cuda_pdl_*)
    Compiler->>Codegen: Generate CUDA source (may inject device runtime header)
    Codegen->>NVRTC: Compile and return kernel + metadata (pdl_map)
    NVRTC->>Host: Provide compiled module and pdl metadata
    Host->>Host: Consult pdl_sync_map for function
    alt function marked for PDL
        Host->>NVRTC: Launch with programmatic stream-serialization attributes
    else
        Host->>NVRTC: Launch normally
    end
    NVRTC-->>Dev: Kernel runs and returns results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

  • [Feature Request] PDL Support #1463: Implements the requested PDL feature (T.pdl_trigger / T.pdl_sync) including detection/annotation and runtime/launcher wiring described in the issue.

"i'm a rabbit in a code-lined patch,
i nibble bugs and leave a match.
triggers chirp and syncs align,
kernels flow in CUDA time.
hop! 🐇✨"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.15% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Feat] PDL Support' directly summarizes the main feature addition across all changes in the changeset.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6988e5 and bfd2c23.

📒 Files selected for processing (1)
  • tilelang/jit/adapter/cython/adapter.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-26T06:45:51.789Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1483
File: tilelang/jit/adapter/cutedsl/adapter.py:93-95
Timestamp: 2025-12-26T06:45:51.789Z
Learning: For the CuTeDSL backend in tilelang/jit/adapter/cutedsl/adapter.py, the host_kernel_source and device_kernel_source have the same value.

Applied to files:

  • tilelang/jit/adapter/cython/adapter.py
🧬 Code graph analysis (1)
tilelang/jit/adapter/cython/adapter.py (2)
tilelang/jit/adapter/tvm_ffi.py (1)
  • get_host_source (298-302)
tilelang/jit/kernel.py (1)
  • get_host_source (451-458)
🔇 Additional comments (1)
tilelang/jit/adapter/cython/adapter.py (1)

388-391: LGTM!

The new get_host_source() method is correctly implemented and aligns with the interface expected by tilelang/jit/kernel.py. The host_kernel_source attribute is guaranteed to be set in both __init__ (line 127) and from_database (line 167) initialization paths, so returning it directly is safe.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (8)
tilelang/jit/adapter/libgen.py (1)

151-153: Consider truncating lib_code in error message.

Including the full source code in the exception can produce extremely long error messages, especially for large kernels. Consider truncating or omitting self.lib_code from the error, or logging it separately at debug level.

🔎 Proposed improvement
         if ret.returncode != 0:
             command_str = " ".join(command)
-            raise RuntimeError(f"Compilation Failed! {command_str}\n {self.lib_code}")
+            raise RuntimeError(f"Compilation Failed! Command: {command_str}\nSource saved at: {src.name}")
testing/python/jit/test_tilelang_jit_nvrtc.py (1)

505-552: Consider adding a compute capability check for PDL support.

Based on tilelang/contrib/nvcc.py (lines 589-594), PDL requires compute capability >= 9.0 (Hopper or later). Other tests in this file, like test_nvrtc_im2col_tma_desc, skip on non-Hopper GPUs. Consider adding a similar check here to avoid test failures on older hardware.

🔎 Proposed fix
 def test_nvrtc_pdl():
     """Test pdl."""
+    if not check_hopper():
+        import pytest
+
+        pytest.skip("Test requires Hopper GPU (compute capability 9.0) for PDL support")
 
     N = 64
tilelang/jit/adapter/wrapper.py (1)

213-213: Consider using set instead of dict for pdl_sync_map.

The pdl_sync_map dictionary always stores 0 as the value (line 450). If the value is unused, a set would be more appropriate and clearer in intent.

🔎 Proposed change
-        self.pdl_sync_map: dict[str, int] | None = {}
+        self.pdl_sync_map: set[str] | None = set()

And update usages:

-            if "has_cuda_pdl_sync" in attrs:
-                self.pdl_sync_map[function_name] = 0
+            if "has_cuda_pdl_sync" in attrs:
+                self.pdl_sync_map.add(function_name)
tilelang/transform/__init__.py (1)

403-406: LGTM with suggestion: Consider expanding the docstring.

The function follows the established pattern for pass wrappers. Consider adding a more detailed docstring similar to other functions in this file, documenting the have_pdl parameter and return type.

🔎 Suggested docstring enhancement
 def MarkCudaSyncCalls(have_pdl: bool = False):
-    """MarkCudaSyncCalls"""
+    """Mark CUDA synchronization calls for PDL support.
+
+    Parameters
+    ----------
+    have_pdl : bool
+        Whether the target supports PDL (Programmatic Device Launch).
+        PDL is available on compute capability >= 9.0 (Hopper+).
+
+    Returns
+    -------
+    fpass : tvm.transform.Pass
+        The result pass
+    """
     return _ffi_api.MarkCudaSyncCalls(have_pdl)  # type: ignore
tilelang/contrib/nvcc.py (1)

590-596: Minor: Prefix unused variable with underscore.

The minor variable from parse_compute_version is unpacked but never used. This is a valid static analysis hint.

Proposed fix
 def have_pdl(target):
     if target.kind.name != "cuda":
         return False
     compute_version = get_target_compute_version(target)
-    major, minor = parse_compute_version(compute_version)
+    major, _ = parse_compute_version(compute_version)
     return major >= 9
testing/python/jit/test_tilelang_jit_ctypes.py (1)

1-1: Remove unused import.

The tvm import is not used in this test file.

Proposed fix
-from tilelang import tvm as tvm
 import tilelang.language as T
src/transform/lower_pdl.cc (2)

83-83: Typo in class name: ElininateCudaSyncCallsEliminateCudaSyncCalls.

The class name has a typo ("Elininate" instead of "Eliminate"). While this doesn't affect functionality, fixing it improves code readability and maintainability.

Proposed fix
-class ElininateCudaSyncCalls : public StmtExprMutator {
+class EliminateCudaSyncCalls : public StmtExprMutator {
 public:
   static PrimFunc Substitute(PrimFunc f) {
-    ElininateCudaSyncCalls mutator;
+    EliminateCudaSyncCalls mutator;
     PrimFunc new_f = f;
     new_f.CopyOnWrite()->body = mutator.VisitStmt(f->body);
 
     return new_f;
   }
   // ... rest of class ...
 private:
-  ElininateCudaSyncCalls() = default;
+  EliminateCudaSyncCalls() = default;
 };

Also update line 141:

   return have_pdl ? MarkCudaSyncCalls::Substitute(f)
-                  : ElininateCudaSyncCalls::Substitute(f);
+                  : EliminateCudaSyncCalls::Substitute(f);

57-78: Duplicate private: access specifier.

There are two private: access specifiers (lines 57 and 76). Consider consolidating them for cleaner code organization.

Proposed fix
-private:
   void CheckCall(const tir::CallNode *call) {
     if (!call)
       return;
     // ... implementation ...
   }

-private:
   bool has_trigger_launch_ = false;
   bool has_grid_sync_ = false;

   MarkCudaSyncCalls() = default;
+
+private:
+  void CheckCall(const tir::CallNode *call) {
+    if (!call)
+      return;
+    // ... implementation ...
+  }
+
+  bool has_trigger_launch_ = false;
+  bool has_grid_sync_ = false;
+
+  MarkCudaSyncCalls() = default;
 };

Note: The Cppcheck "syntax error" on line 147 is a false positive — TVM_FFI_STATIC_INIT_BLOCK is a valid TVM macro for FFI registration.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a431797 and 2a67387.

📒 Files selected for processing (13)
  • docs/programming_guides/instructions.md
  • src/transform/lower_pdl.cc
  • testing/python/jit/test_tilelang_jit_ctypes.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/language/test_tilelang_language_pdl.py
  • tilelang/contrib/nvcc.py
  • tilelang/engine/phase.py
  • tilelang/jit/adapter/libgen.py
  • tilelang/jit/adapter/nvrtc/wrapper.py
  • tilelang/jit/adapter/wrapper.py
  • tilelang/language/__init__.py
  • tilelang/language/pdl.py
  • tilelang/transform/__init__.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_ctypes.py
  • tilelang/jit/adapter/nvrtc/wrapper.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_pdl.py
🧬 Code graph analysis (10)
tilelang/transform/__init__.py (2)
src/transform/lower_pdl.cc (1)
  • MarkCudaSyncCalls (80-80)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
tilelang/language/pdl.py (1)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/engine/phase.py (3)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
tilelang/transform/__init__.py (1)
  • MarkCudaSyncCalls (403-405)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/language/test_tilelang_language_pdl.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
testing/python/jit/test_tilelang_jit_ctypes.py (3)
testing/python/jit/test_tilelang_jit_nvrtc.py (7)
  • test_nvrtc_pdl (505-552)
  • multi_kernels_with_pdl (511-532)
  • main (30-50)
  • main (138-158)
  • main (384-413)
  • main (513-530)
  • kernel (469-484)
testing/python/language/test_tilelang_language_pdl.py (2)
  • main (7-16)
  • main (23-32)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
tilelang/contrib/nvcc.py (1)
tilelang/contrib/rocm.py (1)
  • parse_compute_version (179-201)
tilelang/jit/adapter/wrapper.py (1)
tilelang/jit/adapter/utils.py (1)
  • parse_function_call_args (315-357)
tilelang/jit/adapter/libgen.py (2)
tilelang/jit/adapter/ctypes/adapter.py (1)
  • lib_code (287-289)
tilelang/jit/adapter/cython/adapter.py (1)
  • lib_code (371-373)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🪛 Ruff (0.14.8)
tilelang/language/__init__.py

120-120: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


121-121: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tilelang/language/pdl.py

4-7: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tilelang/contrib/nvcc.py

594-594: Unpacked variable minor is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tilelang/jit/adapter/libgen.py

153-153: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (8)
tilelang/jit/adapter/wrapper.py (3)

141-155: LGTM: PDL kernel launch code structure looks correct.

The KERNEL_LAUNCH_FUNC_CODE correctly configures cudaLaunchConfig_t with cudaLaunchAttributeProgrammaticStreamSerialization and uses cudaLaunchKernelEx for the kernel launch. The attribute setup follows CUDA's programmatic stream serialization pattern.


313-330: LGTM: PDL branch in dispatch function is correctly structured.

The new PDL path properly:

  1. Parses function call arguments
  2. Validates argument count matches parameters
  3. Uses KERNEL_LAUNCH_FUNC_CODE for kernel launch with PDL attributes
  4. Includes error checking after launch

448-450: LGTM: PDL attribute detection correctly populates the sync map.

The has_cuda_pdl_sync attribute is checked and the function is registered in pdl_sync_map for later use during code generation.

tilelang/language/__init__.py (1)

119-123: LGTM!

The PDL function re-exports follow the established pattern in this file. The # noqa: F401 comments are appropriate for suppressing "imported but unused" warnings in __init__.py files intended to expose public API.

docs/programming_guides/instructions.md (1)

141-144: LGTM!

Documentation for the new PDL synchronization helpers is clear and follows the established format of the instruction reference.

tilelang/engine/phase.py (1)

255-257: LGTM!

The MarkCudaSyncCalls pass is correctly integrated after SplitHostDevice. The unconditional application is appropriate since the pass internally handles both cases: marking sync calls when PDL is available (sm_90+) or eliminating them for older architectures.

testing/python/language/test_tilelang_language_pdl.py (1)

37-61: Test implementation looks good.

The tests correctly verify that PDL synchronization calls are present in generated code for sm_90 (Hopper+) and absent for sm_75 (pre-Hopper). This validates the MarkCudaSyncCalls pass behavior.

src/transform/lower_pdl.cc (1)

29-43: LGTM on the MarkCudaSyncCalls implementation.

The mutator correctly scans for CUDA sync extern calls and annotates the PrimFunc with has_cuda_pdl_trigger and has_cuda_pdl_sync attributes. The pattern of visiting both statements and expressions ensures comprehensive detection.

@w169q169 w169q169 mentioned this pull request Dec 22, 2025
1 task
@silentCoder-dev silentCoder-dev self-assigned this Dec 22, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/transform/lower_pdl.cc (3)

9-9: Remove duplicate include.

The header tvm/tir/builtin.h is included twice (line 9 and line 14).

🔎 Proposed fix
 #include "../op/builtin.h"
 #include "../target/utils.h"
 #include "tvm/ir/type.h"
 #include "tvm/tir/builtin.h"
 #include "tvm/tir/expr.h"
 #include "tvm/tir/stmt.h"
 #include <tvm/ffi/reflection/registry.h>
 #include <tvm/tir/analysis.h>
-#include <tvm/tir/builtin.h>
 #include <tvm/tir/stmt_functor.h>
 #include <tvm/tir/transform.h>

Also applies to: 14-14


58-74: Extract duplicated logic into a helper function.

The CheckCall methods in both MarkCudaSyncCalls (lines 58-74) and EliminateCudaSyncCalls (lines 111-130) contain nearly identical logic for detecting PDL sync function names. Consider extracting this into a common helper function to improve maintainability.

🔎 Proposed refactor

Add a helper function in an anonymous namespace:

namespace {
// Returns true if the call is a CUDA PDL sync call
bool IsPDLSyncCall(const tir::CallNode *call) {
  if (!call)
    return false;
  
  if (call->op.same_as(builtin::call_extern())) {
    if (!call->args.empty()) {
      if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
        std::string func_name = str_node->value;
        return func_name == "cudaTriggerProgrammaticLaunchCompletion" ||
               func_name == "cudaGridDependencySynchronize";
      }
    }
  }
  return false;
}

// Returns the specific PDL sync type (0=none, 1=trigger, 2=sync)
int GetPDLSyncType(const tir::CallNode *call) {
  if (!call)
    return 0;
  
  if (call->op.same_as(builtin::call_extern())) {
    if (!call->args.empty()) {
      if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
        std::string func_name = str_node->value;
        if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
          return 1;
        } else if (func_name == "cudaGridDependencySynchronize") {
          return 2;
        }
      }
    }
  }
  return 0;
}
} // anonymous namespace

Then update both classes to use these helpers.

Also applies to: 111-130


138-145: Consider more descriptive pass name.

The pass name "tl.MarkCudaSyncCalls" only mentions "Mark" behavior, but the pass can also eliminate calls when have_pdl is false. Consider a name like "tl.ProcessCudaSyncCalls" or document the dual behavior clearly.

tilelang/jit/adapter/nvrtc/wrapper.py (1)

479-486: Docstring enhancement suggestion (optional):

The docstring is functional but could be improved to match the detail level of generate_l2_persistent_map. Consider documenting what PDL synchronization is, the expected structure of pdl_sync_map, and when/why this returns non-empty code:

    def generate_pdl_sync_code(self, function_name: str) -> str:
-        """
-        Generate Python code to insert PDL synchronization for a given kernel.
-        """
+        """Generate Python code to configure PDL synchronization for a kernel.
+
+        PDL (Programmatic Device Launch) enables programmatic stream serialization
+        for fine-grained kernel synchronization on compatible CUDA architectures.
+
+        Args:
+            function_name: Kernel name to check for PDL sync config
+
+        Returns:
+            Python code that sets programmatic stream serialization attributes,
+            or empty string if PDL sync is not configured for this kernel.
+        """
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f50aa99 and cb052f4.

📒 Files selected for processing (2)
  • src/transform/lower_pdl.cc
  • tilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🔇 Additional comments (5)
src/transform/lower_pdl.cc (2)

29-81: LGTM!

The MarkCudaSyncCalls class correctly implements the TVM StmtExprMutator pattern to detect PDL synchronization calls and mark the function with appropriate attributes. The logic is defensive and properly chains to the base visitor methods.


147-151: LGTM! Static analysis false positive can be ignored.

The FFI registration correctly exposes the pass through TVM's reflection mechanism. The cppcheck syntax error is a false positive—cppcheck doesn't recognize the TVM_FFI_STATIC_INIT_BLOCK macro, which is a valid TVM construct.

tilelang/jit/adapter/nvrtc/wrapper.py (3)

172-186: LGTM! Past config overwriting issue has been resolved.

The PDL synchronization code correctly assumes the config object already exists and modifies it in place, rather than creating a new instance. The placement at line 207 (via placeholder {11}) ensures this code executes after config = CUlaunchConfig() is created at line 198.


207-207: LGTM! PDL sync placeholder correctly positioned.

The placeholder {11} is correctly placed after config creation and before kernel launch, ensuring PDL attributes can be applied to the existing config object.


423-424: LGTM! PDL sync code correctly integrated.

The per-kernel PDL synchronization code is generated and passed as the 11th format argument, correctly matching the {11} placeholder in KERNEL_LAUNCH_FUNC_PY. The pattern follows the established approach used for L2 persistent map generation.

Also applies to: 438-438

@silentCoder-dev
Copy link
Collaborator

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)

58-74: Consider extracting duplicated CheckCall logic.

The logic for detecting CUDA PDL function names is duplicated between MarkCudaSyncCalls::CheckCall and EliminateCudaSyncCalls::CheckCall. Consider extracting this into a shared helper function to improve maintainability.

🔎 Proposed refactor to extract common helper

Add a helper function in the anonymous namespace:

+namespace {
+
+enum class CudaPdlCallType { kNone, kTriggerLaunch, kGridSync };
+
+CudaPdlCallType GetCudaPdlCallType(const tir::CallNode *call) {
+  if (!call || !call->op.same_as(builtin::call_extern()) || call->args.empty()) {
+    return CudaPdlCallType::kNone;
+  }
+  
+  if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
+    std::string func_name = str_node->value;
+    if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
+      return CudaPdlCallType::kTriggerLaunch;
+    } else if (func_name == "cudaGridDependencySynchronize") {
+      return CudaPdlCallType::kGridSync;
+    }
+  }
+  return CudaPdlCallType::kNone;
+}
+
+} // namespace
+
 class MarkCudaSyncCalls : public StmtExprMutator {

Then simplify both CheckCall methods:

 private:
   void CheckCall(const tir::CallNode *call) {
-    if (!call)
-      return;
-    if (call->op.same_as(builtin::call_extern())) {
-      if (!call->args.empty()) {
-        if (const auto *str_node =
-                call->args[0].as<tvm::tir::StringImmNode>()) {
-          std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
-            has_trigger_launch_ = true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
-            has_grid_sync_ = true;
-          }
-        }
-      }
+    auto call_type = GetCudaPdlCallType(call);
+    if (call_type == CudaPdlCallType::kTriggerLaunch) {
+      has_trigger_launch_ = true;
+    } else if (call_type == CudaPdlCallType::kGridSync) {
+      has_grid_sync_ = true;
     }
   }

And in EliminateCudaSyncCalls:

 private:
   bool CheckCall(const tir::CallNode *call) {
-    if (!call)
-      return false;
-
-    if (call->op.same_as(builtin::call_extern())) {
-      if (!call->args.empty()) {
-        if (const auto *str_node =
-                call->args[0].as<tvm::tir::StringImmNode>()) {
-          std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
-            return true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
-            return true;
-          }
-        }
-      }
-    }
-
-    return false;
+    return GetCudaPdlCallType(call) != CudaPdlCallType::kNone;
   }

Also applies to: 111-130


138-145: Pass name doesn't reflect dual behavior.

The pass is named "tl.MarkCudaSyncCalls" but it performs two different operations based on the have_pdl parameter: marking calls when true, or eliminating calls when false. Consider renaming to something more generic like "tl.HandleCudaSyncCalls" or "tl.ProcessCudaSyncCalls" to better reflect this dual behavior.

🔎 Proposed rename
-  return CreatePrimFuncPass(pass_func, 0, "tl.MarkCudaSyncCalls", {});
+  return CreatePrimFuncPass(pass_func, 0, "tl.HandleCudaSyncCalls", {});
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb052f4 and 466c94a.

📒 Files selected for processing (1)
  • src/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🔇 Additional comments (1)
src/transform/lower_pdl.cc (1)

147-151: Static analysis syntax error is a false positive.

The Cppcheck syntax error reported at line 147 is a false positive. The TVM_FFI_STATIC_INIT_BLOCK macro is a valid TVM framework construct for registering FFI functions at static initialization time. The implementation is correct.

@w169q169
Copy link
Contributor Author

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine.
The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.

In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there.
If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

@silentCoder-dev
Copy link
Collaborator

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.

In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

@w169q169
Copy link
Contributor Author

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.
In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.

tvm_cuda_stream_set_access_policy_window only modifies the behavior of the CUDA stream, which is fine for L2 persistence, but PDL specifically requires modifying the launchConfig of the kernel. In the tvm_ffi path, cudaLaunchKernel is implemented inside TVM itself, at:

https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207

There is an existing draft PR here: apache/tvm#18604
Once this is merged in TVM, we can continue implementing PDL support in the tvm_ffi backend.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
testing/python/jit/test_tilelang_jit_cython.py (1)

41-43: Consider optimizing output tensor initialization.

Tensors b and c are initialized with torch.randn but are immediately overwritten by the kernel. Consider using torch.empty instead for better performance.

🔎 Proposed optimization
 a = torch.randn(N, dtype=torch.float32).cuda()
-b = torch.randn(N, dtype=torch.float32).cuda()
-c = torch.randn(N, dtype=torch.float32).cuda()
+b = torch.empty(N, dtype=torch.float32).cuda()
+c = torch.empty(N, dtype=torch.float32).cuda()
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 466c94a and 05b8dc0.

📒 Files selected for processing (1)
  • testing/python/jit/test_tilelang_jit_cython.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_cython.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_cython.py (4)
tilelang/language/kernel.py (1)
  • threads (214-218)
tilelang/language/loop.py (1)
  • Parallel (13-33)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
tilelang/utils/tensor.py (1)
  • torch_assert_close (231-319)
🔇 Additional comments (5)
testing/python/jit/test_tilelang_jit_cython.py (5)

1-6: LGTM - Imports are correct.

All necessary imports are present for the PDL test functionality.


8-11: LGTM - Test setup is appropriate.

The test function is properly defined with a reasonable test size for validating basic PDL functionality.


14-35: LGTM - PDL usage and kernel logic are correct.

The PDL primitives are used correctly:

  • T.pdl_trigger() is called after the first kernel completes its computation, signaling completion
  • T.pdl_sync() is called before the second kernel begins processing, ensuring proper synchronization

The kernel logic correctly implements the expected computation pipeline (B = A + 1.0, then C = B * 2.0).


48-55: LGTM - Test execution and verification are correct.

The test properly executes the kernel, computes reference results, and verifies correctness using appropriate tolerances.


58-59: LGTM - Main guard is correctly implemented.

The main guard follows the standard pattern for tilelang test files.

@silentCoder-dev
Copy link
Collaborator

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.
In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.

tvm_cuda_stream_set_access_policy_window only modifies the behavior of the CUDA stream, which is fine for L2 persistence, but PDL specifically requires modifying the launchConfig of the kernel. In the tvm_ffi path, cudaLaunchKernel is implemented inside TVM itself, at:

https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207

There is an existing draft PR here: apache/tvm#18604 Once this is merged in TVM, we can continue implementing PDL support in the tvm_ffi backend.

I locally implement similar functionality with that draft PR (: I believe this implementation is sufficient, and we can proceed by adding has_programmatic_dependent_launch soon after that pr is merged.

@LeiWang1999
Copy link
Member

Thanks. Since the ctypes execution backend was removed in #1510, we can also remove the related caching mechanisms in this PR :)

Copy link
Collaborator

@silentCoder-dev silentCoder-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, we only need to implement the extension for the TVM FFI backend.

@silentCoder-dev silentCoder-dev dismissed their stale review December 25, 2025 02:47

This issue is resolved

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)

10-10: Remove duplicate include.

tvm/tir/builtin.h is included on both line 10 and line 15. Remove one of them.

🔎 Proposed fix
 #include "tvm/tir/builtin.h"
 #include "tvm/tir/expr.h"
 #include "tvm/tir/stmt.h"
 #include <tvm/ffi/reflection/registry.h>
 #include <tvm/tir/analysis.h>
-#include <tvm/tir/builtin.h>
 #include <tvm/tir/stmt_functor.h>
 #include <tvm/tir/transform.h>

Also applies to: 15-15


59-73: Consider extracting magic strings to named constants.

The function name strings "cudaTriggerProgrammaticLaunchCompletion" and "cudaGridDependencySynchronize" are hardcoded. While acceptable, extracting them to named constants would improve maintainability.

🔎 Proposed refactor

At the top of the file or in the class:

namespace {
constexpr const char* kCudaTriggerPDL = "cudaTriggerProgrammaticLaunchCompletion";
constexpr const char* kCudaGridSync = "cudaGridDependencySynchronize";
}

Then update the checks:

           std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
+          if (func_name == kCudaTriggerPDL) {
             has_trigger_launch_ = true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
+          } else if (func_name == kCudaGridSync) {
             has_grid_sync_ = true;
           }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af5f26d and c123fd7.

📒 Files selected for processing (2)
  • src/transform/lower_pdl.cc
  • tilelang/engine/phase.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tilelang/engine/phase.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/lower_pdl.cc
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc

[error] 92-92: syntax error

(syntaxError)

🔇 Additional comments (3)
src/transform/lower_pdl.cc (3)

24-34: LGTM!

The CheckLDGCalls visitor correctly detects __ldg calls and enforces the constraint that they cannot be used with PDL synchronization.


38-57: LGTM!

The transformation logic is correct:

  • Properly checks PDL support and fails if PDL functions are used when unsupported
  • Correctly annotates the function with attributes based on detected PDL calls
  • Runs CheckLDGCalls on the original body to detect incompatible __ldg usage

84-96: LGTM! Cppcheck syntax error is a false positive.

The pass creation and FFI registration follow standard TVM patterns and are implemented correctly. The Cppcheck syntax error reported at line 92 is a false positive—TVM_FFI_STATIC_INIT_BLOCK() is a TVM macro that expands to valid C++ code, but Cppcheck doesn't understand the macro expansion.

@w169q169
Copy link
Contributor Author

When have_pdl is not supported, we should skip the lower_pdl pass. Running it to try and eliminate PDL calls would be incorrect or unnecessary.

I have thought about this concern, and my reasoning is as follows.

cudaGridDependencySynchronize() and
cudaTriggerProgrammaticLaunchCompletion()
are not just semantic markers; they lower to concrete device instructions whose placement in the instruction stream matters.

For example, in the following kernel:

__global__ void square(int* array, int n) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid < n)
        array[tid] = array[tid] * array[tid];
        cudaGridDependencySynchronize();
        array[tid] = array[tid] * array[tid];
}

cudaGridDependencySynchronize() is lowered to specific SASS instructions (BSYNC B0 and ACQBULK), which are emitted at precise locations in the control flow:

# nvcc -gencode arch=compute_90,code=sm_90
# compiled via godbolt.org

...
 STG.E desc[UR4][R2.64], R5 
.L_x_1:
 BSYNC B0 
.L_x_0:
 ACQBULK 
 LDG.E R0, desc[UR4][R2.64] 
 IMAD R5, R0, R0, RZ 
 STG.E desc[UR4][R2.64], R5 
 EXIT 
...

This positional dependency is the key reason I am cautious about aggressively running lower_pdl when have_pdl is not supported.

If we model PDL via tir.call_extern, then I agree this introduces extra passes whose sole purpose is to eliminate these calls on non-CUDA targets. That is redundant, but at least it preserves exact placement semantics until codegen.

I also considered an alternative using block.attr, similar to T.annotate_l2_hit_ratio. This avoids enabling a CUDA-specific lowering pass on non-CUDA targets. However, the drawback is that we lose precise location information. For example:

with T.Kernel(T.ceildiv(N, block_size), threads=block_size) as (bx2,):
    for i in T.Parallel(block_size):
        idx = bx2 * block_size + i
        if idx < N:
            C[idx] = B[idx] * 2.0
    T.pdl_sync()
    for i in T.Parallel(block_size):
        idx = bx2 * block_size + i
        T.pdl_sync()
        if idx < N:
            C[idx] = B[idx] * 2.0

In this case, pdl_sync becomes attached to the T.Kernel block, but its exact position relative to surrounding instructions is no longer explicit, which is problematic for something that lowers to synchronization instructions.

One possible direction is to avoid call_extern and instead use attr, then explicitly materialize the corresponding device-side calls during codegen for each backend. However, the trade-off is that this approach still requires backend-specific handling and is conceptually similar in cost to eliminating function calls via passes.

These are my current thoughts. If there are mistakes in this reasoning, or if you see a cleaner or more idiomatic approach, I would greatly appreciate your guidance.

@silentCoder-dev
Copy link
Collaborator

silentCoder-dev commented Dec 25, 2025

When have_pdl is not supported, we should skip the lower_pdl pass. Running it to try and eliminate PDL calls would be incorrect or unnecessary.

I have thought about this concern, and my reasoning is as follows.

cudaGridDependencySynchronize() and cudaTriggerProgrammaticLaunchCompletion() are not just semantic markers; they lower to concrete device instructions whose placement in the instruction stream matters.

For example, in the following kernel:

__global__ void square(int* array, int n) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid < n)
        array[tid] = array[tid] * array[tid];
        cudaGridDependencySynchronize();
        array[tid] = array[tid] * array[tid];
}

cudaGridDependencySynchronize() is lowered to specific SASS instructions (BSYNC B0 and ACQBULK), which are emitted at precise locations in the control flow:

# nvcc -gencode arch=compute_90,code=sm_90
# compiled via godbolt.org

...
 STG.E desc[UR4][R2.64], R5 
.L_x_1:
 BSYNC B0 
.L_x_0:
 ACQBULK 
 LDG.E R0, desc[UR4][R2.64] 
 IMAD R5, R0, R0, RZ 
 STG.E desc[UR4][R2.64], R5 
 EXIT 
...

This positional dependency is the key reason I am cautious about aggressively running lower_pdl when have_pdl is not supported.

If we model PDL via tir.call_extern, then I agree this introduces extra passes whose sole purpose is to eliminate these calls on non-CUDA targets. That is redundant, but at least it preserves exact placement semantics until codegen.

I also considered an alternative using block.attr, similar to T.annotate_l2_hit_ratio. This avoids enabling a CUDA-specific lowering pass on non-CUDA targets. However, the drawback is that we lose precise location information. For example:

with T.Kernel(T.ceildiv(N, block_size), threads=block_size) as (bx2,):
    for i in T.Parallel(block_size):
        idx = bx2 * block_size + i
        if idx < N:
            C[idx] = B[idx] * 2.0
    T.pdl_sync()
    for i in T.Parallel(block_size):
        idx = bx2 * block_size + i
        T.pdl_sync()
        if idx < N:
            C[idx] = B[idx] * 2.0

In this case, pdl_sync becomes attached to the T.Kernel block, but its exact position relative to surrounding instructions is no longer explicit, which is problematic for something that lowers to synchronization instructions.

One possible direction is to avoid call_extern and instead use attr, then explicitly materialize the corresponding device-side calls during codegen for each backend. However, the trade-off is that this approach still requires backend-specific handling and is conceptually similar in cost to eliminating function calls via passes.

These are my current thoughts. If there are mistakes in this reasoning, or if you see a cleaner or more idiomatic approach, I would greatly appreciate your guidance.

Testing on SM_80 reveals that NVCC fails due to the griddepcontrol instruction requiring SM_90+. To ensure consistent behavior, the lower_pdl pass has been updated to throw the same error. In my view, addressing this hardware limitation (by targeting a compatible architecture) is a user responsibility; the pass should not automatically eliminate PDL calls as a workaround.

A similar pattern can be observed with the __ldg intrinsic. For example, in testing/python/language/test_tilelang_language_intrinsics_codegen.py, targeting a non-CUDA backend like "metal" results in the error: "tvm.error.InternalError: Unresolved call Op(tl.__ldg)".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 877962b and 22455f2.

📒 Files selected for processing (3)
  • tilelang/contrib/nvrtc.py
  • tilelang/jit/adapter/nvrtc/libgen.py
  • tilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-24T17:20:27.444Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:27.444Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • tilelang/jit/adapter/nvrtc/libgen.py
🔇 Additional comments (6)
tilelang/contrib/nvrtc.py (1)

73-76: LGTM! Simple and effective conditional header inclusion for PDL support.

The string-based detection for cudaGridDependencySynchronize and cudaTriggerProgrammaticLaunchCompletion is pragmatic. The header is correctly inserted after the standard NVRTC header, ensuring the device runtime API declarations are available when PDL intrinsics are used.

tilelang/jit/adapter/nvrtc/wrapper.py (5)

45-46: Good refactor: imports moved to module-level template.

Moving CUlaunchAttribute and CUlaunchAttributeID imports to PREDEF_HOST_FUNC_PY avoids redundant imports per kernel and keeps the generated code cleaner.


174-182: LGTM! Config overwrite issue resolved.

The template now correctly modifies the existing config object created in KERNEL_LAUNCH_FUNC_PY rather than creating a new one. The programmatic stream serialization attribute is properly configured.


203-203: Correct placement of PDL sync code injection.

The placeholder {11} is positioned after config initialization but before cuLaunchKernelEx, ensuring PDL attributes are applied to the launch configuration at the right time.


419-434: Clean integration with existing two-pass code generation.

The PDL sync code generation follows the same pattern as L2 persistent map handling, maintaining consistency with the existing per-kernel configuration approach.


475-482: No changes needed. pdl_sync_map is properly initialized in the parent class TLCUDASourceWrapper at line 213 as self.pdl_sync_map: dict[str, int] | None = {}, and populated at line 450 when kernels have the has_cuda_pdl_sync attribute. The method correctly accesses this inherited attribute.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (3)

7-7: LGTM: Module-level pytest import is appropriate.

The module-level pytest import supports the new PDL test and is best practice. As a follow-up, consider removing the redundant local import pytest at line 377 in test_tvm_ffi_im2col_tma_desc().


446-451: LGTM: Compute capability check is correct for PDL.

The function correctly gates PDL tests on compute_capability[0] >= 9, aligning with the SM_90+ requirement discussed in the PR comments. Note that an identical check_pdl() helper exists in test_tilelang_jit_nvrtc.py. Consolidating these into a shared test utility (e.g., tilelang.testing) could reduce duplication across test modules.


454-505: LGTM: PDL test correctly exercises trigger/sync semantics.

The test accurately validates PDL synchronization by having the first kernel trigger completion after writing B, and the second kernel sync before reading B. The PDL API usage aligns with the definitions in tilelang/language/pdl.py, and the result validation is thorough.

Optional refinements:

  • Test coverage: The current test uses N=64 with block_size=256, launching only a single block per kernel. Consider adding a test case with a larger N (e.g., 1024 or 2048) to verify PDL behavior across multiple thread blocks.
  • Tensor allocation: Lines 491–492 initialize output tensors b and c with torch.randn() but the kernel immediately overwrites them. Using torch.empty() would be slightly more efficient.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22455f2 and 2f83ea7.

📒 Files selected for processing (3)
  • 3rdparty/tvm
  • src/target/rt_mod_cuda.cc
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
✅ Files skipped from review due to trivial changes (1)
  • 3rdparty/tvm
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
🔇 Additional comments (1)
src/target/rt_mod_cuda.cc (1)

1-1: LGTM: Includes are necessary for PDL support.

The new includes provide the attribute definitions (kHasGridSync) and runtime metadata (kUseProgramaticDependentLaunch) required by the PDL detection logic at lines 43-46.

Also applies to: 4-4

Comment on lines 43 to 46
if (f->HasNonzeroAttr(tl::attr::kHasGridSync)) {
info.launch_param_tags.push_back(
runtime::launch_param::kUseProgramaticDependentLaunch);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for both spelling variants and the constant definition

echo "=== Searching for 'Programatic' (single-m) ==="
rg 'Programatic' -C 2

echo -e "\n=== Searching for 'Programmatic' (double-m) ==="
rg 'Programmatic' -C 2

echo -e "\n=== Checking the specific constant definition ==="
rg 'kUseProgramatic' -C 3

Repository: tile-ai/tilelang

Length of output: 6726


The constant name contains a spelling error that should be corrected.

The constant kUseProgramaticDependentLaunch uses single-m "Programatic", but all CUDA API functions (cudaTriggerProgrammaticLaunchCompletion), attributes (cudaLaunchAttributeProgrammaticStreamSerialization), and comments throughout the codebase consistently use double-m "Programmatic". Rename the constant to kUseProgrammaticDependentLaunch for consistency.

🤖 Prompt for AI Agents
In src/target/rt_mod_cuda.cc around lines 43 to 46, the constant name
kUseProgramaticDependentLaunch is misspelled (single “m”); rename all
uses/definitions to kUseProgrammaticDependentLaunch to match CUDA naming and
existing codebase conventions, update any declaration, definition, and
references (including headers and enums) to the corrected identifier, and run a
project-wide search/replace to ensure consistency and rebuild to catch any
remaining references.

Copy link
Collaborator

@silentCoder-dev silentCoder-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@LeiWang1999 LeiWang1999 self-requested a review December 26, 2025 09:39
@LeiWang1999
Copy link
Member

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

\t\tconfig.gridDim = {0};
\t\tconfig.blockDim = {1};
\t\tconfig.dynamicSmemBytes = {2};
\t\tcudaLaunchKernelEx(&config, {4}, {3});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pack kernel args for cudaLaunchKernelEx

cudaLaunchKernelEx expects its third parameter to be a void** array of pointers to the kernel arguments, but in the PDL path {3} is populated from call_args = ", ".join(args_list) (see the PDL branch in this file). That means the generated call passes raw arguments instead of an argument pointer array, so the runtime will interpret the first kernel argument as the void** array (and the next as extra), leading to invalid parameter packing or crashes for any has_cuda_pdl_sync kernel launched via the C++ wrapper.

Useful? React with 👍 / 👎.

Copy link
Collaborator

@silentCoder-dev silentCoder-dev Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudaLaunchKernelExC uses a void** for its arguments, while cudaLaunchKernelEx employs a template to accept ArgsTypes&&... args.

@kurisu6912 kurisu6912 linked an issue Dec 29, 2025 that may be closed by this pull request
1 task
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/target/rt_mod_cuda.cc (1)

36-48: Fix missing closing brace for kHasGridSync condition.

Line 36 opens an if statement for kHasGridSync, but line 39 starts the next if statement without closing the first one. Add a closing brace after line 38, before the use_cooperative_groups check.

    if (f->HasNonzeroAttr(tl::attr::kHasGridSync)) {
      info.launch_param_tags.push_back(
          runtime::launch_param::kUseProgramaticDependentLaunch);
    }
♻️ Duplicate comments (1)
src/target/rt_mod_cuda.cc (1)

38-38: Spelling error already flagged in previous review.

The constant name kUseProgramaticDependentLaunch uses single-m "Programatic", but should be "Programmatic" to match CUDA API naming. This issue was already identified in the previous review.

🧹 Nitpick comments (2)
tilelang/contrib/nvcc.py (1)

589-594: Use _ for the unused minor variable.

The minor variable is unpacked but never used. Following Python conventions and addressing the static analysis hint, prefix it with an underscore to indicate it's intentionally unused.

🔎 Proposed fix
 def have_pdl(target):
     if target.kind.name != "cuda":
         return False
     compute_version = get_target_compute_version(target)
-    major, minor = parse_compute_version(compute_version)
+    major, _ = parse_compute_version(compute_version)
     return major >= 9
tilelang/language/__init__.py (1)

110-113: Remove unnecessary # noqa: F401 directives.

The F401 rule (unused imports) is not enabled in your Ruff configuration, making these # noqa directives unnecessary. You can safely remove them to reduce noise.

🔎 Proposed fix
 from .pdl import (
-    pdl_trigger,  # noqa: F401
-    pdl_sync,  # noqa: F401
+    pdl_trigger,
+    pdl_sync,
 )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c1cdf7c and 8eaddd7.

📒 Files selected for processing (3)
  • src/target/rt_mod_cuda.cc
  • tilelang/contrib/nvcc.py
  • tilelang/language/__init__.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • tilelang/contrib/nvcc.py
🧬 Code graph analysis (2)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
tilelang/contrib/nvcc.py (1)
tilelang/contrib/rocm.py (1)
  • parse_compute_version (179-201)
🪛 Ruff (0.14.10)
tilelang/language/__init__.py

111-111: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


112-112: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tilelang/contrib/nvcc.py

593-593: Unpacked variable minor is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (1)
src/target/rt_mod_cuda.cc (1)

1-1: LGTM!

The new includes provide the necessary attribute definitions and runtime metadata structures for PDL support.

Also applies to: 4-4

silentCoder-dev and others added 2 commits December 31, 2025 13:29
* Introduced a new method, get_host_source, in the CythonKernelAdapter class to return the source code of the host function. This addition enhances the functionality of the adapter by providing direct access to the host kernel source.
@LeiWang1999
Copy link
Member

Thanks @w169q169 @silentCoder-dev. I tested the example, and the kernels overlap in both the cython and tvm_ffi execution backends. Note that for tvm_ffi, overlapping occurs after the first launch, as the initial run includes JIT overhead."
image

@LeiWang1999
Copy link
Member

Merged.

@LeiWang1999 LeiWang1999 merged commit 9fd6ea7 into tile-ai:main Jan 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] PDL Support

3 participants