Skip to content

Conversation

@XuehaiPan
Copy link
Collaborator

@XuehaiPan XuehaiPan commented Dec 19, 2025

Resolves #1478

Summary by CodeRabbit

  • New Features

    • Added set_log_level() for runtime logging control.
    • Added a CUDA driver stub that provides a safe CPU-only fallback with runtime-loaded CUDA wrappers.
  • Performance / UX

    • Heavy native dependencies now lazy-load to reduce import-time overhead.
    • Header inclusion tightened to enforce stub usage.
  • Chores

    • Unified library output/install layout and removed direct CUDA runtime links at install time.
  • CI Improvements

    • Added CI env vars, caching and setup steps, replaced yum with dnf, adjusted CUDA build matrix, and added Python 3.12 wheel validation/testing.

✏️ Tip: You can customize this high-level summary in your review settings.

@XuehaiPan XuehaiPan self-assigned this Dec 19, 2025
@XuehaiPan XuehaiPan added enhancement New feature or request dependencies Pull requests that update a dependency file labels Dec 19, 2025
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a lazy-load mechanism to defer heavy/shared-library loads, exposes set_log_level(level), provides a CPU-safe libcuda stub with extern-C wrappers and include guards, updates sources to use the stub, adds a cuda_stub CMake target with install-time patchelf removal of libcuda, and updates CI to use dnf/uv and validate built wheels. (≤50 words)

Changes

Cohort / File(s) Summary
Workflow — dist & wheel validation
\.github/workflows/dist.yml
Add CI env/cache vars (UV_INDEX_STRATEGY, UV_HTTP_TIMEOUT, XDG_CACHE_HOME, PIP_CACHE_DIR, UV_CACHE_DIR); replace yumdnf; add astral-sh/setup-uv@v7 step; parse/export CUDA parts and UV_INDEX; add Python 3.12 wheel-test venv steps; apply across build-wheels and build-sdist.
Module init & logging
tilelang/__init__.py
Introduce _lazy_load_lib context manager to adjust dlopen/ctypes flags and defer heavy imports (env, tvm, libinfo, submodules); remove prior TqdmLoggingHandler/_init_logger pattern; add public set_log_level(level); compute __version__ and delete helper symbols after use.
CUDA runtime stubs — impl & header
src/target/stubs/cuda.h, src/target/stubs/cuda.cc
Add CUDADriverAPI singleton that dlopen()s libcuda.so(.1), resolves required/optional CUDA driver symbols, exposes get()/is_available()/get_handle(), and supplies extern "C" wrappers so driver calls route through the stub for CPU-only machines.
Source includes updated to stubs
src/op/builtin.cc, src/op/copy.cc, src/runtime/runtime.cc
Replace #include "../target/cuda.h"#include "../target/stubs/cuda.h"; no behavior changes.
Build system — CUDA stub, RPATH & patchelf
CMakeLists.txt
Add cuda_stub shared target, include it in TILELANG_OUTPUT_TARGETS, link it into targets when USE_CUDA; unify per-target output dirs and INSTALL_RPATH; detect patchelf and run install-time patchelf --remove-needed to strip libcuda.so(.1) from installed artifacts.
Packaging config / wheel repair
pyproject.toml
Switch Linux installer steps from yumdnf; remove some nvidia packages from install list; extend repair/audit exclusions to omit libcuda.so.1/libcuda.so and /usr/local/cuda*; minor formatting tweaks.
Header include guard tightening
src/target/stubs/vendor/cuda.h
Add an include-time guard that enforces the header is included only by the stub implementation (compile-time error otherwise).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Importer as "Python importer"
  participant Tilelang as "tilelang/__init__"
  participant CUDAStub as "CUDADriverAPI (cuda_stub)"
  participant LibCUDA as "libcuda.so (system)"

  rect rgba(200,230,255,0.18)
  Note over Tilelang,CUDAStub: Lazy-load import-time behavior and CPU-safe stub
  end

  Importer->>Tilelang: import tilelang
  Tilelang->>Tilelang: enter _lazy_load_lib (adjust dlopen/ctypes flags)
  Tilelang->>CUDAStub: request CUDADriverAPI singleton
  CUDAStub->>LibCUDA: dlopen("libcuda.so.1" / "libcuda.so")
  alt libcuda found
    LibCUDA-->>CUDAStub: handle + symbols
    CUDAStub-->>Tilelang: resolved function pointers (is_available=true)
    Tilelang-->>Importer: import completes (CUDA available)
  else libcuda not found
    LibCUDA-->>CUDAStub: open failed
    CUDAStub-->>Tilelang: provide safe stubs (is_available=false)
    Tilelang-->>Importer: import completes (CPU-only safe)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999

Poem

🐇 I tuck my libs in quiet paws,

I open slow when loaders call,
If CUDA sleeps, I shim and stay,
I patch the wheels and test by day,
Hop-hop — light imports for all.

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main objective: enabling tilelang import on CPU-only machines without CUDA libraries, which matches the core changes throughout the PR.
Linked Issues check ✅ Passed The PR successfully addresses issue #1478 by implementing lazy loading of CUDA libraries to allow import on CPU-only machines: adds CUDA driver stubs with lazy dlopen loading, modifies tilelang/init.py to use lazy-loading context, updates build configuration to support dynamic CUDA library loading, and ensures import succeeds without hard CUDA dependency.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the objective of supporting CPU-only imports: CI/build configuration updates for lazy loading, CUDA stub library implementation, header path adjustments, and module initialization refactoring all serve the core goal without extraneous modifications.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@XuehaiPan XuehaiPan marked this pull request as ready for review December 19, 2025 16:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tilelang/__init__.py (2)

69-79: Consider warning on invalid level strings.

If an invalid level string is passed (e.g., set_log_level('DEBG')), it silently falls back to INFO. This could mask typos.

🔎 Optional enhancement to warn on invalid levels
 def set_log_level(level):
     if isinstance(level, str):
-        level = getattr(logging, level.upper(), logging.INFO)
+        upper_level = level.upper()
+        if not hasattr(logging, upper_level):
+            warnings.warn(f"Unknown log level '{level}', defaulting to INFO", stacklevel=2)
+            level = logging.INFO
+        else:
+            level = getattr(logging, upper_level)
     logger = logging.getLogger(__name__)
     logger.setLevel(level)

113-162: Static analysis: noqa: F401 directives may be unnecessary.

Ruff reports these noqa: F401 directives as unused because the F401 rule appears to be disabled in your project configuration. You could remove them to reduce noise, or enable F401 in your Ruff config if you want to enforce unused import checks.

This is a low-priority cleanup item.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2217eb7 and 226d896.

📒 Files selected for processing (2)
  • .github/workflows/dist.yml (1 hunks)
  • tilelang/__init__.py (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-10T13:29:29.347Z
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Applied to files:

  • .github/workflows/dist.yml
🧬 Code graph analysis (1)
tilelang/__init__.py (3)
tilelang/env.py (3)
  • enable_cache (269-270)
  • disable_cache (272-273)
  • is_cache_enabled (266-267)
tilelang/libinfo.py (1)
  • find_lib_path (7-35)
tilelang/cache/__init__.py (1)
  • clear_cache (43-55)
🪛 Ruff (0.14.8)
tilelang/__init__.py

113-113: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


114-114: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


117-117: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


118-118: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


138-138: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


139-139: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


140-140: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


142-142: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


143-143: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


146-146: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


147-147: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


150-150: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


151-151: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


152-152: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


153-153: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


154-154: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


156-156: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


157-157: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


158-158: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


159-159: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


161-161: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


162-162: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build SDist
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
🔇 Additional comments (4)
.github/workflows/dist.yml (2)

168-173: LGTM - Python setup step added for wheel testing.

The setup step correctly configures Python 3.12 and uv for the subsequent wheel validation tests.


175-190: Good addition for wheel validation on CPU-only runners.

This test step validates that the built wheels can be installed and imported successfully, which directly tests the lazy loading changes. The cd / ensures the installed package is tested rather than the source directory.

One minor observation: the echo ... | uv run --no-project --script - pattern is less common than using python -c. Consider using:

uv run --no-project python -c "import tilelang; print(tilelang.__version__)"

However, the current approach should work correctly.

tilelang/__init__.py (2)

112-136: Verify that RTLD_LAZY achieves the goal for TVM imports.

The RTLD_LAZY flag defers symbol resolution for native libraries loaded via dlopen/ctypes.CDLL. However, TVM may load its runtime library during import tvm using its own loading mechanism.

The effectiveness depends on whether TVM's internal library loading respects the process-wide dlopenflags. The workflow test step should validate this works correctly.


49-50: Good namespace hygiene.

The del statements for internal helpers (_compute_version, _init_logger, _lazy_load_lib) keep the module's public namespace clean. This is a good practice.

Also applies to: 96-97, 163-164

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tilelang/__init__.py (1)

102-117: Critical: Windows compatibility issue remains unresolved.

This is the same issue flagged in the previous review. os.RTLD_LAZY and sys.getdlopenflags() are Unix-only and will raise AttributeError on Windows, breaking imports on that platform.

🔎 Recommended fix with platform guard
 @contextlib.contextmanager
 def _lazy_load_lib():
+    # RTLD_LAZY is Unix-only; Windows doesn't support dlopen flags
+    if not hasattr(os, 'RTLD_LAZY') or not hasattr(sys, 'getdlopenflags'):
+        yield
+        return
+    
     old_flags = sys.getdlopenflags()
     old_init = ctypes.CDLL.__init__

     def lazy_init(self, name, mode=ctypes.DEFAULT_MODE, *args, **kwargs):
         return old_init(self, name, mode | os.RTLD_LAZY, *args, **kwargs)

     sys.setdlopenflags(old_flags | os.RTLD_LAZY)
     ctypes.CDLL.__init__ = lazy_init
     try:
         yield
     finally:
         sys.setdlopenflags(old_flags)
         ctypes.CDLL.__init__ = old_init
🧹 Nitpick comments (1)
tilelang/__init__.py (1)

120-169: Optional cleanup: Remove unnecessary noqa directives.

Ruff reports that the # noqa: F401 directives on many lines are unnecessary because F401 (unused imports) is not enabled in your linter configuration. If these imports are intentionally re-exported, consider using __all__ to make this explicit instead of suppressing warnings.

Alternative approach using __all__

You could replace the noqa comments with an explicit __all__ declaration at the end of the file:

__all__ = [
    'enable_cache', 'disable_cache', 'is_cache_enabled', 'env',
    'DataType', 'jit', 'lazy_jit', 'JITKernel', 'compile', 'par_compile',
    'Profiler', 'clear_cache', 'TensorSupplyType', 'deprecated',
    'Layout', 'Fragment', 'analysis', 'transform', 'language', 'engine',
    'tools', 'dtypes', 'autotune', 'PassConfigKey', 'lower',
    'register_cuda_postproc', 'register_hip_postproc', 'ir', 'tileop',
    'set_log_level', '__version__',
]

This makes the public API explicit and is more maintainable than individual noqa comments.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 226d896 and d2b6a9c.

📒 Files selected for processing (1)
  • tilelang/__init__.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/__init__.py (2)
tilelang/env.py (3)
  • enable_cache (269-270)
  • disable_cache (272-273)
  • is_cache_enabled (266-267)
tilelang/libinfo.py (1)
  • find_lib_path (7-35)
🪛 Ruff (0.14.8)
tilelang/__init__.py

120-120: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


121-121: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


124-124: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


125-125: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


145-145: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


146-146: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


147-147: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


149-149: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


150-150: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


153-153: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


154-154: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


157-157: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


158-158: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


159-159: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


160-160: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


161-161: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


163-163: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


164-164: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


165-165: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


166-166: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


168-168: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


169-169: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build SDist
🔇 Additional comments (5)
tilelang/__init__.py (5)

1-8: LGTM: Necessary imports for lazy loading.

The added standard library imports (contextlib, os, sys) are required for implementing the lazy loading mechanism.


12-50: LGTM: Version computation with proper cleanup.

The version computation logic is sound, and the namespace cleanup on line 50 follows best practices.


53-97: LGTM: Well-structured logging infrastructure.

The logging infrastructure is properly implemented:

  • TqdmLoggingHandler correctly integrates with tqdm to prevent output interference
  • set_log_level provides a user-friendly public API
  • Logger initialization and cleanup follow best practices

119-171: Heavy imports properly wrapped in lazy loading context.

The import structure is correct, and moving all heavy imports inside the with _lazy_load_lib() context achieves the intended lazy loading behavior.


102-117: RTLD_LAZY implementation is correct but requires runtime validation on CPU-only systems.

The _lazy_load_lib() context manager at lines 102-117 is an intentional enhancement (commit: "Set RTLD_LAZY flag on CDLL") that correctly wraps TVM and tilelang library imports. RTLD_LAZY defers symbol resolution until symbols are actually used, which is the appropriate mechanism for allowing imports without CUDA libraries present.

The implementation properly:

  • Modifies process-level dlopen flags via sys.setdlopenflags()
  • Wraps CDLL initialization to append RTLD_LAZY to the mode
  • Restores original state in a finally block

CPU-only support is explicitly intended (CMakeLists.txt mentions "Build CPU-only when we explicitly disable CUDA"), confirming the use case is real.

However, no explicit test validating successful import on a CPU-only machine (with CUDA libraries hidden/unavailable) was found. The existing test file (test_tilelang_language_lazy_jit.py) covers the @lazy_jit decorator for JIT compilation, not the import-time lazy loading behavior. To definitively confirm this solves the OSError: libcuda.so.1 problem, add a test that imports tilelang with CUDA libraries unavailable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tilelang/__init__.py (1)

100-117: Platform compatibility: os.RTLD_LAZY is Unix-only.

This issue was flagged in a previous review but remains unaddressed. os.RTLD_LAZY is not available on Windows, and accessing it will raise AttributeError. The libinfo.py snippet shows Windows is a supported platform (win32 checks exist), so this needs a guard.

🔎 Proposed fix for Windows compatibility
 @contextlib.contextmanager
 def _lazy_load_lib():
     import torch  # noqa: F401 # preload torch to avoid dlopen errors

+    if not hasattr(os, 'RTLD_LAZY'):
+        # Windows doesn't support dlopen flags; skip lazy loading adjustments
+        yield
+        return
+
     old_flags = sys.getdlopenflags()
     old_init = ctypes.CDLL.__init__

     def lazy_init(self, name, mode=ctypes.DEFAULT_MODE, *args, **kwargs):
         return old_init(self, name, mode | os.RTLD_LAZY, *args, **kwargs)

     sys.setdlopenflags(old_flags | os.RTLD_LAZY)
     ctypes.CDLL.__init__ = lazy_init
     try:
         yield
     finally:
         sys.setdlopenflags(old_flags)
         ctypes.CDLL.__init__ = old_init
🧹 Nitpick comments (1)
tilelang/__init__.py (1)

119-170: Lazy loading structure achieves the PR objective.

The heavy imports (tvm, ctypes library loading, submodules) are correctly wrapped within the _lazy_load_lib() context, deferring shared library resolution until import time with RTLD_LAZY semantics.

Regarding static analysis hints: The # noqa: F401 directives are intentional for re-exported symbols. If F401 is not enabled in your Ruff configuration, you could remove them for cleaner code, but they're harmless.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d2b6a9c and a9d01bc.

📒 Files selected for processing (1)
  • tilelang/__init__.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/__init__.py (1)
tilelang/libinfo.py (1)
  • find_lib_path (7-35)
🪛 Ruff (0.14.8)
tilelang/__init__.py

82-82: Do not catch blind exception: Exception

(BLE001)


102-102: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


120-120: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


121-121: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


124-124: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


125-125: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


145-145: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


146-146: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


147-147: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


149-149: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


150-150: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


153-153: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


154-154: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


157-157: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


158-158: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


159-159: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


160-160: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


161-161: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


163-163: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


164-164: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


165-165: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


166-166: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


168-168: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


169-169: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Quick Lint
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build SDist
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
🔇 Additional comments (3)
tilelang/__init__.py (3)

1-48: LGTM!

The import statements and version computation logic are well-structured with appropriate fallbacks.


66-97: LGTM!

The logger initialization with TqdmLoggingHandler is well-implemented. The broad exception catch in emit() (line 82) follows the standard logging.Handler pattern—logging should never raise, and handleError is correctly invoked as a fallback.


102-108: RTLD_LAZY defers function symbol resolution correctly, but verify no CUDA data dependencies exist.

RTLD_LAZY resolves symbols only as the code that references them is executed, and if the symbol is never referenced, then it is never resolved. This approach should allow import tilelang to succeed on CPU-only machines without libcuda.so.1.

However, lazy binding is performed only for function references; references to variables are always immediately bound when the shared object is loaded. If the tilelang library has direct data dependencies on CUDA symbols (not just function calls), those will fail at dlopen() time regardless of RTLD_LAZY. Confirm that TVM's CUDA integration uses function pointers rather than direct variable references.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9d01bc and eadeafd.

📒 Files selected for processing (1)
  • pyproject.toml (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build SDist
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8

@XuehaiPan XuehaiPan force-pushed the lazy-dlopen branch 2 times, most recently from 8ec5352 to 1929031 Compare December 19, 2025 18:11
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.github/workflows/dist.yml (1)

168-174: Cache may not save due to missing actions: write permission.

Based on learnings, the astral-sh/setup-uv@v7 action's caching feature requires GITHUB_TOKEN with actions: write permission to save caches. The current permissions block (line 26) only grants contents: read, so caches will restore but not save.

This is not blocking since caches will still restore from previous runs, but consider adding actions: write to enable cache saving:

permissions:
  contents: read
  actions: write
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7c1604 and 19c3864.

📒 Files selected for processing (1)
  • .github/workflows/dist.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-10T13:29:29.347Z
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Applied to files:

  • .github/workflows/dist.yml

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
CMakeLists.txt (1)

330-333: Consider using WARNING instead of STATUS when patchelf is not found.

When patchelf is missing, the libcuda.so dependency won't be removed, which means the resulting library may fail to import on CPU-only machines. A STATUS message could be easily missed in build output. Using WARNING would make this more visible to developers.

🔎 Suggested change
   find_program(PATCHELF_EXECUTABLE patchelf)
   if (NOT PATCHELF_EXECUTABLE)
-    message(STATUS "patchelf not found.")
+    message(WARNING "patchelf not found - libcuda.so dependency will not be removed from built libraries")
   endif()
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 414055d and 5bd8c50.

📒 Files selected for processing (1)
  • CMakeLists.txt (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build SDist
🔇 Additional comments (4)
CMakeLists.txt (4)

309-316: Good refactoring to centralize target handling.

Defining TILELANG_OUTPUT_TARGETS improves maintainability by ensuring consistent handling across RPATH settings, patching, and installation. The unset(PATCHELF_EXECUTABLE) ensures a clean state before the conditional find_program call.


337-339: LGTM!

Clean refactoring that applies consistent RPATH settings to all output targets via the loop.


359-361: LGTM!

Using ${TILELANG_OUTPUT_TARGETS} maintains consistency with the variable defined at line 309 and ensures any future target additions only need updating in one place.


341-357: Well-structured patching approach that correctly handles targets from different directories.

The separation of POST_BUILD for local targets and a custom target for subdirectory targets is the right approach. Using ALL ensures patch_tvm_libs runs during normal builds. Since the install target depends on the ALL target by default, patching is guaranteed to occur before installation regardless of the build workflow.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/target/stubs/cuda.h (1)

69-70: Move cuLaunchKernelEx to optional APIs for CUDA < 12.0 compatibility.

As flagged in a previous review, cuLaunchKernelEx was introduced in CUDA 12.0. Listing it in TILELANG_LIBCUDA_API_REQUIRED means the lazy loader will throw an error on systems with CUDA 11.x drivers, even if this function is never called. This contradicts the PR's goal of supporting varied environments.

🔎 Proposed fix

Move cuLaunchKernelEx to the optional list or conditionally include it based on CUDA version:

 #define TILELANG_LIBCUDA_API_REQUIRED(_)                                       \
   _(cuGetErrorName)                                                            \
   _(cuGetErrorString)                                                          \
   _(cuCtxGetDevice)                                                            \
   _(cuCtxGetLimit)                                                             \
   _(cuCtxSetLimit)                                                             \
   _(cuCtxResetPersistingL2Cache)                                               \
   _(cuDeviceGetName)                                                           \
   _(cuDeviceGetAttribute)                                                      \
   _(cuModuleLoadData)                                                          \
   _(cuModuleLoadDataEx)                                                        \
   _(cuModuleUnload)                                                            \
   _(cuModuleGetFunction)                                                       \
   _(cuModuleGetGlobal)                                                         \
   _(cuFuncSetAttribute)                                                        \
   _(cuLaunchKernel)                                                            \
-  _(cuLaunchKernelEx)                                                          \
   _(cuLaunchCooperativeKernel)                                                 \
   _(cuMemsetD32)                                                               \
   _(cuStreamSetAttribute)

 // Optional APIs (may not exist in older drivers or specific configurations)
 // These are loaded but may be nullptr if not available
 #if defined(CUDA_VERSION) && (CUDA_VERSION >= 12000)
 #define TILELANG_LIBCUDA_API_OPTIONAL(_)                                       \
+  _(cuLaunchKernelEx)                                                          \
   _(cuTensorMapEncodeTiled)                                                    \
   _(cuTensorMapEncodeIm2col)

Also move the corresponding extern "C" declaration inside the #if defined(CUDA_VERSION) && (CUDA_VERSION >= 12000) block at lines 183-186.

🧹 Nitpick comments (2)
src/target/stubs/cuda.h (2)

104-107: Naming inconsistency between X-macro and actual CUDA API names.

The X-macro creates function pointer members using names like cuModuleGetGlobal_ and cuMemsetD32_ (line 104-107), but the actual CUDA driver API versioned names are cuModuleGetGlobal_v2 and cuMemsetD32_v2 (as seen in extern "C" declarations at lines 171 and 192). This mismatch is explained in comments (lines 97-99) but could lead to confusion during maintenance.

💡 Consider using versioned names in the X-macro

Update the X-macro to use the actual versioned API names to avoid confusion:

 #define TILELANG_LIBCUDA_API_REQUIRED(_)                                       \
   _(cuGetErrorName)                                                            \
   _(cuGetErrorString)                                                          \
   _(cuCtxGetDevice)                                                            \
   _(cuCtxGetLimit)                                                             \
   _(cuCtxSetLimit)                                                             \
   _(cuCtxResetPersistingL2Cache)                                               \
   _(cuDeviceGetName)                                                           \
   _(cuDeviceGetAttribute)                                                      \
   _(cuModuleLoadData)                                                          \
   _(cuModuleLoadDataEx)                                                        \
   _(cuModuleUnload)                                                            \
   _(cuModuleGetFunction)                                                       \
-  _(cuModuleGetGlobal)                                                         \
+  _(cuModuleGetGlobal_v2)                                                      \
   _(cuFuncSetAttribute)                                                        \
   _(cuLaunchKernel)                                                            \
   _(cuLaunchKernelEx)                                                          \
   _(cuLaunchCooperativeKernel)                                                 \
-  _(cuMemsetD32)                                                               \
+  _(cuMemsetD32_v2)                                                            \
   _(cuStreamSetAttribute)

Note: This would require updating the member access in cuda.cc to use the versioned names with trailing underscore (e.g., cuModuleGetGlobal_v2_).


145-195: Document error handling behavior for wrapper functions.

The extern "C" wrapper functions are declared but their error handling behavior when CUDA is unavailable is not documented. Based on the implementation snippet from cuda.cc, these wrappers call CUDADriverAPI::get(), which throws std::runtime_error if libcuda.so cannot be loaded.

Add documentation clarifying the error handling behavior:

📝 Suggested documentation enhancement

Add to the section comment at lines 138-144:

 // ============================================================================
 // Global wrapper functions for lazy-loaded CUDA driver API
 // ============================================================================
 // These functions provide drop-in replacements for CUDA driver API calls.
 // They are exported from the stub library and can be linked against directly.
 // The implementations are in cuda.cc.
+//
+// Error handling: All wrapper functions call CUDADriverAPI::get(), which will
+// throw std::runtime_error if libcuda.so cannot be loaded or if required
+// symbols are missing. Use CUDADriverAPI::is_available() to check CUDA
+// availability without throwing.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac22b7a and 5a7edd0.

📒 Files selected for processing (2)
  • src/target/stubs/cuda.h
  • src/target/stubs/vendor/cuda.h
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • src/target/stubs/cuda.h
🧬 Code graph analysis (1)
src/target/stubs/cuda.h (1)
src/target/stubs/cuda.cc (8)
  • get (108-118)
  • get (108-108)
  • is_available (106-106)
  • is_available (106-106)
  • get_handle (100-104)
  • get_handle (100-100)
  • cuModuleLoadData (165-167)
  • cuModuleLoadData (165-165)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-13.0
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-13.0
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build SDist
🔇 Additional comments (2)
src/target/stubs/cuda.h (2)

1-37: LGTM! Clear documentation and well-designed include guard pattern.

The file-level documentation clearly explains the purpose and provides concrete usage examples. The include guard pattern (_TILELANG_CUDA_STUB_INCLUDE_GUARD) is a good design to enforce that vendor/cuda.h can only be included through this stub header.


39-48: LGTM! Standard cross-platform symbol visibility pattern.

The platform-specific export/import macros follow standard practices for shared library development on Windows and Unix-like systems.

Copy link
Collaborator

@oraluben oraluben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, amazing work!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
.github/workflows/dist.yml (1)

183-200: Critical: uv run --no-project still bypasses the installed wheel.

While line 191 has been improved to use uv pip install directly, line 195 still uses uv run --no-project -- python, which creates an ephemeral environment instead of testing the wheel installed in test-venv. Even though the venv is activated, uv run --no-project ignores the activated environment.

Additionally, error handling could be improved by placing set -e at the top of the run block (before the for loop) rather than inside the subshell.

🔎 Recommended fix
       - name: Test built wheels
         run: |
+          set -e
           for WHEEL in wheelhouse/*.whl; do
             echo "Testing wheel: ${WHEEL}"
-            (
-              set -e
-              uv venv --python=3.12 test-venv
-              source test-venv/bin/activate
-              uv pip install -v "${WHEEL}"
-              (
-                set -e
-                cd /
-                uv run --no-project -- python -c "import tilelang; print(tilelang.__version__)"
-              )
-              deactivate
-              rm -rf test-venv
-            )
+            uv venv --python=3.12 test-venv
+            uv pip install --python test-venv -v "${WHEEL}"
+            # Test from root directory to ensure we're not importing from source
+            (cd / && "${GITHUB_WORKSPACE}/test-venv/bin/python" -c "import tilelang; print(tilelang.__version__)")
+            rm -rf test-venv
           done

This fix:

  • Adds set -e at the top for fail-fast behavior
  • Uses uv pip install --python test-venv to install directly into the venv
  • Invokes test-venv/bin/python directly to test the installed wheel
  • Removes unnecessary activation/deactivation and nested subshells
🧹 Nitpick comments (1)
.github/workflows/dist.yml (1)

157-161: Consider using only UV_TORCH_BACKEND for PyTorch backend selection.

Both UV_INDEX and UV_TORCH_BACKEND are currently set, which may lead to ambiguous package resolution behavior when multiple indices provide the same package. Based on the past discussion and UV's PyTorch integration guide, UV_TORCH_BACKEND is the explicit, recommended approach for selecting CUDA backends.

Consider removing line 160 (UV_INDEX) and relying solely on UV_TORCH_BACKEND (line 161) for clearer, more predictable behavior.

Based on learnings from past discussions between maintainers about UV_INDEX vs UV_TORCH_BACKEND.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20beda8 and 42278c3.

📒 Files selected for processing (1)
  • .github/workflows/dist.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • .github/workflows/dist.yml
📚 Learning: 2025-12-15T08:56:23.827Z
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.

Applied to files:

  • .github/workflows/dist.yml
📚 Learning: 2025-10-10T13:29:29.347Z
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Applied to files:

  • .github/workflows/dist.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-13.0
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-13.0
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build SDist
🔇 Additional comments (4)
.github/workflows/dist.yml (4)

38-42: LGTM with minor note.

The UV environment variables are configured appropriately. The cache directories will resolve to absolute paths at runtime, which should work correctly with the caching actions.

Note: The inline comment mentions "to be updated" - ensure these paths are finalized before merging.


66-72: LGTM!

The setup-uv action is properly configured with caching enabled, which will improve workflow performance.


176-181: LGTM!

Consistent setup of UV across both build-sdist and build-wheels jobs enables proper wheel testing.


116-117: The CUDA-13.0 version in these lines is the current standard in the build matrix, alongside CUDA-12.8. No evidence of a CUDA-13.1 version exists in the codebase to support the claim of a downgrade.

Likely an incorrect or invalid review comment.

oraluben
oraluben previously approved these changes Dec 31, 2025
@XuehaiPan
Copy link
Collaborator Author

cc @LeiWang1999 to stamp this

oraluben
oraluben previously approved these changes Jan 5, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
.github/workflows/dist.yml (1)

189-207: Critical: uv run --no-project bypasses the installed wheel, invalidating the test.

The test creates and activates a venv (Lines 195-196), installs the wheel (Line 197), but then uses uv run --no-project (Line 201) which spawns an ephemeral environment that doesn't contain the installed wheel. This completely defeats the validation purpose.

Additionally, error handling is inadequate - the outer subshell doesn't have set -e, so failures in venv creation or installation won't fail the CI step.

🔎 Proposed fix to directly invoke the venv's Python interpreter
       - name: Test built wheels
         run: |
+          set -e
           for WHEEL in wheelhouse/*.whl; do
             echo "Testing wheel: ${WHEEL}"
-            (
-              set -e
-              uv venv --python=3.12 test-venv
-              source test-venv/bin/activate
-              uv pip install -v "${WHEEL}"
-              (
-                set -e
-                cd /
-                uv run --no-project -- python -c "import tilelang; print(tilelang.__version__)"
-              )
-              deactivate
-              rm -rf test-venv
-            )
+            uv venv --python=3.12 test-venv
+            uv pip install --python test-venv -v "${WHEEL}"
+            # Test from root directory to ensure we're not importing from source
+            (cd / && "${GITHUB_WORKSPACE}/test-venv/bin/python" -c "import tilelang; print(tilelang.__version__)")
+            rm -rf test-venv
           done

This approach:

  • Adds set -e at the top to fail fast on any error
  • Removes the activation/deactivation which is unnecessary when directly targeting a venv
  • Uses uv pip install --python test-venv to install directly into the venv
  • Invokes test-venv/bin/python directly to test the installed package
  • Removes unnecessary subshells
🧹 Nitpick comments (2)
.github/workflows/dist.yml (2)

160-161: Consider whether both UV_INDEX and UV_TORCH_BACKEND are necessary.

Based on previous discussion in this PR, UV_TORCH_BACKEND is the more explicit and recommended approach for PyTorch backend selection. Setting both UV_INDEX and relying on UV_INDEX_STRATEGY: "unsafe-best-match" (Line 38) may introduce ambiguity when multiple indexes provide the same package.

If UV_TORCH_BACKEND alone is sufficient for your use case, consider removing the UV_INDEX configuration to simplify the setup.

🔎 Proposed simplification
-            echo "UV_INDEX=https://download.pytorch.org/whl/cu${CUDA_VERSION_MAJMIN_NODOT}" | tee -a "${GITHUB_ENV}"
             echo "UV_TORCH_BACKEND=cu${CUDA_VERSION_MAJMIN_NODOT}" | tee -a "${GITHUB_ENV}"

40-42: Remove misleading "to be updated" comments on cache environment variables.

The manually configured cache paths (XDG_CACHE_HOME, PIP_CACHE_DIR, UV_CACHE_DIR at lines 40-42) align with setup-uv's documented best practices for self-hosted runners, where workspace-relative cache paths prevent unbounded cache growth. However, the "# to be updated" comments are misleading—these paths are correct and don't require adjustment. Remove the comments to clarify intent.

Note: For GitHub-hosted runners where enable-cache: auto applies, the setup-uv@v7 action handles uv caching automatically; the manual UV_CACHE_DIR env var is redundant but harmless. Ensure that PIP_CACHE_DIR remains if pip is used separately outside of uv workflows.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80a3084 and 27b9543.

📒 Files selected for processing (1)
  • .github/workflows/dist.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • .github/workflows/dist.yml
📚 Learning: 2025-12-15T08:56:23.827Z
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.

Applied to files:

  • .github/workflows/dist.yml
📚 Learning: 2025-10-10T13:29:29.347Z
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Applied to files:

  • .github/workflows/dist.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with Nightly-CUDA-13.1
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with Nightly-CUDA-13.1
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build SDist
🔇 Additional comments (2)
.github/workflows/dist.yml (2)

172-172: LGTM: Package manager updated to dnf.

The change from yum to dnf is appropriate for modern Fedora/RHEL-based container images used by cibuildwheel. This aligns with the current standard package manager in these distributions.


182-187: LGTM: Python and uv setup for wheel validation.

The addition of this setup step is appropriate for testing built wheels. Using Python 3.12 and enabling environment activation provides a clean environment for validation.

oraluben
oraluben previously approved these changes Jan 5, 2026
@oraluben oraluben self-requested a review January 5, 2026 08:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Fix all issues with AI Agents 🤖
In @.github/workflows/dist.yml:
- Around line 160-167: The workflow sets UV_INDEX to the PyTorch nightly index
when matrix.target.toolkit starts with "Nightly-" which triggers a known uv
infinite-download bug; change the nightly branch that sets UV_INDEX (and avoids
setting UV_TORCH_BACKEND) to instead avoid pointing UV at the nightly
index—either pin a specific torch wheel URL/version into UV_INDEX, skip setting
UV_INDEX and use an editable/local/explicit install for nightly builds, or add
logic to set UV_INDEX to a non-nightly/temporal mirror; update the branch that
checks matrix.target.toolkit, the UV_INDEX assignment, and any use of
CUDA_VERSION_MAJMIN_NODOT so nightly jobs don’t use
https://download.pytorch.org/whl/nightly/... directly.
♻️ Duplicate comments (1)
.github/workflows/dist.yml (1)

195-213: uv run --no-project bypasses the installed wheel in test-venv.

The test creates a venv and installs the wheel (line 203), but then uses uv run --no-project (line 207) to run the import test. The --no-project flag creates an ephemeral isolated environment that doesn't use the packages installed in test-venv, defeating the purpose of the validation.

Additionally, set -e is only inside the inner subshell (line 205), so wheel installation failures won't fail the outer loop.

🔎 Proposed fix using direct venv interpreter
       - name: Test built wheels
         run: |
+          set -e
           for WHEEL in wheelhouse/*.whl; do
             echo "Testing wheel: ${WHEEL}"
-            (
-              set -e
-              uv venv --python=3.12 test-venv
-              source test-venv/bin/activate
-              uv pip install --refresh -v "${WHEEL}"
-              (
-                set -e
-                cd /
-                uv run --no-project -- python -c "import tilelang; print(tilelang.__version__)"
-              )
-              deactivate
-              rm -rf test-venv
-            )
+            uv venv --python=3.12 test-venv
+            uv pip install --python test-venv --refresh -v "${WHEEL}"
+            # Test from root directory to ensure we're not importing from source
+            (cd / && "${GITHUB_WORKSPACE}/test-venv/bin/python" -c "import tilelang; print(tilelang.__version__)")
+            rm -rf test-venv
           done

This approach:

  • Adds set -e at the top level to fail fast on any error
  • Uses uv pip install --python test-venv to install directly into the venv without activation
  • Invokes test-venv/bin/python directly to test the installed wheel
  • Removes unnecessary nested subshells and activation/deactivation steps
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a827b69 and dd7b82b.

📒 Files selected for processing (1)
  • .github/workflows/dist.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • .github/workflows/dist.yml
📚 Learning: 2025-12-15T08:56:23.827Z
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.

Applied to files:

  • .github/workflows/dist.yml
📚 Learning: 2025-10-10T13:29:29.347Z
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Applied to files:

  • .github/workflows/dist.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with Nightly-CUDA-13.0
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
  • GitHub Check: Build SDist
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with Nightly-CUDA-13.0
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
🔇 Additional comments (4)
.github/workflows/dist.yml (4)

116-117: LGTM! CUDA version change aligns with PyTorch nightly availability.

Changing from CUDA 13.1 to 13.0 ensures compatibility with available PyTorch nightly wheels. As mentioned in past discussions, the minor CUDA version is not critical when kernels are JIT-compiled.


178-178: LGTM! Updated to modern package manager.

Replacing yum with dnf is appropriate for recent Linux distributions used by cibuildwheel.


188-193: LGTM! Setup step for wheel testing.

The uv setup step is appropriate for the subsequent wheel testing workflow.


40-42: github.workspace is not interpolated in workflow-level env block.

The cache directory paths use ${{ github.workspace }}, but the workflow-level env block is evaluated before job contexts are available. These variables will contain the literal string ${{ github.workspace }} rather than the actual workspace path.

Move these environment variable definitions to job-level env blocks (under build-sdist and build-wheels jobs) where github.workspace is available, or use relative paths like .cache, .cache/pip, and .cache/uv.

🔎 Proposed fix to move cache paths to job-level env

Remove from workflow-level env (lines 40-42):

-  XDG_CACHE_HOME: "${{ github.workspace }}/.cache" # to be updated
-  PIP_CACHE_DIR: "${{ github.workspace }}/.cache/pip" # to be updated
-  UV_CACHE_DIR: "${{ github.workspace }}/.cache/uv" # to be updated

Then add to each job's env block (e.g., under build-sdist.env and build-wheels.env):

  XDG_CACHE_HOME: "${{ github.workspace }}/.cache"
  PIP_CACHE_DIR: "${{ github.workspace }}/.cache/pip"
  UV_CACHE_DIR: "${{ github.workspace }}/.cache/uv"

Or use relative paths in the workflow-level env:

-  XDG_CACHE_HOME: "${{ github.workspace }}/.cache" # to be updated
-  PIP_CACHE_DIR: "${{ github.workspace }}/.cache/pip" # to be updated
-  UV_CACHE_DIR: "${{ github.workspace }}/.cache/uv" # to be updated
+  XDG_CACHE_HOME: ".cache"
+  PIP_CACHE_DIR: ".cache/pip"
+  UV_CACHE_DIR: ".cache/uv"
⛔ Skipped due to learnings
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Fix all issues with AI Agents 🤖
In @.github/workflows/dist.yml:
- Around line 40-42: Update the inline comments on the cache env vars in the
workflow: either remove the "# to be updated" comment if XDG_CACHE_HOME,
PIP_CACHE_DIR, and UV_CACHE_DIR values are correct and final, or replace each
comment with a precise note describing what must change (e.g., target path, use
of runner temp, or reason for customization) so the intent is clear; apply this
to the XDG_CACHE_HOME, PIP_CACHE_DIR, and UV_CACHE_DIR entries in dist.yml.
♻️ Duplicate comments (2)
.github/workflows/dist.yml (2)

160-167: Critical: PyTorch nightly index bug remains unresolved.

This segment still sets UV_INDEX to the PyTorch nightly URL (line 162), which was flagged in previous reviews as triggering a known uv bug (issues #9651/#10307) causing infinite download loops. The stable builds (lines 163-166) set both UV_INDEX and UV_TORCH_BACKEND, but previous discussion suggests UV_TORCH_BACKEND alone may be sufficient for stable builds.

Consider:

  1. For nightly builds: Pin a specific torch wheel URL or use an alternative installation method to avoid the nightly index bug
  2. For stable builds: Evaluate whether UV_INDEX is needed when UV_TORCH_BACKEND is set, as the backend selection may be sufficient

195-219: Critical: Test still uses uv run --no-project which bypasses the installed wheel.

The test creates and activates test-venv, installs the wheel into it (line 209), but then uses uv run --no-project (line 213) to run the import test. The --no-project flag creates an ephemeral environment that bypasses the venv and the installed wheel, defeating the purpose of the validation.

The fix from previous reviews remains valid: invoke the venv's Python interpreter directly instead of using uv run --no-project.

🔎 Recommended fix using direct venv Python
       - name: Test built wheels
         run: |
+          set -e
           for WHEEL in wheelhouse/*.whl; do
             echo "Testing wheel: ${WHEEL}"
-            (
-              set -e
-              uv venv --python=3.12 test-venv
-              source test-venv/bin/activate
-
-              uv pip install --upgrade pip setuptools wheel
-              if [[ "${UV_INDEX}" == *"/nightly/"* ]]; then
-                uv pip install --prerelease=allow -v torch
-              fi
-
-              uv pip install -v "${WHEEL}"
-              (
-                set -e
-                cd /
-                uv run --no-project -- python -c "import tilelang; print(tilelang.__version__)"
-              )
-              deactivate
-              rm -rf test-venv
-            )
+            uv venv --python=3.12 test-venv
+            uv pip install --python test-venv --upgrade pip setuptools wheel
+            if [[ "${UV_INDEX}" == *"/nightly/"* ]]; then
+              uv pip install --python test-venv --prerelease=allow -v torch
+            fi
+            uv pip install --python test-venv -v "${WHEEL}"
+            # Test from root to ensure we're not importing from source
+            (cd / && "${GITHUB_WORKSPACE}/test-venv/bin/python" -c "import tilelang; print(tilelang.__version__)")
+            rm -rf test-venv
           done

This eliminates venv activation/deactivation, uses --python test-venv for installs, and invokes the venv's Python directly.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0abf514 and 6d0d01e.

📒 Files selected for processing (1)
  • .github/workflows/dist.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • .github/workflows/dist.yml
📚 Learning: 2025-12-15T08:56:23.827Z
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.

Applied to files:

  • .github/workflows/dist.yml
📚 Learning: 2025-10-10T13:29:29.347Z
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Applied to files:

  • .github/workflows/dist.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Build SDist
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with CUDA-12.8
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-latest with Nightly-CUDA-13.0
  • GitHub Check: Build wheels for Python 3.9 on ubuntu-24.04-arm with Nightly-CUDA-13.0
  • GitHub Check: Build wheels for Python 3.9 on macos-latest with Metal
🔇 Additional comments (4)
.github/workflows/dist.yml (4)

66-72: LGTM!

The uv setup with caching is configured correctly.


116-117: LGTM!

The CUDA version update to 13.0 for nightly builds is appropriate.


178-178: LGTM!

Updating from yum to dnf is appropriate for modern RHEL/Fedora-based build containers.


188-194: LGTM!

The uv setup with caching is configured correctly for the build-wheels job.

@XuehaiPan XuehaiPan merged commit cfbc49b into tile-ai:main Jan 6, 2026
13 checks passed
@XuehaiPan XuehaiPan deleted the lazy-dlopen branch January 6, 2026 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Cannot import tilelang on pure CPU machine with CUDA build

3 participants