[Cuda] Move implementatinos from jit_cuda.h into jit_cuda.cpp, so easier to …#23
Conversation
…modify without triggering major recompilation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors jit_cuda.h by moving inline implementations into jit_cuda.cpp, reducing header recompilation when implementations change.
- Declarations for
JITModuleCUDAandJITSessionCUDAmethods are moved to the header. - Method definitions are relocated to the
.cppfile.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| taichi/runtime/cuda/jit_cuda.h | Stripped out inline implementations, leaving only declarations |
| taichi/runtime/cuda/jit_cuda.cpp | Added moved implementations for JITModuleCUDA and JITSessionCUDA |
bo-rc
left a comment
There was a problem hiding this comment.
LGTM (moving implementations to .cpp files only).
|
Would be great to have PR titles that fit in one line. I think this is good practice that makes it for convenient to get a quick overview of the git tree. |
duburcqa
left a comment
There was a problem hiding this comment.
I don't get why it was implemented in the header before. Do not make any sense to me.
Fair. Feel free to refuse to Approve PRs until I modify the title to your satisfaction, so I can get a feel for what works :) |
I know why, because I always do the same thing: it's easier to write a small class in a single file, than spending a few minutes splitting it across two and adding |
|
Ok, I'm probably just too used to splitting declaration and definition that it feels completely natural to me. |
I used to rather like generating the header file contents automatically https://github.com/hughperkins/DeepCL/blob/46a7bba5eb9f0012d9d705e20c64848bad01285a/src/clmath/CLMathWrapper.h#L37-L55 |
Introduce the helper machinery that the per-class to_torch / to_numpy methods will migrate to in subsequent commits. Existing public symbols (can_zerocopy, dlpack_to_torch, invalidate_zerocopy_cache, current_arch_is_cpu) are preserved as deprecated shims so the in-tree pre-rework callers continue to work; they will be removed once every call site is migrated. New surface: - _ZerocopyCache: per-instance container with two independent slots (torch tensor + numpy ndarray), each filled lazily on first access via torch.utils.dlpack.from_dlpack and numpy.from_dlpack respectively. Numpy zero-copy now bypasses torch entirely (closes review #6). - make_zerocopy_cache_if_supported(owner, ...): constructs a cache when zero-copy is supported and registers `owner` with `pyquadrants.cache_holders` so invalidation is wired automatically (closes review #18). - get_zerocopy_torch / get_zerocopy_numpy: thin entry points that implement the always-zerocopy-then-clone semantic (closes review #15, #16, #21) and the Apple Metal double-sync (qd.sync() on read AND torch.mps.synchronize() after .clone()/.to(); closes review #1, #22, #23). Also applies the small lints from the review: - Module-level constant for the torch>2.9.1 MPS bytes_offset probe; drops the pointless lru_cache wrapper around a zero-arg helper (closes review #2). - ASCII '...' instead of Unicode horizontal ellipsis '\u2026' in the docstring (closes review #3). - Top-level imports for numpy and torch (try/except for the no-torch CI case); no per-call lazy imports in the new code path (closes review #7, #9). The deprecated shim still does what the existing per-class methods expect; the new helpers are torch-clean. cache_holders is still empty until the next commits register Ndarray / ScalarField / MatrixField; this commit alone is no-op behaviourally.
Convert Ndarray.to_torch / _ndarray_to_numpy / _ndarray_matrix_to_numpy to thin calls into _interop.get_zerocopy_torch / get_zerocopy_numpy. Behaviour changes vs. the previous PR state: - to_torch(copy=True) now ALSO uses the zerocopy export and clones the result, instead of skipping zerocopy entirely. Zerocopy DLPack is cheaper than the kernel-copy fallback even when followed by a clone (closes review #15, #16, #21). - to_torch on Apple Metal now does both syncs: qd.sync() before the read AND torch.mps.synchronize() after the .clone() so the cloned buffer is actually populated by the time the call returns (closes review #1, #22, #23). - to_numpy(copy=False) now goes through numpy.from_dlpack directly -- no torch round-trip, no torch import required (closes review #6). - to_numpy default (copy=None) keeps the current "independent copy" semantics: numpy arrays conventionally outlive their source, so the default is the safer one. Zero-copy is opt-in via copy=False. Cache invalidation switches from the legacy _qd_dlpack_tc attribute to the per-class _zerocopy_cache (cached_property) + _invalidate_zerocopy_cache method. Registration in pyquadrants.cache_holders happens automatically on first cache access (closes review #18). The existing _reset hook drops its zerocopy invalidation call: that work is now handled by the cache_holders loop in impl.reset() BEFORE C++ teardown, eliminating the use-after-free that motivated the Genesis to_numpy() always-copy workaround.
Convert ScalarField.to_torch / to_numpy to thin calls into _interop.get_zerocopy_torch / get_zerocopy_numpy, mirroring the Ndarray migration in the previous commit. - Adds Field._invalidate_zerocopy_cache method (used by all subclasses) so the impl.reset() cache_holders loop has a single contract to call. - Adds ScalarField._zerocopy_cache as a cached_property; constructed once per instance instead of re-checking can_zerocopy on every call (closes review #17). - Registers ScalarField in pyquadrants.cache_holders on first cache access (closes review #18). - to_torch(copy=True) now zerocopies and clones rather than skipping zerocopy entirely (closes review #15, #16, #21); to_torch on Metal does both syncs (closes review #1, #22, #23). - to_numpy(copy=False) goes through numpy.from_dlpack directly, no torch round-trip and no torch import required (closes review #6). Default to_numpy semantic (copy=None) remains "independent copy" for lifetime safety; zero-copy is opt-in via copy=False.
Convert MatrixField.to_torch / to_numpy to thin calls into _interop.get_zerocopy_torch / get_zerocopy_numpy. Adds a single _matrix_view_shape() helper that returns (expected_shape, as_vector); both to_torch and to_numpy now go through it instead of duplicating the n / m / keep_dims branching (closes review #13, #14). Same wins as the ScalarField migration: - _zerocopy_cache as cached_property (closes review #17) - automatic registration with pyquadrants.cache_holders (closes #18) - copy=True now zerocopies + clones (closes #15, #16, #21) - to_numpy(copy=False) goes through numpy.from_dlpack directly, no torch round-trip (closes #6) - Apple Metal double-sync via the centralised helpers (closes #1, #22, #23) The redundant reshape after zerocopy stays (DLPack returns the field's flat n-d shape; numpy/torch users expect the matrix dims appended) but it's now in one place per method instead of two.
…modify without triggering major recompilation
Issue: #
Brief Summary
No code changes. Just move implementation from .h to .cpp, so easier to modify without triggering major recompilation
copilot:summary
Walkthrough
copilot:walkthrough