Skip to content

[Cuda] Move implementatinos from jit_cuda.h into jit_cuda.cpp, so easier to …#23

Merged
hughperkins merged 7 commits intomainfrom
hp/move-jit-cuda-implementationt-to-cpp
Jun 11, 2025
Merged

[Cuda] Move implementatinos from jit_cuda.h into jit_cuda.cpp, so easier to …#23
hughperkins merged 7 commits intomainfrom
hp/move-jit-cuda-implementationt-to-cpp

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

…modify without triggering major recompilation

Issue: #

Brief Summary

No code changes. Just move implementation from .h to .cpp, so easier to modify without triggering major recompilation

copilot:summary

Walkthrough

copilot:walkthrough

…modify without triggering major recompilation
@hughperkins hughperkins requested a review from Copilot June 10, 2025 02:20

This comment was marked as outdated.

@hughperkins hughperkins requested a review from Copilot June 10, 2025 02:23

This comment was marked as outdated.

@hughperkins hughperkins requested a review from Copilot June 10, 2025 02:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors jit_cuda.h by moving inline implementations into jit_cuda.cpp, reducing header recompilation when implementations change.

  • Declarations for JITModuleCUDA and JITSessionCUDA methods are moved to the header.
  • Method definitions are relocated to the .cpp file.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
taichi/runtime/cuda/jit_cuda.h Stripped out inline implementations, leaving only declarations
taichi/runtime/cuda/jit_cuda.cpp Added moved implementations for JITModuleCUDA and JITSessionCUDA

Copy link
Copy Markdown
Contributor

@bo-rc bo-rc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (moving implementations to .cpp files only).

@duburcqa
Copy link
Copy Markdown
Contributor

Would be great to have PR titles that fit in one line. I think this is good practice that makes it for convenient to get a quick overview of the git tree.

Copy link
Copy Markdown
Contributor

@duburcqa duburcqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why it was implemented in the header before. Do not make any sense to me.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Would be great to have PR titles that fit in one line. I think this is good practice that makes it for convenient to get a quick overview of the git tree.

Fair. Feel free to refuse to Approve PRs until I modify the title to your satisfaction, so I can get a feel for what works :)

@hughperkins
Copy link
Copy Markdown
Collaborator Author

I don't get why it was implemented in the header before. Do not make any sense to me.

I know why, because I always do the same thing: it's easier to write a small class in a single file, than spending a few minutes splitting it across two and adding [Classname]:: in front of each method name. (Then later on, I eventually split it into two files, once it starts getting annoying).

@hughperkins hughperkins enabled auto-merge (squash) June 10, 2025 21:56
@duburcqa
Copy link
Copy Markdown
Contributor

duburcqa commented Jun 10, 2025

Ok, I'm probably just too used to splitting declaration and definition that it feels completely natural to me.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Ok, I'm probably just too used to splitting declaration and definition that it feels completely natural to me.

I used to rather like generating the header file contents automatically https://github.com/hughperkins/DeepCL/blob/46a7bba5eb9f0012d9d705e20c64848bad01285a/src/clmath/CLMathWrapper.h#L37-L55

@hughperkins
Copy link
Copy Markdown
Collaborator Author

@hughperkins hughperkins merged commit ea7c78f into main Jun 11, 2025
16 of 17 checks passed
@hughperkins hughperkins deleted the hp/move-jit-cuda-implementationt-to-cpp branch June 11, 2025 00:25
hughperkins added a commit that referenced this pull request Apr 26, 2026
Introduce the helper machinery that the per-class to_torch / to_numpy methods
will migrate to in subsequent commits. Existing public symbols (can_zerocopy,
dlpack_to_torch, invalidate_zerocopy_cache, current_arch_is_cpu) are preserved
as deprecated shims so the in-tree pre-rework callers continue to work; they
will be removed once every call site is migrated.

New surface:

- _ZerocopyCache: per-instance container with two independent slots (torch
  tensor + numpy ndarray), each filled lazily on first access via
  torch.utils.dlpack.from_dlpack and numpy.from_dlpack respectively. Numpy
  zero-copy now bypasses torch entirely (closes review #6).

- make_zerocopy_cache_if_supported(owner, ...): constructs a cache when
  zero-copy is supported and registers `owner` with `pyquadrants.cache_holders`
  so invalidation is wired automatically (closes review #18).

- get_zerocopy_torch / get_zerocopy_numpy: thin entry points that implement
  the always-zerocopy-then-clone semantic (closes review #15, #16, #21) and
  the Apple Metal double-sync (qd.sync() on read AND torch.mps.synchronize()
  after .clone()/.to(); closes review #1, #22, #23).

Also applies the small lints from the review:
- Module-level constant for the torch>2.9.1 MPS bytes_offset probe; drops the
  pointless lru_cache wrapper around a zero-arg helper (closes review #2).
- ASCII '...' instead of Unicode horizontal ellipsis '\u2026' in the docstring
  (closes review #3).
- Top-level imports for numpy and torch (try/except for the no-torch CI
  case); no per-call lazy imports in the new code path (closes review #7,
  #9). The deprecated shim still does what the existing per-class methods
  expect; the new helpers are torch-clean.

cache_holders is still empty until the next commits register Ndarray /
ScalarField / MatrixField; this commit alone is no-op behaviourally.
hughperkins added a commit that referenced this pull request Apr 26, 2026
Convert Ndarray.to_torch / _ndarray_to_numpy / _ndarray_matrix_to_numpy
to thin calls into _interop.get_zerocopy_torch / get_zerocopy_numpy.

Behaviour changes vs. the previous PR state:

- to_torch(copy=True) now ALSO uses the zerocopy export and clones the
  result, instead of skipping zerocopy entirely. Zerocopy DLPack is
  cheaper than the kernel-copy fallback even when followed by a clone
  (closes review #15, #16, #21).

- to_torch on Apple Metal now does both syncs: qd.sync() before the read
  AND torch.mps.synchronize() after the .clone() so the cloned buffer
  is actually populated by the time the call returns (closes review
  #1, #22, #23).

- to_numpy(copy=False) now goes through numpy.from_dlpack directly --
  no torch round-trip, no torch import required (closes review #6).

- to_numpy default (copy=None) keeps the current "independent copy"
  semantics: numpy arrays conventionally outlive their source, so the
  default is the safer one. Zero-copy is opt-in via copy=False.

Cache invalidation switches from the legacy _qd_dlpack_tc attribute to
the per-class _zerocopy_cache (cached_property) + _invalidate_zerocopy_cache
method. Registration in pyquadrants.cache_holders happens automatically
on first cache access (closes review #18).

The existing _reset hook drops its zerocopy invalidation call: that work
is now handled by the cache_holders loop in impl.reset() BEFORE C++
teardown, eliminating the use-after-free that motivated the Genesis
to_numpy() always-copy workaround.
hughperkins added a commit that referenced this pull request Apr 26, 2026
Convert ScalarField.to_torch / to_numpy to thin calls into
_interop.get_zerocopy_torch / get_zerocopy_numpy, mirroring the Ndarray
migration in the previous commit.

- Adds Field._invalidate_zerocopy_cache method (used by all subclasses)
  so the impl.reset() cache_holders loop has a single contract to call.
- Adds ScalarField._zerocopy_cache as a cached_property; constructed once
  per instance instead of re-checking can_zerocopy on every call (closes
  review #17).
- Registers ScalarField in pyquadrants.cache_holders on first cache
  access (closes review #18).
- to_torch(copy=True) now zerocopies and clones rather than skipping
  zerocopy entirely (closes review #15, #16, #21); to_torch on Metal
  does both syncs (closes review #1, #22, #23).
- to_numpy(copy=False) goes through numpy.from_dlpack directly, no
  torch round-trip and no torch import required (closes review #6).

Default to_numpy semantic (copy=None) remains "independent copy" for
lifetime safety; zero-copy is opt-in via copy=False.
hughperkins added a commit that referenced this pull request Apr 26, 2026
Convert MatrixField.to_torch / to_numpy to thin calls into
_interop.get_zerocopy_torch / get_zerocopy_numpy.

Adds a single _matrix_view_shape() helper that returns (expected_shape,
as_vector); both to_torch and to_numpy now go through it instead of
duplicating the n / m / keep_dims branching (closes review #13, #14).

Same wins as the ScalarField migration:
- _zerocopy_cache as cached_property (closes review #17)
- automatic registration with pyquadrants.cache_holders (closes #18)
- copy=True now zerocopies + clones (closes #15, #16, #21)
- to_numpy(copy=False) goes through numpy.from_dlpack directly, no torch
  round-trip (closes #6)
- Apple Metal double-sync via the centralised helpers (closes #1, #22, #23)

The redundant reshape after zerocopy stays (DLPack returns the field's flat
n-d shape; numpy/torch users expect the matrix dims appended) but it's now
in one place per method instead of two.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants