Skip to content

TracyCUDA: show GPU zones for CUDA Graph-launched kernels#1330

Merged
wolfpld merged 21 commits intowolfpld:masterfrom
bmilanich:cuda-graph-fallback
Apr 15, 2026
Merged

TracyCUDA: show GPU zones for CUDA Graph-launched kernels#1330
wolfpld merged 21 commits intowolfpld:masterfrom
bmilanich:cuda-graph-fallback

Conversation

@bmilanich
Copy link
Copy Markdown
Contributor

@bmilanich bmilanich commented Apr 6, 2026

Problem

When kernels are launched via CUDA Graphs (cudaGraphLaunch / cuGraphLaunch), Tracy shows 0 GPU zones. CUPTI delivers CONCURRENT_KERNEL and MEMCPY activity records but no per-kernel API callback fires, so matchActivityToAPICall() always fails and every zone is silently dropped.

Root cause

Each activity record has a correlationId that Tracy matches against API callbacks stored in cudaCallSiteInfo. For CUDA Graph launches there are no individual-kernel callbacks, so the map has no entries for those correlation IDs.

Key CUPTI finding (discovered during implementation)

All kernels in one cuGraphLaunch call share the same correlationId as the launch itself. This was confirmed by instrumenting both the API callback and activity buffer simultaneously:

cudaGraphLaunch ENTER corr=9
KERNEL graphId=2  corr=9   ← same as launch
KERNEL graphId=2  corr=9   ← same as launch

This means the fix is straightforward once cuGraphLaunch is tracked in the API callback machinery.

Note: CUPTI_ACTIVITY_KIND_GRAPH_TRACE was investigated but must not be enabled — it suppresses individual CONCURRENT_KERNEL records for graph-launched kernels, replacing them with graph-level summaries.

Fix

Core graph correlation

  • Add cudaGraphLaunch_v10000 / cuGraphLaunch (both runtime and driver API variants) to the existing callback tracker maps so their CPU call site is captured in cudaCallSiteInfo
  • Introduce matchGraphActivityToAPICall() helper (analogous to matchActivityToAPICall()):
    • For the first kernel/memcpy/memset from a graph launch, matchActivityToAPICall succeeds normally (consuming the launch entry) and the result is cached in cudaGraphCurrentLaunch[graphId]
    • Subsequent operations from the same launch (same correlationId, same graphId) find the cached APICallInfo via graphId
    • Each new launch of the same graph overwrites the cache entry via insert_or_assign, which acquires the write lock only once
    • If neither path succeeds, the caller emits a diagnostic matchError() in the profiler timeline (same behaviour as for non-graph activity)
  • using GraphID = uint32_t typedef used throughout for graphId-typed variables

Cache retirement (prevents unbounded growth)

  • On cudaGraphExecDestroy (ENTER callback, while handle is still valid): call cuptiGetGraphExecId to translate exec handle → graphId and add to a pending-retirement set (graphExecPendingRetire)
  • Erasure is deferred to OnBufferCompleted because cudaGraphExecDestroy doesn't wait for GPU completion — CUPTI may still have undelivered activity records in its buffers
  • A graphId is erased only when a full buffer arrives containing no records bearing that graphId, indicating all in-flight records have been delivered
  • getGraphIdFromRecord() helper extracts the graphId field from CONCURRENT_KERNEL / MEMCPY / MEMSET activity records for per-buffer tracking
  • std::atomic<bool> graphRetirePending dirty flag lets OnBufferCompleted skip the retirement mutex on the hot path (no pending retirements = no lock acquired)

ConcurrentHashMap fixes

  • ConcurrentHashMap: added insert_or_assign wrapping std::unordered_map::insert_or_assign under a single write lock
  • ConcurrentHashMap::fetch(): fixed a pre-existing missing read lock — mapping.find() was called without holding any lock, a data race now exercised on every graph activity record

Memory tracking cleanup

  • Remove cudaMalloc/cudaFree/cuMemAlloc_v2/cuMemFree_v2 (6 CBIDs total) from cbidRuntimeTrackers / cbidDriverTrackers — the MEMORY2 handler never calls matchActivityToAPICall and never emits GPU zones, so these entries leaked indefinitely
  • MEMORY2 handler: removed the early return on failed matchActivityToAPICallCUpti_ActivityMemory3 has no graphId field, so graph-launched alloc nodes (cudaGraphAddMemAllocNode) and pre-profiling allocations can never be correlated to an API call. The handler only needs address, size, and timestamp from the activity record

Other

  • CUPTI_ACTIVITY_KIND_SYNCHRONIZATION handler: confirmed correct as-is. In-graph event record/wait nodes produce no SYNCHRONIZATION activity records (empirically verified on H100)
  • examples/CUDAGraphRepro/Makefile: added -arch=native so NVCC auto-detects the target GPU architecture

Results

Tested on H100, CUDA 13.1, driver 580.105.08.

Single graph (10 launches × 3 nodes):

Tracy version GPU zones
Unpatched 0
This PR 30 ✅ (20 vector_add + 10 CUDA::memcpy)

Multiple graphs, interleaved launches — two distinct graphs launched alternately (A, B, A, B, …) to stress graphId cache switching:

  • Graph A: kernel(add) + memcpy + kernel(add) — 3 nodes, 5 launches = 15 zones
  • Graph B: kernel(scale) + kernel(add) + kernel(scale) — 3 nodes, 5 launches = 15 zones
Zone type Expected Got
vector_add 15 15 ✅
vector_scale 10 10 ✅
CUDA::memcpy (graph) 5 5 ✅
Computation result c[0] 6.0 6.0 ✅

Regular (non-graph) CUDA operations — verified that memory CBID tracker removal doesn't regress normal profiling:

Metric Expected Got
GPU zones 31 31 ✅
Memory allocs 36 36 ✅
Memory frees 36 36 ✅

GPU zones are correctly correlated to the cuGraphLaunch CPU call site: selecting a GPU zone highlights the corresponding CPU time range in the timeline.

Debug build of tracy-capture (asserts enabled) confirmed no assertions fire at server/TracyWorker.cpp:5988.

Note on GPU zone source locations

All CUDA Graph GPU zones point to TracyCUDA.hpp rather than the user's call site. This is the same behaviour as regular CUDA zones — CUPTI delivers activity records asynchronously in a background thread that has no knowledge of the original call stack.

Files changed

  • public/tracy/TracyCUDA.hpp — core fix
  • examples/CUDAGraphRepro/repro.cu — updated to use TracyCUDA headers and verify zones
  • examples/CUDAGraphRepro/Makefile — release and debug variants; -arch=native for GPU arch detection

🤖 Generated with Claude Code

bmilanich and others added 7 commits March 24, 2026 09:24
When kernels are launched via CUDA Graphs (cuGraphLaunch), CUPTI delivers
CONCURRENT_KERNEL, MEMCPY, and MEMSET activity records but no
corresponding API callback fires for the individual operations. This
means matchActivityToAPICall() always fails, and every GPU activity
record is silently dropped by matchError().

Fix this by falling back to a synthetic APICallInfo using the GPU
timestamps from the activity record when no API correlation exists.
This produces correct GPU zones with kernel names and timing — just
without the CPU-to-GPU launch correlation arrow.

Tested on NVIDIA H100 with CUDA 13.1: before this fix, 0 GPU zones
appeared for CUDA Graph workloads; after, all kernel and memcpy zones
are visible in the Tracy timeline.
Minimal reproducer showing that CUDA Graph-launched kernels produce
0 GPU zones in Tracy. The repro creates a simple graph (2 kernels +
1 memcpy), launches it 10 times, and expects ~30 GPU zones. Without
the fallback patch, all activity records are dropped by matchError().

Tested on NVIDIA H100, CUDA 13.1.
Replace the synthetic APICallInfo hack with proper correlation via
CUPTI_ACTIVITY_KIND_GRAPH_TRACE. When cuGraphLaunch fires an API
callback, its correlationId is stored in cudaCallSiteInfo. The
GRAPH_TRACE activity record carries the same correlationId plus the
graphId, which lets us build a graphId→APICallInfo map. Kernel/memcpy/
memset activities then look up this map via their graphId field.

Key changes:
- Add cuGraphLaunch/cuGraphLaunch_ptsz to cbidDriverTrackers so the
  API callback machinery captures the CPU call site
- Enable CUPTI_ACTIVITY_KIND_GRAPH_TRACE and handle it in
  DoProcessDeviceEvent to populate cudaGraphCurrentLaunch[graphId]
- Add cudaGraphCurrentLaunch map to PersistentState
- Two-pass buffer processing in OnBufferCompleted so GRAPH_TRACE
  records (which complete last on GPU) are processed before the
  kernel/memcpy/memset records that depend on them
- Replace graphId=0 fallback in kernel/memcpy/memset with proper
  cudaGraphCurrentLaunch lookup; fall through to matchError if
  the graphId is not found
- Update repro to include TracyCUDA headers and properly test
  GPU zone correlation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The repro uses cudaGraphLaunch (runtime API) not cuGraphLaunch (driver
API). Add cudaGraphLaunch_v10000 and its _ptsz variant to
cbidRuntimeTrackers so that graphs launched via the runtime API also
get their CPU call site captured for GPU zone correlation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CUPTI discovery: all kernels launched by one cuGraphLaunch share the
same correlationId as the launch call itself. GRAPH_TRACE was the
wrong approach — enabling it suppresses per-kernel CONCURRENT_KERNEL
records entirely, replacing them with graph-level summaries.

New approach:
- Drop CUPTI_ACTIVITY_KIND_GRAPH_TRACE (it conflicts with CONCURRENT_KERNEL)
- Drop two-pass buffer processing (no longer needed)
- On the first kernel/memcpy/memset from a graph launch, matchActivityToAPICall
  succeeds (consuming the cuGraphLaunch entry) and the result is cached in
  cudaGraphCurrentLaunch[graphId]
- Subsequent operations from the same launch find the cached entry via graphId

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests:
- Two distinct graphs (different graphIds) on the same stream
- Graph A: kernel + memcpy + kernel (3 nodes)
- Graph B: scale + add + scale (3 nodes)
- 5 interleaved launches of each, stressing the graphId cache
- Expected 30 graph GPU zones total

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The kernel, memcpy, and memset cases all had identical logic for
handling graph-launched activities. Extract it into a single helper
next to matchActivityToAPICall.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bmilanich and others added 3 commits April 6, 2026 13:33
NVCC 13.1 defaults to a PTX version incompatible with the installed
driver (580.105.08), causing kernels to silently fail with "provided
PTX was compiled with an unsupported toolchain". Use -arch=native so
NVCC auto-detects the target GPU (H100, sm_90) at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cudaGraphCurrentLaunch cache update was acquiring the write lock
twice (once for erase, once for emplace). Wrapping
std::unordered_map::insert_or_assign under a single write lock lets
the caller do it in one operation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…helper

- Fix wrong comments on graph launch tracker entries: they claimed
  correlation works "via CUPTI_ACTIVITY_KIND_GRAPH_TRACE", but that
  approach was rejected (GRAPH_TRACE suppresses per-kernel records).
  The actual mechanism is the shared correlationId across all nodes
  in one graph launch.
- Fix ConcurrentHashMap::fetch() missing its read lock — a pre-existing
  data race now exercised by the new graph correlation hot path.
- Cache PersistentState::Get().cudaGraphCurrentLaunch in a local ref
  inside matchGraphActivityToAPICall instead of calling Get() twice.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bmilanich
Copy link
Copy Markdown
Contributor Author

Here is how it looks in the profiler now:
tracy with CUDA Graph zones

@bmilanich bmilanich marked this pull request as ready for review April 6, 2026 19:23
@wolfpld
Copy link
Copy Markdown
Owner

wolfpld commented Apr 7, 2026

Please review if these apply:

  1. SYNCHRONIZATION and MEMORY2 handlers not updated — These still call matchActivityToAPICall directly. If graph-launched sync events have a non-zero graphId, they'd be silently dropped the same way kernels were. Should these also use matchGraphActivityToAPICall, or is there a reason they don't need it? (Worth a comment if intentional.)
  2. Stale cache entries in cudaGraphCurrentLaunch — No cleanup when a graph is destroyed. Since CUPTI reuses graphId values, a stale entry could theoretically be consumed by a later unrelated graph launch on the same graphId. The insert_or_assign overwrite handles same-graph relaunches, but a destroyed-then-recreated graph with a recycled graphId could inherit stale APICallInfo. Consider adding a note or defensive check.

bmilanich and others added 2 commits April 7, 2026 15:30
CUpti_ActivityMemory3 has no graphId field, so matchGraphActivityToAPICall
cannot be applied. Graph-launched cudaGraphAddMemAllocNode emits multiple
MEMORY2 records sharing the launch correlationId; only the first is
tracked, subsequent ones fire a spurious matchError.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CUpti_ActivityMemory3 has no graphId field, so graph-launched alloc
nodes and pre-profiling allocations can't be correlated to an API call.
The handler only needs address, size, and timestamp from the activity
record — apiCall is never used. Remove the early return so memory
tracking works in all cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bmilanich
Copy link
Copy Markdown
Contributor Author

bmilanich commented Apr 7, 2026

Please review if these apply:

  1. SYNCHRONIZATION and MEMORY2 handlers not updated — These still call matchActivityToAPICall directly. If graph-launched sync events have a non-zero graphId, they'd be silently dropped the same way kernels were. Should these also use matchGraphActivityToAPICall, or is there a reason they don't need it? (Worth a comment if intentional.)

So I had a separate patch for memory tracking with graphs, I can add it here just as well: there is no graphId in the CUpti_ActivityMemory3 so the solution here is proceed with memory tracking regardless of matching with matchActivityToAPICall() (I have updated the PR with it).
Graphs do not produce SYNCHRONIZATION activity in CUPTI, their synchronization stays inside the GPU so no need to update SYNCHRONIZATION.

  1. Stale cache entries in cudaGraphCurrentLaunch — No cleanup when a graph is destroyed. Since CUPTI reuses graphId values, a stale entry could theoretically be consumed by a later unrelated graph launch on the same graphId. The insert_or_assign overwrite handles same-graph relaunches, but a destroyed-then-recreated graph with a recycled graphId could inherit stale APICallInfo. Consider adding a note or defensive check.

It does not seem to be an issue as the new graph with the same graphId will still have a different correlationId and will succeed in the matchGraphActivityToAPICall(), replacing the stale cache entry.

Calling matchActivityToAPICall in the MEMORY2 handler was consuming
the cudaCallSiteInfo entry for the graph launch correlationId. If a
graph mixes alloc nodes with kernel/memcpy nodes, all activities share
the same correlationId — consuming it here would cause
matchGraphActivityToAPICall to fail for the kernel/memcpy records that
follow, silently dropping their GPU zones.

Since apiCall is never used by the MEMORY2 handler (only address, size,
and timestamp from the activity record are needed), remove the call
entirely and leave the entry for the kernel/memcpy to consume and cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@slomp slomp self-requested a review April 7, 2026 21:38
@slomp
Copy link
Copy Markdown
Collaborator

slomp commented Apr 8, 2026

@bmilanich Thank you for the PR!
I will try to carve some time next week to review/test it properly, but from a first glance, I like the solution.

Quick hypothetical question:
What happens if one calls cuGraphLaunch() while there's another cuGraphLaunch() on the same graphId still in-flight? Presumably, CUDA should give the second call a different correlationId (as it would for other API calls), but I could not find definitive documentation on that.

bmilanich and others added 2 commits April 9, 2026 09:28
Tests two questions:
1. Does relaunching the same cudaGraphExec produce a new correlationId
   each time, or is it reused?
2. Do two different cudaGraphExec handles from the same cudaGraph share
   a graphId?

Results on H100, CUDA 13.1:
- Each launch of the same exec handle gets a strictly unique, monotonically
  increasing correlationId. CPU callback corrId == GPU activity corrId.
  This is formally documented in cupti_activity.h:
    "Each graph launch is assigned a unique correlation ID that is
     identical to the correlation ID in the driver API activity record
     that launched the graph."
- graphId identifies the exec handle (instantiation), not the graph
  definition. Two cudaGraphInstantiate calls on the same graph produce
  different graphIds.

These findings confirm that the cudaGraphCurrentLaunch cache in
matchGraphActivityToAPICall is always refreshed by the first activity
of each new launch before the graphId fallback path is ever used.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests whether CUPTI recycles graphId values after cudaGraphExecDestroy,
which would be the only scenario where the graphLaunchCache in TracyCUDA
could serve stale entries for a non-matching exec handle.

Result (H100, CUDA 12, CUPTI): graphId is a monotonically increasing
counter that is never recycled. 22 create/instantiate/launch/destroy
cycles produced unique IDs ranging from 2 to 65 (incrementing by 3 per
cycle — one unit per node created during graph construction).

This confirms that the stale-cache concern raised in code review is not
a real risk in practice: two distinct exec handles always have distinct
graphIds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bmilanich
Copy link
Copy Markdown
Contributor Author

@bmilanich Thank you for the PR! I will try to carve some time next week to review/test it properly, but from a first glance, I like the solution.

Quick hypothetical question: What happens if one calls cuGraphLaunch() while there's another cuGraphLaunch() on the same graphId still in-flight? Presumably, CUDA should give the second call a different correlationId (as it would for other API calls), but I could not find definitive documentation on that.

This is a good point: if the same graphId and different corr id records will arrive interleaved it will mess up the cache. I need to make a good repro for this and figure how to handle that.

@bmilanich
Copy link
Copy Markdown
Contributor Author

So it seems to be impossible to produce interleaved records from the same graph handle launches. This is actually documented here: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1920584881db959c8c74130d79019b73

Description
Executes graphExec in stream. Only one instance of graphExec may be executing at a time. Each launch is ordered behind both any previous work in stream and any previous launches of graphExec. To execute a graph concurrently, it must be instantiated multiple times into multiple executable graphs.

@slomp
Copy link
Copy Markdown
Collaborator

slomp commented Apr 9, 2026

So it seems to be impossible to produce interleaved records from the same graph handle launches. This is actually documented here: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1920584881db959c8c74130d79019b73

Description
Executes graphExec in stream. Only one instance of graphExec may be executing at a time. Each launch is ordered behind both any previous work in stream and any previous launches of graphExec. To execute a graph concurrently, it must be instantiated multiple times into multiple executable graphs.

Great, thanks for digging!

Copy link
Copy Markdown
Collaborator

@slomp slomp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a first pass of the code here.
I'll give it a run on some workloads next week.

return false;
}
} else if (graphId != 0) {
graphLaunchCache.insert_or_assign(graphId, apiCallInfo);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is insert_and_assign really necessary (as opposed to just using operator[])?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operator[] grabs the read lock and then returns the plain reference, which, if assigned, will go without any locking. insert_or_assign() grabs the write lock and, if necessary, updates under the lock so it's safe.

Copy link
Copy Markdown
Collaborator

@slomp slomp Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant to say emplace, not operator[].

If I am following the logic correctly, whatever is placed in graphLaunchCache stays there until it gets overwritten, so emplace would just keep remembering the very first insertion, whereas here you want to enforce that it gets rewritten, since it could be a different launch of the same graph.

(I wonder if there's a CPUTI activity record that tells us when a graph launch has started and when it has ended.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's this:
https://docs.nvidia.com/cupti/api/structCUpti__ActivityGraphTrace2.html

It will be posted in the activity record queue after all graph nodes have executed, so you could use that to "cleanup" the entry from the map, I suppose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant to say emplace, not operator[].

If I am following the logic correctly, whatever is placed in graphLaunchCache stays there until it gets overwritten, so emplace would just keep remembering the very first insertion, whereas here you want to enforce that it gets rewritten, since it could be a different launch of the same graph.

Yes, but the graphLaunchCache is indexed by the graphId, which remains the same between different cuGraphLaunch() executions, so the CPU correlation would point to the very first launch of this graph.

(I wonder if there's a CPUTI activity record that tells us when a graph launch has started and when it has ended.)

There is and I explored it but it's only tracing the whole graph execution and suppresses individual kernel activities as stated in NV's doc:

This activity record represents execution for a graph without giving visibility about the execution of its nodes. This is intended to reduce overheads in tracing each node. The activity kind is CUPTI_ACTIVITY_KIND_GRAPH_TRACE

Thus the comments at https://github.com/wolfpld/tracy/pull/1330/changes#diff-f80a13989fdff31b7a1416dbe2081b9aa8d022005f2b1a2b8dc859be3c9f33e7R1244

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, yeah, what a bizarre choice (or overlook) from NVIDIA...

Comment thread public/tracy/TracyCUDA.hpp
Comment thread public/tracy/TracyCUDA.hpp Outdated
Comment thread public/tracy/TracyCUDA.hpp Outdated
Comment thread public/tracy/TracyCUDA.hpp Outdated
Comment thread public/tracy/TracyCUDA.hpp Outdated
bmilanich and others added 4 commits April 13, 2026 10:23
Without retirement, the cache grows by one entry per unique exec handle
ever launched and never shrinks. While bounded by the number of distinct
execs in the application, long-running programs creating and destroying
many exec handles accumulate stale entries indefinitely.

Retirement mechanism:
- At cudaGraphExecDestroy (ENTER, while handle is still valid): call
  cuptiGetGraphExecId to translate exec handle → graphId and add to a
  pending-retirement set. Works for both runtime (cudaGraphExecDestroy)
  and driver (cuGraphExecDestroy) APIs. No new subscription needed —
  the existing cuptiEnableDomain already routes all API callbacks here.

- Deferral in OnBufferCompleted: erasure is not done immediately because
  cudaGraphExecDestroy does not wait for GPU completion. CUPTI may still
  have undelivered activity records for the last launch in its internal
  buffers. We defer the erase until a full buffer arrives that contains
  no records bearing the retired graphId, indicating all in-flight
  records have been delivered.

- getGraphIdFromRecord: new helper that extracts the graphId field from
  CONCURRENT_KERNEL / MEMCPY / MEMSET activity records (the three kinds
  that carry a graphId) for use in the per-buffer tracking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cudaMalloc/cudaFree (and driver equivalents) were tracked in
cbidRuntimeTrackers/cbidDriverTrackers, creating a cudaCallSiteInfo
entry on each API call. But the MEMORY2 handler never calls
matchActivityToAPICall (and never calls EmitGpuZone) — it only needs
the address, size, and timestamp from the activity record itself. Since
no activity handler consumes these entries, they leaked indefinitely.

Remove the 6 memory API CBIDs from both tracker maps so no entry is
created. This eliminates the leak with no change in visible behavior:
the MEMORY2 handler already operates independently of cudaCallSiteInfo.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `using GraphID = uint32_t` typedef and use it throughout for
  graphId-typed variables (PersistentState, matchGraphActivityToAPICall,
  getGraphIdFromRecord, retirement set, buffer loop).

- Move matchError from matchGraphActivityToAPICall to caller sites
  (KERNEL, MEMCPY, MEMSET handlers). Keeping the error at the caller
  provides more debugging context about which activity kind failed.
  Remove the now-unnecessary `kind` parameter from the function.

- Replace insert_or_assign with operator[] assignment in
  matchGraphActivityToAPICall. Access to graphLaunchCache is
  single-threaded (CUPTI worker), so the simpler syntax is sufficient.
  Remove the insert_or_assign method from ConcurrentHashMap entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
operator[] on ConcurrentHashMap returns a reference after releasing the
read lock — the subsequent assignment happens with no lock held. This is
a latent data race if the map is ever accessed from multiple threads.

insert_or_assign performs the lookup and assignment atomically under a
single write lock, which is the correct pattern for a ConcurrentHashMap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bmilanich and others added 2 commits April 13, 2026 10:23
Replace the mutex-guarded empty check in OnBufferCompleted with an
std::atomic<bool> dirty flag. The mutex is now only acquired when
there is actual retirement work to do. Also update stale comment
on cudaGraphCurrentLaunch that said "let them leak".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The comment still described consuming/not-consuming cudaCallSiteInfo
entries, but memory CBIDs are no longer tracked so no entry exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bmilanich bmilanich requested a review from slomp April 13, 2026 16:51
@slomp
Copy link
Copy Markdown
Collaborator

slomp commented Apr 15, 2026

@wolfpld Are you OK with a new example being added with this PR?
Other than that, LGTM

@wolfpld
Copy link
Copy Markdown
Owner

wolfpld commented Apr 15, 2026

Are you OK with a new example being added with this PR?

Well, these are not really examples, but rather repro cases.

I don't really like the specific wording around them, or that these are AI generated.

On the other hand, I don't really see a viable way to provide these repro cases in some other way that would be convenient. And the actual programs do show where the problems are.

@wolfpld wolfpld merged commit ebf3f02 into wolfpld:master Apr 15, 2026
7 checks passed
@slomp
Copy link
Copy Markdown
Collaborator

slomp commented Apr 15, 2026

@wolfpld I'll add to my bucket list a task to consolidate this CUDA example with other CUDA code I use here for testing, and release it as a test instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants