TracyCUDA: show GPU zones for CUDA Graph-launched kernels#1330
TracyCUDA: show GPU zones for CUDA Graph-launched kernels#1330wolfpld merged 21 commits intowolfpld:masterfrom
Conversation
When kernels are launched via CUDA Graphs (cuGraphLaunch), CUPTI delivers CONCURRENT_KERNEL, MEMCPY, and MEMSET activity records but no corresponding API callback fires for the individual operations. This means matchActivityToAPICall() always fails, and every GPU activity record is silently dropped by matchError(). Fix this by falling back to a synthetic APICallInfo using the GPU timestamps from the activity record when no API correlation exists. This produces correct GPU zones with kernel names and timing — just without the CPU-to-GPU launch correlation arrow. Tested on NVIDIA H100 with CUDA 13.1: before this fix, 0 GPU zones appeared for CUDA Graph workloads; after, all kernel and memcpy zones are visible in the Tracy timeline.
Minimal reproducer showing that CUDA Graph-launched kernels produce 0 GPU zones in Tracy. The repro creates a simple graph (2 kernels + 1 memcpy), launches it 10 times, and expects ~30 GPU zones. Without the fallback patch, all activity records are dropped by matchError(). Tested on NVIDIA H100, CUDA 13.1.
Replace the synthetic APICallInfo hack with proper correlation via CUPTI_ACTIVITY_KIND_GRAPH_TRACE. When cuGraphLaunch fires an API callback, its correlationId is stored in cudaCallSiteInfo. The GRAPH_TRACE activity record carries the same correlationId plus the graphId, which lets us build a graphId→APICallInfo map. Kernel/memcpy/ memset activities then look up this map via their graphId field. Key changes: - Add cuGraphLaunch/cuGraphLaunch_ptsz to cbidDriverTrackers so the API callback machinery captures the CPU call site - Enable CUPTI_ACTIVITY_KIND_GRAPH_TRACE and handle it in DoProcessDeviceEvent to populate cudaGraphCurrentLaunch[graphId] - Add cudaGraphCurrentLaunch map to PersistentState - Two-pass buffer processing in OnBufferCompleted so GRAPH_TRACE records (which complete last on GPU) are processed before the kernel/memcpy/memset records that depend on them - Replace graphId=0 fallback in kernel/memcpy/memset with proper cudaGraphCurrentLaunch lookup; fall through to matchError if the graphId is not found - Update repro to include TracyCUDA headers and properly test GPU zone correlation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The repro uses cudaGraphLaunch (runtime API) not cuGraphLaunch (driver API). Add cudaGraphLaunch_v10000 and its _ptsz variant to cbidRuntimeTrackers so that graphs launched via the runtime API also get their CPU call site captured for GPU zone correlation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CUPTI discovery: all kernels launched by one cuGraphLaunch share the same correlationId as the launch call itself. GRAPH_TRACE was the wrong approach — enabling it suppresses per-kernel CONCURRENT_KERNEL records entirely, replacing them with graph-level summaries. New approach: - Drop CUPTI_ACTIVITY_KIND_GRAPH_TRACE (it conflicts with CONCURRENT_KERNEL) - Drop two-pass buffer processing (no longer needed) - On the first kernel/memcpy/memset from a graph launch, matchActivityToAPICall succeeds (consuming the cuGraphLaunch entry) and the result is cached in cudaGraphCurrentLaunch[graphId] - Subsequent operations from the same launch find the cached entry via graphId Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests: - Two distinct graphs (different graphIds) on the same stream - Graph A: kernel + memcpy + kernel (3 nodes) - Graph B: scale + add + scale (3 nodes) - 5 interleaved launches of each, stressing the graphId cache - Expected 30 graph GPU zones total Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The kernel, memcpy, and memset cases all had identical logic for handling graph-launched activities. Extract it into a single helper next to matchActivityToAPICall. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NVCC 13.1 defaults to a PTX version incompatible with the installed driver (580.105.08), causing kernels to silently fail with "provided PTX was compiled with an unsupported toolchain". Use -arch=native so NVCC auto-detects the target GPU (H100, sm_90) at build time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cudaGraphCurrentLaunch cache update was acquiring the write lock twice (once for erase, once for emplace). Wrapping std::unordered_map::insert_or_assign under a single write lock lets the caller do it in one operation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…helper - Fix wrong comments on graph launch tracker entries: they claimed correlation works "via CUPTI_ACTIVITY_KIND_GRAPH_TRACE", but that approach was rejected (GRAPH_TRACE suppresses per-kernel records). The actual mechanism is the shared correlationId across all nodes in one graph launch. - Fix ConcurrentHashMap::fetch() missing its read lock — a pre-existing data race now exercised by the new graph correlation hot path. - Cache PersistentState::Get().cudaGraphCurrentLaunch in a local ref inside matchGraphActivityToAPICall instead of calling Get() twice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Please review if these apply:
|
CUpti_ActivityMemory3 has no graphId field, so matchGraphActivityToAPICall cannot be applied. Graph-launched cudaGraphAddMemAllocNode emits multiple MEMORY2 records sharing the launch correlationId; only the first is tracked, subsequent ones fire a spurious matchError. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CUpti_ActivityMemory3 has no graphId field, so graph-launched alloc nodes and pre-profiling allocations can't be correlated to an API call. The handler only needs address, size, and timestamp from the activity record — apiCall is never used. Remove the early return so memory tracking works in all cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
So I had a separate patch for memory tracking with graphs, I can add it here just as well: there is no
It does not seem to be an issue as the new graph with the same |
Calling matchActivityToAPICall in the MEMORY2 handler was consuming the cudaCallSiteInfo entry for the graph launch correlationId. If a graph mixes alloc nodes with kernel/memcpy nodes, all activities share the same correlationId — consuming it here would cause matchGraphActivityToAPICall to fail for the kernel/memcpy records that follow, silently dropping their GPU zones. Since apiCall is never used by the MEMORY2 handler (only address, size, and timestamp from the activity record are needed), remove the call entirely and leave the entry for the kernel/memcpy to consume and cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@bmilanich Thank you for the PR! Quick hypothetical question: |
Tests two questions:
1. Does relaunching the same cudaGraphExec produce a new correlationId
each time, or is it reused?
2. Do two different cudaGraphExec handles from the same cudaGraph share
a graphId?
Results on H100, CUDA 13.1:
- Each launch of the same exec handle gets a strictly unique, monotonically
increasing correlationId. CPU callback corrId == GPU activity corrId.
This is formally documented in cupti_activity.h:
"Each graph launch is assigned a unique correlation ID that is
identical to the correlation ID in the driver API activity record
that launched the graph."
- graphId identifies the exec handle (instantiation), not the graph
definition. Two cudaGraphInstantiate calls on the same graph produce
different graphIds.
These findings confirm that the cudaGraphCurrentLaunch cache in
matchGraphActivityToAPICall is always refreshed by the first activity
of each new launch before the graphId fallback path is ever used.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests whether CUPTI recycles graphId values after cudaGraphExecDestroy, which would be the only scenario where the graphLaunchCache in TracyCUDA could serve stale entries for a non-matching exec handle. Result (H100, CUDA 12, CUPTI): graphId is a monotonically increasing counter that is never recycled. 22 create/instantiate/launch/destroy cycles produced unique IDs ranging from 2 to 65 (incrementing by 3 per cycle — one unit per node created during graph construction). This confirms that the stale-cache concern raised in code review is not a real risk in practice: two distinct exec handles always have distinct graphIds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This is a good point: if the same graphId and different corr id records will arrive interleaved it will mess up the cache. I need to make a good repro for this and figure how to handle that. |
|
So it seems to be impossible to produce interleaved records from the same graph handle launches. This is actually documented here: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1920584881db959c8c74130d79019b73 |
Great, thanks for digging! |
slomp
left a comment
There was a problem hiding this comment.
Doing a first pass of the code here.
I'll give it a run on some workloads next week.
| return false; | ||
| } | ||
| } else if (graphId != 0) { | ||
| graphLaunchCache.insert_or_assign(graphId, apiCallInfo); |
There was a problem hiding this comment.
Is insert_and_assign really necessary (as opposed to just using operator[])?
There was a problem hiding this comment.
operator[] grabs the read lock and then returns the plain reference, which, if assigned, will go without any locking. insert_or_assign() grabs the write lock and, if necessary, updates under the lock so it's safe.
There was a problem hiding this comment.
Sorry, I meant to say emplace, not operator[].
If I am following the logic correctly, whatever is placed in graphLaunchCache stays there until it gets overwritten, so emplace would just keep remembering the very first insertion, whereas here you want to enforce that it gets rewritten, since it could be a different launch of the same graph.
(I wonder if there's a CPUTI activity record that tells us when a graph launch has started and when it has ended.)
There was a problem hiding this comment.
It looks like there's this:
https://docs.nvidia.com/cupti/api/structCUpti__ActivityGraphTrace2.html
It will be posted in the activity record queue after all graph nodes have executed, so you could use that to "cleanup" the entry from the map, I suppose.
There was a problem hiding this comment.
Sorry, I meant to say
emplace, notoperator[].If I am following the logic correctly, whatever is placed in
graphLaunchCachestays there until it gets overwritten, soemplacewould just keep remembering the very first insertion, whereas here you want to enforce that it gets rewritten, since it could be a different launch of the same graph.
Yes, but the graphLaunchCache is indexed by the graphId, which remains the same between different cuGraphLaunch() executions, so the CPU correlation would point to the very first launch of this graph.
(I wonder if there's a CPUTI activity record that tells us when a graph launch has started and when it has ended.)
There is and I explored it but it's only tracing the whole graph execution and suppresses individual kernel activities as stated in NV's doc:
This activity record represents execution for a graph without giving visibility about the execution of its nodes. This is intended to reduce overheads in tracing each node. The activity kind is CUPTI_ACTIVITY_KIND_GRAPH_TRACE
Thus the comments at https://github.com/wolfpld/tracy/pull/1330/changes#diff-f80a13989fdff31b7a1416dbe2081b9aa8d022005f2b1a2b8dc859be3c9f33e7R1244
There was a problem hiding this comment.
Hmmm, yeah, what a bizarre choice (or overlook) from NVIDIA...
Without retirement, the cache grows by one entry per unique exec handle ever launched and never shrinks. While bounded by the number of distinct execs in the application, long-running programs creating and destroying many exec handles accumulate stale entries indefinitely. Retirement mechanism: - At cudaGraphExecDestroy (ENTER, while handle is still valid): call cuptiGetGraphExecId to translate exec handle → graphId and add to a pending-retirement set. Works for both runtime (cudaGraphExecDestroy) and driver (cuGraphExecDestroy) APIs. No new subscription needed — the existing cuptiEnableDomain already routes all API callbacks here. - Deferral in OnBufferCompleted: erasure is not done immediately because cudaGraphExecDestroy does not wait for GPU completion. CUPTI may still have undelivered activity records for the last launch in its internal buffers. We defer the erase until a full buffer arrives that contains no records bearing the retired graphId, indicating all in-flight records have been delivered. - getGraphIdFromRecord: new helper that extracts the graphId field from CONCURRENT_KERNEL / MEMCPY / MEMSET activity records (the three kinds that carry a graphId) for use in the per-buffer tracking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cudaMalloc/cudaFree (and driver equivalents) were tracked in cbidRuntimeTrackers/cbidDriverTrackers, creating a cudaCallSiteInfo entry on each API call. But the MEMORY2 handler never calls matchActivityToAPICall (and never calls EmitGpuZone) — it only needs the address, size, and timestamp from the activity record itself. Since no activity handler consumes these entries, they leaked indefinitely. Remove the 6 memory API CBIDs from both tracker maps so no entry is created. This eliminates the leak with no change in visible behavior: the MEMORY2 handler already operates independently of cudaCallSiteInfo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add `using GraphID = uint32_t` typedef and use it throughout for graphId-typed variables (PersistentState, matchGraphActivityToAPICall, getGraphIdFromRecord, retirement set, buffer loop). - Move matchError from matchGraphActivityToAPICall to caller sites (KERNEL, MEMCPY, MEMSET handlers). Keeping the error at the caller provides more debugging context about which activity kind failed. Remove the now-unnecessary `kind` parameter from the function. - Replace insert_or_assign with operator[] assignment in matchGraphActivityToAPICall. Access to graphLaunchCache is single-threaded (CUPTI worker), so the simpler syntax is sufficient. Remove the insert_or_assign method from ConcurrentHashMap entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
operator[] on ConcurrentHashMap returns a reference after releasing the read lock — the subsequent assignment happens with no lock held. This is a latent data race if the map is ever accessed from multiple threads. insert_or_assign performs the lookup and assignment atomically under a single write lock, which is the correct pattern for a ConcurrentHashMap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the mutex-guarded empty check in OnBufferCompleted with an std::atomic<bool> dirty flag. The mutex is now only acquired when there is actual retirement work to do. Also update stale comment on cudaGraphCurrentLaunch that said "let them leak". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The comment still described consuming/not-consuming cudaCallSiteInfo entries, but memory CBIDs are no longer tracked so no entry exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@wolfpld Are you OK with a new |
Well, these are not really examples, but rather repro cases. I don't really like the specific wording around them, or that these are AI generated. On the other hand, I don't really see a viable way to provide these repro cases in some other way that would be convenient. And the actual programs do show where the problems are. |
|
@wolfpld I'll add to my bucket list a task to consolidate this CUDA |

Problem
When kernels are launched via CUDA Graphs (
cudaGraphLaunch/cuGraphLaunch), Tracy shows 0 GPU zones. CUPTI deliversCONCURRENT_KERNELandMEMCPYactivity records but no per-kernel API callback fires, somatchActivityToAPICall()always fails and every zone is silently dropped.Root cause
Each activity record has a
correlationIdthat Tracy matches against API callbacks stored incudaCallSiteInfo. For CUDA Graph launches there are no individual-kernel callbacks, so the map has no entries for those correlation IDs.Key CUPTI finding (discovered during implementation)
All kernels in one
cuGraphLaunchcall share the samecorrelationIdas the launch itself. This was confirmed by instrumenting both the API callback and activity buffer simultaneously:This means the fix is straightforward once
cuGraphLaunchis tracked in the API callback machinery.Note:
CUPTI_ACTIVITY_KIND_GRAPH_TRACEwas investigated but must not be enabled — it suppresses individualCONCURRENT_KERNELrecords for graph-launched kernels, replacing them with graph-level summaries.Fix
Core graph correlation
cudaGraphLaunch_v10000/cuGraphLaunch(both runtime and driver API variants) to the existing callback tracker maps so their CPU call site is captured incudaCallSiteInfomatchGraphActivityToAPICall()helper (analogous tomatchActivityToAPICall()):matchActivityToAPICallsucceeds normally (consuming the launch entry) and the result is cached incudaGraphCurrentLaunch[graphId]correlationId, samegraphId) find the cachedAPICallInfoviagraphIdinsert_or_assign, which acquires the write lock only oncematchError()in the profiler timeline (same behaviour as for non-graph activity)using GraphID = uint32_ttypedef used throughout for graphId-typed variablesCache retirement (prevents unbounded growth)
cudaGraphExecDestroy(ENTER callback, while handle is still valid): callcuptiGetGraphExecIdto translate exec handle → graphId and add to a pending-retirement set (graphExecPendingRetire)OnBufferCompletedbecausecudaGraphExecDestroydoesn't wait for GPU completion — CUPTI may still have undelivered activity records in its buffersgetGraphIdFromRecord()helper extracts the graphId field fromCONCURRENT_KERNEL/MEMCPY/MEMSETactivity records for per-buffer trackingstd::atomic<bool> graphRetirePendingdirty flag letsOnBufferCompletedskip the retirement mutex on the hot path (no pending retirements = no lock acquired)ConcurrentHashMap fixes
ConcurrentHashMap: addedinsert_or_assignwrappingstd::unordered_map::insert_or_assignunder a single write lockConcurrentHashMap::fetch(): fixed a pre-existing missing read lock —mapping.find()was called without holding any lock, a data race now exercised on every graph activity recordMemory tracking cleanup
cudaMalloc/cudaFree/cuMemAlloc_v2/cuMemFree_v2(6 CBIDs total) fromcbidRuntimeTrackers/cbidDriverTrackers— the MEMORY2 handler never callsmatchActivityToAPICalland never emits GPU zones, so these entries leaked indefinitelymatchActivityToAPICall—CUpti_ActivityMemory3has nographIdfield, so graph-launched alloc nodes (cudaGraphAddMemAllocNode) and pre-profiling allocations can never be correlated to an API call. The handler only needs address, size, and timestamp from the activity recordOther
CUPTI_ACTIVITY_KIND_SYNCHRONIZATIONhandler: confirmed correct as-is. In-graph event record/wait nodes produce noSYNCHRONIZATIONactivity records (empirically verified on H100)examples/CUDAGraphRepro/Makefile: added-arch=nativeso NVCC auto-detects the target GPU architectureResults
Tested on H100, CUDA 13.1, driver 580.105.08.
Single graph (10 launches × 3 nodes):
vector_add+ 10CUDA::memcpy)Multiple graphs, interleaved launches — two distinct graphs launched alternately (A, B, A, B, …) to stress graphId cache switching:
kernel(add)+memcpy+kernel(add)— 3 nodes, 5 launches = 15 zoneskernel(scale)+kernel(add)+kernel(scale)— 3 nodes, 5 launches = 15 zonesvector_addvector_scaleCUDA::memcpy(graph)c[0]Regular (non-graph) CUDA operations — verified that memory CBID tracker removal doesn't regress normal profiling:
GPU zones are correctly correlated to the
cuGraphLaunchCPU call site: selecting a GPU zone highlights the corresponding CPU time range in the timeline.Debug build of
tracy-capture(asserts enabled) confirmed no assertions fire atserver/TracyWorker.cpp:5988.Note on GPU zone source locations
All CUDA Graph GPU zones point to
TracyCUDA.hpprather than the user's call site. This is the same behaviour as regular CUDA zones — CUPTI delivers activity records asynchronously in a background thread that has no knowledge of the original call stack.Files changed
public/tracy/TracyCUDA.hpp— core fixexamples/CUDAGraphRepro/repro.cu— updated to use TracyCUDA headers and verify zonesexamples/CUDAGraphRepro/Makefile— release and debug variants;-arch=nativefor GPU arch detection🤖 Generated with Claude Code