TracyCUDA: show GPU zones for CUDA Graph-launched kernels by bmilanich · Pull Request #1330 · wolfpld/tracy

bmilanich · 2026-04-06T16:37:51Z

Problem

When kernels are launched via CUDA Graphs (cudaGraphLaunch / cuGraphLaunch), Tracy shows 0 GPU zones. CUPTI delivers CONCURRENT_KERNEL and MEMCPY activity records but no per-kernel API callback fires, so matchActivityToAPICall() always fails and every zone is silently dropped.

Root cause

Each activity record has a correlationId that Tracy matches against API callbacks stored in cudaCallSiteInfo. For CUDA Graph launches there are no individual-kernel callbacks, so the map has no entries for those correlation IDs.

Key CUPTI finding (discovered during implementation)

All kernels in one cuGraphLaunch call share the same correlationId as the launch itself. This was confirmed by instrumenting both the API callback and activity buffer simultaneously:

cudaGraphLaunch ENTER corr=9
KERNEL graphId=2  corr=9   ← same as launch
KERNEL graphId=2  corr=9   ← same as launch

This means the fix is straightforward once cuGraphLaunch is tracked in the API callback machinery.

Note: CUPTI_ACTIVITY_KIND_GRAPH_TRACE was investigated but must not be enabled — it suppresses individual CONCURRENT_KERNEL records for graph-launched kernels, replacing them with graph-level summaries.

Fix

Core graph correlation

Add cudaGraphLaunch_v10000 / cuGraphLaunch (both runtime and driver API variants) to the existing callback tracker maps so their CPU call site is captured in cudaCallSiteInfo
Introduce matchGraphActivityToAPICall() helper (analogous to matchActivityToAPICall()):
- For the first kernel/memcpy/memset from a graph launch, matchActivityToAPICall succeeds normally (consuming the launch entry) and the result is cached in cudaGraphCurrentLaunch[graphId]
- Subsequent operations from the same launch (same correlationId, same graphId) find the cached APICallInfo via graphId
- Each new launch of the same graph overwrites the cache entry via insert_or_assign, which acquires the write lock only once
- If neither path succeeds, the caller emits a diagnostic matchError() in the profiler timeline (same behaviour as for non-graph activity)
using GraphID = uint32_t typedef used throughout for graphId-typed variables

Cache retirement (prevents unbounded growth)

On cudaGraphExecDestroy (ENTER callback, while handle is still valid): call cuptiGetGraphExecId to translate exec handle → graphId and add to a pending-retirement set (graphExecPendingRetire)
Erasure is deferred to OnBufferCompleted because cudaGraphExecDestroy doesn't wait for GPU completion — CUPTI may still have undelivered activity records in its buffers
A graphId is erased only when a full buffer arrives containing no records bearing that graphId, indicating all in-flight records have been delivered
getGraphIdFromRecord() helper extracts the graphId field from CONCURRENT_KERNEL / MEMCPY / MEMSET activity records for per-buffer tracking
std::atomic<bool> graphRetirePending dirty flag lets OnBufferCompleted skip the retirement mutex on the hot path (no pending retirements = no lock acquired)

ConcurrentHashMap fixes

ConcurrentHashMap: added insert_or_assign wrapping std::unordered_map::insert_or_assign under a single write lock
ConcurrentHashMap::fetch(): fixed a pre-existing missing read lock — mapping.find() was called without holding any lock, a data race now exercised on every graph activity record

Memory tracking cleanup

Remove cudaMalloc/cudaFree/cuMemAlloc_v2/cuMemFree_v2 (6 CBIDs total) from cbidRuntimeTrackers / cbidDriverTrackers — the MEMORY2 handler never calls matchActivityToAPICall and never emits GPU zones, so these entries leaked indefinitely
MEMORY2 handler: removed the early return on failed matchActivityToAPICall — CUpti_ActivityMemory3 has no graphId field, so graph-launched alloc nodes (cudaGraphAddMemAllocNode) and pre-profiling allocations can never be correlated to an API call. The handler only needs address, size, and timestamp from the activity record

Other

CUPTI_ACTIVITY_KIND_SYNCHRONIZATION handler: confirmed correct as-is. In-graph event record/wait nodes produce no SYNCHRONIZATION activity records (empirically verified on H100)
examples/CUDAGraphRepro/Makefile: added -arch=native so NVCC auto-detects the target GPU architecture

Results

Tested on H100, CUDA 13.1, driver 580.105.08.

Single graph (10 launches × 3 nodes):

Tracy version	GPU zones
Unpatched	0
This PR	30 ✅ (20 `vector_add` + 10 `CUDA::memcpy`)

Multiple graphs, interleaved launches — two distinct graphs launched alternately (A, B, A, B, …) to stress graphId cache switching:

Graph A: kernel(add) + memcpy + kernel(add) — 3 nodes, 5 launches = 15 zones
Graph B: kernel(scale) + kernel(add) + kernel(scale) — 3 nodes, 5 launches = 15 zones

Zone type	Expected	Got
`vector_add`	15	15 ✅
`vector_scale`	10	10 ✅
`CUDA::memcpy` (graph)	5	5 ✅
Computation result `c[0]`	6.0	6.0 ✅

Regular (non-graph) CUDA operations — verified that memory CBID tracker removal doesn't regress normal profiling:

Metric	Expected	Got
GPU zones	31	31 ✅
Memory allocs	36	36 ✅
Memory frees	36	36 ✅

GPU zones are correctly correlated to the cuGraphLaunch CPU call site: selecting a GPU zone highlights the corresponding CPU time range in the timeline.

Debug build of tracy-capture (asserts enabled) confirmed no assertions fire at server/TracyWorker.cpp:5988.

Note on GPU zone source locations

All CUDA Graph GPU zones point to TracyCUDA.hpp rather than the user's call site. This is the same behaviour as regular CUDA zones — CUPTI delivers activity records asynchronously in a background thread that has no knowledge of the original call stack.

Files changed

public/tracy/TracyCUDA.hpp — core fix
examples/CUDAGraphRepro/repro.cu — updated to use TracyCUDA headers and verify zones
examples/CUDAGraphRepro/Makefile — release and debug variants; -arch=native for GPU arch detection

🤖 Generated with Claude Code

When kernels are launched via CUDA Graphs (cuGraphLaunch), CUPTI delivers CONCURRENT_KERNEL, MEMCPY, and MEMSET activity records but no corresponding API callback fires for the individual operations. This means matchActivityToAPICall() always fails, and every GPU activity record is silently dropped by matchError(). Fix this by falling back to a synthetic APICallInfo using the GPU timestamps from the activity record when no API correlation exists. This produces correct GPU zones with kernel names and timing — just without the CPU-to-GPU launch correlation arrow. Tested on NVIDIA H100 with CUDA 13.1: before this fix, 0 GPU zones appeared for CUDA Graph workloads; after, all kernel and memcpy zones are visible in the Tracy timeline.

Minimal reproducer showing that CUDA Graph-launched kernels produce 0 GPU zones in Tracy. The repro creates a simple graph (2 kernels + 1 memcpy), launches it 10 times, and expects ~30 GPU zones. Without the fallback patch, all activity records are dropped by matchError(). Tested on NVIDIA H100, CUDA 13.1.

Replace the synthetic APICallInfo hack with proper correlation via CUPTI_ACTIVITY_KIND_GRAPH_TRACE. When cuGraphLaunch fires an API callback, its correlationId is stored in cudaCallSiteInfo. The GRAPH_TRACE activity record carries the same correlationId plus the graphId, which lets us build a graphId→APICallInfo map. Kernel/memcpy/ memset activities then look up this map via their graphId field. Key changes: - Add cuGraphLaunch/cuGraphLaunch_ptsz to cbidDriverTrackers so the API callback machinery captures the CPU call site - Enable CUPTI_ACTIVITY_KIND_GRAPH_TRACE and handle it in DoProcessDeviceEvent to populate cudaGraphCurrentLaunch[graphId] - Add cudaGraphCurrentLaunch map to PersistentState - Two-pass buffer processing in OnBufferCompleted so GRAPH_TRACE records (which complete last on GPU) are processed before the kernel/memcpy/memset records that depend on them - Replace graphId=0 fallback in kernel/memcpy/memset with proper cudaGraphCurrentLaunch lookup; fall through to matchError if the graphId is not found - Update repro to include TracyCUDA headers and properly test GPU zone correlation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The repro uses cudaGraphLaunch (runtime API) not cuGraphLaunch (driver API). Add cudaGraphLaunch_v10000 and its _ptsz variant to cbidRuntimeTrackers so that graphs launched via the runtime API also get their CPU call site captured for GPU zone correlation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CUPTI discovery: all kernels launched by one cuGraphLaunch share the same correlationId as the launch call itself. GRAPH_TRACE was the wrong approach — enabling it suppresses per-kernel CONCURRENT_KERNEL records entirely, replacing them with graph-level summaries. New approach: - Drop CUPTI_ACTIVITY_KIND_GRAPH_TRACE (it conflicts with CONCURRENT_KERNEL) - Drop two-pass buffer processing (no longer needed) - On the first kernel/memcpy/memset from a graph launch, matchActivityToAPICall succeeds (consuming the cuGraphLaunch entry) and the result is cached in cudaGraphCurrentLaunch[graphId] - Subsequent operations from the same launch find the cached entry via graphId Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tests: - Two distinct graphs (different graphIds) on the same stream - Graph A: kernel + memcpy + kernel (3 nodes) - Graph B: scale + add + scale (3 nodes) - 5 interleaved launches of each, stressing the graphId cache - Expected 30 graph GPU zones total Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The kernel, memcpy, and memset cases all had identical logic for handling graph-launched activities. Extract it into a single helper next to matchActivityToAPICall. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

NVCC 13.1 defaults to a PTX version incompatible with the installed driver (580.105.08), causing kernels to silently fail with "provided PTX was compiled with an unsupported toolchain". Use -arch=native so NVCC auto-detects the target GPU (H100, sm_90) at build time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The cudaGraphCurrentLaunch cache update was acquiring the write lock twice (once for erase, once for emplace). Wrapping std::unordered_map::insert_or_assign under a single write lock lets the caller do it in one operation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…helper - Fix wrong comments on graph launch tracker entries: they claimed correlation works "via CUPTI_ACTIVITY_KIND_GRAPH_TRACE", but that approach was rejected (GRAPH_TRACE suppresses per-kernel records). The actual mechanism is the shared correlationId across all nodes in one graph launch. - Fix ConcurrentHashMap::fetch() missing its read lock — a pre-existing data race now exercised by the new graph correlation hot path. - Cache PersistentState::Get().cudaGraphCurrentLaunch in a local ref inside matchGraphActivityToAPICall instead of calling Get() twice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bmilanich · 2026-04-06T19:22:43Z

Here is how it looks in the profiler now:

wolfpld · 2026-04-07T19:39:15Z

Please review if these apply:

SYNCHRONIZATION and MEMORY2 handlers not updated — These still call matchActivityToAPICall directly. If graph-launched sync events have a non-zero graphId, they'd be silently dropped the same way kernels were. Should these also use matchGraphActivityToAPICall, or is there a reason they don't need it? (Worth a comment if intentional.)
Stale cache entries in cudaGraphCurrentLaunch — No cleanup when a graph is destroyed. Since CUPTI reuses graphId values, a stale entry could theoretically be consumed by a later unrelated graph launch on the same graphId. The insert_or_assign overwrite handles same-graph relaunches, but a destroyed-then-recreated graph with a recycled graphId could inherit stale APICallInfo. Consider adding a note or defensive check.

CUpti_ActivityMemory3 has no graphId field, so matchGraphActivityToAPICall cannot be applied. Graph-launched cudaGraphAddMemAllocNode emits multiple MEMORY2 records sharing the launch correlationId; only the first is tracked, subsequent ones fire a spurious matchError. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CUpti_ActivityMemory3 has no graphId field, so graph-launched alloc nodes and pre-profiling allocations can't be correlated to an API call. The handler only needs address, size, and timestamp from the activity record — apiCall is never used. Remove the early return so memory tracking works in all cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bmilanich · 2026-04-07T20:47:39Z

Please review if these apply:

SYNCHRONIZATION and MEMORY2 handlers not updated — These still call matchActivityToAPICall directly. If graph-launched sync events have a non-zero graphId, they'd be silently dropped the same way kernels were. Should these also use matchGraphActivityToAPICall, or is there a reason they don't need it? (Worth a comment if intentional.)

So I had a separate patch for memory tracking with graphs, I can add it here just as well: there is no graphId in the CUpti_ActivityMemory3 so the solution here is proceed with memory tracking regardless of matching with matchActivityToAPICall() (I have updated the PR with it).
Graphs do not produce SYNCHRONIZATION activity in CUPTI, their synchronization stays inside the GPU so no need to update SYNCHRONIZATION.

Stale cache entries in cudaGraphCurrentLaunch — No cleanup when a graph is destroyed. Since CUPTI reuses graphId values, a stale entry could theoretically be consumed by a later unrelated graph launch on the same graphId. The insert_or_assign overwrite handles same-graph relaunches, but a destroyed-then-recreated graph with a recycled graphId could inherit stale APICallInfo. Consider adding a note or defensive check.

It does not seem to be an issue as the new graph with the same graphId will still have a different correlationId and will succeed in the matchGraphActivityToAPICall(), replacing the stale cache entry.

Calling matchActivityToAPICall in the MEMORY2 handler was consuming the cudaCallSiteInfo entry for the graph launch correlationId. If a graph mixes alloc nodes with kernel/memcpy nodes, all activities share the same correlationId — consuming it here would cause matchGraphActivityToAPICall to fail for the kernel/memcpy records that follow, silently dropping their GPU zones. Since apiCall is never used by the MEMORY2 handler (only address, size, and timestamp from the activity record are needed), remove the call entirely and leave the entry for the kernel/memcpy to consume and cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

slomp · 2026-04-08T22:01:00Z

@bmilanich Thank you for the PR!
I will try to carve some time next week to review/test it properly, but from a first glance, I like the solution.

Quick hypothetical question:
What happens if one calls cuGraphLaunch() while there's another cuGraphLaunch() on the same graphId still in-flight? Presumably, CUDA should give the second call a different correlationId (as it would for other API calls), but I could not find definitive documentation on that.

Tests two questions: 1. Does relaunching the same cudaGraphExec produce a new correlationId each time, or is it reused? 2. Do two different cudaGraphExec handles from the same cudaGraph share a graphId? Results on H100, CUDA 13.1: - Each launch of the same exec handle gets a strictly unique, monotonically increasing correlationId. CPU callback corrId == GPU activity corrId. This is formally documented in cupti_activity.h: "Each graph launch is assigned a unique correlation ID that is identical to the correlation ID in the driver API activity record that launched the graph." - graphId identifies the exec handle (instantiation), not the graph definition. Two cudaGraphInstantiate calls on the same graph produce different graphIds. These findings confirm that the cudaGraphCurrentLaunch cache in matchGraphActivityToAPICall is always refreshed by the first activity of each new launch before the graphId fallback path is ever used. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tests whether CUPTI recycles graphId values after cudaGraphExecDestroy, which would be the only scenario where the graphLaunchCache in TracyCUDA could serve stale entries for a non-matching exec handle. Result (H100, CUDA 12, CUPTI): graphId is a monotonically increasing counter that is never recycled. 22 create/instantiate/launch/destroy cycles produced unique IDs ranging from 2 to 65 (incrementing by 3 per cycle — one unit per node created during graph construction). This confirms that the stale-cache concern raised in code review is not a real risk in practice: two distinct exec handles always have distinct graphIds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bmilanich · 2026-04-09T14:49:13Z

@bmilanich Thank you for the PR! I will try to carve some time next week to review/test it properly, but from a first glance, I like the solution.

Quick hypothetical question: What happens if one calls cuGraphLaunch() while there's another cuGraphLaunch() on the same graphId still in-flight? Presumably, CUDA should give the second call a different correlationId (as it would for other API calls), but I could not find definitive documentation on that.

This is a good point: if the same graphId and different corr id records will arrive interleaved it will mess up the cache. I need to make a good repro for this and figure how to handle that.

bmilanich · 2026-04-09T15:10:38Z

So it seems to be impossible to produce interleaved records from the same graph handle launches. This is actually documented here: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1920584881db959c8c74130d79019b73

Description
Executes graphExec in stream. Only one instance of graphExec may be executing at a time. Each launch is ordered behind both any previous work in stream and any previous launches of graphExec. To execute a graph concurrently, it must be instantiated multiple times into multiple executable graphs.

slomp · 2026-04-09T21:16:22Z

So it seems to be impossible to produce interleaved records from the same graph handle launches. This is actually documented here: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1920584881db959c8c74130d79019b73
Description
Executes graphExec in stream. Only one instance of graphExec may be executing at a time. Each launch is ordered behind both any previous work in stream and any previous launches of graphExec. To execute a graph concurrently, it must be instantiated multiple times into multiple executable graphs.

Great, thanks for digging!

slomp

Doing a first pass of the code here.
I'll give it a run on some workloads next week.

slomp · 2026-04-12T15:53:04Z

+                    return false;
+                }
+            } else if (graphId != 0) {
+                graphLaunchCache.insert_or_assign(graphId, apiCallInfo);


Is insert_and_assign really necessary (as opposed to just using operator[])?

operator[] grabs the read lock and then returns the plain reference, which, if assigned, will go without any locking. insert_or_assign() grabs the write lock and, if necessary, updates under the lock so it's safe.

Sorry, I meant to say emplace, not operator[].

If I am following the logic correctly, whatever is placed in graphLaunchCache stays there until it gets overwritten, so emplace would just keep remembering the very first insertion, whereas here you want to enforce that it gets rewritten, since it could be a different launch of the same graph.

(I wonder if there's a CPUTI activity record that tells us when a graph launch has started and when it has ended.)

It looks like there's this:
https://docs.nvidia.com/cupti/api/structCUpti__ActivityGraphTrace2.html

It will be posted in the activity record queue after all graph nodes have executed, so you could use that to "cleanup" the entry from the map, I suppose.

Sorry, I meant to say emplace, not operator[].

If I am following the logic correctly, whatever is placed in graphLaunchCache stays there until it gets overwritten, so emplace would just keep remembering the very first insertion, whereas here you want to enforce that it gets rewritten, since it could be a different launch of the same graph.

Yes, but the graphLaunchCache is indexed by the graphId, which remains the same between different cuGraphLaunch() executions, so the CPU correlation would point to the very first launch of this graph.

(I wonder if there's a CPUTI activity record that tells us when a graph launch has started and when it has ended.)

There is and I explored it but it's only tracing the whole graph execution and suppresses individual kernel activities as stated in NV's doc:

This activity record represents execution for a graph without giving visibility about the execution of its nodes. This is intended to reduce overheads in tracing each node. The activity kind is CUPTI_ACTIVITY_KIND_GRAPH_TRACE

Thus the comments at https://github.com/wolfpld/tracy/pull/1330/changes#diff-f80a13989fdff31b7a1416dbe2081b9aa8d022005f2b1a2b8dc859be3c9f33e7R1244

Hmmm, yeah, what a bizarre choice (or overlook) from NVIDIA...

Without retirement, the cache grows by one entry per unique exec handle ever launched and never shrinks. While bounded by the number of distinct execs in the application, long-running programs creating and destroying many exec handles accumulate stale entries indefinitely. Retirement mechanism: - At cudaGraphExecDestroy (ENTER, while handle is still valid): call cuptiGetGraphExecId to translate exec handle → graphId and add to a pending-retirement set. Works for both runtime (cudaGraphExecDestroy) and driver (cuGraphExecDestroy) APIs. No new subscription needed — the existing cuptiEnableDomain already routes all API callbacks here. - Deferral in OnBufferCompleted: erasure is not done immediately because cudaGraphExecDestroy does not wait for GPU completion. CUPTI may still have undelivered activity records for the last launch in its internal buffers. We defer the erase until a full buffer arrives that contains no records bearing the retired graphId, indicating all in-flight records have been delivered. - getGraphIdFromRecord: new helper that extracts the graphId field from CONCURRENT_KERNEL / MEMCPY / MEMSET activity records (the three kinds that carry a graphId) for use in the per-buffer tracking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cudaMalloc/cudaFree (and driver equivalents) were tracked in cbidRuntimeTrackers/cbidDriverTrackers, creating a cudaCallSiteInfo entry on each API call. But the MEMORY2 handler never calls matchActivityToAPICall (and never calls EmitGpuZone) — it only needs the address, size, and timestamp from the activity record itself. Since no activity handler consumes these entries, they leaked indefinitely. Remove the 6 memory API CBIDs from both tracker maps so no entry is created. This eliminates the leak with no change in visible behavior: the MEMORY2 handler already operates independently of cudaCallSiteInfo. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add `using GraphID = uint32_t` typedef and use it throughout for graphId-typed variables (PersistentState, matchGraphActivityToAPICall, getGraphIdFromRecord, retirement set, buffer loop). - Move matchError from matchGraphActivityToAPICall to caller sites (KERNEL, MEMCPY, MEMSET handlers). Keeping the error at the caller provides more debugging context about which activity kind failed. Remove the now-unnecessary `kind` parameter from the function. - Replace insert_or_assign with operator[] assignment in matchGraphActivityToAPICall. Access to graphLaunchCache is single-threaded (CUPTI worker), so the simpler syntax is sufficient. Remove the insert_or_assign method from ConcurrentHashMap entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

operator[] on ConcurrentHashMap returns a reference after releasing the read lock — the subsequent assignment happens with no lock held. This is a latent data race if the map is ever accessed from multiple threads. insert_or_assign performs the lookup and assignment atomically under a single write lock, which is the correct pattern for a ConcurrentHashMap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the mutex-guarded empty check in OnBufferCompleted with an std::atomic<bool> dirty flag. The mutex is now only acquired when there is actual retirement work to do. Also update stale comment on cudaGraphCurrentLaunch that said "let them leak". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The comment still described consuming/not-consuming cudaCallSiteInfo entries, but memory CBIDs are no longer tracked so no entry exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

slomp · 2026-04-15T18:39:17Z

@wolfpld Are you OK with a new example being added with this PR?
Other than that, LGTM

wolfpld · 2026-04-15T19:14:03Z

Are you OK with a new example being added with this PR?

Well, these are not really examples, but rather repro cases.

I don't really like the specific wording around them, or that these are AI generated.

On the other hand, I don't really see a viable way to provide these repro cases in some other way that would be convenient. And the actual programs do show where the problems are.

slomp · 2026-04-15T19:48:27Z

@wolfpld I'll add to my bucket list a task to consolidate this CUDA example with other CUDA code I use here for testing, and release it as a test instead.

bmilanich and others added 7 commits March 24, 2026 09:24

Refactor: extract matchGraphActivityToAPICall helper

f74dd21

The kernel, memcpy, and memset cases all had identical logic for handling graph-launched activities. Extract it into a single helper next to matchActivityToAPICall. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

slomp mentioned this pull request Apr 6, 2026

GPU: CUDA: handle GPU activity coming from cuGraphLaunch() #1322

Closed

bmilanich and others added 3 commits April 6, 2026 13:33

bmilanich marked this pull request as ready for review April 6, 2026 19:23

bmilanich and others added 2 commits April 7, 2026 15:30

slomp self-requested a review April 7, 2026 21:38

bmilanich and others added 2 commits April 9, 2026 09:28

slomp requested changes Apr 12, 2026

View reviewed changes

bmilanich and others added 4 commits April 13, 2026 10:23

bmilanich and others added 2 commits April 13, 2026 10:23

Update stale MEMORY2 comment to reflect cbid tracker removal

0a0566a

The comment still described consuming/not-consuming cudaCallSiteInfo entries, but memory CBIDs are no longer tracked so no entry exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bmilanich requested a review from slomp April 13, 2026 16:51

slomp approved these changes Apr 15, 2026

View reviewed changes

wolfpld merged commit ebf3f02 into wolfpld:master Apr 15, 2026
7 checks passed

Uh oh!

Conversation

bmilanich commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Key CUPTI finding (discovered during implementation)

Fix

Core graph correlation

Cache retirement (prevents unbounded growth)

ConcurrentHashMap fixes

Memory tracking cleanup

Other

Results

Note on GPU zone source locations

Files changed

Uh oh!

bmilanich commented Apr 6, 2026

Uh oh!

wolfpld commented Apr 7, 2026

Uh oh!

bmilanich commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slomp commented Apr 8, 2026

Uh oh!

bmilanich commented Apr 9, 2026

Uh oh!

bmilanich commented Apr 9, 2026

Uh oh!

slomp commented Apr 9, 2026

Uh oh!

slomp left a comment

Choose a reason for hiding this comment

Uh oh!

slomp Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

bmilanich Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

slomp Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slomp Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

bmilanich Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

slomp Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slomp commented Apr 15, 2026

Uh oh!

wolfpld commented Apr 15, 2026

Uh oh!

Uh oh!

slomp commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bmilanich commented Apr 6, 2026 •

edited

Loading

bmilanich commented Apr 7, 2026 •

edited

Loading

slomp Apr 14, 2026 •

edited

Loading