[graph_trainer] Fix CUDAGraph warmup to stay on current stream by bobrenjc93 · Pull Request #2922 · pytorch/torchtitan

bobrenjc93 · 2026-04-09T20:15:41Z

Stack from ghstack (oldest at bottom):

-> [graph_trainer] Fix CUDAGraph warmup to stay on current stream #2922

The warmup phase was running on _cg_manager.stream via
_use_cuda_memory_pool_manager, which caused NCCL collectives to
execute on a non-default stream. On multi-node IB/RoCE setups this
leads to illegal memory access errors because NCCL expects to run
on the stream it was initialized with.

Fix by passing torch.cuda.current_stream() to
_use_cuda_memory_pool_manager instead of _cg_manager.stream. This
keeps NCCL collectives on the correct stream while still directing
warmup allocations into the graph memory pool (avoiding fragmentation
between eager and graph pools).

Also remove a redundant torch.cuda.synchronize() before graph
recording — torch.cuda.graph.__enter__ already performs a full
device sync. Add capture_error_mode="thread_local" for better
error reporting during graph capture.

[ghstack-poisoned]

The warmup phase was running on _cg_manager.stream via _use_cuda_memory_pool_manager, which caused NCCL collectives to execute on a non-default stream. On multi-node IB/RoCE setups this leads to illegal memory access errors because NCCL expects to run on the stream it was initialized with. Fix by passing torch.cuda.current_stream() to _use_cuda_memory_pool_manager instead of _cg_manager.stream. This keeps NCCL collectives on the correct stream while still directing warmup allocations into the graph memory pool (avoiding fragmentation between eager and graph pools). Also remove a redundant torch.cuda.synchronize() before graph recording — torch.cuda.graph.__enter__ already performs a full device sync. Add capture_error_mode="thread_local" for better error reporting during graph capture. ghstack-source-id: 18aa00a Pull-Request: #2922

bobrenjc93 · 2026-04-10T14:39:46Z

@claude

claude · 2026-04-10T14:40:12Z

Claude finished @bobrenjc93's task in 3m 9s —— View job

Review of PR #2922

Gather context and read changed files
Review the diff against base branch
Post review feedback

Overall: This is a well-motivated fix. The net change (vs. base at HEAD~2) is small and focused — switching the warmup stream parameter from _cg_manager.stream to torch.cuda.current_stream(), plus adding capture_error_mode="thread_local" to graph capture. The comments explaining the rationale are clear and helpful.

Changes look correct. Here are some detailed observations:

Warmup stream fix (`cudagraph.py:192-194`)

The core fix — passing torch.cuda.current_stream() instead of _cg_manager.stream to _use_cuda_memory_pool_manager — makes sense. _use_cuda_memory_pool_manager directs allocations into the graph memory pool but the stream argument controls which stream the operations actually execute on. Using _cg_manager.stream (a non-default stream) during warmup would cause NCCL collectives to run on a stream they weren't initialized with, which explains the illegal memory access errors on multi-node setups.

One thing to verify: torch.cuda.current_stream() is evaluated at call time, so this will correctly capture whatever stream is active when warmup runs. This is the right behavior — if the user is on the default stream (typical), warmup stays on the default stream.

Removed `torch.cuda.synchronize()` before graph recording (`cudagraph.py:204-209`)

The PR description says torch.cuda.graph.__enter__ already performs a full device sync. This is accurate based on the PyTorch implementation — the torch.cuda.graph context manager synchronizes the stream before starting capture. Removing the redundant sync is a clean improvement.

`capture_error_mode="thread_local"` (`cudagraph.py:209`)

This is a good addition for debugging. "thread_local" restricts error checking to the current thread, which gives more precise error attribution during graph capture compared to the default "global" mode. This was already applied to the dummy graph at line 59, so it's consistent to use it here too.

Minor notes

The intermediate commit (ffc45b0) removed _use_cuda_memory_pool_manager entirely and replaced warmup with bare synchronize() + self._runnable(*args) + synchronize(). The final commit (9c5b2b6) brings the pool manager back but with the correct stream. The net diff is clean — only the stream argument changed and the redundant sync/comment removed. The two-commit history is a bit noisy (the intermediate commit is effectively reverted then improved), but since this is a ghstack PR, the squashed result will be fine.
The comment block at lines 187-191 is well-written and explains the "why" clearly. It documents both the memory pool motivation and the stream choice, which will help future readers.

LGTM — the fix is targeted, well-explained, and the risk surface is small.

BoyuanFeng · 2026-04-10T21:02:10Z

torchtitan/experiments/graph_trainer/cudagraph.py

                self._cudagraph,
                pool=_cg_manager.graph_pool,
                stream=_cg_manager.stream,
+                capture_error_mode="thread_local",


would this line be sufficient? If we don't have the change above in

with _use_cuda_memory_pool_manager( device, _cg_manager.graph_pool, torch.cuda.current_stream() ):

I don't think so? This just gives better error reporting but the underlying issue is IMAs stemming from the warmup run on the wrong stream (which gets solved with the quoted code).

Is there a reason you think this one line would be sufficient?

warmup on capture stream has been working with nccl for a while. can we find the root cause why torch.cuda.current_stream() is needed to avoid IMA?

if using torch.cuda.current_stream(), there would be a high memory overhead due to fragmentation between capture stream and default stream in private pool.

bobrenjc93 · 2026-04-13T23:36:26Z

talked offline, will try and get precise profile data first and then we can figure out best workaround

Update

ffc45b0

[ghstack-poisoned]

pytorch-bot bot added the ciflow/8gpu label Apr 9, 2026

bobrenjc93 mentioned this pull request Apr 9, 2026

[graph_trainer] Add cudagraph support for AOT and CooR precompile paths #2907

Open

bobrenjc93 marked this pull request as ready for review April 9, 2026 20:16

bobrenjc93 requested review from SherlockNoMad, aditvenk, tianyu-l, xmfan and yiming0416 as code owners April 9, 2026 20:16

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 9, 2026

bobrenjc93 mentioned this pull request Apr 9, 2026

[graph_trainer] Add CooR precompile support for DeepSeek V3 #2916

Open

yiming0416 requested a review from BoyuanFeng April 9, 2026 20:21

Update

9c5b2b6

[ghstack-poisoned]

BoyuanFeng reviewed Apr 10, 2026

View reviewed changes

bobrenjc93 requested a review from BoyuanFeng April 10, 2026 21:15

bobrenjc93 closed this Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[graph_trainer] Fix CUDAGraph warmup to stay on current stream#2922

[graph_trainer] Fix CUDAGraph warmup to stay on current stream#2922
bobrenjc93 wants to merge 2 commits intogh/bobrenjc93/40/basefrom
gh/bobrenjc93/40/head

bobrenjc93 commented Apr 9, 2026 •

edited

Loading

Uh oh!

bobrenjc93 commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

BoyuanFeng Apr 10, 2026

Uh oh!

bobrenjc93 Apr 10, 2026

Uh oh!

BoyuanFeng Apr 13, 2026

Uh oh!

bobrenjc93 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bobrenjc93 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bobrenjc93 commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #2922

Warmup stream fix (cudagraph.py:192-194)

Removed torch.cuda.synchronize() before graph recording (cudagraph.py:204-209)

capture_error_mode="thread_local" (cudagraph.py:209)

Minor notes

Uh oh!

BoyuanFeng Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

bobrenjc93 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

bobrenjc93 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bobrenjc93 commented Apr 9, 2026 •

edited

Loading

claude bot commented Apr 10, 2026 •

edited

Loading

Warmup stream fix (`cudagraph.py:192-194`)

Removed `torch.cuda.synchronize()` before graph recording (`cudagraph.py:204-209`)

`capture_error_mode="thread_local"` (`cudagraph.py:209`)