fix: Disable cudnn sdpa backend when using activation checkpointing by yfw · Pull Request #1717 · NVIDIA-NeMo/RL

yfw · 2026-01-05T22:52:06Z

What does this PR do ?

The previous fix (#1676) did not cover all the cases. Specifically, it was failing for certain configurations where the cudnn backend still seemed to be selected during activation recomputation. This is because the context manager for setting the sdpa backend was only activated in the forward call, not during activation recomputation. This PR disables cudnn backend globally when activation checkpointing is enabled for the dtensor path to avoid this.

Related: #1663

This is likely related to this change in the cudnn backend: https://github.com/pytorch/pytorch/pull/155958/files#diff-0af86060a6f34f46e562971d76a9ad8ddaeb945c8fbd6693186f1d60304de438L263

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Bug Fixes
- Improved activation checkpointing reliability by ensuring consistent CUDA backend behavior during model training.
Refactor
- Simplified internal configuration logic for activation checkpointing and CUDA optimization coordination.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

github-actions · 2026-01-05T22:52:29Z

⚠️ File Consistency Check

Check based on commit: 9b386a4 (PR #1717 from yifu/fix_act2)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

coderabbitai · 2026-01-05T22:56:35Z

📝 Walkthrough

Walkthrough

Modified dtensor_policy_worker_v2.py to refine activation checkpointing and SDPA backend handling. Changed context-parallel activation checkpointing condition, removed SDPA exclusion logic, and added explicit runtime disablement of CUDNN SDPA backend when activation checkpointing is enabled.

Changes

Cohort / File(s)	Summary
SDPA and Activation Checkpointing Logic `nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py`	Added CUDA backend utilities import. Narrowed CP-related activation checkpointing condition to trigger only when `cp_size > 1` (excluding OR clause). Removed conditional block excluding activation_checkpointing in SDPA init setup. Introduced explicit runtime disablement of CUDNN SDPA backend when activation checkpointing is enabled post-model-load.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix: Fix crash when using activation_checkpointing #1676: Directly modifies the same dtensor_policy_worker_v2.py file with related CUDNN SDPA/attention handling when activation_checkpointing is enabled.
cp: fix: Fix crash when using activation_checkpointing (1676) into r0.5.0 #1682: Modifies identical activation_checkpointing and SDPA/CUDNN attention logic in the same file with overlapping backend treatment changes.

Suggested labels

CI:L1, Run CICD

Suggested reviewers

terrykong
joyang-nv

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR implements cuDNN SDPA backend disablement for activation checkpointing but lacks validation showing the fix resolves the recomputation error and doesn't cause convergence regressions.	Add test results confirming the fix resolves the recomputation metadata error, validates no convergence regressions occur, and documents performance impact of disabling cuDNN SDPA backend.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title accurately summarizes the main change: disabling the cudnn SDPA backend when activation checkpointing is enabled, which is the core fix described in the PR objectives.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Ensure both the forward and recompute don't use cudnn Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

github-actions · 2026-01-06T02:17:28Z

⚠️ File Consistency Check

Check based on commit: c91fb39 (PR #1717 from yifu/fix_act2)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

…VIDIA-NeMo#1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>

…VIDIA-NeMo#1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Disable cudnn sdpa backend when using activation checkpointing

9b386a4

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw requested review from a team as code owners January 5, 2026 22:52

yfw changed the title ~~Disable cudnn sdpa backend when using activation checkpointing~~ fix: Disable cudnn sdpa backend when using activation checkpointing Jan 5, 2026

yfw added the CI:L1 Run doctests, unit tests, and functional tests label Jan 5, 2026

yfw temporarily deployed to nemo-ci January 5, 2026 22:54 — with GitHub Actions Inactive

yfw temporarily deployed to nemo-ci January 5, 2026 23:08 — with GitHub Actions Inactive

yfw temporarily deployed to nemo-ci January 6, 2026 00:53 — with GitHub Actions Inactive

More robust fix

c91fb39

Ensure both the forward and recompute don't use cudnn Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

terrykong approved these changes Jan 6, 2026

View reviewed changes

terrykong added r0.5.0 CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 6, 2026

terrykong temporarily deployed to nemo-ci January 6, 2026 06:45 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci January 6, 2026 06:49 — with GitHub Actions Inactive

terrykong enabled auto-merge (squash) January 6, 2026 06:56

terrykong temporarily deployed to nemo-ci January 6, 2026 08:37 — with GitHub Actions Inactive

terrykong merged commit 705d25f into main Jan 6, 2026
53 of 55 checks passed

terrykong deleted the yifu/fix_act2 branch January 6, 2026 10:29

chtruong814 pushed a commit that referenced this pull request Jan 6, 2026

fix: Disable cudnn sdpa backend when using activation checkpointing (#…

3249e2e

…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

parthmannan pushed a commit to parthmannan/RL that referenced this pull request Jan 15, 2026

fix: Disable cudnn sdpa backend when using activation checkpointing (N…

90e14ee

…VIDIA-NeMo#1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

fix: Disable cudnn sdpa backend when using activation checkpointing (#…

bb02bd9

…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

fix: Disable cudnn sdpa backend when using activation checkpointing (#…

cb08469

…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

seonjinn pushed a commit that referenced this pull request Mar 9, 2026

fix: Disable cudnn sdpa backend when using activation checkpointing (#…

b3d7723

…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Disable cudnn sdpa backend when using activation checkpointing#1717

fix: Disable cudnn sdpa backend when using activation checkpointing#1717
terrykong merged 2 commits intomainfrom
yifu/fix_act2

yfw commented Jan 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

github-actions Bot commented Jan 5, 2026

Uh oh!

coderabbitai Bot commented Jan 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

github-actions Bot commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yfw commented Jan 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Jan 5, 2026

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

coderabbitai Bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

github-actions Bot commented Jan 6, 2026

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yfw commented Jan 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 5, 2026 •

edited

Loading