fix: Disable cudnn sdpa backend when using activation checkpointing#1717
fix: Disable cudnn sdpa backend when using activation checkpointing#1717
Conversation
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
|
📝 WalkthroughWalkthroughModified dtensor_policy_worker_v2.py to refine activation checkpointing and SDPA backend handling. Changed context-parallel activation checkpointing condition, removed SDPA exclusion logic, and added explicit runtime disablement of CUDNN SDPA backend when activation checkpointing is enabled. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Ensure both the forward and recompute don't use cudnn Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
|
…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
…VIDIA-NeMo#1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
…VIDIA-NeMo#1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
…VIDIA-NeMo#1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
…1717) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
What does this PR do ?
The previous fix (#1676) did not cover all the cases. Specifically, it was failing for certain configurations where the cudnn backend still seemed to be selected during activation recomputation. This is because the context manager for setting the sdpa backend was only activated in the
forwardcall, not during activation recomputation. This PR disables cudnn backend globally when activation checkpointing is enabled for the dtensor path to avoid this.Related: #1663
This is likely related to this change in the cudnn backend: https://github.com/pytorch/pytorch/pull/155958/files#diff-0af86060a6f34f46e562971d76a9ad8ddaeb945c8fbd6693186f1d60304de438L263
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Bug Fixes
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.