fix: Fix crash when using activation_checkpointing#1676
Conversation
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
|
📝 WalkthroughWalkthroughThe change modifies the condition in DTensorPolicyWorkerV2.init that determines when to apply a specialized SDPA backend. The condition now triggers when cp_size > 1 OR activation_checkpointing is enabled, whereas previously it only checked cp_size > 1. A comment clarifies that activation_checkpointing requires excluding CUDNN_ATTENTION due to a known error. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (4)**/*.py📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
nemo_rl/**/*.py📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
!(**/tests/**|**/test_*.py|**/test_*.sh)📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.{py,sh}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
What does this PR do ?
Fixes a crash that occasionally occurs when enabling activation_checkpointing in dtensor path.
Previously, we were seeing the following error:
This seemed to be happening with the CUDNN_ATTENTION sdpa implementation was being selected and activation_checkpointing was enabled. As a workaround, we do not allow CUDNN_ATTENTION when activation_checkpointing is enabled in the dtensor path.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.