Skip to content

Fixes distributed training hanging issue#3273

Merged
Mayankm96 merged 1 commit into
isaac-sim:mainfrom
kellyguo11:fix/distributed-crash
Aug 27, 2025
Merged

Fixes distributed training hanging issue#3273
Mayankm96 merged 1 commit into
isaac-sim:mainfrom
kellyguo11:fix/distributed-crash

Conversation

@kellyguo11
Copy link
Copy Markdown
Contributor

Description

We have been hunting down a strange issue in distributed training setups with rendering enabled, where often the process would hang midway through training and causes NCCL timeouts. A workaround was discovered to set app.execution.debug.forceSerial = true, which forces serialized scheduling of omni graph within the same thread. This appears to have resolved the hanging issue and did not cause performance regressions.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

Comment thread apps/isaaclab.python.headless.rendering.kit
@Mayankm96 Mayankm96 merged commit 3edc06c into isaac-sim:main Aug 27, 2025
7 of 8 checks passed
george-nehma pushed a commit to george-nehma/DreamLander-IsaacLab that referenced this pull request Oct 24, 2025
# Description

We have been hunting down a strange issue in distributed training setups
with rendering enabled, where often the process would hang midway
through training and causes NCCL timeouts. A workaround was discovered
to set `app.execution.debug.forceSerial = true`, which forces serialized
scheduling of omni graph within the same thread. This appears to have
resolved the hanging issue and did not cause performance regressions.

## Type of change

<!-- As you go through the list, delete the ones that are not
applicable. -->

- Bug fix (non-breaking change which fixes an issue)

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants