Skip to content

Conversation

@IsaacYangSLA
Copy link
Contributor

Signed-off-by: Isaac Yang isaacy@nvidia.com

Fixes #1236 .

Description

GPU idle detection may have a race condition if multiple jobs are launched at nearly the same time. Add some random delay to break the tie.

The logs from four jobs in reported failed pipeline showed all of them were assigned to GPU 0 and 3. That meant all of them reached the detection code at the same time and at the time, none of them was using GPU.

The number 16 can be larger so the probability of two jobs being launched side-by-side is reduced.
The number 60 is to give time for unittest code to start allocating GPU memory, and for other instance to be aware of GPU utilization by others.

Status

Ready

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh --codeformat --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: Isaac Yang <isaacy@nvidia.com>
Copy link
Contributor

@wyli wyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@wyli wyli merged commit 870258b into Project-MONAI:master Nov 17, 2020
@IsaacYangSLA IsaacYangSLA deleted the add_random_delay_during_concurrent_job_launching branch November 17, 2020 17:25
wyli added a commit to wyli/MONAI that referenced this pull request Apr 5, 2021
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
wyli added a commit to wyli/MONAI that referenced this pull request Apr 5, 2021
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
@wyli wyli mentioned this pull request Apr 5, 2021
1 task
wyli added a commit to wyli/MONAI that referenced this pull request Apr 5, 2021
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cron integration test memory error

2 participants