Skip to content

Conversation

@wyli
Copy link
Contributor

@wyli wyli commented Nov 25, 2020

Signed-off-by: Wenqi Li wenqil@nvidia.com

Fixes #926

Description

adds a utility for distributed tests

Status

Ready

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh --codeformat --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@wyli wyli marked this pull request as draft November 25, 2020 23:39
@wyli wyli force-pushed the 926-distributed-training-tests branch 4 times, most recently from a81a782 to 424b607 Compare November 26, 2020 11:10
@wyli wyli requested review from Nic-Ma and ericspod and removed request for Nic-Ma November 26, 2020 11:34
@wyli wyli marked this pull request as ready for review November 26, 2020 11:34
@wyli
Copy link
Contributor Author

wyli commented Nov 26, 2020

- tested single machine multiprocess on macos and ubuntu w/o GPU

on windows CI there's an error for the distributed test OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\hostedtoolcache\windows\Python\3.8.6\x64\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies. I think this is a limitation of the CI instance rather than a code issue. so I skip the tests on windows

updates:

  • fixed windows issue with 3rd party github action al-cheb/configure-pagefile-action@v1.2
  • tested single machine multiprocess on macos/ubuntu/windows w/o GPU

@wyli wyli requested a review from Nic-Ma November 26, 2020 11:42
@wyli
Copy link
Contributor Author

wyli commented Nov 26, 2020

/integration-test

wyli added 3 commits November 26, 2020 17:11
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
@wyli wyli force-pushed the 926-distributed-training-tests branch from 98b4034 to 4748431 Compare November 26, 2020 17:15
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
@wyli wyli force-pushed the 926-distributed-training-tests branch from 4748431 to da4092b Compare November 26, 2020 17:30
@Nic-Ma Nic-Ma self-requested a review November 27, 2020 14:15
Copy link
Contributor

@Nic-Ma Nic-Ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use spawn for now, good enough for the first version, maybe some expert can help update it later.

Thanks.

@wyli
Copy link
Contributor Author

wyli commented Nov 27, 2020

Let's use spawn for now, good enough for the first version, maybe some expert can help update it later.

Thanks.

sure thanks! this also needs to be extended to multi-node test cases

@wyli wyli merged commit dcc0a38 into Project-MONAI:master Nov 27, 2020
@wyli wyli deleted the 926-distributed-training-tests branch April 12, 2021 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support to run distributed training tests in CI

2 participants