[horovod] Horovod+Ray Pytorch Lightning Accelerator by amogkam · Pull Request #13458 · ray-project/ray

amogkam · 2021-01-14T16:43:18Z

Why are these changes needed?

TODO:

Add unit tests
Test on multi-node, multi-GPU cluster

TODO for future PRs:

Add release test
Add documentation
Integrate with Tune

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

amogkam · 2021-01-14T16:44:02Z

I'm seeing these warning messages at the end of the training run

(pid=25122) /usr/local/Cellar/python@3.8/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
(pid=25122)   warnings.warn('resource_tracker: There appear to be %d '
(pid=25111) /usr/local/Cellar/python@3.8/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
(pid=25111)   warnings.warn('resource_tracker: There appear to be %d '
(pid=25116) /usr/local/Cellar/python@3.8/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
(pid=25116)   warnings.warn('resource_tracker: There appear to be %d '
(pid=25110) /usr/local/Cellar/python@3.8/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
(pid=25110)   warnings.warn('resource_tracker: There appear to be %d '

python/ray/util/lightning_accelerators/tests/test_horovod_ray_accelerator.py

amogkam · 2021-01-14T17:12:31Z

cc @tgaddair

…vod-ptl

python/ray/util/lightning_accelerators/examples/ptl_horovod_ray_example.py

richardliaw · 2021-01-22T09:04:26Z

python/ray/util/lightning_accelerators/horovod_ray_accelerator.py

+        self.executor.start(executable_cls=get_executable_cls())
+
+    def train(self):
+        results = self.executor.run(self.train_remote)


Hmm, just as a note, I think this will actually not work for any model larger than 4GB because we end up serializing the whole state. One possibility is to dereference the pointer to self.model and then use the Object store to move it over.

I wouldn't block this PR on that though

Oh that's a good point. Is that how it's currently being handled with standard Horovod+Ray?

In standard horovod/ray, the model is defined + instantiated within the training function. However, in PTL, the model is expected to be instantiated before the training function is serialized.

Ah that's right, got it.

This should be supported now- the entire trainer is put into the object store and then fetched later by each worker.

richardliaw

One comment about reducing code by using bolts.

amogkam · 2021-01-22T23:22:40Z

@richardliaw if this looks good to you then let's merge this? The failing tests are all unrelated and are known to be flaky.

…ect#13458)" This reverts commit 7340cf1.

amogkam added 2 commits January 13, 2021 12:45

wip

cebb875

basic example

3293cb5

amogkam commented Jan 14, 2021

View reviewed changes

python/ray/util/lightning_accelerators/tests/test_horovod_ray_accelerator.py Show resolved Hide resolved

updates

c341c4d

amogkam added 21 commits January 19, 2021 09:15

wip

4a2d855

cpu passing

ca23e2e

unit tests

4ccedc8

add build file

d39dc9b

Merge branch 'master' of https://github.com/ray-project/ray into horo…

157789c

…vod-ptl

add tests to ci

131a512

fix

e8e0e89

change import

6a51c24

try fix fixture

aafe009

try

d968a80

gpu tests working

f349043

wip

ede192b

update example

4ae7de8

updates

ab8d672

update gpu tests

97f7282

more updates

c495da8

add gpu

1862988

Merge branch 'master' of https://github.com/ray-project/ray into horo…

9975eb4

…vod-ptl

updates

919e107

cleanup wip

79ed9fc

final cleanup

b87d3d9

amogkam changed the title ~~[WIP] Horovod+Ray PTL Accelerator~~ Horovod+Ray Pytorch Lightning Accelerator Jan 22, 2021

amogkam marked this pull request as ready for review January 22, 2021 08:58

amogkam requested a review from richardliaw January 22, 2021 08:58

amogkam assigned richardliaw Jan 22, 2021

richardliaw reviewed Jan 22, 2021

View reviewed changes

python/ray/util/lightning_accelerators/examples/ptl_horovod_ray_example.py Show resolved Hide resolved

richardliaw reviewed Jan 22, 2021

View reviewed changes

richardliaw approved these changes Jan 22, 2021

View reviewed changes

amogkam added 2 commits January 22, 2021 01:09

lint

704ebdb

update

0f8f89e

amogkam added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 22, 2021

richardliaw changed the title ~~Horovod+Ray Pytorch Lightning Accelerator~~ [horovod] Horovod+Ray Pytorch Lightning Accelerator Jan 23, 2021

richardliaw merged commit 01d74af into ray-project:master Jan 23, 2021

fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021

[horovod] Horovod+Ray Pytorch Lightning Accelerator (ray-project#13458)

7340cf1

fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021

Revert "[horovod] Horovod+Ray Pytorch Lightning Accelerator (ray-proj…

03222d2

…ect#13458)" This reverts commit 7340cf1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[horovod] Horovod+Ray Pytorch Lightning Accelerator#13458

[horovod] Horovod+Ray Pytorch Lightning Accelerator#13458
richardliaw merged 26 commits intoray-project:masterfrom
amogkam:horovod-ptl

amogkam commented Jan 14, 2021 •

edited

Loading

Uh oh!

amogkam commented Jan 14, 2021

Uh oh!

Uh oh!

amogkam commented Jan 14, 2021

Uh oh!

Uh oh!

richardliaw Jan 22, 2021

Uh oh!

amogkam Jan 22, 2021

Uh oh!

richardliaw Jan 22, 2021

Uh oh!

amogkam Jan 22, 2021

Uh oh!

amogkam Jan 22, 2021

Uh oh!

richardliaw left a comment

Uh oh!

amogkam commented Jan 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amogkam commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

amogkam commented Jan 14, 2021

Uh oh!

Uh oh!

amogkam commented Jan 14, 2021

Uh oh!

Uh oh!

richardliaw Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

amogkam Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

richardliaw Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

amogkam Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

amogkam Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

richardliaw left a comment

Choose a reason for hiding this comment

Uh oh!

amogkam commented Jan 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amogkam commented Jan 14, 2021 •

edited

Loading