Skip to content

[horovod] Horovod+Ray Pytorch Lightning Accelerator#13458

Merged
richardliaw merged 26 commits intoray-project:masterfrom
amogkam:horovod-ptl
Jan 23, 2021
Merged

[horovod] Horovod+Ray Pytorch Lightning Accelerator#13458
richardliaw merged 26 commits intoray-project:masterfrom
amogkam:horovod-ptl

Conversation

@amogkam
Copy link
Contributor

@amogkam amogkam commented Jan 14, 2021

Why are these changes needed?

TODO:

  • Add unit tests
  • Test on multi-node, multi-GPU cluster

TODO for future PRs:

  • Add release test
  • Add documentation
  • Integrate with Tune

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@amogkam
Copy link
Contributor Author

amogkam commented Jan 14, 2021

I'm seeing these warning messages at the end of the training run

(pid=25122) /usr/local/Cellar/python@3.8/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
(pid=25122)   warnings.warn('resource_tracker: There appear to be %d '
(pid=25111) /usr/local/Cellar/python@3.8/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
(pid=25111)   warnings.warn('resource_tracker: There appear to be %d '
(pid=25116) /usr/local/Cellar/python@3.8/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
(pid=25116)   warnings.warn('resource_tracker: There appear to be %d '
(pid=25110) /usr/local/Cellar/python@3.8/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
(pid=25110)   warnings.warn('resource_tracker: There appear to be %d '

@amogkam
Copy link
Contributor Author

amogkam commented Jan 14, 2021

cc @tgaddair

@amogkam amogkam changed the title [WIP] Horovod+Ray PTL Accelerator Horovod+Ray Pytorch Lightning Accelerator Jan 22, 2021
@amogkam amogkam marked this pull request as ready for review January 22, 2021 08:58
@amogkam amogkam requested a review from richardliaw January 22, 2021 08:58
self.executor.start(executable_cls=get_executable_cls())

def train(self):
results = self.executor.run(self.train_remote)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, just as a note, I think this will actually not work for any model larger than 4GB because we end up serializing the whole state. One possibility is to dereference the pointer to self.model and then use the Object store to move it over.

I wouldn't block this PR on that though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good point. Is that how it's currently being handled with standard Horovod+Ray?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In standard horovod/ray, the model is defined + instantiated within the training function. However, in PTL, the model is expected to be instantiated before the training function is serialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's right, got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be supported now- the entire trainer is put into the object store and then fetched later by each worker.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about reducing code by using bolts.

@amogkam
Copy link
Contributor Author

amogkam commented Jan 22, 2021

@richardliaw if this looks good to you then let's merge this? The failing tests are all unrelated and are known to be flaky.

@amogkam amogkam added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 22, 2021
@richardliaw richardliaw changed the title Horovod+Ray Pytorch Lightning Accelerator [horovod] Horovod+Ray Pytorch Lightning Accelerator Jan 23, 2021
@richardliaw richardliaw merged commit 01d74af into ray-project:master Jan 23, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants