[horovod] Horovod+Ray Pytorch Lightning Accelerator#13458
[horovod] Horovod+Ray Pytorch Lightning Accelerator#13458richardliaw merged 26 commits intoray-project:masterfrom
Conversation
|
I'm seeing these warning messages at the end of the training run |
python/ray/util/lightning_accelerators/tests/test_horovod_ray_accelerator.py
Show resolved
Hide resolved
|
cc @tgaddair |
python/ray/util/lightning_accelerators/examples/ptl_horovod_ray_example.py
Show resolved
Hide resolved
| self.executor.start(executable_cls=get_executable_cls()) | ||
|
|
||
| def train(self): | ||
| results = self.executor.run(self.train_remote) |
There was a problem hiding this comment.
Hmm, just as a note, I think this will actually not work for any model larger than 4GB because we end up serializing the whole state. One possibility is to dereference the pointer to self.model and then use the Object store to move it over.
I wouldn't block this PR on that though
There was a problem hiding this comment.
Oh that's a good point. Is that how it's currently being handled with standard Horovod+Ray?
There was a problem hiding this comment.
In standard horovod/ray, the model is defined + instantiated within the training function. However, in PTL, the model is expected to be instantiated before the training function is serialized.
There was a problem hiding this comment.
Ah that's right, got it.
There was a problem hiding this comment.
This should be supported now- the entire trainer is put into the object store and then fetched later by each worker.
richardliaw
left a comment
There was a problem hiding this comment.
One comment about reducing code by using bolts.
|
@richardliaw if this looks good to you then let's merge this? The failing tests are all unrelated and are known to be flaky. |
…ect#13458)" This reverts commit 7340cf1.
Why are these changes needed?
TODO:
TODO for future PRs:
Related issue number
Checks
scripts/format.shto lint the changes in this PR.