[tune/train] Consolidate checkpoint manager 2: Ray Train#24772
Merged
krfricke merged 40 commits intoray-project:masterfrom Jun 7, 2022
Merged
[tune/train] Consolidate checkpoint manager 2: Ray Train#24772krfricke merged 40 commits intoray-project:masterfrom
krfricke merged 40 commits intoray-project:masterfrom
Conversation
6 tasks
6 tasks
added 6 commits
June 2, 2022 10:55
# Conflicts: # python/ray/train/checkpoint.py
added 4 commits
June 3, 2022 21:55
# Conflicts: # python/ray/air/train/integrations/huggingface/huggingface_trainer.py
# Conflicts: # python/ray/air/train/data_parallel_trainer.py # python/ray/air/train/integrations/huggingface/huggingface_trainer.py
amogkam
approved these changes
Jun 7, 2022
Contributor
|
Actually we may want to move the common classes to ray.air right now itself. We’ll be deprecating ml_utils soon. |
Contributor
Author
|
thanks for the review - deprecating ML utils sounds good, I just want to make sure we're clear about where this code goes then. In Let's defer this when we actually deprecate the ml_utils - I'll attend to the other comments and merge once tests are passing. |
6 tasks
krfricke
added a commit
that referenced
this pull request
Jun 7, 2022
#24772 broke the smoke test as it was not run on CI - this PR hotfixes this
krfricke
added a commit
that referenced
this pull request
Jun 8, 2022
**Update**: This PR is now part 3 of a three PR group to consolidate the checkpoints. 1. Part 1 adds the common checkpoint management class #24771 2. Part 2 adds the integration for Ray Train #24772 3. This PR builds on #24772 and includes all changes. It moves the Ray Tune integration to use the new common checkpoint manager class. Old PR description: This PR consolidates the Ray Train and Tune checkpoint managers. These concepts previously did something very similar but in different modules. To simplify maintenance in the future, we've consolidated the common core. - This PR keeps full compatibility with the previous interfaces and implementations. This means that for now, Train and Tune will have separate CheckpointManagers that both extend the common core - This PR prepares Tune to move to a CheckpointStrategy object - In follow-up PRs, we can further unify interfacing with the common core, possibly removing any train- or tune-specific adjustments (e.g. moving to setup on init rather on runtime for Ray Train) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
sumanthratna
pushed a commit
to sumanthratna/ray
that referenced
this pull request
Jun 8, 2022
…#24772) This is a follow-up from ray-project#24771 which moves the Ray Train implementation to use the new common checkpoint manager class.
sumanthratna
pushed a commit
to sumanthratna/ray
that referenced
this pull request
Jun 8, 2022
ray-project#24772 broke the smoke test as it was not run on CI - this PR hotfixes this
sumanthratna
pushed a commit
to sumanthratna/ray
that referenced
this pull request
Jun 8, 2022
…24430) **Update**: This PR is now part 3 of a three PR group to consolidate the checkpoints. 1. Part 1 adds the common checkpoint management class ray-project#24771 2. Part 2 adds the integration for Ray Train ray-project#24772 3. This PR builds on ray-project#24772 and includes all changes. It moves the Ray Tune integration to use the new common checkpoint manager class. Old PR description: This PR consolidates the Ray Train and Tune checkpoint managers. These concepts previously did something very similar but in different modules. To simplify maintenance in the future, we've consolidated the common core. - This PR keeps full compatibility with the previous interfaces and implementations. This means that for now, Train and Tune will have separate CheckpointManagers that both extend the common core - This PR prepares Tune to move to a CheckpointStrategy object - In follow-up PRs, we can further unify interfacing with the common core, possibly removing any train- or tune-specific adjustments (e.g. moving to setup on init rather on runtime for Ray Train) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
This is a follow-up from #24771 which moves the Ray Train implementation to use the new common checkpoint manager class.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.