Skip to content

[tune/train] Consolidate checkpoint manager 2: Ray Train#24772

Merged
krfricke merged 40 commits intoray-project:masterfrom
krfricke:tune-train/checkpoints-train
Jun 7, 2022
Merged

[tune/train] Consolidate checkpoint manager 2: Ray Train#24772
krfricke merged 40 commits intoray-project:masterfrom
krfricke:tune-train/checkpoints-train

Conversation

@krfricke
Copy link
Contributor

@krfricke krfricke commented May 13, 2022

Why are these changes needed?

This is a follow-up from #24771 which moves the Ray Train implementation to use the new common checkpoint manager class.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@krfricke krfricke added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 3, 2022
Kai Fricke added 4 commits June 3, 2022 21:55
# Conflicts:
#	python/ray/air/train/integrations/huggingface/huggingface_trainer.py
# Conflicts:
#	python/ray/air/train/data_parallel_trainer.py
#	python/ray/air/train/integrations/huggingface/huggingface_trainer.py
@krfricke krfricke removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 6, 2022
@krfricke krfricke requested a review from amogkam June 6, 2022 09:54
@krfricke krfricke added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 6, 2022
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @krfricke!

@amogkam
Copy link
Contributor

amogkam commented Jun 7, 2022

Actually we may want to move the common classes to ray.air right now itself. We’ll be deprecating ml_utils soon.

@krfricke
Copy link
Contributor Author

krfricke commented Jun 7, 2022

thanks for the review - deprecating ML utils sounds good, I just want to make sure we're clear about where this code goes then. In ray.air it may be confusing with the Checkpoint class. I think one question would be if this should just go into Tune once we removed the custom implementations (just like tune.report()).

Let's defer this when we actually deprecate the ml_utils - I'll attend to the other comments and merge once tests are passing.

@krfricke krfricke removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 7, 2022
@krfricke krfricke merged commit 984b9a5 into ray-project:master Jun 7, 2022
@krfricke krfricke deleted the tune-train/checkpoints-train branch June 7, 2022 12:51
krfricke added a commit that referenced this pull request Jun 7, 2022
#24772 broke the smoke test as it was not run on CI - this PR hotfixes this
krfricke added a commit that referenced this pull request Jun 8, 2022
**Update**: This PR is now part 3 of a three PR group to consolidate the checkpoints.

1. Part 1 adds the common checkpoint management class #24771 
2. Part 2 adds the integration for Ray Train #24772
3. This PR builds on #24772 and includes all changes. It moves the Ray Tune integration to use the new common checkpoint manager class.

Old PR description:

This PR consolidates the Ray Train and Tune checkpoint managers. These concepts previously did something very similar but in different modules. To simplify maintenance in the future, we've consolidated the common core.

- This PR keeps full compatibility with the previous interfaces and implementations. This means that for now, Train and Tune will have separate CheckpointManagers that both extend the common core
- This PR prepares Tune to move to a CheckpointStrategy object
- In follow-up PRs, we can further unify interfacing with the common core, possibly removing any train- or tune-specific adjustments (e.g. moving to setup on init rather on runtime for Ray Train)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
sumanthratna pushed a commit to sumanthratna/ray that referenced this pull request Jun 8, 2022
…#24772)

This is a follow-up from ray-project#24771 which moves the Ray Train implementation to use the new common checkpoint manager class.
sumanthratna pushed a commit to sumanthratna/ray that referenced this pull request Jun 8, 2022
ray-project#24772 broke the smoke test as it was not run on CI - this PR hotfixes this
sumanthratna pushed a commit to sumanthratna/ray that referenced this pull request Jun 8, 2022
…24430)

**Update**: This PR is now part 3 of a three PR group to consolidate the checkpoints.

1. Part 1 adds the common checkpoint management class ray-project#24771 
2. Part 2 adds the integration for Ray Train ray-project#24772
3. This PR builds on ray-project#24772 and includes all changes. It moves the Ray Tune integration to use the new common checkpoint manager class.

Old PR description:

This PR consolidates the Ray Train and Tune checkpoint managers. These concepts previously did something very similar but in different modules. To simplify maintenance in the future, we've consolidated the common core.

- This PR keeps full compatibility with the previous interfaces and implementations. This means that for now, Train and Tune will have separate CheckpointManagers that both extend the common core
- This PR prepares Tune to move to a CheckpointStrategy object
- In follow-up PRs, we can further unify interfacing with the common core, possibly removing any train- or tune-specific adjustments (e.g. moving to setup on init rather on runtime for Ray Train)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants