[Core][Autoscaler] Configure idleTimeoutSeconds per node type#48813
[Core][Autoscaler] Configure idleTimeoutSeconds per node type#48813rickyyx merged 10 commits intoray-project:masterfrom
Conversation
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
TODO: @ryanaoleary I'll update this PR with doc/API changes and comments containing my manual testing process. |
Manual testing processKubeRay:
|
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
|
Before merging this PR, would you mind:
|
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Autoscaler logs show available_node_types: worker group with worker group without
There was a CI error for a Ray Serve test but I think it's unrelated to this PR. |
|
cc @rickyyx this PR looks good to me. Would you mind taking a look? Thanks! |
| # The maximal number of worker nodes can be launched for this node type. | ||
| max_worker_nodes: int | ||
| # Idle timeout seconds for worker nodes of this node type. | ||
| idle_timeout_s: Optional[float] = None |
There was a problem hiding this comment.
nit: should we enforce it as integer with a cast when we add this? I see it being int as part of the schema
Or we could make this a float in the schema too. No preference over this.
There was a problem hiding this comment.
I'll change it to a number type in the schema and then add a cast to float when we call idle_timeout_s = group_spec.get(IDLE_SECONDS_KEY), since I implemented it as an int in the RayCluster CRD for consistency with the other field: https://github.com/ray-project/kuberay/blob/925effe34022c72c41691c0b79d8d3051d4a1b77/ray-operator/apis/ray/v1/raycluster_types.go#L94
There was a problem hiding this comment.
I ran the tests again and implemented this change in: 1bd8afb
There was a problem hiding this comment.
Awesome, thanks for the great work!
|
|
||
|
|
||
| @pytest.mark.parametrize("node_type_idle_timeout_s", [1, 2, 10]) | ||
| def test_idle_termination_with_node_type_idle_timeout(node_type_idle_timeout_s): |
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
…oject#48813) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Adds `idle_timeout_s` as a field to `node_type_configs`, enabling the v2 autoscaler to configure idle termination per worker type. This PR depends on a change in KubeRay to the RayCluster CRD, since we want to support passing `idleTimeoutSeconds` to individual worker groups such that they can specify a custom idle duration: ray-project/kuberay#2558 ## Related issue number Closes ray-project#36888 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: ryanaoleary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Ricky Xu <xuchen727@hotmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>

Why are these changes needed?
Adds
idle_timeout_sas a field tonode_type_configs, enabling the v2 autoscaler to configure idle termination per worker type.This PR depends on a change in KubeRay to the RayCluster CRD, since we want to support passing
idleTimeoutSecondsto individual worker groups such that they can specify a custom idle duration: ray-project/kuberay#2558Related issue number
Closes #36888
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.