Skip to content

[Core] Fix session_name not reused when GCS restarts + node ip address not set for driver#39211

Closed
rkooo567 wants to merge 46 commits intoray-project:masterfrom
rkooo567:option-2-node-ip
Closed

[Core] Fix session_name not reused when GCS restarts + node ip address not set for driver#39211
rkooo567 wants to merge 46 commits intoray-project:masterfrom
rkooo567:option-2-node-ip

Conversation

@rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Sep 1, 2023

Why are these changes needed?

This is the PR that combines #37644 and also allows to reuse the session_name when GCS is restarted.

Every GCS HA users are supposed to use RAY_external_storage_namespace, and we are using this name as a session_name. This also means all the directory will have different style naming when GCS HA is used (instead of session_, it will become session_).

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(


You should also set the OS environment variable ``RAY_external_storage_namespace`` to isolate the data stored in Redis.
This makes sure that there is no data conflicts if multiple Ray clusters share the same Redis instance.
``RAY_external_storage_namespace`` must be an unique ID, and whenever you restart a head node,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give users an example of unique id or generator code like uuid.uuid4() ?

self._session_name = f"session_{date_str}_{os.getpid()}"
else:
assert not self._default_worker
session_name = ray._private.utils.internal_kv_get_with_retry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This not a broker) . the behavior of worker node will be changed if worker node also set RAY_external_storage_namespace.
origin:
worker node's session_name always fetch from kv storage.

Now:
If worker node set RAY_external_storage_namespace. worker node's session_name will use RAY_external_storage_namespace first.

If the RAY_external_storage_namespace is set improperly by the user, it can potentially cause issues with the worker nodes.

@rkooo567
Copy link
Contributor Author

rkooo567 commented Sep 5, 2023

Sorry @larrylian this was the emergency PR in case we cannot get #39194 in. I think #39194 will work out, so let me close the issue

@rkooo567 rkooo567 closed this Sep 5, 2023
@larrylian
Copy link
Contributor

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants