Skip to content

[CoreWorker] lazy bind core_work's job_config through task spec.#31375

Merged
scv119 merged 28 commits intoray-project:masterfrom
scv119:job_config
Jan 12, 2023
Merged

[CoreWorker] lazy bind core_work's job_config through task spec.#31375
scv119 merged 28 commits intoray-project:masterfrom
scv119:job_config

Conversation

@scv119
Copy link
Copy Markdown
Contributor

@scv119 scv119 commented Dec 30, 2022

Why are these changes needed?

Previously the worker get job_config information from raylet on construction. This prevents us from lazily binding job_config to workers. This PR enables lazily bind job_config, by piggybacking job_confg in TaskSpec, and initialize the job_config when the worker receives task execution request (push_task) call.

We also refactor the WorkerContext and RayletClient as part of the chagne.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@scv119 scv119 changed the title [CoreWorker] populate taskspec with job_config [CoreWorker] populate job_config through task spec. Jan 1, 2023
@scv119 scv119 changed the title [CoreWorker] populate job_config through task spec. [CoreWorker] lazy bind core_work's job_config through task spec. Jan 1, 2023
@scv119 scv119 marked this pull request as ready for review January 1, 2023 01:04
@scv119
Copy link
Copy Markdown
Contributor Author

scv119 commented Jan 2, 2023

@liuyang-my the Java test failed but i'm not quite sure what exactly happened reading the logs. Do you know what might have gone wrong? (presumably we are hitting some deadlock issues?) https://buildkite.com/ray-project/oss-ci-build-pr/builds/8396#01857127-086b-406b-92da-6f935dcc8447 is the failed test

Comment thread src/ray/core_worker/context.cc Outdated
Comment thread src/ray/core_worker/context.cc Outdated
}
// ---Actor death contexts end----

message JobConfig {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move JobConfig to common.proto to break the circular dependency

Comment thread cpp/src/ray/runtime/local_mode_ray_runtime.cc Outdated
Comment thread src/ray/core_worker/context.cc Outdated
Comment thread src/ray/core_worker/context.cc Outdated
Comment thread src/ray/core_worker/core_worker.cc
std::string task_name =
invocation.name.empty() ? functionDescriptor->DefaultTaskName() : invocation.name;

static rpc::JobConfig kDefaultJobConfig;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is static a premature optimization? Is it possible for this to be called from multiple threads and corrupt the object?

@fishbone
Copy link
Copy Markdown
Contributor

fishbone commented Jan 6, 2023

Question: will this slow down the perfs? I think this adds runtime env to all task specs (previously, only one). Do you mind benchmarking the perf regression?

Besides this, do you think it's good to pass job config through stdin for the workers? If doing this way, we probably could limit the all changes in worker pool.

I'm also thinking in the future this maybe need extension. We probably don't want to pass everything to task spec I believe.

Btw, ok with this approach if the benchmark with job config is ok. But let's add comment to job config proto to let people know it's passed to all tasks repeatedly.

Still reviewing...

const TaskID &GetCurrentTaskId();

const JobID &GetCurrentJobID();
JobID GetCurrentJobID();
Copy link
Copy Markdown
Contributor

@fishbone fishbone Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why update this one but not the rest?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this calls to the context which returns by value instead of reference; so we change it accordingly.
change them to return by value is a great idea, but will yield double the size of the PR and touching a lot of cpp runtime code, thus we prefer not changing it in this PR.

std::string task_name =
invocation.name.empty() ? functionDescriptor->DefaultTaskName() : invocation.name;

rpc::JobConfig kDefaultJobConfig;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
make it a global const static variable?

not sure, but I always feel kXYZ is global const variables.

Comment thread src/ray/common/task/task_spec.cc Outdated
return JobID::FromBinary(message_->job_id());
}

rpc::JobConfig TaskSpecification::JobConfig() const { return message_->job_config(); }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not const reference?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Language::PYTHON,
FunctionDescriptorBuilder::BuildPython("", "", "", ""),
job_id,
config,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
rpc::JobConfig()

seems easier to read. Otherwise we need check what's config in the code.

Comment thread src/ray/core_worker/context.h Outdated
Comment on lines +113 to +114
JobID current_job_id_ GUARDED_BY(mutex_);
rpc::JobConfig job_config_ GUARDED_BY(mutex_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use optional here if it's lazily initialized?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Why don't we use optional instead of default config?

@scv119
Copy link
Copy Markdown
Contributor Author

scv119 commented Jan 6, 2023

thanks for reviewing!

kicking off benchmark here: https://buildkite.com/ray-project/release-tests-pr/builds/24753

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't touch the worker pool code. In that case, those workers started by each job still considered to belong to the job?

Also, I am curious about the behavior changes. Previously,

  1. When the worker starts, it belongs to the job
  2. When the job terminates the workers are killed.

With this change, how are these semantics changed?

Comment thread src/ray/common/task/task_spec.cc Outdated
return JobID::FromBinary(message_->job_id());
}

rpc::JobConfig TaskSpecification::JobConfig() const { return message_->job_config(); }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment thread src/ray/core_worker/context.cc Outdated
job_config_ = job_config;
}
RAY_CHECK(current_job_id_ == job_id);
RAY_CHECK(google::protobuf::util::MessageDifferencer::Equals(job_config_, job_config_));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? Is overhead of MessageDifferencer::Equals small?

return current_job_id_;
}

rpc::JobConfig WorkerContext::GetCurrentJobConfig() const {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const reference?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing a reference to a state in critical section yields undefined behavior.

Comment thread src/ray/core_worker/context.h
Comment thread src/ray/core_worker/context.h Outdated
Comment on lines +113 to +114
JobID current_job_id_ GUARDED_BY(mutex_);
rpc::JobConfig job_config_ GUARDED_BY(mutex_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Why don't we use optional instead of default config?

Comment thread src/ray/core_worker/core_worker.cc
Comment thread src/ray/core_worker/core_worker.cc
Comment thread src/ray/core_worker/core_worker.cc Outdated

if (options_.worker_type == WorkerType::DRIVER &&
!options_.serialized_job_config.empty()) {
// Driver populates job_config through worker startup options.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC driver is not started with worker startup options?

Maybe it should be "driver populates the job config via initialization. Workers populates it when the first task is received"?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 6, 2023
Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request change since Cade approves it already..

@scv119 scv119 merged commit 302a7e5 into ray-project:master Jan 12, 2023
@MisterLin1995 MisterLin1995 mentioned this pull request Jan 12, 2023
7 tasks
scv119 pushed a commit that referenced this pull request Jan 12, 2023
Fix and reopen java tests closed in #31375

Co-authored-by: Marcus Zhang <zxl265370@antgroup.com>
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
)

Previously the worker get job_config information from raylet on construction. This prevents us from lazily binding job_config to workers. This PR enables lazily bind job_config, by piggybacking job_confg in TaskSpec, and initialize the job_config when the worker receives task execution request (push_task) call.

We also refactor the WorkerContext and RayletClient as part of the chagne.
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
Fix and reopen java tests closed in #31375

Co-authored-by: Marcus Zhang <zxl265370@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants