[Serve][Deployment Graph][Perf] Add minimal executor DAGNode by jiaodong · Pull Request #24754 · ray-project/ray

jiaodong · 2022-05-13T00:56:11Z

Why are these changes needed?

Current deployment graph has big perf issues compare with using plain deployment handle, mostly because overhead of DAGNode traversal mechanism. We need this mechanism to empower DAG API, specially deeply nested objects in args where we rely on pickling; But meanwhile the nature of each execution becomes re-creating and replacing every DAGNode instances involved upon each execution, that incurs overhead.

Some overhead is inevitable due to pickling and executing DAGNode python code, but they could be quite minimal. As I profiled earlier, pickling itself is quite fast for our benchmarks at magnitude of microseconds.

Meanwhile the elephant in the room is DeploymentNode and its relatives are doing too much work in constructor that's beyond necessary, thus slowing everything down. So the fix is as simple as

Introduce a new set of executor dag node types that contains absolute minimal information that only preserves the DAG structure with traversal mechanism, and ability to call relevant deployment handles.
Add a simple new pass in our build() that generates and replaces nodes with executor dag to produce a final executor dag to run the graph.

Current ray dag -> serve dag mixed a lot of stuff related to deployment generation and init args, in longer term we should remove them but our correctness depends on it so i rather leave it as separate PR.

Current 10 node chain with deployment graph `.bind()`

chain_length: 10, num_clients: 1
latency_mean_ms: 41.05, latency_std_ms: 15.18
throughput_mean_tps: 27.5, throughput_std_tps: 3.2

Using raw deployment handle without dag overhead

chain_length: 10, num_clients: 1
latency_mean_ms: 20.39, latency_std_ms: 4.57
throughput_mean_tps: 51.9, throughput_std_tps: 1.04

After this PR:

chain_length: 10, num_clients: 1
latency_mean_ms: 20.35, latency_std_ms: 0.87
throughput_mean_tps: 48.4, throughput_std_tps: 1.43

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jiaodong · 2022-05-18T05:07:56Z

Tests passed for me on Python 3.7, i think CI tests are Python 3.6 only, will take a closer look tomorrow.

…rve instance

edoakes

@jiaodong the high-level approach makes sense to me (having separate runtime implementations for execution), but I'm a bit confused by the scope of what these node types are doing. It seems like there should only be a very minimal path that we can take in execution (essentially only calling function nodes or deployment method nodes) -- what do we need the deployment executor node for?

edoakes · 2022-05-20T15:52:21Z

python/ray/experimental/dag/py_obj_scanner.py

        from ray.serve.pipeline.deployment_node import DeploymentNode
        from ray.serve.pipeline.deployment_method_node import DeploymentMethodNode
        from ray.serve.pipeline.deployment_function_node import DeploymentFunctionNode
+        from ray.serve.deployment_executor_node import DeploymentExecutorNode
+        from ray.serve.deployment_method_executor_node import (
+            DeploymentMethodExecutorNode,
+        )
+        from ray.serve.deployment_function_executor_node import (
+            DeploymentFunctionExecutorNode,
+        )


seems bad to me that we need to register these in the core DAG code... if external parties wanted to add a DAG implementation, would they also need to do this? Is there a way we can define an interface to register these instead of hard-coding them? cc @ericl if you have thoughts

edoakes · 2022-05-20T15:53:34Z

python/ray/serve/deployment_executor_node.py

+        - Meanwhile, __init__, _copy_impl and _execute_impl are on the critical
+            pass of execution for every request.


why is _copy_impl on the critical path for execution?

Also, as an aside these should not be underscore-prefixed if they're part of the public interface we need to implement to make a functional dag node.

This is because upon executing the dag, we need to recursively find and replace nodes to corresponding _execute_impl output value. Then in https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/experimental/dag/dag_node.py?L174 we're returning a new instance with replaced args / kwargs.

The first part of what you said makes sense, but the second part I don't quite understand. Shouldn't we only need to call _execute_impl in the execution path? Why do we need to "return a new instance with replaced args/kwargs"?

edoakes · 2022-05-20T15:54:38Z

python/ray/serve/deployment_executor_node.py

+    def _execute_impl(self, *args, **kwargs):
+        """Does not call into anything or produce a new value, as the time
+        this function gets called, all child nodes are already resolved to
+        ObjectRefs.
+        """
+        return self._deployment_handle


I'm confused at what _execute_impl does... should this just be calling the deployment handle? I thought the contract of _execute_impl was that it always returns an ObjectRef?

Actually, is it even valid for this to be executed directly? I think we should only have deployment method nodes at time of execution, right?

We need to have a node that captures class / deployment creation. For Ray Core DAG, this is creating a new actor; For Serve's DeploymentNode, we deferred the work to subsequent Deployment objects to be deployed in serve.run().

So in the minimal executor dag version of DeploymentNode, it should do literally nothing but returning the handle, but it's also necessary because we need to support user passing DeploymentNode around into other func or classes' constructor.

So a simpler way to understand this function is something like

@serve.deployment class Combine: def __init__(self, model): self.model = model

Where the model here is a DeploymentNode, and we swapped it with a DeploymentExecutorNode that upon execution returns the handle to model for init args.

Why is this "upon execution?" Shouldn't this just be in the initial transformation step where we go from DeploymentNode -> Handle? Not understanding why the intermediate transformation to DeploymentExecutorNode is necessary.

My mental model of how DAG execution work is just to start from the output node, and recursively call the execute_impl of its dependencies, and replace the dependencies with the result of the call.

for example, MethodNode's first calls its parent DeployeNode exec_impl, which returns the handle, and then the MethodNode's parent_node becomes the handle object, so the method node's impl can be something like self.parent_node.method_name.remote().

edoakes · 2022-05-20T15:54:42Z

python/ray/serve/deployment_executor_node.py

+            new_kwargs,
+        )
+
+    def _execute_impl(self, *args, **kwargs):


type hint return

edoakes · 2022-05-20T15:54:50Z

python/ray/serve/deployment_executor_node.py

+        new_kwargs: Dict[str, Any],
+        new_options: Dict[str, Any],
+        new_other_args_to_resolve: Dict[str, Any],
+    ):


type hint return

edoakes · 2022-05-20T15:54:59Z

python/ray/serve/deployment_executor_node.py

+        self._deployment_handle = deployment_handle
+        super().__init__(dag_args, dag_kwargs, {}, {})
+
+    def _copy_impl(


what is _copy_impl used for?

ditto -- https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/experimental/dag/dag_node.py?L174

edoakes · 2022-05-20T15:56:27Z

python/ray/serve/deployment_method_executor_node.py

+        method_body = getattr(
+            self._deployment_node_replaced_by_handle, self._deployment_method_name
+        )


why not put this in the constructor?

i can give that a try

updated -- the value of self._deployment_node_replaced_by_handle is not a deployment handle yet until execution time, therefore getting the deployment method handle here won't work at constructor. From benchmark it makes negligible perf difference.

edoakes · 2022-05-20T15:57:08Z

python/ray/serve/deployment_executor_node.py

+    def __getattr__(self, method_name: str):
+        return DeploymentMethodExecutorNode(
+            method_name,
+            (),
+            {},
+            other_args_to_resolve={
+                PARENT_CLASS_NODE_KEY: self,
+            },
+        )


is this valid? I don't think we should ever be calling getattr on this at execution time, right? It should already be replaced with a deployment method node.

This method is actually only called at our own transformation time because we need to interpret deployment.method nodes, at execution time the only method matters is _execute_impl and _copy_impl

I do think this PR reveals a bigger issue. We don't have an architecture/design doc for how Ray DAG works, what's all about _execute_impl, _copy_impl, _other_args_to_resolve... this is making people contribute and review relevant doc.

simon-mo

new code looks fine, more questions about delete old ones

simon-mo · 2022-05-20T20:13:59Z

python/ray/serve/pipeline/api.py

@@ -67,6 +69,19 @@ def build(ray_dag_root_node: DAGNode) -> List[Deployment]:
            lambda node: transform_ray_dag_to_serve_dag(node, node_name_generator)


now i'm wondering, can we directly go from ClassNode/FunctionNode to these ExecutorNode? The transform process can spit out:

A list of deployment objects

A minimal dag for execution

So we can delete all the existing Deployment{, Function, Method}Node implementations

yes this is exactly my thinking. I think we only need to do the following:

Collect list of deployments that need to be deployed (w/ args, options).

Transform ClassNode -> ServeHandle.

Transform {ClassMethodNode, FunctionNode} -> DAGExecutorNode.

we eventually should be able to, and in April i spent about a week trying it but it has a bit more magic regarding how we deal with deployment and dag handles as deployment arg in our 1.12 sprint .. so this PR is more about adding the final ideal state with perf wins, then we can work backwards to simplify what happened in first pass.

jiaodong · 2022-05-20T21:38:15Z

The ideal case is to split up responsibilities of first pass from ray_dag -> serve_dag. We need to use DAG traversal mechanism to do both 1) Map Ray DAGNode types to Serve DAGNode types 2) Generate Deployment objects with right init_args where ClassNode needs to be replaced by a deployment handle to the class.

This part is currently messy as it mixed both things in one pass, under one node type, so slowing down execution is the obvious user facing issue but clear abstraction is the other missing piece. Ideally I think it should be structure as:

User author ray_dag and locally tested it, handed it over to serve.run()
Trivially map ClassNode -> DeploymentNode with args captured so we have a serve_dag for later replacement
Use dag traversal on the structure of serve_dag to generate deployment objects with init_args using serve handles
Executor pass that combines deployment objects with right init_args to a runnable DAG object, based on backbone structure of serve_dag.

We have 1-3 working now, this PR adds 4) to optimize user facing perf. But 2) & 3) has tech debt that we can split into separate steps and remove unnecessary code.

simon-mo · 2022-05-23T18:15:31Z

Per offline conversation, this PR can be merged as long as test passing and small nits resolved.

This PR represent part of the end state
The PR improves perf of deployment graph drastically once merged
The untangling and cleaning up PR is coming up this week and expected to be more complicated to review, and we shouldn't delay this one.

simon-mo · 2022-05-23T22:43:08Z

python/ray/serve/pipeline/tests/test_deployment_node.py



-@pytest.mark.asyncio
+@pytest.mark.skip(reason="async handle not enabled yet")


I thought async handle needs to be implemented for performance?

yes but not in this PR yet

simon-mo · 2022-05-23T22:43:58Z

python/ray/serve/tests/test_pipeline_deployment_config.py



-def test_deploment_options_func_class_with_class_method():
+def test_deploment_options_func_class_with_class_method(serve_instance):


this shouldn't be required if we the minimal executor node uses lazy handle.

oh there's one committed that i removed lazy handle but later realized we still need it to run dag without serve instance. let me change this back.

simon-mo

LGTM.

Please make sure to follow up with

Async handle changes
Cleanup changes

edoakes · 2022-05-23T23:26:00Z

I'm ok with merging this for now, let's sync up about the end state we want to work towards -- I think things are much more complex than they need to be right now.

jiaodong · 2022-05-24T01:31:34Z

sg, let me put up an internal onboarding doc style for DAG stuff in general, and we can see with more clarity how we map ray nodes to deployment graph

jiaodong added 14 commits May 12, 2022 17:55

wip

c4872b6

more progress

144ed91

function nodes working, just one tiny fix on DAGDriver

8708c20

function combine working

4e555c2

function chain passing

3f97036

simple class with class method worked

7ee241e

move DAGDriver handle part to a dedicated generate path

6cf6701

more tests passed, minor issue with deployment method chain

21cd9ad

30 tests passed, 2 failed now

0744268

all tests passed after fixing deployment method node _execute_impl

3e48d1d

clean up comments and travis

6e9f551

Merge branch 'master' into remove_deployment_creation

6ecb7df

polishing

78563d1

nits

13a29bc

jiaodong requested review from edoakes, sihanwang41 and simon-mo May 18, 2022 02:53

jiaodong assigned edoakes and simon-mo May 18, 2022

nits, revert deployment node handle code

6fbf90a

jiaodong marked this pull request as ready for review May 18, 2022 03:12

jiaodong added 3 commits May 18, 2022 10:28

remove LazyHandle, simply handle return value and fix tests

f4cbba2

add lazy handle back .. i forgot this is for CLI without requiring se…

7aba93f

…rve instance

fix test

f2a2b1c

jiaodong added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. serve Ray Serve Related Issue flaky-test labels May 18, 2022

edoakes reviewed May 20, 2022

View reviewed changes

edoakes removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 20, 2022

edoakes added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed serve Ray Serve Related Issue flaky-test labels May 20, 2022

simon-mo reviewed May 20, 2022

View reviewed changes

jiaodong added 2 commits May 23, 2022 11:50

Merge branch 'master' into remove_deployment_creation

4a6b7b1

address comments

8615336

jiaodong added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. flaky-test and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels May 23, 2022

simon-mo reviewed May 23, 2022

View reviewed changes

nits for test

62e71b6

simon-mo approved these changes May 23, 2022

View reviewed changes

edoakes merged commit f27e85c into ray-project:master May 24, 2022

		- Meanwhile, __init__, _copy_impl and _execute_impl are on the critical
		pass of execution for every request.

		@@ -67,6 +69,19 @@ def build(ray_dag_root_node: DAGNode) -> List[Deployment]:
		lambda node: transform_ray_dag_to_serve_dag(node, node_name_generator)



		@pytest.mark.asyncio
		@pytest.mark.skip(reason="async handle not enabled yet")



		def test_deploment_options_func_class_with_class_method():
		def test_deploment_options_func_class_with_class_method(serve_instance):

Conversation

jiaodong commented May 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Current 10 node chain with deployment graph .bind()

Using raw deployment handle without dag overhead

After this PR:

Related issue number

Checks

Uh oh!

jiaodong commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiaodong commented May 20, 2022

Uh oh!

simon-mo commented May 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

Uh oh!

edoakes commented May 23, 2022

Uh oh!

jiaodong commented May 13, 2022 •

edited

Loading

Current 10 node chain with deployment graph `.bind()`

jiaodong commented May 18, 2022 •

edited

Loading