[FLINK-10945] Use InputDependencyConstraint to avoid resource dead… #7255

zhuzhurk · 2018-12-06T15:32:36Z

…locks in LAZY_FROM_SOURCES scheduling when resources are limited

What is the purpose of the change

This PR add a job config InputDependencyConstraint, which helps to avoid resource deadlocks in LAZY_FROM_SOURCES scheduling when resources are limited, as described in FLINK-10945.

Brief change log

Add InputDependencyConstraint to ExecutionConfig
Adjust isConsumable interface in IntermediateResultPartition to fit for the data actual consumable definition
Change current execution lazy scheduling logic(in Execution.scheduleOrUpdateConsumers(edges)) to schedule tasks only if the InputDependencyConstraint is satisfied(an interface ExecutionVertex.checkInputDependencyConstraints is added to serve this purpose).

Verifying this change

This change added tests and can be verified as follows:

Added IntermediateResultPartitionTest to validate IntermediateResultPartition changes
Added ExecutionVertexInputConstraintTest to validate the constraint check logic in ExecutionVertex

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (JavaDocs)

…locks for finite stream jobs when resources are limited

azagrebin

Thanks for your contribution @zhuzhurk !
I have left some comments.

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

...ntime/src/main/java/org/apache/flink/runtime/executiongraph/IntermediateResultPartition.java

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionVertex.java

.../src/test/java/org/apache/flink/runtime/deployment/InputChannelDeploymentDescriptorTest.java

azagrebin

Thanks @zhuzhurk ! I also added couple of comments for tests.

azagrebin · 2018-12-12T17:05:20Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

-						return null;
-					},
-					executor);
+				if (consumerVertex.checkInputDependencyConstraints()) {


I think the TODO comment belongs to the first line of the new scheduleConsumer method

From my understanding, the TODO comment is related to the "consumerState == CREATED" section in scheduleOrUpdateConsumers, which invokes cachePartitionInfo first and then schedules the vertex. The cachePartitionInfo action is needed to avoid deployment race, at the cost of redundant partition infos to update to task, which is the concern as described in the TODO comment.

So far the redundant partition it's not a big problem. But I think we can optimize it later. One possible solution in my mind is to remove known partition infos from the cache when creating InputChannelDeploymentDescriptor.

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionVertex.java

...rc/test/java/org/apache/flink/runtime/executiongraph/ExecutionVertexInputConstraintTest.java

...e/src/test/java/org/apache/flink/runtime/executiongraph/IntermediateResultPartitionTest.java

…endencyConstraint == ANY 2. Fixes for tests

azagrebin

Thanks for addressing the comments @zhuzhurk

tillrohrmann

Thanks for your contribution @zhuzhurk. The changes look very good. I had some minor comments which we could address before merging.

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionVertex.java

tillrohrmann · 2019-01-08T16:02:54Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionVertex.java

+	public boolean checkInputDependencyConstraints() {
+		if (getExecutionGraph().getInputDependencyConstraint() == InputDependencyConstraint.ANY) {
+			// InputDependencyConstraint == ANY
+			return IntStream.range(0, inputEdges.length).anyMatch(this::isInputConsumable);


Having moved isInputConsumable into IntermediateResult allows to get rid of the indirection of the IntStream.

tillrohrmann · 2019-01-08T16:07:10Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionVertex.java

+	 * @return whether the input constraint is satisfied
+	 */
+	public boolean checkInputDependencyConstraints() {
+		if (getExecutionGraph().getInputDependencyConstraint() == InputDependencyConstraint.ANY) {


Shouldn't the InputDependencyConstraint rather be a value of the ExecutionJobVertex than the ExecutionGraph? I guess it should be configurable for each operator individually.

I've move InputDependencyConstraint to JobVertex. And the job wide default value can be configured in ExecutionConfig. But I haven't make it configurable through DataSet/DataStream API yet.

I agree we should support the constraint configurable for each operator. But I'm not quite sure whether we should support it with DataSet API or later for the stream/batch unified StreamGraph/Transformation API? Could you share your suggestion?

In our production experience, a job-wide configured input constraint satisfies most users, together with BATCH_FORCED execution mode, to ensure a batch job can finish with limited resources.

tillrohrmann · 2019-01-08T16:09:39Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionVertex.java


 		if (partition.getIntermediateResult().getResultType().isPipelined()) {
 			// Schedule or update receivers of this partition
+			partition.markSomePipelinedDataProduced();


could this be generalized by having a partition#markDataProduced and then calling it fall intermediate results independent of the type?

tillrohrmann · 2019-01-08T16:33:29Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraphBuilder.java


+		try {
+			executionGraph.setInputDependencyConstraint(
+                jobGraph.getSerializedExecutionConfig().deserializeValue(classLoader).getInputDependencyConstraint());


I understand that this was the easiest way how to get the InputDependencyConstraint into the ExecutionGraph but I think it should be part of the JobVertex, because it is not a global setting but rather controls how each vertex is scheduled.

Good suggestion. I've move InputDependencyConstraint to JobVertex.

…gurable through API yet) 2. Refine input consumable checks

tillrohrmann

Thanks for the fixup @zhuzhurk. My last comment is whether markDataProduced should not be always called in the scheduleOrUpdate method independent of the result type. Once we have resolved this comment, the PR is ready to be merged.

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionVertex.java

tillrohrmann · 2019-01-16T10:53:26Z

Thanks for addressing my comments @zhuzhurk. Looks really good now. Merging this PR now.

…locks for finite stream jobs when resources are limited This commit adds a job config InputDependencyConstraint, which helps to avoid resource deadlocks in LAZY_FROM_SOURCES scheduling when resources are limited. The InputDependencyConstraint controls across multiple inputs when consumers are scheduled. Currently it supports ANY and ALL. ANY means that any input intermediate result partition must be consumable and ALL means that all input intermediate result partitions (from all inputs) need to be consumable in order to schedule the consumer task. This closes apache#7255.

zhuzhurk · 2019-01-17T02:09:15Z

Thanks Andrey(@azagrebin) and Till(@tillrohrmann) for the reviewing.

…locks for finite stream jobs when resources are limited This commit adds a job config InputDependencyConstraint, which helps to avoid resource deadlocks in LAZY_FROM_SOURCES scheduling when resources are limited. The InputDependencyConstraint controls across multiple inputs when consumers are scheduled. Currently it supports ANY and ALL. ANY means that any input intermediate result partition must be consumable and ALL means that all input intermediate result partitions (from all inputs) need to be consumable in order to schedule the consumer task. This closes apache#7255.

[FLINK-10945] Add an InputDependencyConstraint to avoid resource dead…

0cf6a38

…locks for finite stream jobs when resources are limited

zhuzhurk mentioned this pull request Dec 6, 2018

[FLINK-10945] Add an InputDependencyConstraint to avoid resource dead… #7252

Closed

azagrebin requested changes Dec 10, 2018

View reviewed changes

Minor changes for comments

a6371b9

azagrebin reviewed Dec 13, 2018

View reviewed changes

zhuzhurk added 2 commits December 14, 2018 14:40

1. Add a shortcut for scheduling input constraint check when InputDep…

374735c

…endencyConstraint == ANY 2. Fixes for tests

Refine comments

c7a33ee

azagrebin approved these changes Dec 19, 2018

View reviewed changes

tillrohrmann requested changes Jan 8, 2019

View reviewed changes

1. Move InputDependencyConstraint to JobVertex (but not make it confi…

2bd5173

…gurable through API yet) 2. Refine input consumable checks

tillrohrmann reviewed Jan 15, 2019

View reviewed changes

Fix for comments

8c2ce55

tillrohrmann approved these changes Jan 16, 2019

View reviewed changes

asfgit closed this in 171a3b1 Jan 16, 2019

rmetzger added the component=Runtime/Coordination label Mar 18, 2019

zhuzhurk deleted the zhuzh_master branch April 17, 2019 07:17

azagrebin mentioned this pull request Jun 4, 2019

[FLINK-11391] Introduce shuffle master interface #8362

Closed

[FLINK-10945] Use InputDependencyConstraint to avoid resource dead… #7255

[FLINK-10945] Use InputDependencyConstraint to avoid resource dead… #7255

Uh oh!

Conversation

zhuzhurk commented Dec 6, 2018

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

azagrebin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azagrebin left a comment

Choose a reason for hiding this comment

Uh oh!

azagrebin Dec 12, 2018

Choose a reason for hiding this comment

Uh oh!

zhuzhurk Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azagrebin left a comment

Choose a reason for hiding this comment

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tillrohrmann Jan 8, 2019

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Jan 8, 2019

Choose a reason for hiding this comment

Uh oh!

zhuzhurk Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Jan 8, 2019

Choose a reason for hiding this comment

Uh oh!

zhuzhurk Jan 9, 2019

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Jan 8, 2019

Choose a reason for hiding this comment

Uh oh!

zhuzhurk Jan 9, 2019

Choose a reason for hiding this comment

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tillrohrmann commented Jan 16, 2019

Uh oh!

zhuzhurk commented Jan 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

zhuzhurk Dec 14, 2018 •

edited

Loading

zhuzhurk Jan 9, 2019 •

edited

Loading