SAMZA-1860: Modularize Join input validation in ExecutionPlanner #637

ahmedahamid · 2018-09-12T05:27:16Z

This change breaks down the validation of partition counts of input and
intermediate streams participating in Join operations into 3 separate steps:

Grouping InputOperatorSpecs by the JoinOperatorSpecs of the Join operations they participate in
Replacing InputOperatorSpecs with their corresponding StreamEdges
Verifying/Inferring partition counts of input/intermediate streams

This change covers stream-stream Joins only.

This change breaks down the validation of partition counts of input and intermediate streams participating in Join operations into 3 separate steps: 1. Grouping InputOperatorSpecs by the JoinOperatorSpecs of the Join operations they participate in 2. Replacing InputOperatorSpecs with their corresponding StreamEdges 3. Verifying/Inferring partition counts of input/intermediate streams

prateekm · 2018-09-19T17:10:40Z

@bharathkk @vjagadish1989 Can you take a look at this?

prateekm · 2018-09-19T17:11:03Z

cc @nickpan47, since this affects the ExecutionPlanner.

vjagadish1989

Looks great, Ahmad! The planner is one of the complex pieces of the Samza code-base, thanks much for refactoring it!

vjagadish1989 · 2018-09-24T00:45:19Z

samza-core/src/main/java/org/apache/samza/execution/ExecutionPlanner.java

+
+    // Verify agreement between joined input/intermediate streams.
+    // This may involve setting partition counts of intermediate stream edges.
+    joinedStreamsGroups.forEach(ExecutionPlanner::validateJoinedStreamsGroupPartitions);


It looks like validate is also "assigning" partition counts.. Would it at all be cleaner to separate out computation of partition-counts from their validation? As an example, computing of partition-counts depends on the order in which we process the StreamEdgeGroups while validation may not.

Yes, validate is also assigning partition counts. It's a little difficult to separate both operations at the moment. For groups containing stream edges with known partition counts only, we just need verification — this is the easy case. On the other hand, stream edge groups with a mix of set/unset partitions require a mix of verification and assignment.

I thought of deciding whether to do verification vs assignment depending on the JoinedStreamsGroupCategory of a JoinedStreamsGroup. Problem with that is:

For a group with a mix of set/unset partitions, I could very easily need to do both verification and assignment within the same group, e.g. {e1 (8), e2 (?), e3 (8)}.

JoinedStreamsGroupCategory is not really reliable once we start setting partition counts (which is another reason why StreamEdges are better off being immutable). For instance, by the time we process group delete it sorry #2 of {e1 (8), e2 (?)} and {e2 (?), e3 (8)}, it will only require verification even though its (stale) category will be SOME_PARTITION_COUNT_SET.

I think we can just change the verb from validate to something else that conveys the possibility of mutation. I'll try to come up with something but I'm also open to suggestions.

vjagadish1989 · 2018-09-24T00:55:02Z

samza-core/src/main/java/org/apache/samza/execution/ExecutionPlanner.java

+     *   processing them in the above order (most constrained first) is guaranteed to
+     *   yield correct assignment of partition counts of e3 and e4 in a single scan.
+     */
+    Collections.sort(joinedStreamsGroups, Comparator.comparing(JoinedStreamsGroup::getCategory));


not necessarily in scope for this PR, Do you have a sense for what would it take to make "StreamEdge" immutable?

for eg: it seems like the setPartitionCount method on the StreamEdge could be replaced with a map of StreamEdge -> partitionCount

I came to the same conclusion actually — making StreamEdge immutable can greatly improve the ExecutionPlanner. I didn't want to do it in this series of PRs to avoid a scoop creep though.

I think one idea we can explore is making partition count a ctor param and a readonly property of a StreamEdge. This would require deferring the creation of any StreamEdge until its partition count is known, which is probably not going to be difficult after this PR.

I'll take note of this and send a follow-up PR later on.

vjagadish1989 · 2018-09-24T01:08:11Z

samza-core/src/main/java/org/apache/samza/execution/ExecutionPlanner.java

+    } else {
+      category = JoinedStreamsGroupCategory.SOME_PARTITION_COUNT_SET;
    }
+


The notion of a "category" looks like a detail of the JoinedStreamsGroup, which can be inferred from the other params. What do you think about moving the logic that determines "category" to the constructor of JoinedStreamsGroup ?

A nice property is that we could avoid inconsistent object states. For eg: with the current constructor: JoinedStreamsGroup(groupId, streamEdges, category), one could create an instance of JoinedStreamsGroup such that its "streamEdges" and "category" contradict each other

I did consider this and I would have really liked to do so. However,

Deciding the category in JoinedStreamsGroup requires one more (redundant) iteration over the StreamEdges.

More importantly, there is no way to avoid inconsistent object states in JoinedStreamsGroup because StreamEdges are mutable. In fact, we already throw every JoinedStreamsGroup with intermediate streams into this inconsistent state once we start setting partition counts. This made me less inclined to incur the overhead of the extra iteration in [doc] remove samza-serializers maven dependency #1.

Together, these 2 points made me prefer to keep JoinedStreamsGroup as a passive data object that contains zero logic. I thought this would help set readers' expectations that this object is just a way of organizing data w/o maintaining any invariants.

vjagadish1989 · 2018-09-24T01:13:15Z

samza-core/src/main/java/org/apache/samza/execution/OperatorSpecGraphAnalyzer.java

+   */
+  private static <T> void traverse(T vertex, Consumer<T> visitor, Function<T, Iterable<? extends T>> getNextVertexes) {
+    visitor.accept(vertex);
+    for (T nextVertex : getNextVertexes.apply(vertex)) {


Question: Could you have cycles anywhere in the traversal? If so, should this method guard against that? Instead, if the visitors are expected to track and avoid cycles, would be worth documenting it

I don't think there could be cycles in the OperatorSpecGraph, and if there could be then we never handled them.

A visitor's responsibility is strictly dictated by getNextVertexes, and since both are user-supplied, it's all up to the user. There are no general requirements on visitors.

samza-core/src/main/java/org/apache/samza/execution/OperatorSpecGraphAnalyzer.java

bharathkk

Thanks a lot for putting this together. It is definitely looking much better.

samza-core/src/main/java/org/apache/samza/execution/ExecutionPlanner.java

samza-core/src/main/java/org/apache/samza/execution/OperatorSpecGraphAnalyzer.java

samza-core/src/main/java/org/apache/samza/execution/ExecutionPlanner.java

samza-core/src/main/java/org/apache/samza/execution/OperatorSpecGraphAnalyzer.java

bharathkk · 2018-09-24T14:45:03Z

samza-core/src/main/java/org/apache/samza/execution/OperatorSpecGraphAnalyzer.java

+ * A utility class that encapsulates the logic for traversing an {@link OperatorSpecGraph} and building
+ * associations between related {@link OperatorSpec}s.
+ */
+/* package private */ class OperatorSpecGraphAnalyzer {


Here are my thoughts on this class. I feel we don't have a pressing need for it to be generified yet.
Can we start with a simple helper class that does traversal and returns a mapping in one go?

By that,

we are still isolating the traversal logic.

It simplifies the review

It simplifies testing

We can always refactor to extract the traversal part if we plan to introduce more visitors.
Thoughts?

@bharathkk : We'll likely introduce more visitors for computing partition-counts for StreamTableJoin and side-inputs. Once we have those follow-up PRs, we can decide if the current implementation is overly general. If that is indeed the case, we can certainly revisit it.

Until then, it's probably efficient to leave this PR in its current state. What do you think?

If we're only doing stream-stream join then I agree. The reason I wrote it this way though is because I have another PR that will add one more visitor in order to support stream-table joins. I wanted to lay the grounds for the upcoming change since I have already finished coding it.

@bharathkk @vjagadish1989 Would you prefer I send you the follow-up PRs now or is this PR good to go?

this PR should be good to go.

@ahmedahamid sorry to be late to the party. One comment for your future PR: OperatorSpecGraph is only available for high-level APIs. In the effort to unify the runtime support for both high-level and low-level APIs, ExecutionPlanner and the corresponding JobGraph/JobNode classes now have only access to ApplicationDescriptorImpl. Any need to traverse the graph should be starting from ApplicationDescriptorImpl.getInputOperators() now (see PR #642)

No worries. All the changes I have been making to the ExecutionPlanner only rely on InputOperatorSpecs.

vjagadish1989

approved

vjagadish1989 · 2018-09-26T03:03:44Z

merged and submitted!

nickpan47 · 2018-09-26T16:22:04Z

samza-core/src/main/java/org/apache/samza/execution/ExecutionPlanner.java

+    Set<StreamEdge> streamEdges = new HashSet<>();
+
+    for (InputOperatorSpec inputOpSpec : inputOpSpecs) {
+      StreamEdge streamEdge = jobGraph.getOrCreateStreamEdge(getStreamSpec(inputOpSpec.getStreamId(), streamConfig));


Just for the record, I have a strong concern here that we are potentially modifying the StreamEdge in jobGraph in a method called validateJoinInputStreamPartitions(). This also breaks the abstraction that createJobGraph() should already created the StreamEdges and JobNodes needed based on traversal of the operator DAG. Are we saying that even after createJobGraph() method is called, JobGraph can be missing some StreamEdges? That should not be the case, since each StreamEdge should be corresponding to an explicit partitionBy() operator, or explicitly defined as input/output streams. Let's sync up on the purpose and use case of this case.

Let's take this comment to a separate PR. I realized that this is an existing pattern in ExecutionPlanner/JobGraph. Ideally, get/create streamEdge from JobGraph should have separate methods, one should be read-only.

Certainly agree. This call site is actually using getOrCreateStreamEdge to retrieve existing StreamEdges not create new ones.

vjagadish1989 reviewed Sep 24, 2018

View reviewed changes

bharathkk reviewed Sep 24, 2018

View reviewed changes

Address PR feedback

4b7fe2d

vjagadish1989 approved these changes Sep 26, 2018

View reviewed changes

asfgit closed this in e904e70 Sep 26, 2018

nickpan47 reviewed Sep 26, 2018

View reviewed changes

ahmedahamid deleted the dev/ahabdulh/modularize-exec-planner branch September 27, 2018 02:35

SAMZA-1860: Modularize Join input validation in ExecutionPlanner #637

SAMZA-1860: Modularize Join input validation in ExecutionPlanner #637

Uh oh!

Conversation

ahmedahamid commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prateekm commented Sep 19, 2018

Uh oh!

prateekm commented Sep 19, 2018

Uh oh!

vjagadish1989 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmedahamid Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmedahamid Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmedahamid Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bharathkk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vjagadish1989 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vjagadish1989 commented Sep 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmedahamid Sep 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

ahmedahamid commented Sep 12, 2018 •

edited

Loading

ahmedahamid Sep 24, 2018 •

edited

Loading

ahmedahamid Sep 24, 2018 •

edited

Loading

ahmedahamid Sep 24, 2018 •

edited

Loading

vjagadish1989 left a comment •

edited

Loading

ahmedahamid Sep 27, 2018 •

edited

Loading