KAFKA-9739: Fixes null key changing child node by bbejeck · Pull Request #8400 · apache/kafka

bbejeck · 2020-04-01T00:03:50Z

For some context, when building a streams application, the optimizer keeps track of the key-changing operations and any repartition nodes that are descendants of the key-changer. During the optimization phase (if enabled), any repartition nodes are logically collapsed into one. The optimizer updates the graph by inserting the single repartition node between the key-changing node and its first child node. This graph update process is done by searching for a node that has the key-changing node as one of its direct parents, and the search starts from the repartition node, going up in the parent hierarchy.

The one exception to this rule is if there is a merge node that is a descendant of the key-changing node, then during the optimization phase, the map tracking key-changers to repartition nodes is updated to have the merge node as the key. Then the optimization process updates the graph to place the single repartition node between the merge node and its first child node.

The error in KAFKA-9739 occurred because there was an assumption that the repartition nodes are children of the merge node. But in the topology from KAFKA-9739, the repartition node was a parent of the merge node. So when attempting to find the first child of the merge node, nothing was found (obviously) resulting in StreamException(Found a null keyChangingChild node for..)

This PR fixes this bug by first checking that all repartition nodes for optimization are children of the merge node.

This PR includes a test with the topology from KAFKA-9739.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…the parent of at least one repartition topics to be optimized.

…ork was done on 2.4 branch and the naming conventions for repartition topics has changed.

…he merge node to update the optimization map with the merge node vs. the key-changing node.

bbejeck · 2020-04-01T00:05:32Z

An excellent example of this in action is https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/processor/internals/RepartitionWithMergeOptimizingTest.java.

Here's the un-optimized topology

And the optimized one

bbejeck · 2020-04-01T00:11:05Z

ping @guozhangwang, @mjsax, and @vvcephei

guozhangwang · 2020-04-01T00:13:14Z

Thanks @bbejeck ! also cc @ableegoldman @cadonna to take a look as well.

bbejeck · 2020-04-01T13:23:24Z

-                    mergeNodesToKeyChangers.get(mergeNode).add(key);
+            final Set<Map.Entry<StreamsGraphNode, LinkedHashSet<OptimizableRepartitionNode<?, ?>>>> entrySet = keyChangingOperationsToOptimizableRepartitionNodes.entrySet();
+            for (final Map.Entry<StreamsGraphNode, LinkedHashSet<OptimizableRepartitionNode<?, ?>>> entry : entrySet) {
+                if (mergeNodeHasRepartitionChildren(mergeNode, entry.getValue())) {


This is the fix

bbejeck · 2020-04-01T14:02:27Z

Java 11 failed with kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup updated existing Jira ticket

Java 8 failed with org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldAllowConcurrentAccesses created a new Jira ticket for this.

retest this please.

bbejeck · 2020-04-01T16:38:50Z

Java 8 failed with kafka.api.PlaintextProducerSendTest.testNonBlockingProducer I've updated the Jira ticket

Java 11 passed

retest this please.

vvcephei

Thanks for the lucid PR in response to a truly mind-bending bug. The explanation sounds right to me, and the code looks right. The test looks good, too.

Thanks!

bbejeck · 2020-04-03T16:05:16Z

Merged #8400 into trunk.

vvcephei · 2020-04-03T16:09:34Z

Thanks, @bbejeck !

2.4 port of #8400 since cherry-picking not possible Reviewers: John Roesler <john@confluent.io>

2.4 port of apache#8400 since cherry-picking not possible Reviewers: John Roesler <john@confluent.io>

A port of #8400 for 2.3. The process of sorting source and sink nodes changed in 2.4, so we can't cherry-pick the PR directly as we need to update the expected topology to what it would be in the 2.3 version. Reviewers: John Roesler <john@confluent.io>, Andrew Choi <a24choi@edu.uwaterloo.ca>

For some context, when building a streams application, the optimizer keeps track of the key-changing operations and any repartition nodes that are descendants of the key-changer. During the optimization phase (if enabled), any repartition nodes are logically collapsed into one. The optimizer updates the graph by inserting the single repartition node between the key-changing node and its first child node. This graph update process is done by searching for a node that has the key-changing node as one of its direct parents, and the search starts from the repartition node, going up in the parent hierarchy. The one exception to this rule is if there is a merge node that is a descendant of the key-changing node, then during the optimization phase, the map tracking key-changers to repartition nodes is updated to have the merge node as the key. Then the optimization process updates the graph to place the single repartition node between the merge node and its first child node. The error in KAFKA-9739 occurred because there was an assumption that the repartition nodes are children of the merge node. But in the topology from KAFKA-9739, the repartition node was a parent of the merge node. So when attempting to find the first child of the merge node, nothing was found (obviously) resulting in StreamException(Found a null keyChangingChild node for..) This PR fixes this bug by first checking that all repartition nodes for optimization are children of the merge node. This PR includes a test with the topology from KAFKA-9739. Reviewers: John Roesler <john@confluent.io>

This is a port of #8400 for the 2.5 branch For some context, when building a streams application, the optimizer keeps track of the key-changing operations and any repartition nodes that are descendants of the key-changer. During the optimization phase (if enabled), any repartition nodes are logically collapsed into one. The optimizer updates the graph by inserting the single repartition node between the key-changing node and its first child node. This graph update process is done by searching for a node that has the key-changing node as one of its direct parents, and the search starts from the repartition node, going up in the parent hierarchy. The one exception to this rule is if there is a merge node that is a descendant of the key-changing node, then during the optimization phase, the map tracking key-changers to repartition nodes is updated to have the merge node as the key. Then the optimization process updates the graph to place the single repartition node between the merge node and its first child node. The error in KAFKA-9739 occurred because there was an assumption that the repartition nodes are children of the merge node. But in the topology from KAFKA-9739, the repartition node was a parent of the merge node. So when attempting to find the first child of the merge node, nothing was found (obviously) resulting in StreamException(Found a null keyChangingChild node for..) This PR fixes this bug by first checking that all repartition nodes for optimization are children of the merge node. Reviewers: John Roesler <john@confluent.io>

2.4 port of apache#8400 since cherry-picking not possible Reviewers: John Roesler <john@confluent.io>

bbejeck added 4 commits March 29, 2020 16:50

KAFKA-9739: When optimizing with merge nodes, the merge node must be …

c42e849

…the parent of at least one repartition topics to be optimized.

KAFKA-9739: Clean up from cherry-pick, update naming since original w…

a7a18ee

…ork was done on 2.4 branch and the naming conventions for repartition topics has changed.

KAFKA-9739: Add comment about topology remove println statement

96c3b65

KAFKA-9739: Change to all repartition topics need to be children of t…

238943c

…he merge node to update the optimization map with the merge node vs. the key-changing node.

bbejeck added the streams label Apr 1, 2020

bbejeck commented Apr 1, 2020

View reviewed changes

vvcephei approved these changes Apr 2, 2020

View reviewed changes

andrewchoi5 approved these changes Apr 3, 2020

View reviewed changes

bbejeck merged commit 9783b85 into apache:trunk Apr 3, 2020

bbejeck deleted the KAFKA-9739_trunk_branch_null_keyChangingChildNode branch April 3, 2020 16:05

bbejeck mentioned this pull request Apr 3, 2020

KAFKA-9739: Fixes null key changing child node #8416

Merged

3 tasks

bbejeck added a commit that referenced this pull request Apr 3, 2020

KAFKA-9739: Fixes null key changing child node (#8416)

9c91e05

2.4 port of #8400 since cherry-picking not possible Reviewers: John Roesler <john@confluent.io>

bbejeck added a commit to bbejeck/kafka that referenced this pull request Apr 3, 2020

KAFKA-9739: Fixes null key changing child node (apache#8416)

e172ac9

2.4 port of apache#8400 since cherry-picking not possible Reviewers: John Roesler <john@confluent.io>

bbejeck mentioned this pull request Apr 3, 2020

KAFKA-9739: 2.3 null child node fix #8419

Merged

3 tasks

bbejeck mentioned this pull request Apr 15, 2020

KAFKA-9739: Fix for 2.5 branch #8492

Merged

3 tasks

guozhangwang mentioned this pull request May 7, 2020

MINOR: Log4j Improvements on Fetcher #8629

Merged

3 tasks

qq619618919 pushed a commit to qq619618919/kafka that referenced this pull request May 12, 2020

KAFKA-9739: Fixes null key changing child node (apache#8416)

95b933d

2.4 port of apache#8400 since cherry-picking not possible Reviewers: John Roesler <john@confluent.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-9739: Fixes null key changing child node#8400

KAFKA-9739: Fixes null key changing child node#8400
bbejeck merged 4 commits intoapache:trunkfrom
bbejeck:KAFKA-9739_trunk_branch_null_keyChangingChildNode

bbejeck commented Apr 1, 2020

Uh oh!

bbejeck commented Apr 1, 2020

Uh oh!

bbejeck commented Apr 1, 2020

Uh oh!

guozhangwang commented Apr 1, 2020

Uh oh!

bbejeck Apr 1, 2020

Uh oh!

bbejeck commented Apr 1, 2020

Uh oh!

bbejeck commented Apr 1, 2020

Uh oh!

vvcephei left a comment

Uh oh!

bbejeck commented Apr 3, 2020

Uh oh!

vvcephei commented Apr 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bbejeck commented Apr 1, 2020

Committer Checklist (excluded from commit message)

Uh oh!

bbejeck commented Apr 1, 2020

Uh oh!

bbejeck commented Apr 1, 2020

Uh oh!

guozhangwang commented Apr 1, 2020

Uh oh!

bbejeck Apr 1, 2020

Choose a reason for hiding this comment

Uh oh!

bbejeck commented Apr 1, 2020

Uh oh!

bbejeck commented Apr 1, 2020

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

bbejeck commented Apr 3, 2020

Uh oh!

vvcephei commented Apr 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants