Skip to content

KAFKA-9739: Fixes null key changing child node#8400

Merged
bbejeck merged 4 commits intoapache:trunkfrom
bbejeck:KAFKA-9739_trunk_branch_null_keyChangingChildNode
Apr 3, 2020
Merged

KAFKA-9739: Fixes null key changing child node#8400
bbejeck merged 4 commits intoapache:trunkfrom
bbejeck:KAFKA-9739_trunk_branch_null_keyChangingChildNode

Conversation

@bbejeck
Copy link
Copy Markdown
Member

@bbejeck bbejeck commented Apr 1, 2020

For some context, when building a streams application, the optimizer keeps track of the key-changing operations and any repartition nodes that are descendants of the key-changer. During the optimization phase (if enabled), any repartition nodes are logically collapsed into one. The optimizer updates the graph by inserting the single repartition node between the key-changing node and its first child node. This graph update process is done by searching for a node that has the key-changing node as one of its direct parents, and the search starts from the repartition node, going up in the parent hierarchy.

The one exception to this rule is if there is a merge node that is a descendant of the key-changing node, then during the optimization phase, the map tracking key-changers to repartition nodes is updated to have the merge node as the key. Then the optimization process updates the graph to place the single repartition node between the merge node and its first child node.

The error in KAFKA-9739 occurred because there was an assumption that the repartition nodes are children of the merge node. But in the topology from KAFKA-9739, the repartition node was a parent of the merge node. So when attempting to find the first child of the merge node, nothing was found (obviously) resulting in StreamException(Found a null keyChangingChild node for..)

This PR fixes this bug by first checking that all repartition nodes for optimization are children of the merge node.

This PR includes a test with the topology from KAFKA-9739.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

bbejeck added 4 commits March 29, 2020 16:50
…the parent of at least one repartition topics to be optimized.
…ork was done on 2.4 branch and the naming conventions for repartition topics has changed.
…he merge node to update the optimization map with the merge node vs. the key-changing node.
@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Apr 1, 2020

An excellent example of this in action is https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/processor/internals/RepartitionWithMergeOptimizingTest.java.

Here's the un-optimized topology
merge_node_with_repartition_before_optimization

And the optimized one

merge_node_with_repartition_optimized

@bbejeck bbejeck added the streams label Apr 1, 2020
@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Apr 1, 2020

ping @guozhangwang, @mjsax, and @vvcephei

@guozhangwang
Copy link
Copy Markdown
Contributor

Thanks @bbejeck ! also cc @ableegoldman @cadonna to take a look as well.

mergeNodesToKeyChangers.get(mergeNode).add(key);
final Set<Map.Entry<StreamsGraphNode, LinkedHashSet<OptimizableRepartitionNode<?, ?>>>> entrySet = keyChangingOperationsToOptimizableRepartitionNodes.entrySet();
for (final Map.Entry<StreamsGraphNode, LinkedHashSet<OptimizableRepartitionNode<?, ?>>> entry : entrySet) {
if (mergeNodeHasRepartitionChildren(mergeNode, entry.getValue())) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix

@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Apr 1, 2020

Java 11 failed with kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup updated existing Jira ticket

Java 8 failed with org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldAllowConcurrentAccesses created a new Jira ticket for this.

retest this please.

@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Apr 1, 2020

Java 8 failed with kafka.api.PlaintextProducerSendTest.testNonBlockingProducer I've updated the Jira ticket

Java 11 passed

retest this please.

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the lucid PR in response to a truly mind-bending bug. The explanation sounds right to me, and the code looks right. The test looks good, too.

Thanks!

@bbejeck bbejeck merged commit 9783b85 into apache:trunk Apr 3, 2020
@bbejeck
Copy link
Copy Markdown
Member Author

bbejeck commented Apr 3, 2020

Merged #8400 into trunk.

@bbejeck bbejeck deleted the KAFKA-9739_trunk_branch_null_keyChangingChildNode branch April 3, 2020 16:05
@vvcephei
Copy link
Copy Markdown
Contributor

vvcephei commented Apr 3, 2020

Thanks, @bbejeck !

bbejeck added a commit that referenced this pull request Apr 3, 2020
2.4 port of #8400 since cherry-picking not possible

Reviewers: John Roesler <john@confluent.io>
bbejeck added a commit to bbejeck/kafka that referenced this pull request Apr 3, 2020
2.4 port of apache#8400 since cherry-picking not possible

Reviewers: John Roesler <john@confluent.io>
@bbejeck bbejeck mentioned this pull request Apr 3, 2020
3 tasks
bbejeck added a commit that referenced this pull request Apr 4, 2020
A port of #8400 for 2.3. The process of sorting source and sink nodes changed in 2.4, so we can't cherry-pick the PR directly as we need to update the expected topology to what it would be in the 2.3 version.

Reviewers: John Roesler <john@confluent.io>, Andrew Choi <a24choi@edu.uwaterloo.ca>
bbejeck added a commit that referenced this pull request Apr 4, 2020
A port of #8400 for 2.3. The process of sorting source and sink nodes changed in 2.4, so we can't cherry-pick the PR directly as we need to update the expected topology to what it would be in the 2.3 version.

Reviewers: John Roesler <john@confluent.io>, Andrew Choi <a24choi@edu.uwaterloo.ca>
bbejeck added a commit to bbejeck/kafka that referenced this pull request Apr 15, 2020
For some context, when building a streams application, the optimizer keeps track of the key-changing operations and any repartition nodes that are descendants of the key-changer. During the optimization phase (if enabled), any repartition nodes are logically collapsed into one. The optimizer updates the graph by inserting the single repartition node between the key-changing node and its first child node. This graph update process is done by searching for a node that has the key-changing node as one of its direct parents, and the search starts from the repartition node, going up in the parent hierarchy.

The one exception to this rule is if there is a merge node that is a descendant of the key-changing node, then during the optimization phase, the map tracking key-changers to repartition nodes is updated to have the merge node as the key. Then the optimization process updates the graph to place the single repartition node between the merge node and its first child node.

The error in KAFKA-9739 occurred because there was an assumption that the repartition nodes are children of the merge node. But in the topology from KAFKA-9739, the repartition node was a parent of the merge node. So when attempting to find the first child of the merge node, nothing was found (obviously) resulting in StreamException(Found a null keyChangingChild node for..)

This PR fixes this bug by first checking that all repartition nodes for optimization are children of the merge node.

This PR includes a test with the topology from KAFKA-9739.

Reviewers: John Roesler <john@confluent.io>
@bbejeck bbejeck mentioned this pull request Apr 15, 2020
3 tasks
bbejeck added a commit that referenced this pull request Apr 15, 2020
This is a port of #8400 for the 2.5 branch

For some context, when building a streams application, the optimizer keeps track of the key-changing operations and any repartition nodes that are descendants of the key-changer. During the optimization phase (if enabled), any repartition nodes are logically collapsed into one. The optimizer updates the graph by inserting the single repartition node between the key-changing node and its first child node. This graph update process is done by searching for a node that has the key-changing node as one of its direct parents, and the search starts from the repartition node, going up in the parent hierarchy.

The one exception to this rule is if there is a merge node that is a descendant of the key-changing node, then during the optimization phase, the map tracking key-changers to repartition nodes is updated to have the merge node as the key. Then the optimization process updates the graph to place the single repartition node between the merge node and its first child node.

The error in KAFKA-9739 occurred because there was an assumption that the repartition nodes are children of the merge node. But in the topology from KAFKA-9739, the repartition node was a parent of the merge node. So when attempting to find the first child of the merge node, nothing was found (obviously) resulting in StreamException(Found a null keyChangingChild node for..)

This PR fixes this bug by first checking that all repartition nodes for optimization are children of the merge node.

Reviewers: John Roesler <john@confluent.io>
qq619618919 pushed a commit to qq619618919/kafka that referenced this pull request May 12, 2020
2.4 port of apache#8400 since cherry-picking not possible

Reviewers: John Roesler <john@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants