KIP-221 / Add KStream#repartition operation#7170
Conversation
…mber of partitions based on InternalTopicProperties
…, KeyValueMapper)
…erations method with InteralTopicConfig and StreamPartitioner parameters
…cNamesWithProperties; Moved InternalTopicProperties class to dedicated file
… after repartition operation is performed
|
@lkokhreidze Thanks for the PR. There are some checkstyle errors. Can you please fix them before we review your PR? |
|
@mjsax done |
mjsax
left a comment
There was a problem hiding this comment.
Thanks for the PR.
Made an initial pass. I still need to wrap my head around the optimization layer and how we merge repartition nodes. We need to add more test to RepartitiontTopicNameTest and/or StreamsGraphTest IMHO, to verify that the new repartition() operator works as intended.
Also, it seems you forgot to update groupBy() and groupByKey().
Finally, thinking about the KIP once more: as we extend groupBy to configure the internal repartition topic, I am wondering if we should extend the KIP and also allow to do this for join() that may also create repartition topics? \cc @guozhangwang @bbejeck @vvcephei @ableegoldman @cadonna @abbccdda
@mjsax this is the first PR (written in PR description) |
Agreed, I'll do that. I wanted to tag @bbejeck as seems like he's the main author behind optimization logic. I'll add tests for optimization logic to make sure nothing breaks. |
|
Ah. I guess, I skipped the PR description... Sorry for that. I discussed the proposal with @vvcephei in person, and thinking about the semantics once more, I am actually wondering if it is wise to change We have basically two dimensions which 2 cases each to consider for
Case (1), (2), and (4) are straight forward. However, case (2) is somewhat awkward because we actually want to treat Therefore, only case (4) is left in which passing in Therefore, I don't see a good use case for which it make sense to pass in Would be great if you could share your thoughts about it? A second point I discussed with @vvcephei is about the optimization. We both have the impression that |
Not sure I agree @mjsax -- maybe you just want to control the parallelism in case a repartition is required? You could enforce users to step through their whole topology, figure out when/where repartitioning is needed, and use |
I don't see this as a use case in practice. Why would one want to change the parallelism? Because, the aggregation operation is over or under provisions and thus one wants to decrease or increase the parallelism. If I am ok with the "default" parallelism in case there is no repartitioning, why would I not be ok with it if data is repartitioned?
This is less an issue IMHO, because if I want to scale up for example, it's sufficient to insert |
|
My question is, why do you need Similar to your second comment, if you want to "scale up" again later, you call |
|
Ok, well I am fine with this framing it as a "set parallelism" operation...I don't want to stall this KIP/PR further, but what if this was split into a new set of |
|
Hello @mjsax @ableegoldman @vvcephei While, for me, as a user, 2nd option looks much more appealing, similarly how key selector for Again, your arguments are totally valid, and all can be achieved just by having |
|
Thanks @vvcephei for the update and no worries :) |
vvcephei
left a comment
There was a problem hiding this comment.
Haha, well. I did start the review, and made a fair amount of progress before getting sidetracked by a global catastrophe...
It's still in my "actively working on this" bucket, and I'll commit to not starting new work until I finish my review. For now, I'll go ahead and ask this one question, which came up early in my review. I skimmed over the KIP and discussion thread, but didn't see a specific discussion of the overload in question.
|
test this please |
vvcephei
left a comment
There was a problem hiding this comment.
Hey, @lkokhreidze , I finally finished my review, and it looks good to me. I'm not sure if @mjsax wants to make another pass.
| } | ||
|
|
||
| @Test | ||
| public void shouldCreateOnlyOneRepartitionTopicWhenRepartitionIsFollowedByGroupByKey() throws ExecutionException, InterruptedException { |
There was a problem hiding this comment.
Similar to above: we should be able to test with via unit tests using Topology#describe()
There was a problem hiding this comment.
Thought about that, but somehow it felt "safer" with integration tests. Mainly because I was more comfortable verifying that topics actually get created when using repartition operation.
There was a problem hiding this comment.
I had a similar thought, that it looks like good fodder for unit testing, but I did like the safety blanket of verifying the actual partition counts. I guess I'm fine either way, with a preference for whatever is already in the PR ;)
There was a problem hiding this comment.
Mainly because I was more comfortable verifying that topics actually get created when using repartition operation.
I guess that is fair. (I just try to keep test runtime short if we can -- let's keep the integration test.)
| } | ||
|
|
||
| @Test | ||
| public void shouldGenerateRepartitionTopicWhenNameIsNotSpecified() throws ExecutionException, InterruptedException { |
There was a problem hiding this comment.
Seems to be unit-test able via Topology#describe() ?
There was a problem hiding this comment.
Thought about that, but somehow it felt "safer" with integration tests. Mainly because I was more comfortable verifying that topics actually get created when using repartition operation.
| } | ||
|
|
||
| @Test | ||
| public void shouldGoThroughRebalancingCorrectly() throws ExecutionException, InterruptedException { |
There was a problem hiding this comment.
Not sure what this test is about, ie, how does is relate to the repartition() feature?
There was a problem hiding this comment.
It's related to this comment #7170 (comment)
| } | ||
|
|
||
| @Test | ||
| public void shouldInvokePartitionerWhenSet() { |
There was a problem hiding this comment.
Not sure what this test actually verifies?
There was a problem hiding this comment.
This was the "easiest" way I could figure out to verify that custom partitioner is invoked when it's set
Co-Authored-By: John Roesler <vvcephei@users.noreply.github.com>
|
Hi @mjsax, I've addressed your comments, would appreciate another review. |
|
Small update: f2bcdfe In this commit I've added Topology optimization option as test parameter. This PR touches topology optimization (indirectly). In order to make sure that everything works as expected, I though it would beneficial in the integration tests verifying both, Regards, |
|
Wow, that's great. Thanks, @lkokhreidze ! |
| Arrays.asList(StreamsConfig.OPTIMIZE, StreamsConfig.NO_OPTIMIZATION) | ||
| .forEach(x -> values.add(new Object[]{x})); | ||
|
|
||
| return values; |
There was a problem hiding this comment.
Seems unnesseary complex? A simple
return Arrays.asList(new String[][] {
{StreamsConfig.OPTIMIZE},
{StreamsConfig.NO_OPTIMIZATION}
});
would do, too :)
(Feel free to ignore the comment.)
| return values; | ||
| } | ||
|
|
||
| public KStreamRepartitionIntegrationTest(final String topologyOptimization) { |
There was a problem hiding this comment.
A simple
@Parameter
public String topologyOptimization;
Would be sufficient instead of adding a constructor and those lines could go into before().
(As above, feel free to ignore this comment.)
|
Merged to |
|
Yes, thank you @lkokhreidze for seeing this through! |
KIP-221: Enhance DSL with Connecting Topic Creation and Repartition Hint
Tickets: KAFKA-6037 KAFKA-8611
Description
This is PR for KIP-221. Goal of this PR is to introduce new
KStream#repartitionoperator and underline machinery that can be used for repartition configuration onKStreaminstance.Notable Changes
org.apache.kafka.streams.kstream.internals.graph.UnoptimizableRepartitionNode. This node is NOT subject of optimization algorithm, therefore, eachrepartitionoperation is excluded from optimization algorithm.org.apache.kafka.streams.processor.internals.InternalTopicPropertiesclass that can be used for capturing repartition topic configurations passed via DSL operationsorg.apache.kafka.streams.processor.internals.InternalTopologyBuilder#internalTopicNamesWithPropertiesmap for storing mapping between internal topics and their corresponding configuration. If configuration is presentRepartitionTopicConfigis enriched with configurations passed via DSL operations (In this case viaorg.apache.kafka.streams.kstream.Repartitionedclass).KStreamRepartitionIntegrationTestfor testing different scenarios ofKStream#repartitionCommitter Checklist (excluded from commit message)