KAFKA-5505: Incremental cooperative rebalancing in Connect (KIP-415)#6363
KAFKA-5505: Incremental cooperative rebalancing in Connect (KIP-415)#6363rhauch merged 66 commits intoapache:trunkfrom
Conversation
0dd3063 to
ef8d122
Compare
24cffd0 to
abbf617
Compare
|
@ewencp @hachikuji @rhauch @mumrah the PR is ready for review. The following items are expected to be addressed along with your comments:
Some files contain changes that are introduced by other outstanding PRs. If still present here, please skip or review the changes in their respective PRs. Really looking forward to your comments! Thanks! |
There was a problem hiding this comment.
Nice job, @kkonstantine. Took my first pass, and overall it looks good. With your review guidance above, I couldn't find any major issues, but I have quite a few comments/questions. Found and logged some nits when I happened to notice them, but I wasn't looking for them. :-D
BTW, #6342 is now merged.
rayokota
left a comment
There was a problem hiding this comment.
Looks great @kkonstantine ! I am very excited for this feature! Just a few comments and nits. Thanks!
7ce0d75 to
e5b22b1
Compare
e5b22b1 to
4f25d17
Compare
mumrah
left a comment
There was a problem hiding this comment.
Looks great @kkonstantine! My only real concern is the complexity and length of the methods in IncrementalCooperativeAssignor. I kind of wonder if a pattern other than procedural is warranted? I think we should at least consider changing the private methods to package-private and adding some unit tests.
4f25d17 to
28402f1
Compare
|
@kkonstantine , thanks for responding to my feedback! I just had one remaining comment. Looking great! |
9ce15ad to
662a259
Compare
kkonstantine
left a comment
There was a problem hiding this comment.
@rhauch @mumrah I've addressed almost all your comments with changes or replies.
Would you mind returning to these discussions to see if we can resolve them?
There are a couple remaining items regarding javadocs and error handling during assignment, if I'm not mistaken, that I will definitely address before merging. Thanks!
fc4cb39 to
4bebbdd
Compare
b6d69f2 to
79677ae
Compare
kkonstantine
left a comment
There was a problem hiding this comment.
@ewencp thanks a lot for what I assume is your first round of comments!
I fixed/replied to the majority of the comments. Will return to the ones that need more work very soon in a second pass.
145a2fd to
736668b
Compare
|
Thanks @rhauch @mumrah @rayokota @ryannedolan and @ewencp for all the insightful and useful comments! I believe I've addressed everything, except a few cleanup/refactoring suggestions that deemed high risk at this point and will be addressed in a follow up PR after this feature is merged. Soak testing has been also performed and has confirmed correct execution for several days. More extensive testing and performance benchmarking will follow up in the next few days. I'll be glad if we can get this in. Thanks! |
rhauch
left a comment
There was a problem hiding this comment.
Fantastic work, @kkonstantine. I wish this weren't such a big PR, but I've been steadily tracking the progress of the latest commits as you've been running multiple tests. As you say, there are some minor things that could be cleaned up and improved, but given the size of the PR it'd be good to handle those separately in the coming days, since they shouldn't affect behavior or functionality but will be more about maintainability.
I'm approving pending a green build and successful Connect tests. Most of the recent PR builds have been great, but I know you changed just a few test-related things (e.g., Jenkinsfile to run the Connect tests many times) that you've now reverted, and they theoretically shouldn't affect the build.
…pache#6363) Added the incremental cooperative rebalancing in Connect to avoid global rebalances on all connectors and tasks with each new/changed/removed connector. This new protocol is backward compatible and will work with heterogeneous clusters that exist during a rolling upgrade, but once the clusters consist of new workers only some affected connectors and tasks will be rebalanced: connectors and tasks on existing nodes still in the cluster and not added/changed/removed will continue running while the affected connectors and tasks are rebalanced. This commit attempted to minimize the changes to the existing V0 protocol logic, though that was not entirely possible. This commit adds extensive unit and integration tests for both the old V0 protocol and the new v1 protocol. Soak testing has been performed multiple times to verify behavior while connectors and added, changed, and removed and while workers are added and removed from the cluster. Author: Konstantine Karantasis <konstantine@confluent.io> Reviewers: Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <me@ewencp.org>, Robert Yokota <rayokota@gmail.com>, David Arthur <mumrah@gmail.com>, Ryanne Dolan <ryannedolan@gmail.com>
|
Hi Can you please suggest in which kafka version is this issue fixed as I am still seeing this problem every time i add a new connector, all the connector gets restarted? |
|
@findnanda, the Jira issue (https://issues.apache.org/jira/browse/KAFKA-5505) shows that this was merged and completed in AK 2.3.0. If you're using AK 2.3.0 or later and still having problems, please create a new Jira issue and provide the Connect worker configs and a lot more detail about a procedure to replicate the problem. Thanks! |
Tested via:
Committer Checklist (excluded from commit message)