Skip to content

KAFKA-5505: Incremental cooperative rebalancing in Connect (KIP-415)#6363

Merged
rhauch merged 66 commits intoapache:trunkfrom
kkonstantine:kafka-5505
May 17, 2019
Merged

KAFKA-5505: Incremental cooperative rebalancing in Connect (KIP-415)#6363
rhauch merged 66 commits intoapache:trunkfrom
kkonstantine:kafka-5505

Conversation

@kkonstantine
Copy link
Copy Markdown
Contributor

@kkonstantine kkonstantine commented Mar 4, 2019

  • Resolve the stop-the-world effect in Connect by allowing connectors and tasks to keep running if possible.
  • Maintain connector and task assignment in the presence of quick restarts or rolling upgrades of Connect workers
  • Upgrade Connect protocol to allow workers to report their active assignments and the leader to revoke assignments.
  • Select the Connect protocol during runtime in a fully backwards compatible way.
  • Introduce ConnectAssignor and make task assignment scheduling pluggable.
  • Implement incremental cooperative rebalancing of connectors and tasks
  • https://cwiki.apache.org/confluence/display/KAFKA/KIP-415%3A+Incremental+Cooperative+Rebalancing+in+Kafka+Connect

Tested via:

  • Unit tests for protocol compatibility
  • Unit tests on task assignment scheduling
  • Unit tests on incremental cooperative rebalancing
  • Integration tests on incremental cooperative rebalancing

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@kkonstantine kkonstantine force-pushed the kafka-5505 branch 3 times, most recently from 0dd3063 to ef8d122 Compare March 6, 2019 21:50
Copy link
Copy Markdown
Contributor Author

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked a set of files that belong to different PRs and will be removed once those PRs are merged.
The PRs are:
#6340
#6342

Comment thread clients/src/main/java/org/apache/kafka/common/protocol/types/Schema.java Outdated
Comment thread connect/runtime/src/test/resources/log4j.properties Outdated
@kkonstantine kkonstantine force-pushed the kafka-5505 branch 2 times, most recently from 24cffd0 to abbf617 Compare March 11, 2019 04:23
@kkonstantine
Copy link
Copy Markdown
Contributor Author

@ewencp @hachikuji @rhauch @mumrah the PR is ready for review. The following items are expected to be addressed along with your comments:

  • The code on task assignment is currently written to be readable and evaluate correctness. I'd like to add micro-benchmarks and I expect it'll be optimized.
  • Unit tests for DistributedHerder and WorkerCoordinatorIncrementalTest will be expanded to guard against changes in the new protocol more extensively.
  • More logging will be added appropriately and some integration with metrics will be considered.
  • A few more integration tests will be added.

Some files contain changes that are introduced by other outstanding PRs. If still present here, please skip or review the changes in their respective PRs.

Really looking forward to your comments! Thanks!

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, @kkonstantine. Took my first pass, and overall it looks good. With your review guidance above, I couldn't find any major issues, but I have quite a few comments/questions. Found and logged some nits when I happened to notice them, but I wasn't looking for them. :-D

BTW, #6342 is now merged.

Copy link
Copy Markdown
Contributor

@rayokota rayokota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @kkonstantine ! I am very excited for this feature! Just a few comments and nits. Thanks!

Copy link
Copy Markdown
Contributor Author

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. @rayokota I think I've replied to all your comments.
@rhauch I didn't get to all your comments yet. Next I'll update the config and continue with the rest of the comments.

Copy link
Copy Markdown
Member

@mumrah mumrah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @kkonstantine! My only real concern is the complexity and length of the methods in IncrementalCooperativeAssignor. I kind of wonder if a pattern other than procedural is warranted? I think we should at least consider changing the private methods to package-private and adding some unit tests.

@rayokota
Copy link
Copy Markdown
Contributor

@kkonstantine , thanks for responding to my feedback! I just had one remaining comment. Looking great!

@kkonstantine kkonstantine force-pushed the kafka-5505 branch 2 times, most recently from 9ce15ad to 662a259 Compare April 25, 2019 22:36
Copy link
Copy Markdown
Contributor Author

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhauch @mumrah I've addressed almost all your comments with changes or replies.
Would you mind returning to these discussions to see if we can resolve them?
There are a couple remaining items regarding javadocs and error handling during assignment, if I'm not mistaken, that I will definitely address before merging. Thanks!

Comment thread connect/runtime/src/test/resources/log4j.properties Outdated
@kkonstantine kkonstantine force-pushed the kafka-5505 branch 2 times, most recently from b6d69f2 to 79677ae Compare April 29, 2019 23:37
Copy link
Copy Markdown
Contributor Author

@kkonstantine kkonstantine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ewencp thanks a lot for what I assume is your first round of comments!
I fixed/replied to the majority of the comments. Will return to the ones that need more work very soon in a second pass.

Comment thread connect/runtime/src/test/resources/log4j.properties Outdated
@kkonstantine kkonstantine force-pushed the kafka-5505 branch 2 times, most recently from 145a2fd to 736668b Compare May 16, 2019 23:22
@kkonstantine
Copy link
Copy Markdown
Contributor Author

Thanks @rhauch @mumrah @rayokota @ryannedolan and @ewencp for all the insightful and useful comments! I believe I've addressed everything, except a few cleanup/refactoring suggestions that deemed high risk at this point and will be addressed in a follow up PR after this feature is merged.

Soak testing has been also performed and has confirmed correct execution for several days. More extensive testing and performance benchmarking will follow up in the next few days.

I'll be glad if we can get this in. Thanks!

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work, @kkonstantine. I wish this weren't such a big PR, but I've been steadily tracking the progress of the latest commits as you've been running multiple tests. As you say, there are some minor things that could be cleaned up and improved, but given the size of the PR it'd be good to handle those separately in the coming days, since they shouldn't affect behavior or functionality but will be more about maintainability.

I'm approving pending a green build and successful Connect tests. Most of the recent PR builds have been great, but I know you changed just a few test-related things (e.g., Jenkinsfile to run the Connect tests many times) that you've now reverted, and they theoretically shouldn't affect the build.

@rhauch rhauch merged commit ce584a0 into apache:trunk May 17, 2019
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
…pache#6363)

Added the incremental cooperative rebalancing in Connect to avoid global rebalances on all connectors and tasks with each new/changed/removed connector. This new protocol is backward compatible and will work with heterogeneous clusters that exist during a rolling upgrade, but once the clusters consist of new workers only some affected connectors and tasks will be rebalanced: connectors and tasks on existing nodes still in the cluster and not added/changed/removed will continue running while the affected connectors and tasks are rebalanced.

This commit attempted to minimize the changes to the existing V0 protocol logic, though that was not entirely possible.

This commit adds extensive unit and integration tests for both the old V0 protocol and the new v1 protocol. Soak testing has been performed multiple times to verify behavior while connectors and added, changed, and removed and while workers are added and removed from the cluster.

Author: Konstantine Karantasis <konstantine@confluent.io>
Reviewers: Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <me@ewencp.org>, Robert Yokota <rayokota@gmail.com>, David Arthur <mumrah@gmail.com>, Ryanne Dolan <ryannedolan@gmail.com>
@findnanda
Copy link
Copy Markdown

Hi Can you please suggest in which kafka version is this issue fixed as I am still seeing this problem every time i add a new connector, all the connector gets restarted?

@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented Nov 15, 2019

@findnanda, the Jira issue (https://issues.apache.org/jira/browse/KAFKA-5505) shows that this was merged and completed in AK 2.3.0.

If you're using AK 2.3.0 or later and still having problems, please create a new Jira issue and provide the Connect worker configs and a lot more detail about a procedure to replicate the problem. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants