KAFKA-12440: ClusterId validation for Vote, BeginQuorum, EndQuorum and FetchSnapshot by dengziming · Pull Request #10289 · apache/kafka

dengziming · 2021-03-10T07:16:13Z

More detailed description of your change
This pr follows up #10129 which add clusterId validation to FetchRequest.

Summary of testing strategy (including rationale)
Unit test.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

dengziming · 2021-03-10T11:12:14Z

Hello, @dajac , PTAL.

jsancio

Thanks for the PR. One quick comment. I'll look at the rest of the PR later this week.

jsancio · 2021-03-10T20:34:34Z

It is sad that we have to add KafkaRaftClient to this list. Do you know what exactly pushed this over the threshold? This would allow us to look into ways to re-organize the code so that it is not so complex.

handleVoteRequest() method has too many if condition.

I agree it is unfortunate. There are probably ways we can improve this. For example, this logic smells a little bit:

if (quorum.isLeader()) { logger.debug("Rejecting vote request {} with epoch {} since we are already leader on that epoch", request, candidateEpoch); voteGranted = false; } else if (quorum.isCandidate()) { logger.debug("Rejecting vote request {} with epoch {} since we are already candidate on that epoch", request, candidateEpoch); voteGranted = false; } else if (quorum.isResigned()) { logger.debug("Rejecting vote request {} with epoch {} since we have resigned as candidate/leader in this epoch", request, candidateEpoch); voteGranted = false; } else if (quorum.isFollower()) { FollowerState state = quorum.followerStateOrThrow(); logger.debug("Rejecting vote request {} with epoch {} since we already have a leader {} on that epoch", request, candidateEpoch, state.leaderId()); voteGranted = false;

It might be possible to push this logic into EpochState or at least to use make use of the name() method in the logging. @dengziming would you be interested in following up on this separately?

Thank you, I will take some time to improve this.

hachikuji · 2021-03-12T01:45:39Z

What should we do if we see this error in a response? It looks like it would hit handleUnexpectedError currently which just logs an error. That might be ok for now. I think there is a window during startup when we could consider these errors to be fatal. This would be helpful detecting configuration problems. We probably do not want them to be fatal in all cases though because that might result in a misconfigured node killing a stable cluster.

It's a bit difficult to figure out how to add the window, we could not simply rely on a fixed configuration, I add a ticket to track this problem: https://issues.apache.org/jira/browse/KAFKA-12465.

We can implement that when handling a response, invalid cluster id are fatal unless a previous response contained a valid cluster id.

@jsancio This is simple but not very perfect, consider a four-node cluster A-0(clusterId=A) A-1(clusterId=A) B-0(clusterId=B) B-1(clusterId=B), when starting, they all become candidate and send vote request to other nodes, if they all receive vote response from a node with the same clusterId to itself then they will all live, but if all receive vote response from a node with a different clusterId they will all be killed. It seems that the logic is similar to leader-election which should reach a consensus. So we'd better treat them as non-fatal currently and have some discussion to reach a consensus about wheater we should treat this as fatal.

Yes. @dengziming in that example, the user has incorrectly configured the cluster. The user was configured it so that all of the controllers have each other's listener (connection) information but the cluster ids are different.

The question is do we want to catch those misconfiguration early by shutting down the brokers/controllers? Or do we want to continue executing with the user potentially missing that the controllers/brokers are incorrectly configuration?

There have been conversation of having the first controller leader generate the cluster id and replicate that information to all off the nodes. The currently implementation generate the cluster id in the StorateTool which the user has to run when configuring the controllers.

I am okay leaving it as is and addressing this in a future PR.

hachikuji · 2021-03-18T22:15:48Z

I think a better way to do this is to modify validateVoterOnlyRequest and validateLeaderOnlyRequest so that we pass the clusterId. Then we can get rid of getClusterId.

I tried this approach, it seems that voter and leader is partition level terminology so that validateVoterOnlyRequest is used to get a partition level error but cluster validate is a request level error, we'd better separate these 2 errors since we are making way for multi-raft. I changed the getClusterId method to pass the clusterId to it directly, WDYT.

…d FetchSnapshot

hachikuji

LGTM. Thanks for the patch!

jsancio reviewed Mar 10, 2021

View reviewed changes

hachikuji reviewed Mar 12, 2021

View reviewed changes

dengziming force-pushed the KAFKA-12440-clusterId-validation branch from 80382b3 to 22f0f76 Compare March 15, 2021 08:19

hachikuji reviewed Mar 18, 2021

View reviewed changes

dengziming added 2 commits March 19, 2021 15:48

KAFKA-12440: ClusterId validation for Vote, BeginQourum, EndQuorum an…

58612e2

…d FetchSnapshot

fix

ce8a99d

dengziming force-pushed the KAFKA-12440-clusterId-validation branch from 22f0f76 to b01a7c2 Compare March 19, 2021 07:56

improvement

ef799af

dengziming force-pushed the KAFKA-12440-clusterId-validation branch from b01a7c2 to ef799af Compare March 19, 2021 08:34

hachikuji approved these changes Mar 19, 2021

View reviewed changes

hachikuji merged commit 69eebbf into apache:trunk Mar 19, 2021

dengziming mentioned this pull request Aug 13, 2021

KAFKA-12465: Logic about inconsistent cluster id #11209

Closed

3 tasks

dengziming deleted the KAFKA-12440-clusterId-validation branch November 24, 2022 06:46

Conversation

dengziming commented Mar 10, 2021

Committer Checklist (excluded from commit message)

Uh oh!

dengziming commented Mar 10, 2021

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dengziming Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dengziming Mar 19, 2021 •

edited

Loading