KAFKA-14154; Ensure AlterPartition not sent to stale controller#12499
KAFKA-14154; Ensure AlterPartition not sent to stale controller#12499hachikuji wants to merge 4 commits intoapache:trunkfrom
Conversation
| handleResponse(request) | ||
| )) | ||
| controllerOpt.foreach { activeController => | ||
| if (activeController.epoch >= request.minControllerEpoch) { |
There was a problem hiding this comment.
To confirm, this check is done on the broker side right? I guess you sort of allude to this in the PR description that potentially a more ideal solution would be for the controller to do the check server side, but that would require a version bump.
There was a problem hiding this comment.
Yes, that's right.
| info(s"Recorded new controller, from now on will use broker $controllerNode") | ||
| updateControllerAddress(controllerNode) | ||
| metadataUpdater.setNodes(Seq(controllerNode).asJava) | ||
| case Some(controllerNodeAndEpoch) => |
There was a problem hiding this comment.
Is this where/how eventually the LeaderAndIsr from the new controller gets applied?
So we prevent sending the request if the epoch is lower? And is it the case, that there is always a controller with an epoch at least as large? Or in some cases would we need to wait/retry until such a controller exists? |
@jolshan Yes, that is right. Ensuring some level of monotonicity seems like a good general change even outside the original bug. It is weird to allow the broker to send requests to a controller that it knows for sure is stale, and it makes the system harder to reason about. One thing I have been trying to think through is how this bug affects kraft. The kraft controller will also return |
|
I am going to close this PR. On the one hand, it does not address the problem for KRaft; on the other, we have thought of a simpler fix for the zk controller, which I will open shortly. |
It is possible currently for a leader to send an
AlterPartitionrequest to a stale controller which does not have the latest leader epoch discovered through aLeaderAndIsrrequest. In this case, the stale controller returnsFENCED_LEADER_EPOCH, which causes the partition leader to get stuck. This is a change in behavior following #12032. Prior to that patch, the request would either be accepted (potentially incorrectly) if theLeaderAndIsrstate matched that on the controller, or it would have returnedNOT_CONTROLLERafter the stale controller failed to apply the update to Zookeeper.This patch fixes the problem by ensuring that
AlterPartitionis sent to a controller with an epoch which is at least as large as that of the controller which sent theLeaderAndIsrrequest. The way this is achieved is by tracking the controller epoch inBrokerToControllerChannelManagerand ensuring that it is only updated monotonically regardless of the source. If we find a controller epoch throughLeaderAndIsrwhich is larger than what we have in theMetadataCache, then the controller node is reset and we wait until we have discovered the controller node with a higher epoch. This ensures that theFENCED_LEADER_EPOCHerror from the controller can be trusted.A more elegant solution to this problem would probably be to include the controller epoch in the
AlterPartitionrequest, but this would require a version bump. Alternatively, we considered letting the controller returnUNKNOWN_LEADER_EPOCHinstead ofFENCED_LEADER_EPOCHwhen the epoch is larger than what it has in its context. This too likely would require a version bump. Finally, we considered reverting #12032, which would restore the looser validation logic which allows the controller to acceptAlterPartitionrequests with larger leader epochs. We rejected this option because we feel it can lead to correctness violations.Committer Checklist (excluded from commit message)