MINOR: Refactor controller partition reassignment logic into separate class#7339
MINOR: Refactor controller partition reassignment logic into separate class#7339stanislavkozlovski wants to merge 6 commits intoapache:trunkfrom
Conversation
|
Of note to reviewers:
|
3ef4dc5 to
a05b75c
Compare
|
JDK 8 passed, JDK 11 both were grey builds with retest this please |
a05b75c to
b497ba9
Compare
|
I'm having trouble running locally. I have only verified the ReassignPartitionsClusterTest and DeleteTopicsTest locally |
There was a problem hiding this comment.
We should be using mockito for new tests.
There was a problem hiding this comment.
the use of Nil: _* as a second argument is due to this being unable to compile otherwise in Scala, due to its auto-tupling feature (article explaining it).
There are some workarounds like adding a helper method or using mockito-scala. I opted for repeating the argument for the time being
There was a problem hiding this comment.
moved all these instantiations below the handlers such that we can have ReassignmentManager reference a handler
|
JDK 11/2.13 failed although it expired JDK 11/8 passed |
|
retest this please |
|
JDK 11 / Scala 2.13 - JDK 8 / Scala 2.11 - Seems like a flake. Created https://issues.apache.org/jira/browse/KAFKA-8967 |
|
retest this please |
viktorsomogyi
left a comment
There was a problem hiding this comment.
Hey, this is great stuff, I was also looking at this class (and also KafkaApis) that we could tear apart. There it would be better maybe to separate based on API calls but that's for another evenings.
So here are a few questions/suggestions:
- Is there any reason passing the epoch as a function or is it just preference? Can't you just use the one in controllerContext as both seem to do the same thing?
- For sendUpdateMetadata I think we could pass the method itself to the ReassignmentHelper. Even though it's a 3liner, I wouldn't copy-paste it.
- For tests I think we might be able to use the parameterized unit tests if the goal is to test the ReassignmentHelper (one example I did recently is https://github.com/apache/kafka/pull/7361/files#diff-3e5b61802d5dae0d374bf75f6c06a10a)
I had said
We can use the same epoch, I just wanted to maintain the code as similar as possible.
That was my initial approach, in the end I went without it due to its simplicity. Passing a method looked weirder to me. I don't have a strong opinion
Thanks! That is a good example. I would prefer we defer this to another PR |
51e7d80 to
3f98f39
Compare
|
Rebased with c620b73 cc @hachikuji @cmccabe for a second round |
3f98f39 to
5c8a02d
Compare
There was a problem hiding this comment.
I don't really like passing in the eventManager here. It is used in
- when we remove the
/reassign_partitionsznode - when we register a new ISR ZNodeChangeHandler
We could circumvent this by passing in two methods onZNodeDeletion and onReassignmentStart depending on people's thoughts
There was a problem hiding this comment.
I like the your idea of adding callbacks. I think we can probably turn it into a Listener or something. For example:
trait ReassignmentListener {
def onReassignmentUpdated // invoked in `updateCurrentReassignment`
def onReassignmentResumed // invoked at the start of `onPartitionReassignment`
def onReassignmentFinished // invoked at the end of `onPartitionReassignment` (case B)
}Using this approach, we can probably also get rid of the dependence on TopicDeletionManager.
5c8a02d to
30dcb5d
Compare
… class This patch adds a ReassignmentManager class which encapsulates most of the nitty-gritty details of reassigning a partition. Splitting the logic helps with testability and this patch leverages that to add unit tests for partition reassignments
c21588e to
b661bc5
Compare
hachikuji
left a comment
There was a problem hiding this comment.
Thanks for the refactor. Left a few comments.
| * A helper class which contains logic for driving partition reassignments. | ||
| * This class is not thread-safe. | ||
| */ | ||
| class ReassignmentsManager(controllerContext: ControllerContext, |
There was a problem hiding this comment.
nit: maybe just ReassignmentManager. More in line with classes like ReplicaManager.
| !zkPartitionsResumed.contains(tp) | ||
| } | ||
| } catch { | ||
| case e: IllegalStateException => handleIllegalState(e) |
There was a problem hiding this comment.
Prior to this patch, handleIllegalState is only protecting calls to sendRequestsToBrokers which re-throws all unexpected exceptions as IllegalStateException. If we want this to be useful here, we should do the same.
However, to be honest, I am not sure when it makes sense to force resignation of the current controller. The current protection only for sendRequestsToBrokers seems arbitrary. I'm inclined to say it should be rare though which makes me doubt the changes here. We need to be sure that the next controller will actually be able to recover. In some cases, it seems like it would clearly be preferable to just let the current operation fail and go on to the next event.
Can we leave behavioral changes like this out of this PR since the focus here is improving testability? I'd prefer to try and come up with a principled approach to handling unexpected errors in the controller.
There was a problem hiding this comment.
I did this because maybeResumeReassignments calls onPartitionReassignment which calls sendRequestsToBrokers in phase B or in phase A's updateLeaderEpochAndSendRequest call.
I wanted to ensure that any errors there are caught, otherwise this patch would again change the behavior in it being propagated. If we don't catch this, it is again a behavioral change.
Is there concern that other methods may raise an IllegalStateException?
Perhaps we can re-throw the sendRequestsToBrokers' exception to something else and catch that only?
There was a problem hiding this comment.
Yeah, my concern was that something else might raise IllegalStateException. I am actually a bit tempted to get rid of handleIllegalState altogether. It just seems so arbitrary. Looking through the code, I cannot see the specific case we're trying to protect. The call to sendRequestsToBrokers just builds the requests and puts them in a queue. Perhaps I'm missing something?
There was a problem hiding this comment.
The handleIllegalState stuff was added by Flavio to fix a bug that had been reported. We kept that through the Controller refactoring. A few related PRs:
There was a problem hiding this comment.
Thanks @ijuma. That helps. It looks like it was specifically trying to protect the validation we do in newBatch(), but the root cause of the reported issue was evidently unknown. And neither have I heard of any recurrence of it. So my first inclination after seeing this is to also get rid of newBatch along with the logic to resign the controller (which seems like massive overkill). Will look more carefully tomorrow if I get a chance.
There was a problem hiding this comment.
I like the your idea of adding callbacks. I think we can probably turn it into a Listener or something. For example:
trait ReassignmentListener {
def onReassignmentUpdated // invoked in `updateCurrentReassignment`
def onReassignmentResumed // invoked at the start of `onPartitionReassignment`
def onReassignmentFinished // invoked at the end of `onPartitionReassignment` (case B)
}Using this approach, we can probably also get rid of the dependence on TopicDeletionManager.
| } | ||
|
|
||
| /** | ||
| * Phase B of a partition reassignment is the part where all the new replicas are in ISR |
There was a problem hiding this comment.
Is this duplication necessary? It seems likely to diverge over time.
There was a problem hiding this comment.
Theoretically, if we change the code these tests would fail and this gets updated but I can see how it's likely to get missed. Let's keep the comments in between the test code though
| mockTopicDeletionManager = Mockito.mock(classOf[TopicDeletionManager]) | ||
| mockControllerBrokerRequestBatch = Mockito.mock(classOf[ControllerBrokerRequestBatch]) | ||
| mockReplicaStateMachine = Mockito.mock(classOf[ReplicaStateMachine]) | ||
| mockPartitionStateMachine = Mockito.mock(classOf[PartitionStateMachine]) |
There was a problem hiding this comment.
I'd suggest using MockPartitionStateMachine and MockReplicaStateMachine to simplify these test cases.
There was a problem hiding this comment.
Makes sense on the MockPartitionStateMachine. With the replica state machine, isn't it more useful to keep it an EasyMock for now since we only use it for the assertion we have in A2 of testPhaseAOfPartitionReassignment ?
There was a problem hiding this comment.
The nice thing that comes from MockReplicaStateMachine is validation of the state changes.
| new ControllerBrokerRequestBatch(config, controllerChannelManager, eventManager, controllerContext, stateChangeLogger)) | ||
| val topicDeletionManager = new TopicDeletionManager(config, controllerContext, replicaStateMachine, | ||
| partitionStateMachine, new ControllerDeletionClient(this, zkClient)) | ||
| val reassignmentsManager = new ReassignmentsManager(controllerContext, zkClient, topicDeletionManager, |
There was a problem hiding this comment.
High level, I think this refactor is a good improvement. It is similar to the refactor that introduced TopicDeletionManager and it does make testing a bit easier. That said, I want to mention a kind of drawback. Although it succeeds in factoring out some of the logic out of KafkaController, it doesn't really do anything about the complex interdependencies between the various components and the fact that any one of them can mutate the controller state. And in fact, it makes it a little harder to track all these mutations because they are spread over more classes.
I think it would be a useful exercise to try and think about some of these components in more of a functional way. Rather than allowing the reassignment manager to directly mutate any and all state that the controller owns, perhaps we can treat it more like a function which accepts the current state of the world, makes some modifications, and then returns the proposed new state of the world. Then it could be up to the controller to decide how to enact the new state (e.g. by making changes in ZK and sending UpdateMetadata requests). The nice thing then is that we don't need all the dependencies and all the nasty mocking that comes with them.
Anyway, this is more of a "food for thought" comment. I'm not exactly sure how to do this myself.
There was a problem hiding this comment.
I do agree it does not fix the underlying issue of complexity and I agree it makes the mutation changes harder to track.
If we are to start refactoring towards a more functional approach, I think it would be easier to start with the lowest level classes which still mutate the state - the state machines.
Otherwise it'd be pretty difficult to make the ReassignmentManager functional when components it uses mutate the state underneath
| zkClient.unregisterZNodeChangeHandler(path) | ||
| if (deletedZNode) { | ||
| // Ensure we detect future reassignments | ||
| eventManager.put(ZkPartitionReassignment) |
There was a problem hiding this comment.
I wonder if we can just call
isActive && zkClient.registerZNodeChangeHandlerAndCheckExistence(partitionReassignmentHandler))
to outright register the change handler here?
|
retest this please |
| } | ||
| val reassignmentsManager = new ReassignmentManager(controllerContext, zkClient, reassignmentListener, | ||
| replicaStateMachine, partitionStateMachine, brokerRequestBatch, stateChangeLogger, | ||
| shouldSkipReassignment = tp => { |
There was a problem hiding this comment.
I had one suggestion which you are free to reject. We have this kind of awkward back and forth between the controller and the manager when a reassignment is triggered:
- Controller detects reassignment
- Controller delegates trigger to Manager
- Manager asks Controller if reassignment is allowed
- Manager executes reassignment
I am wondering if it would be better to leave all of the trigger logic inside the controller and only delegate to the manager after step 3. In other words, perhaps maybeTriggerPartitionReassignment can be left inside the controller. What do you think?
|
This PR is being marked as stale since it has not had any activity in 90 days. If you If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact). If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
|
This PR has been closed since it has not had any activity in 120 days. If you feel like this |
This patch adds a ReassignmentManager class which encapsulates most of the nitty-gritty details of reassigning a partition. Splitting the logic helps with testability and this patch leverages that to add unit tests for partition reassignments