MINOR: onControllerResignation should be invoked if triggerControllerMove is called#2935
MINOR: onControllerResignation should be invoked if triggerControllerMove is called#2935ijuma wants to merge 6 commits intoapache:trunkfrom
Conversation
|
Link to previously failing Jenkins build: https://builds.apache.org/job/kafka-trunk-jdk8/1461/testReport/kafka.controller/ControllerFailoverTest/testMetadataUpdate/ |
|
Refer to this link for build results (access rights to CI server needed): |
|
Review by @onurkaraman and @junrao. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
I think it would be worth figuring out whether the transiently failing tests were actually coming from bb663d0 and exactly why (as in, what in that patch caused the change). The So there's no change stemming from |
|
@onurkaraman, thanks, I was hoping you'd be able to track down if it was indeed bb663d0. It looks like the underlying issue was there too: The difference is that the NPE didn't seem to happen. Since the NPE requires a particular event ordering, it seems plausible that making it single threaded made this more likely than before. |
|
Interesting. I ran the test 60 times and they all passed. I ran it against my local trunk which just went up to commit bb663d0 (KAFKA-5028). |
|
retest this please |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
I'm not sure if removing this line is the right move, as it will cause events to be processed by the controller even after triggering the controller move.
triggerControllerMove now just deletes the /controller znode. The Reelect corresponding to the /controller znode deletion will come behind events already in the queue. Since the PR removes the activeControllerId.set(-1) line, the controller will still think it's active and will process the queued events in between the current event and the Reelect event.
When I first saw the PR title, I thought you'd simply add a line calling onControllerResignation within triggerControllerMove. I think this makes more sense.
There was a problem hiding this comment.
I tried adding onControllerResignation in that method first, but that causes another exception to be thrown. It may be that this test needs be rethought.
There was a problem hiding this comment.
Yes, I agree that perhaps we should improve the test in ControllerFailoverTest. My understanding is that the test tries to simulate a case that the controller gets into an illegal state and tries to see if the controller can recover from it. But the way that it simulates the illegal state is quite involved. Perhaps we can do the following. (1) Consolidate all try/catch illegalStateException to ControllerEventThread.doWork(). (2) Change the test by inserting a MockEvent type that throws illegalStateException.
|
By the way, your latest jenkins run still shows an NPE in the exact same spot in ControllerFailoverTest.testMetadataUpdate: I took a peek at the test itself. It's quite complicated. Do we know what it's actually trying to test? |
241b989 to
919901e
Compare
|
@onurkaraman, yes, I noticed that the latest Jenkins run triggered that NPE again. I looked a bit more and it seems that the test is racy as it calls non thread safe from a different thread. That may be the source of the NPE. So, more than one issue in play. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
919901e to
c188e61
Compare
There was a problem hiding this comment.
It seems like this should be in the try/catch since it can throw an IllegalStateException and it seems like updateLeaderEpoch doesn't do anything with this field. But please double-check.
6e922ca to
8e5d282
Compare
|
@junrao @onurkaraman, here's another attempt that uses a mock event as suggested by Jun. I didn't move the |
|
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Both of these should be waitUntilTrue as they happen after onControllerResigned
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
retest this please Tests passed, but running them again, just in case. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
This assertion probably doesn't need to be in waitUntilTrue()?
There was a problem hiding this comment.
It seems that we don't really need epochMap. previousEpoch could be obtained from the previous controller. If a broker is not a controller, the controller epoch is not valid.
fa7405f to
e603ee9
Compare
|
@junrao, I addressed your feedback and fixed a trivial merge conflict. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
@ijuma : Thanks for the latest patch. LGTM |
…Move is called Also update the test to be simpler since we can use a mock event to simulate the issue more easily (thanks Jun for the suggestion). This should fix two issues: 1. A transient test failure due to a NPE in ControllerFailoverTest.testMetadataUpdate: ```text Caused by: java.lang.NullPointerException at kafka.controller.ControllerBrokerRequestBatch.addUpdateMetadataRequestForBrokers(ControllerChannelManager.scala:338) at kafka.controller.KafkaController.sendUpdateMetadataRequest(KafkaController.scala:975) at kafka.controller.ControllerFailoverTest.testMetadataUpdate(ControllerFailoverTest.scala:141) ``` The test was creating an additional thread and it does not seem like it was doing the appropriate synchronization (perhaps this became more of an issue after we changed the Controller to be single-threaded and changed the locking) 2. Setting `activeControllerId.set(-1)` in `triggerControllerMove` causes `Reelect` not to invoke `onControllerResignation`. Among other things, this causes an `IllegalStateException` to be thrown when `KafkaScheduler.startup` is invoked for the second time without the corresponding `shutdown`. We now simply call `onControllerResignation` as part of `triggerControllerMove`. Finally, I included a few clean-ups: 1. No longer update the broker state in `onControllerFailover`. This is no longer needed since we removed the `RunningAsController` state (KAFKA-3761). 2. Trivial clean-ups in KafkaController 3. Removed unused parameter in `ZkUtils.getPartitionLeaderAndIsrForTopics` Author: Ismael Juma <ismael@juma.me.uk> Reviewers: Jun Rao <junrao@gmail.com> Closes #2935 from ijuma/on-controller-resignation-if-trigger-controller-move (cherry picked from commit 6021618) Signed-off-by: Jun Rao <junrao@gmail.com>
Also update the test to be simpler since we can use a mock event to simulate the issue
more easily (thanks Jun for the suggestion). This should fix two issues:
The test was creating an additional thread and it does not seem like it was doing the
appropriate synchronization (perhaps this became more of an issue after we changed
the Controller to be single-threaded and changed the locking)
activeControllerId.set(-1)intriggerControllerMovecausesReelectnot to invokeonControllerResignation. Among other things, this causes anIllegalStateExceptionto be thrown whenKafkaScheduler.startupis invoked for the second time without the correspondingshutdown. We now simply callonControllerResignationas part oftriggerControllerMove.Finally, I included a few clean-ups:
onControllerFailover. This is no longer neededsince we removed the
RunningAsControllerstate (KAFKA-3761).ZkUtils.getPartitionLeaderAndIsrForTopics