MINOR: fix KRaftClusterTest and KRaft integration test failure#13647
MINOR: fix KRaftClusterTest and KRaft integration test failure#13647showuon wants to merge 1 commit intoapache:trunkfrom
Conversation
32108a4 to
4430fde
Compare
4430fde to
07dee34
Compare
|
@mumrah , please take a look. Thanks. |
|
Ah, looks like I introduced this problem. You're correct that we shouldn't call the fault handler here for expected errors like a controller failover. I added this logic to catch other errors inside the activation event regarding migration state. For example, if an established KRaft cluster is restarted with migrations enabled, we should terminate the controller with an error. Since we actually need to read some state from the metadata log to determine this, we can't just do a simple config validation as we start ControllerServer. Can we keep the exception handler, but only call the fault handler for RuntimeExceptions? |
|
Yes, the fault handler should be invoked only for non- (Technically |
|
Fixing the issue in #13651 |
|
@mumrah the original PR had the same failures, how come we merged it? |
|
The test failure was introduced by a commit fairly late in #13407. I did briefly investigate it, but couldn't reproduce it locally, so I figured it was existing flakiness. Basically, it's just my fault for not looking more closely at the test failures. |
Saw a bunch of tests in
RaftClusterSnapshotTest,KRaftClusterTest,QuorumControllerTestfailed in recent builds: #_1801, #_1800 after this PR merged: #13407.Did some investigation, found they all failed because of this kind of error: (ex: here, here)
Mostly, the reason of exception thrown is
No controller appears to be active. And it's because when the active controller tried to write and commit the activation messages, the leadership changed to other nodes, which is quite normal. We should not treat it as fatal error with this case.And there are also cases like this (here)
It's because while activating the controller, it's shutting down. Again, this should not be a fatal error, instead, we can just fail this commit as before.
Committer Checklist (excluded from commit message)