Skip to content

MINOR: fix KRaftClusterTest and KRaft integration test failure#13647

Closed
showuon wants to merge 1 commit intoapache:trunkfrom
showuon:fixKRaftClusterTest
Closed

MINOR: fix KRaftClusterTest and KRaft integration test failure#13647
showuon wants to merge 1 commit intoapache:trunkfrom
showuon:fixKRaftClusterTest

Conversation

@showuon
Copy link
Copy Markdown
Member

@showuon showuon commented Apr 27, 2023

Saw a bunch of tests in RaftClusterSnapshotTest, KRaftClusterTest, QuorumControllerTest failed in recent builds: #_1801, #_1800 after this PR merged: #13407.

Did some investigation, found they all failed because of this kind of error: (ex: here, here)

org.apache.kafka.server.fault.FaultHandlerException: fatalFaultHandler: exception while activating controller: xxxxx

Mostly, the reason of exception thrown is No controller appears to be active. And it's because when the active controller tried to write and commit the activation messages, the leadership changed to other nodes, which is quite normal. We should not treat it as fatal error with this case.

And there are also cases like this (here)

org.apache.kafka.server.fault.FaultHandlerException: fatalFaultHandler: exception while activating controller: java.util.concurrent.RejectedExecutionException

It's because while activating the controller, it's shutting down. Again, this should not be a fatal error, instead, we can just fail this commit as before.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@showuon showuon force-pushed the fixKRaftClusterTest branch from 32108a4 to 4430fde Compare April 27, 2023 09:53
@showuon showuon force-pushed the fixKRaftClusterTest branch from 4430fde to 07dee34 Compare April 27, 2023 09:54
@showuon showuon changed the title MINOR: fix KRaftClusterTest MINOR: fix KRaftClusterTest and KRaft integration test failure Apr 27, 2023
@showuon
Copy link
Copy Markdown
Member Author

showuon commented Apr 27, 2023

@mumrah , please take a look. Thanks.

@mumrah
Copy link
Copy Markdown
Member

mumrah commented Apr 27, 2023

Ah, looks like I introduced this problem. You're correct that we shouldn't call the fault handler here for expected errors like a controller failover. I added this logic to catch other errors inside the activation event regarding migration state. For example, if an established KRaft cluster is restarted with migrations enabled, we should terminate the controller with an error. Since we actually need to read some state from the metadata log to determine this, we can't just do a simple config validation as we start ControllerServer.

Can we keep the exception handler, but only call the fault handler for RuntimeExceptions?

@mumrah mumrah added the kraft label Apr 27, 2023
@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Apr 27, 2023

Yes, the fault handler should be invoked only for non-ApiException classes

(Technically ApiException itself is a RuntimException ... a questionable decision but there's no changing it now)

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Apr 28, 2023

Fixing the issue in #13651

@showuon showuon closed this Apr 28, 2023
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Apr 28, 2023

@mumrah the original PR had the same failures, how come we merged it?

@mumrah
Copy link
Copy Markdown
Member

mumrah commented Apr 28, 2023

The test failure was introduced by a commit fairly late in #13407. I did briefly investigate it, but couldn't reproduce it locally, so I figured it was existing flakiness. Basically, it's just my fault for not looking more closely at the test failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants