KAFKA-13955: Fix failing KRaftClusterTest tests#12238
Conversation
…er.heartbeat.interval.ms
|
Interesting bug. If we decided to increase the default However, I'm wondering if increasing the |
|
@showuon Yes, after we reach a consensus here, we should update the KIP and note it in the KIP discussion thread. |
|
KAFKA-13955 is created for this failing tests, in case duplicated works by other people. (Because it failed many tests in PR build results) |
|
@dengziming Thanks for looking at this. Did you consider fixing the logic for when the controller allows the broker to unfence? The current implementation requires that For example, should we allow the broker to be behind by
|
I've similar solution came out yesterday, but then, I don't think this is a good solution because it might break the expectation/assumption that when broker is up, the broker should already catch up all the metadata changes. There are chances the
I like this idea. We can make sure the broker already catch up the last committed offset when heartbeat sent, which means, the metadata changes before broker startup are all caught up. Thank you. |
I like this solution too. However, there are some complexities here (we'd want to make sure the heartbeat wasn't too long ago) Bumping the no-op timeout to might be a good quick fix until we have time to implement that (although I wonder why we can't use 4 seconds rather than 5?) |
|
These are the failing tests: They look unrelated to KRaft. |
|
I merged this PR to fix the tests and I created https://issues.apache.org/jira/browse/KAFKA-13959.
@dengziming @showuon My preference is to fix KAFKA-13959 and revert this commit before we ship 3.3.0. |
More detailed description of your change
Will will generate NoOpRecord periodically to increase metadata LEO, however, when a broker startup, we will wait until its metadata LEO catches up with the controller LEO, we generate NoOpRecord every 500ms and send heartbeat request every 2000ms.
It's almost impossible for a broker to catch up with the controller LEO if the broker sends a query request every 2000ms but the controller LEO increases every 500ms, so the tests in
KRaftClusterTestwill fail.Summary of testing strategy (including rationale)
After this change, the tests in
KRaftClusterTestall succeed.Committer Checklist (excluded from commit message)