KAFKA-13955: Fix failing KRaftClusterTest tests by dengziming · Pull Request #12238 · apache/kafka

dengziming · 2022-06-02T08:07:56Z

More detailed description of your change
Will will generate NoOpRecord periodically to increase metadata LEO, however, when a broker startup, we will wait until its metadata LEO catches up with the controller LEO, we generate NoOpRecord every 500ms and send heartbeat request every 2000ms.

It's almost impossible for a broker to catch up with the controller LEO if the broker sends a query request every 2000ms but the controller LEO increases every 500ms, so the tests in KRaftClusterTest will fail.

Summary of testing strategy (including rationale)
After this change, the tests in KRaftClusterTest all succeed.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…er.heartbeat.interval.ms

dengziming · 2022-06-02T08:13:13Z

cc @jsancio @cmccabe
I also find a weird thing, KRaftClusterTest didn't fail after we merge #12183 which add NoOpRecord, but it only fail after #12207 is merged in which we just made a minor adjustment.

showuon · 2022-06-02T08:30:44Z

Interesting bug. If we decided to increase the default metadata.max.idle.interval.ms value in the end, we should note in the KIP discussion thread.

However, I'm wondering if increasing the metadata.max.idle.interval.ms too high will delay the time we find unavailable controller?

dengziming · 2022-06-02T09:14:57Z

@showuon Yes, after we reach a consensus here, we should update the KIP and note it in the KIP discussion thread.

showuon · 2022-06-02T13:37:23Z

KAFKA-13955 is created for this failing tests, in case duplicated works by other people. (Because it failed many tests in PR build results)

jsancio · 2022-06-02T16:07:12Z

@dengziming Thanks for looking at this. Did you consider fixing the logic for when the controller allows the broker to unfence?

The current implementation requires that request.currentMetadataOffset() >= lastCommittedOffset. As you point out this very difficult to achieve on a busy system. Either because of NoOpRecord or a lot of controller operations.

For example, should we allow the broker to be behind by A records (or time) to unfence the broker? Another solution is to require the broker to only catch up to the last committed offset when they last sent the heartbeat. For example:

Broker sends a hearbeat with current offset of Y. The last commit offset is X. The controller remember this last commit offset, call it X'
Broker sends another hearbeat with current offset of Z. Unfence the broker if Z >= X or Z >= X'. What do you think?

showuon · 2022-06-03T08:03:09Z

should we allow the broker to be behind by A records (or time) to unfence the broker?

I've similar solution came out yesterday, but then, I don't think this is a good solution because it might break the expectation/assumption that when broker is up, the broker should already catch up all the metadata changes. There are chances the A records/time behind contain some important metadata update, right?

Another solution is to require the broker to only catch up to the last committed offset when they last sent the heartbeat.

I like this idea. We can make sure the broker already catch up the last committed offset when heartbeat sent, which means, the metadata changes before broker startup are all caught up.
+1 for it!

Thank you.

cmccabe · 2022-06-03T16:40:19Z

Another solution is to require the broker to only catch up to the last committed offset when they last sent the heartbeat.

I like this solution too. However, there are some complexities here (we'd want to make sure the heartbeat wasn't too long ago)

Bumping the no-op timeout to might be a good quick fix until we have time to implement that (although I wonder why we can't use 4 seconds rather than 5?)

jsancio · 2022-06-03T18:24:25Z

These are the failing tests:


Build / JDK 8 and Scala 2.12 / org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.testRemovingWorker
Build / JDK 8 and Scala 2.12 / kafka.network.ConnectionQuotasTest.testListenerConnectionRateLimitWhenActualRateAboveLimit()

They look unrelated to KRaft.

jsancio

LGTM

jsancio · 2022-06-03T18:47:29Z

I merged this PR to fix the tests and I created https://issues.apache.org/jira/browse/KAFKA-13959.

@showuon Yes, after we reach a consensus here, we should update the KIP and note it in the KIP discussion thread.

@dengziming @showuon My preference is to fix KAFKA-13959 and revert this commit before we ship 3.3.0.

KIP-835: metadata.max.idle.interval.ms shoud be much bigger than brok…

a0599f2

…er.heartbeat.interval.ms

showuon mentioned this pull request Jun 2, 2022

MINOR: Update the kafka-reassign-partitions script command in documentation #12237

Merged

3 tasks

jsancio self-requested a review June 3, 2022 18:24

jsancio added the kraft label Jun 3, 2022

jsancio approved these changes Jun 3, 2022

View reviewed changes

jsancio changed the title ~~KIP-835: MetadataMaxIdleIntervalMs should be much bigger than brokerHeartbeatIntervalMs~~ KAFKA-13955: Fix failing KRaftClusterTest tests Jun 3, 2022

jsancio merged commit 0b3ab46 into apache:trunk Jun 3, 2022

dengziming deleted the KIP-835-fix branch October 8, 2022 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-13955: Fix failing KRaftClusterTest tests#12238

KAFKA-13955: Fix failing KRaftClusterTest tests#12238
jsancio merged 1 commit intoapache:trunkfrom
dengziming:KIP-835-fix

dengziming commented Jun 2, 2022

Uh oh!

dengziming commented Jun 2, 2022 •

edited

Loading

Uh oh!

showuon commented Jun 2, 2022 •

edited

Loading

Uh oh!

dengziming commented Jun 2, 2022

Uh oh!

showuon commented Jun 2, 2022

Uh oh!

jsancio commented Jun 2, 2022

Uh oh!

showuon commented Jun 3, 2022

Uh oh!

cmccabe commented Jun 3, 2022

Uh oh!

jsancio commented Jun 3, 2022

Uh oh!

jsancio left a comment

Uh oh!

jsancio commented Jun 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dengziming commented Jun 2, 2022

Committer Checklist (excluded from commit message)

Uh oh!

dengziming commented Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

showuon commented Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dengziming commented Jun 2, 2022

Uh oh!

showuon commented Jun 2, 2022

Uh oh!

jsancio commented Jun 2, 2022

Uh oh!

showuon commented Jun 3, 2022

Uh oh!

cmccabe commented Jun 3, 2022

Uh oh!

jsancio commented Jun 3, 2022

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

jsancio commented Jun 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dengziming commented Jun 2, 2022 •

edited

Loading

showuon commented Jun 2, 2022 •

edited

Loading