Skip to content

KAFKA-13955: Fix failing KRaftClusterTest tests#12238

Merged
jsancio merged 1 commit intoapache:trunkfrom
dengziming:KIP-835-fix
Jun 3, 2022
Merged

KAFKA-13955: Fix failing KRaftClusterTest tests#12238
jsancio merged 1 commit intoapache:trunkfrom
dengziming:KIP-835-fix

Conversation

@dengziming
Copy link
Copy Markdown
Member

More detailed description of your change
Will will generate NoOpRecord periodically to increase metadata LEO, however, when a broker startup, we will wait until its metadata LEO catches up with the controller LEO, we generate NoOpRecord every 500ms and send heartbeat request every 2000ms.

It's almost impossible for a broker to catch up with the controller LEO if the broker sends a query request every 2000ms but the controller LEO increases every 500ms, so the tests in KRaftClusterTest will fail.

image

Summary of testing strategy (including rationale)
After this change, the tests in KRaftClusterTest all succeed.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@dengziming
Copy link
Copy Markdown
Member Author

dengziming commented Jun 2, 2022

cc @jsancio @cmccabe
I also find a weird thing, KRaftClusterTest didn't fail after we merge #12183 which add NoOpRecord, but it only fail after #12207 is merged in which we just made a minor adjustment.

@showuon
Copy link
Copy Markdown
Member

showuon commented Jun 2, 2022

Interesting bug. If we decided to increase the default metadata.max.idle.interval.ms value in the end, we should note in the KIP discussion thread.

However, I'm wondering if increasing the metadata.max.idle.interval.ms too high will delay the time we find unavailable controller?

@dengziming
Copy link
Copy Markdown
Member Author

@showuon Yes, after we reach a consensus here, we should update the KIP and note it in the KIP discussion thread.

@showuon
Copy link
Copy Markdown
Member

showuon commented Jun 2, 2022

KAFKA-13955 is created for this failing tests, in case duplicated works by other people. (Because it failed many tests in PR build results)

@jsancio
Copy link
Copy Markdown
Member

jsancio commented Jun 2, 2022

@dengziming Thanks for looking at this. Did you consider fixing the logic for when the controller allows the broker to unfence?

The current implementation requires that request.currentMetadataOffset() >= lastCommittedOffset. As you point out this very difficult to achieve on a busy system. Either because of NoOpRecord or a lot of controller operations.

For example, should we allow the broker to be behind by A records (or time) to unfence the broker? Another solution is to require the broker to only catch up to the last committed offset when they last sent the heartbeat. For example:

  1. Broker sends a hearbeat with current offset of Y. The last commit offset is X. The controller remember this last commit offset, call it X'
  2. Broker sends another hearbeat with current offset of Z. Unfence the broker if Z >= X or Z >= X'. What do you think?

@showuon
Copy link
Copy Markdown
Member

showuon commented Jun 3, 2022

should we allow the broker to be behind by A records (or time) to unfence the broker?

I've similar solution came out yesterday, but then, I don't think this is a good solution because it might break the expectation/assumption that when broker is up, the broker should already catch up all the metadata changes. There are chances the A records/time behind contain some important metadata update, right?

Another solution is to require the broker to only catch up to the last committed offset when they last sent the heartbeat.

I like this idea. We can make sure the broker already catch up the last committed offset when heartbeat sent, which means, the metadata changes before broker startup are all caught up.
+1 for it!

Thank you.

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Jun 3, 2022

Another solution is to require the broker to only catch up to the last committed offset when they last sent the heartbeat.

I like this solution too. However, there are some complexities here (we'd want to make sure the heartbeat wasn't too long ago)

Bumping the no-op timeout to might be a good quick fix until we have time to implement that (although I wonder why we can't use 4 seconds rather than 5?)

@jsancio
Copy link
Copy Markdown
Member

jsancio commented Jun 3, 2022

These are the failing tests:


Build / JDK 8 and Scala 2.12 / org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.testRemovingWorker
Build / JDK 8 and Scala 2.12 / kafka.network.ConnectionQuotasTest.testListenerConnectionRateLimitWhenActualRateAboveLimit()

They look unrelated to KRaft.

@jsancio jsancio self-requested a review June 3, 2022 18:24
@jsancio jsancio added the kraft label Jun 3, 2022
Copy link
Copy Markdown
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jsancio jsancio changed the title KIP-835: MetadataMaxIdleIntervalMs should be much bigger than brokerHeartbeatIntervalMs KAFKA-13955: Fix failing KRaftClusterTest tests Jun 3, 2022
@jsancio jsancio merged commit 0b3ab46 into apache:trunk Jun 3, 2022
@jsancio
Copy link
Copy Markdown
Member

jsancio commented Jun 3, 2022

I merged this PR to fix the tests and I created https://issues.apache.org/jira/browse/KAFKA-13959.

@showuon Yes, after we reach a consensus here, we should update the KIP and note it in the KIP discussion thread.

@dengziming @showuon My preference is to fix KAFKA-13959 and revert this commit before we ship 3.3.0.

@dengziming dengziming deleted the KIP-835-fix branch October 8, 2022 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants