KAFKA-7605; Retry async commit failures in integration test cases to fix flaky tests#5890
KAFKA-7605; Retry async commit failures in integration test cases to fix flaky tests#5890ijuma merged 6 commits intoapache:trunkfrom
Conversation
|
Well, the one failure in the jdk8 build was the test I'm trying to fix, so just increasing the timeout is not enough 😞. Let me try again to reproduce. |
| val numFailuresBeforePoll = commitCallback.failCount | ||
| TestUtils.pollUntilTrue(consumer, () => commitCallback.successCount >= count || commitCallback.failCount > numFailuresBeforePoll, | ||
| "Failed to observe commit callback before timeout", waitTimeMs = 10000) | ||
| assertEquals("Unexpected async commit failure", numFailuresBeforePoll, commitCallback.failCount) |
There was a problem hiding this comment.
The flakiness comes from the async commit failing?
There was a problem hiding this comment.
That's what is timing out. I was wondering if the commit is actually failing.
|
retest this please |
|
The latest failure confirmed my suspicion: That likely means that an unexpected rebalance has taken place. I will try to enable some additional logging to get to the cause of the rebalance. |
|
retest this please |
2 similar comments
|
retest this please |
|
retest this please |
|
retest this please |
3 similar comments
|
retest this please |
|
retest this please |
|
retest this please |
|
At long last, the cause is clear: I will add some logic to retry when we get |
a7d733a to
ec64f43
Compare
|
retest this please |
|
So this was just a test bug? |
|
@ijuma Yep, the test seemed to have regressed in a recent refactor. We need not name the guilty party though 😝 |
|
retest this please |
|
Three straight successful jdk8 builds. Before it was failing nearly every time with this particular test case failing every couple times. |
| override def onComplete(offsets: util.Map[TopicPartition, OffsetAndMetadata], exception: Exception): Unit = { | ||
| exception match { | ||
| case null => | ||
| isComplete = true |
There was a problem hiding this comment.
Nit: would it be slightly better if we removed this and in the case e clause, changed the last line to error = Option(e)?
There was a problem hiding this comment.
Yes, sounds good.
|
Java 11 job failure is unrelated: kafka.server.RequestQuotaTest.testResponseThrottleTimeWhenBothFetchAndRequestQuotasViolated Java 8 job passed. |
|
@hachikuji does this need to be cherry-picked to 2.1 or any other branch? |
|
I merged to trunk, please cherry-pick if needed @hachikuji. |
…fix flaky tests (apache#5890) We are seeing some timeouts in tests which depend on the awaitCommitCallback (e.g. SaslMultiMechanismConsumerTest.testCoordinatorFailover). After some investigation, we found that it is caused by a disconnect when attempting the async commit. To fix the problem, we have added simple retry logic to the test utility. Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>
We are seeing some timeouts in tests which depend on the
awaitCommitCallback(e.g.SaslMultiMechanismConsumerTest.testCoordinatorFailover). After some investigation, we found that it is caused by a disconnect when attempting the async commit. To fix the problem, we have added simple retry logic to the test utility.Committer Checklist (excluded from commit message)