KAFKA-14534: Reduce flakiness in TransactionsExpirationTest by gharris1727 · Pull Request #13036 · apache/kafka

gharris1727 · 2022-12-21T21:41:59Z

This test asserts that after a producerId expires and before a transactionId expires, the producerId is reused in a subsequent epoch.
The transactionId expiration time used in this test was too short, and a race condition between the two expirations was occasionally causing a new producerId to be returned without an epoch bump.

Signed-off-by: Greg Harris greg.harris@aiven.io

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

This test asserts that after a producerId expires and before a transactionId expires, the producerId is reused in a subsequent epoch. The transactionId expiration time used in this test was too short, and a race condition between the two expirations was occasionally causing a new producerId to be returned without an epoch bump. Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 · 2022-12-21T21:43:18Z

@jolshan Could you take a look at this test stabilization?

showuon

Thanks for looking into this flaky test. I think your change makes sense... if we only have this test in this test suite. But obviously, after your change, it failed other tests:
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13036/1/

I'm thinking, if the root cause is the gap between producerID expiration and transactionID expiration is too small, could we just increase the transactionID from 1 sec to maybe 2 sec? After all, the producerID expiration doesn't cause any test failure here, we shouldn't change it. WDYT?

jolshan · 2022-12-22T15:58:46Z

Thanks for working on this test. I've been struggling with it (as the author 😅 ) for a while.

I think what Luke says makes sense. The gap between the expiries is too small.

Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 · 2022-12-22T17:09:30Z

after your change, it failed other tests:

Thanks, fixed. I was running only the first test locally and didn't check to see what the timeouts in the other test were.

After all, the producerID expiration doesn't cause any test failure here, we shouldn't change it. WDYT?

Yes, the CI failures I was seeing were all about the race condition between the test and transactionId expiration. However, while I was running these tests under a local CPU slowdown to 25-30%, I was seeing race conditions happening between the test and the producerId expiration. The test would poll for the producer state, only for it to have already been expired by the broker. I'm lengthening both here so that we don't need a follow-up for that other known flakiness :)

jolshan · 2022-12-22T17:21:48Z

    serverProps.put(KafkaConfig.GroupInitialRebalanceDelayMsProp, "0")
    serverProps.put(KafkaConfig.TransactionsAbortTimedOutTransactionCleanupIntervalMsProp, "200")
-    serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "1000")
+    serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "10000")


Maybe a nit, but was this much of an increase required? (Next time I should write tests that use mock time 😓 )

was this much of an increase required?

The larger the timeouts, the more forgiving the test. I went for the most forgiving timeout that didn't require changing the default 15 second waitUntilTrue timeout. For example the given 1000ms timeout may have the following success rates:

99% success rate on a developer laptop

90% success rate in CI. (15 failures out of 150 runs in 25 builds)

50% success rate when simulating a 0.3 CPU environment

After changing the timeouts I was seeing >90% success rate in my 0.3 CPU environment (i didn't run it too much to confirm), which should hopefully be enough to bring up the CI success rate as well.

Below 0.3 CPU the test flaked out much more severely, like the cluster wouldn't come all the way up, etc. If the test logic is as de-flaked as the infrastructure itself then I consider that good enough.

Next time I should write tests that use mock time

Yes, if possible you should always prefer a mock time test. When I'm debugging flakey tests, they're nearly always flakey because of time passing strangely. In a mocked time environment you are completely isolated from the speed of the JVM.

One last off-topic comment: When writing a test, think about "if someone pressed pause on this part of the test for a long time, would the test still pass?" Because that's what the effect of a CPU limited environment is: threads are de-scheduled and stop executing, effectively pausing in between lines of code.

In this case, the test paused between the producer calls and the producerState call for more than a second, and the expiration on the broker fired first. Now it's much less likely that it will be paused for more than 10 seconds, but there's still always a chance. If this was a mocked time environment, and you could be certain that the producerState would be executed before the transaction expiration, and there wouldn't be any room for flakiness.

Feel free to file a jira to make it mocked time 😄

jolshan · 2022-12-22T17:23:27Z

-    val oldProducerId = pState(0).producerId
+    var pState : List[ProducerState] = null
+    TestUtils.waitUntilTrue(() => { pState = producerState; pState.nonEmpty}, "Producer IDs for topic1 did not propagate quickly")
+    assertEquals(1, pState.size, "Unexpected producer to topic1")


The error message here seems to expect more than one producer?

The message is only printed when the state assertion fails, and the waitUntilTrue has already asserted that it is > 0.
The only case that this could be true is if there are more than one producer, which the error message calls out.

Ah ok -- that makes sense. I guess there's no crazy race where it would expire again now. :)

showuon

LGTM!
Triggering another CI build to make sure it makes test reliable.
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13036/3/

showuon · 2022-12-23T09:31:11Z

No failed tests in TransactionsExpirationTest for 2 CI build in a row.

…3036) This test asserts that after a producerId expires and before a transactionId expires, the producerId is reused in a subsequent epoch. The transactionId expiration time used in this test was too short, and a race condition between the two expirations was occasionally causing a new producerId to be returned without an epoch bump. Reviewers: Luke Chen <showuon@gmail.com>, Justine Olshan <jolshan@confluent.io>

showuon reviewed Dec 22, 2022

View reviewed changes

Comment thread core/src/test/scala/integration/kafka/api/TransactionsExpirationTest.scala Outdated

Comment thread core/src/test/scala/integration/kafka/api/TransactionsExpirationTest.scala

Review comments: test failure and assertion race condition

8c6e852

Signed-off-by: Greg Harris <greg.harris@aiven.io>

gharris1727 requested a review from showuon December 22, 2022 17:09

jolshan reviewed Dec 22, 2022

View reviewed changes

jolshan approved these changes Dec 22, 2022

View reviewed changes

showuon approved these changes Dec 23, 2022

View reviewed changes

showuon merged commit 94c6d64 into apache:trunk Dec 23, 2022

gharris1727 mentioned this pull request Jan 6, 2023

KAFKA-14600: Reduce flakiness in ProducerIdExpirationTest #13087

Merged

3 tasks

Conversation

gharris1727 commented Dec 21, 2022

Committer Checklist (excluded from commit message)

Uh oh!

gharris1727 commented Dec 21, 2022

Uh oh!

showuon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jolshan commented Dec 22, 2022

Uh oh!

gharris1727 commented Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jolshan Dec 22, 2022

Choose a reason for hiding this comment

Uh oh!

gharris1727 Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jolshan Dec 22, 2022

Choose a reason for hiding this comment

Uh oh!

jolshan Dec 22, 2022

Choose a reason for hiding this comment

Uh oh!

gharris1727 Dec 22, 2022

Choose a reason for hiding this comment

Uh oh!

jolshan Dec 22, 2022

Choose a reason for hiding this comment

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

showuon commented Dec 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

showuon left a comment •

edited

Loading

gharris1727 commented Dec 22, 2022 •

edited

Loading

gharris1727 Dec 22, 2022 •

edited

Loading