Skip to content

KAFKA-14534: Reduce flakiness in TransactionsExpirationTest#13036

Merged
showuon merged 2 commits intoapache:trunkfrom
gharris1727:kafka-14534-transaction-flake
Dec 23, 2022
Merged

KAFKA-14534: Reduce flakiness in TransactionsExpirationTest#13036
showuon merged 2 commits intoapache:trunkfrom
gharris1727:kafka-14534-transaction-flake

Conversation

@gharris1727
Copy link
Copy Markdown
Contributor

This test asserts that after a producerId expires and before a transactionId expires, the producerId is reused in a subsequent epoch.
The transactionId expiration time used in this test was too short, and a race condition between the two expirations was occasionally causing a new producerId to be returned without an epoch bump.

Signed-off-by: Greg Harris greg.harris@aiven.io

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

This test asserts that after a producerId expires and before a
transactionId expires, the producerId is reused in a subsequent epoch.
The transactionId expiration time used in this test was too short,
and a race condition between the two expirations was occasionally
causing a new producerId to be returned without an epoch bump.

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727
Copy link
Copy Markdown
Contributor Author

@jolshan Could you take a look at this test stabilization?

Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this flaky test. I think your change makes sense... if we only have this test in this test suite. But obviously, after your change, it failed other tests:
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13036/1/

I'm thinking, if the root cause is the gap between producerID expiration and transactionID expiration is too small, could we just increase the transactionID from 1 sec to maybe 2 sec? After all, the producerID expiration doesn't cause any test failure here, we shouldn't change it. WDYT?

Comment thread core/src/test/scala/integration/kafka/api/TransactionsExpirationTest.scala Outdated
@jolshan
Copy link
Copy Markdown
Member

jolshan commented Dec 22, 2022

Thanks for working on this test. I've been struggling with it (as the author 😅 ) for a while.

I think what Luke says makes sense. The gap between the expiries is too small.

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727
Copy link
Copy Markdown
Contributor Author

gharris1727 commented Dec 22, 2022

after your change, it failed other tests:

Thanks, fixed. I was running only the first test locally and didn't check to see what the timeouts in the other test were.

After all, the producerID expiration doesn't cause any test failure here, we shouldn't change it. WDYT?

Yes, the CI failures I was seeing were all about the race condition between the test and transactionId expiration. However, while I was running these tests under a local CPU slowdown to 25-30%, I was seeing race conditions happening between the test and the producerId expiration. The test would poll for the producer state, only for it to have already been expired by the broker. I'm lengthening both here so that we don't need a follow-up for that other known flakiness :)

@gharris1727 gharris1727 requested a review from showuon December 22, 2022 17:09
serverProps.put(KafkaConfig.GroupInitialRebalanceDelayMsProp, "0")
serverProps.put(KafkaConfig.TransactionsAbortTimedOutTransactionCleanupIntervalMsProp, "200")
serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "1000")
serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "10000")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a nit, but was this much of an increase required? (Next time I should write tests that use mock time 😓 )

Copy link
Copy Markdown
Contributor Author

@gharris1727 gharris1727 Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this much of an increase required?

The larger the timeouts, the more forgiving the test. I went for the most forgiving timeout that didn't require changing the default 15 second waitUntilTrue timeout. For example the given 1000ms timeout may have the following success rates:

  • 99% success rate on a developer laptop
  • 90% success rate in CI. (15 failures out of 150 runs in 25 builds)
  • 50% success rate when simulating a 0.3 CPU environment

After changing the timeouts I was seeing >90% success rate in my 0.3 CPU environment (i didn't run it too much to confirm), which should hopefully be enough to bring up the CI success rate as well.

Below 0.3 CPU the test flaked out much more severely, like the cluster wouldn't come all the way up, etc. If the test logic is as de-flaked as the infrastructure itself then I consider that good enough.

Next time I should write tests that use mock time

Yes, if possible you should always prefer a mock time test. When I'm debugging flakey tests, they're nearly always flakey because of time passing strangely. In a mocked time environment you are completely isolated from the speed of the JVM.

One last off-topic comment: When writing a test, think about "if someone pressed pause on this part of the test for a long time, would the test still pass?" Because that's what the effect of a CPU limited environment is: threads are de-scheduled and stop executing, effectively pausing in between lines of code.

In this case, the test paused between the producer calls and the producerState call for more than a second, and the expiration on the broker fired first. Now it's much less likely that it will be paused for more than 10 seconds, but there's still always a chance. If this was a mocked time environment, and you could be certain that the producerState would be executed before the transaction expiration, and there wouldn't be any room for flakiness.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to file a jira to make it mocked time 😄

val oldProducerId = pState(0).producerId
var pState : List[ProducerState] = null
TestUtils.waitUntilTrue(() => { pState = producerState; pState.nonEmpty}, "Producer IDs for topic1 did not propagate quickly")
assertEquals(1, pState.size, "Unexpected producer to topic1")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message here seems to expect more than one producer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is only printed when the state assertion fails, and the waitUntilTrue has already asserted that it is > 0.
The only case that this could be true is if there are more than one producer, which the error message calls out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok -- that makes sense. I guess there's no crazy race where it would expire again now. :)

Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Triggering another CI build to make sure it makes test reliable.
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13036/3/

@showuon
Copy link
Copy Markdown
Member

showuon commented Dec 23, 2022

No failed tests in TransactionsExpirationTest for 2 CI build in a row.

@showuon showuon merged commit 94c6d64 into apache:trunk Dec 23, 2022
guozhangwang pushed a commit to guozhangwang/kafka that referenced this pull request Jan 25, 2023
…3036)

This test asserts that after a producerId expires and before a transactionId expires, the producerId is reused in a subsequent epoch.
The transactionId expiration time used in this test was too short, and a race condition between the two expirations was occasionally causing a new producerId to be returned without an epoch bump.

Reviewers: Luke Chen <showuon@gmail.com>, Justine Olshan <jolshan@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants