KAFKA-14534: Reduce flakiness in TransactionsExpirationTest#13036
KAFKA-14534: Reduce flakiness in TransactionsExpirationTest#13036showuon merged 2 commits intoapache:trunkfrom
Conversation
This test asserts that after a producerId expires and before a transactionId expires, the producerId is reused in a subsequent epoch. The transactionId expiration time used in this test was too short, and a race condition between the two expirations was occasionally causing a new producerId to be returned without an epoch bump. Signed-off-by: Greg Harris <greg.harris@aiven.io>
|
@jolshan Could you take a look at this test stabilization? |
There was a problem hiding this comment.
Thanks for looking into this flaky test. I think your change makes sense... if we only have this test in this test suite. But obviously, after your change, it failed other tests:
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13036/1/
I'm thinking, if the root cause is the gap between producerID expiration and transactionID expiration is too small, could we just increase the transactionID from 1 sec to maybe 2 sec? After all, the producerID expiration doesn't cause any test failure here, we shouldn't change it. WDYT?
|
Thanks for working on this test. I've been struggling with it (as the author 😅 ) for a while. I think what Luke says makes sense. The gap between the expiries is too small. |
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Thanks, fixed. I was running only the first test locally and didn't check to see what the timeouts in the other test were.
Yes, the CI failures I was seeing were all about the race condition between the test and transactionId expiration. However, while I was running these tests under a local CPU slowdown to 25-30%, I was seeing race conditions happening between the test and the producerId expiration. The test would poll for the producer state, only for it to have already been expired by the broker. I'm lengthening both here so that we don't need a follow-up for that other known flakiness :) |
| serverProps.put(KafkaConfig.GroupInitialRebalanceDelayMsProp, "0") | ||
| serverProps.put(KafkaConfig.TransactionsAbortTimedOutTransactionCleanupIntervalMsProp, "200") | ||
| serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "1000") | ||
| serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "10000") |
There was a problem hiding this comment.
Maybe a nit, but was this much of an increase required? (Next time I should write tests that use mock time 😓 )
There was a problem hiding this comment.
was this much of an increase required?
The larger the timeouts, the more forgiving the test. I went for the most forgiving timeout that didn't require changing the default 15 second waitUntilTrue timeout. For example the given 1000ms timeout may have the following success rates:
- 99% success rate on a developer laptop
- 90% success rate in CI. (15 failures out of 150 runs in 25 builds)
- 50% success rate when simulating a 0.3 CPU environment
After changing the timeouts I was seeing >90% success rate in my 0.3 CPU environment (i didn't run it too much to confirm), which should hopefully be enough to bring up the CI success rate as well.
Below 0.3 CPU the test flaked out much more severely, like the cluster wouldn't come all the way up, etc. If the test logic is as de-flaked as the infrastructure itself then I consider that good enough.
Next time I should write tests that use mock time
Yes, if possible you should always prefer a mock time test. When I'm debugging flakey tests, they're nearly always flakey because of time passing strangely. In a mocked time environment you are completely isolated from the speed of the JVM.
One last off-topic comment: When writing a test, think about "if someone pressed pause on this part of the test for a long time, would the test still pass?" Because that's what the effect of a CPU limited environment is: threads are de-scheduled and stop executing, effectively pausing in between lines of code.
In this case, the test paused between the producer calls and the producerState call for more than a second, and the expiration on the broker fired first. Now it's much less likely that it will be paused for more than 10 seconds, but there's still always a chance. If this was a mocked time environment, and you could be certain that the producerState would be executed before the transaction expiration, and there wouldn't be any room for flakiness.
There was a problem hiding this comment.
Feel free to file a jira to make it mocked time 😄
| val oldProducerId = pState(0).producerId | ||
| var pState : List[ProducerState] = null | ||
| TestUtils.waitUntilTrue(() => { pState = producerState; pState.nonEmpty}, "Producer IDs for topic1 did not propagate quickly") | ||
| assertEquals(1, pState.size, "Unexpected producer to topic1") |
There was a problem hiding this comment.
The error message here seems to expect more than one producer?
There was a problem hiding this comment.
The message is only printed when the state assertion fails, and the waitUntilTrue has already asserted that it is > 0.
The only case that this could be true is if there are more than one producer, which the error message calls out.
There was a problem hiding this comment.
Ah ok -- that makes sense. I guess there's no crazy race where it would expire again now. :)
showuon
left a comment
There was a problem hiding this comment.
LGTM!
Triggering another CI build to make sure it makes test reliable.
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13036/3/
|
No failed tests in |
…3036) This test asserts that after a producerId expires and before a transactionId expires, the producerId is reused in a subsequent epoch. The transactionId expiration time used in this test was too short, and a race condition between the two expirations was occasionally causing a new producerId to be returned without an epoch bump. Reviewers: Luke Chen <showuon@gmail.com>, Justine Olshan <jolshan@confluent.io>
This test asserts that after a producerId expires and before a transactionId expires, the producerId is reused in a subsequent epoch.
The transactionId expiration time used in this test was too short, and a race condition between the two expirations was occasionally causing a new producerId to be returned without an epoch bump.
Signed-off-by: Greg Harris greg.harris@aiven.io
Committer Checklist (excluded from commit message)