KAFKA-18035, KAFKA-18306, KAFKA-18092: Address TransactionsTest flaky tests#18264
KAFKA-18035, KAFKA-18306, KAFKA-18092: Address TransactionsTest flaky tests#18264jolshan merged 2 commits intoapache:trunkfrom
Conversation
CalvinLiu7947
left a comment
There was a problem hiding this comment.
Thanks for the change! LGTM
| // Second transaction: abort | ||
| producer.beginTransaction() | ||
| val successfulFuture = producer.send(TestUtils.producerRecordWithExpectedTransactionStatus(topic1, null, "2", "2", willBeCommitted = false)) | ||
| TestUtils.waitUntilTrue(successfulFuture.isDone, "First message of the second transaction did not complete its send.") |
There was a problem hiding this comment.
nit: any reason to use a waitUntilTrue here instead of a future.get()`?
a waitUntilTrue will return even if the send completed exceptionally, whereas presumably you'd want that to fail immediately.
There was a problem hiding this comment.
That's fair. Can make the change.
There was a problem hiding this comment.
I guess the benefit is that future.get could block for a while -- not sure if there is a global test timeout, but maybe the method itself should have a timeout too?
There was a problem hiding this comment.
Yeah, I think for the tests it's more important to recover if the tested code gets stuck, than to be optimal from the perf perspective (obviously .get is more efficient because it would just block the thread, vs. waitUntilTrue would just loop and poll). If we want to fail on send failure (not sure if this is the goal of the test) we can use .get after waitUntilTrue.
There was a problem hiding this comment.
you can make a time-bounded get on the future using TestUtils.DEFAULT_MAX_WAIT_MS.
i think in practice, the producer's future will be bounded to at most the delivery timeout.
There was a problem hiding this comment.
Oh I missed the continued discussion. I just set the wait on the get to 20 seconds like some of the others in the file. The waitUntilTrue default is 15 seconds so I think this is sufficient.
There was a problem hiding this comment.
i think the main benefit of the get() is that if the send does fail with an exception when you otherwise expect it to succeed, the exception gets thrown and fails the test immediately, rather than continuing with incorrect preconditons.
… tests (#18264) A lot of these tests assumed that the commit/abort happened immediately. Spoiler alert -- it does not. For some I ensure that the first send of the next transaction is successful before grabbing the epoch. I also loosened some checks since we don't need to guarantee the exact epoch. I went back and forth with completely deleting testBumpTransactionalEpochWithTV2Enabled since we don't have client side epoch bumps with V2 (which is what the test was originally testing), but I opted to keep it to just confirm the epoch on each transaction -- even in the timeout scenario. Reviewers: Calvin Liu <caliu@confluent.io>, Artem Livshits <alivshits@confluent.io>, Jeff Kim <jeff.kim@confluent.io>, David Mao <dmao@confluent.io>
|
Picked to 4.0 as well. |
…e-old-protocol-versions * apache-github/trunk: KAFKA-18312: Added entityType: topicName to SubscribedTopicNames in ShareGroupHeartbeatRequest.json (apache#18285) HOTFIX: fix incompatible types: Optional<TimestampAndOffset> cannot be converted to Option<TimestampAndOffset> (apache#18284) MINOR Fix some test-catalog issues (apache#18272) KAFKA-18180: Move OffsetResultHolder to storage module (apache#18100) KAFKA-18301; Make coordinator records first class citizen (apache#18261) KAFKA-18262 Remove DefaultPartitioner and UniformStickyPartitioner (apache#18204) KAFKA-18296 Remove deprecated KafkaBasedLog constructor (apache#18257) KAFKA-12829: Remove old Processor and ProcessorSupplier interfaces (apache#18238) KAFKA-18292 Remove deprecated methods of UpdateFeaturesOptions (apache#18245) KAFKA-12829: Remove deprecated Topology#addProcessor of old Processor API (apache#18154) KAFKA-18035, KAFKA-18306, KAFKA-18092: Address TransactionsTest flaky tests (apache#18264) MINOR: change the default linger time in the new coordinator (apache#18274) KAFKA-18305: validate controller.listener.names is not in inter.broker.listener.name for kcontrollers (apache#18222) KAFKA-18207: Serde for handling transaction records (apache#18136) KAFKA-13722: Refactor Kafka Streams store interfaces (apache#18243) KAFKA-17131: Refactor TimeDefinitions (apache#18241) MINOR: Fix MessageFormatters (apache#18266) Mark flaky tests for Dec 18, 2024 (apache#18263)
… tests (apache#18264) A lot of these tests assumed that the commit/abort happened immediately. Spoiler alert -- it does not. For some I ensure that the first send of the next transaction is successful before grabbing the epoch. I also loosened some checks since we don't need to guarantee the exact epoch. I went back and forth with completely deleting testBumpTransactionalEpochWithTV2Enabled since we don't have client side epoch bumps with V2 (which is what the test was originally testing), but I opted to keep it to just confirm the epoch on each transaction -- even in the timeout scenario. Reviewers: Calvin Liu <caliu@confluent.io>, Artem Livshits <alivshits@confluent.io>, Jeff Kim <jeff.kim@confluent.io>, David Mao <dmao@confluent.io>
A lot of these tests assumed that the commit/abort happened immediately. Spoiler alert -- it does not.
For some I ensure that the first send of the next transaction is successful before grabbing the epoch. I also loosened some checks since we don't need to guarantee the exact epoch.
I went back and forth with completely deleting
testBumpTransactionalEpochWithTV2Enabledsince we don't have client side epoch bumps with V2 (which is what the test was originally testing), but I opted to keep it to just confirm the epoch on each transaction -- even in the timeout scenario.