Skip to content

KAFKA-18035, KAFKA-18306, KAFKA-18092: Address TransactionsTest flaky tests#18264

Merged
jolshan merged 2 commits intoapache:trunkfrom
jolshan:KAFKA-18035
Dec 19, 2024
Merged

KAFKA-18035, KAFKA-18306, KAFKA-18092: Address TransactionsTest flaky tests#18264
jolshan merged 2 commits intoapache:trunkfrom
jolshan:KAFKA-18035

Conversation

@jolshan
Copy link
Copy Markdown
Member

@jolshan jolshan commented Dec 18, 2024

A lot of these tests assumed that the commit/abort happened immediately. Spoiler alert -- it does not.

For some I ensure that the first send of the next transaction is successful before grabbing the epoch. I also loosened some checks since we don't need to guarantee the exact epoch.

I went back and forth with completely deleting testBumpTransactionalEpochWithTV2Enabled since we don't have client side epoch bumps with V2 (which is what the test was originally testing), but I opted to keep it to just confirm the epoch on each transaction -- even in the timeout scenario.

@github-actions github-actions Bot added core Kafka Broker tests Test fixes (including flaky tests) small Small PRs labels Dec 18, 2024
Copy link
Copy Markdown
Contributor

@CalvinLiu7947 CalvinLiu7947 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! LGTM

Copy link
Copy Markdown
Contributor

@artemlivshits artemlivshits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// Second transaction: abort
producer.beginTransaction()
val successfulFuture = producer.send(TestUtils.producerRecordWithExpectedTransactionStatus(topic1, null, "2", "2", willBeCommitted = false))
TestUtils.waitUntilTrue(successfulFuture.isDone, "First message of the second transaction did not complete its send.")
Copy link
Copy Markdown
Contributor

@splett2 splett2 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any reason to use a waitUntilTrue here instead of a future.get()`?

a waitUntilTrue will return even if the send completed exceptionally, whereas presumably you'd want that to fail immediately.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. Can make the change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the benefit is that future.get could block for a while -- not sure if there is a global test timeout, but maybe the method itself should have a timeout too?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think for the tests it's more important to recover if the tested code gets stuck, than to be optimal from the perf perspective (obviously .get is more efficient because it would just block the thread, vs. waitUntilTrue would just loop and poll). If we want to fail on send failure (not sure if this is the goal of the test) we can use .get after waitUntilTrue.

Copy link
Copy Markdown
Contributor

@splett2 splett2 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make a time-bounded get on the future using TestUtils.DEFAULT_MAX_WAIT_MS.

i think in practice, the producer's future will be bounded to at most the delivery timeout.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missed the continued discussion. I just set the wait on the get to 20 seconds like some of the others in the file. The waitUntilTrue default is 15 seconds so I think this is sufficient.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the main benefit of the get() is that if the send does fail with an exception when you otherwise expect it to succeed, the exception gets thrown and fails the test immediately, rather than continuing with incorrect preconditons.

@jolshan jolshan merged commit e099fce into apache:trunk Dec 19, 2024
jolshan added a commit that referenced this pull request Dec 19, 2024
… tests (#18264)

A lot of these tests assumed that the commit/abort happened immediately. Spoiler alert -- it does not.

For some I ensure that the first send of the next transaction is successful before grabbing the epoch. I also loosened some checks since we don't need to guarantee the exact epoch.

I went back and forth with completely deleting testBumpTransactionalEpochWithTV2Enabled since we don't have client side epoch bumps with V2 (which is what the test was originally testing), but I opted to keep it to just confirm the epoch on each transaction -- even in the timeout scenario.

Reviewers: Calvin Liu <caliu@confluent.io>, Artem Livshits <alivshits@confluent.io>, Jeff Kim <jeff.kim@confluent.io>, David Mao <dmao@confluent.io>
@jolshan
Copy link
Copy Markdown
Member Author

jolshan commented Dec 19, 2024

Picked to 4.0 as well.

ijuma added a commit to ijuma/kafka that referenced this pull request Dec 20, 2024
…e-old-protocol-versions

* apache-github/trunk:
  KAFKA-18312: Added entityType: topicName to SubscribedTopicNames in ShareGroupHeartbeatRequest.json (apache#18285)
  HOTFIX: fix incompatible types: Optional<TimestampAndOffset> cannot be converted to Option<TimestampAndOffset> (apache#18284)
  MINOR Fix some test-catalog issues (apache#18272)
  KAFKA-18180: Move OffsetResultHolder to storage module (apache#18100)
  KAFKA-18301; Make coordinator records first class citizen (apache#18261)
  KAFKA-18262 Remove DefaultPartitioner and UniformStickyPartitioner (apache#18204)
  KAFKA-18296 Remove deprecated KafkaBasedLog constructor (apache#18257)
  KAFKA-12829: Remove old Processor and ProcessorSupplier interfaces (apache#18238)
  KAFKA-18292 Remove deprecated methods of UpdateFeaturesOptions (apache#18245)
  KAFKA-12829: Remove deprecated Topology#addProcessor of old Processor API (apache#18154)
  KAFKA-18035, KAFKA-18306, KAFKA-18092: Address TransactionsTest flaky tests (apache#18264)
  MINOR: change the default linger time in the new coordinator (apache#18274)
  KAFKA-18305: validate controller.listener.names is not in inter.broker.listener.name for kcontrollers (apache#18222)
  KAFKA-18207: Serde for handling transaction records (apache#18136)
  KAFKA-13722: Refactor Kafka Streams store interfaces (apache#18243)
  KAFKA-17131: Refactor TimeDefinitions (apache#18241)
  MINOR: Fix MessageFormatters (apache#18266)
  Mark flaky tests for Dec 18, 2024 (apache#18263)
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
… tests (apache#18264)

A lot of these tests assumed that the commit/abort happened immediately. Spoiler alert -- it does not.

For some I ensure that the first send of the next transaction is successful before grabbing the epoch. I also loosened some checks since we don't need to guarantee the exact epoch.

I went back and forth with completely deleting testBumpTransactionalEpochWithTV2Enabled since we don't have client side epoch bumps with V2 (which is what the test was originally testing), but I opted to keep it to just confirm the epoch on each transaction -- even in the timeout scenario.

Reviewers: Calvin Liu <caliu@confluent.io>, Artem Livshits <alivshits@confluent.io>, Jeff Kim <jeff.kim@confluent.io>, David Mao <dmao@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Kafka Broker small Small PRs tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants