Skip to content

KAFKA-9802; Increase transaction timeout in system tests to reduce flakiness#8736

Merged
hachikuji merged 2 commits intoapache:trunkfrom
hachikuji:KAFKA-9802
May 28, 2020
Merged

KAFKA-9802; Increase transaction timeout in system tests to reduce flakiness#8736
hachikuji merged 2 commits intoapache:trunkfrom
hachikuji:KAFKA-9802

Conversation

@hachikuji
Copy link
Copy Markdown
Contributor

@hachikuji hachikuji commented May 27, 2020

We have been seeing increased flakiness in transaction system tests. I believe the cause might be due to KIP-537, which increased the default zk session timeout from 6s to 18s and the default replica lag timeout from 10s to 30s. In the system test, we use the default transaction timeout of 10s. However, since the system test involves hard failures, the Produce request could be blocking for as long as the max of these two in order to wait for an ISR shrink. Hence this patch increases the timeout to 30s.

Note this patch also includes a minor logging fix in Partition. Previously we would see messages like the following:

[Broker id=3] Leader output-topic-0 starts at leader epoch 0 from offset 0 with high watermark 0 ISR 3,2,1 addingReplicas  removingReplicas .Previous leader epoch was -1.

This patch fixes the log to print as the following:

[Broker id=3] Leader output-topic-0 starts at leader epoch 0 from offset 0 with high watermark 0 ISR [3,2,1] addingReplicas []  removingReplicas []. Previous leader epoch was -1.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. There were two spaces before addingReplicas in the fixed log output. Is that correct or an issue during copy and paste? Looking at the code, it seemed like there should be a single space.

@hachikuji
Copy link
Copy Markdown
Contributor Author

Thanks, it was just a copy/paste bug I think. I tweaked the output a little.

Copy link
Copy Markdown
Contributor

@bob-barrett bob-barrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me, I just had one question about the new timeout value

self.num_seed_messages = 100000
self.transaction_size = 750
self.transaction_timeout = 10000
self.transaction_timeout = 30000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the producer can block for up to 30 seconds, do we want a slightly longer transaction timeout, just to have some buffer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. Maybe we can add 5-10s padding.

@hachikuji hachikuji merged commit d9fe30d into apache:trunk May 28, 2020
hachikuji added a commit that referenced this pull request May 28, 2020
…akiness (#8736)

We have been seeing increased flakiness in transaction system tests. I believe the cause might be due to KIP-537, which increased the default zk session timeout from 6s to 18s and the default replica lag timeout from 10s to 30s. In the system test, we use the default transaction timeout of 10s. However, since the system test involves hard failures, the Produce request could be blocking for as long as the max of these two in order to wait for an ISR shrink. Hence this patch increases the timeout to 30s.

Reviewers: Bob Barrett <bob.barrett@confluent.io>, Ismael Juma <github@juma.me.uk>
Kvicii pushed a commit to Kvicii/kafka that referenced this pull request May 30, 2020
* 'trunk' of github.com:apache/kafka: (36 commits)
  Remove redundant `containsKey` call in KafkaProducer (apache#8761)
  KAFKA-9494; Include additional metadata information in DescribeConfig response (KIP-569) (apache#8723)
  KAFKA-10061; Fix flaky `ReassignPartitionsIntegrationTest.testCancellation` (apache#8749)
  KAFKA-9130; KIP-518 Allow listing consumer groups per state (apache#8238)
  KAFKA-9501: convert between active and standby without closing stores (apache#8248)
  KAFKA-10056; Ensure consumer metadata contains new topics on subscription change (apache#8739)
  MINOR: Log the reason for coordinator discovery failure (apache#8747)
  KAFKA-10029; Don't update completedReceives when channels are closed to avoid ConcurrentModificationException (apache#8705)
  MINOR: remove unnecessary timeout for admin request (apache#8738)
  MINOR: Relax Percentiles test (apache#8748)
  MINOR: regression test for task assignor config (apache#8743)
  MINOR: Update documentation.html to refer to 2.6 (apache#8745)
  MINOR: Update documentation.html to refer to 2.5 (apache#8744)
  KAFKA-9673: Filter and Conditional SMTs (apache#8699)
  KAFKA-9971: Error Reporting in Sink Connectors (KIP-610) (apache#8720)
  KAFKA-10052: Harden assertion of topic settings in Connect integration tests (apache#8735)
  MINOR: Slight MetadataCache tweaks to avoid unnecessary work (apache#8728)
  KAFKA-9802; Increase transaction timeout in system tests to reduce flakiness (apache#8736)
  KAFKA-10050: kafka_log4j_appender.py fixed for JDK11 (apache#8731)
  KAFKA-9146: Add option to force delete active members in StreamsResetter (apache#8589)
  ...

# Conflicts:
#	core/src/main/scala/kafka/log/Log.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants