Increase acceptable duration time for ReassignPartitionsClusterTest#shouldExecuteThrottledReassignment#5887
Increase acceptable duration time for ReassignPartitionsClusterTest#shouldExecuteThrottledReassignment#5887ijuma merged 5 commits intoapache:trunkfrom stanislavkozlovski:shouldExecuteThrottledReassignment-flaky-test
Conversation
| took > expectedDurationSecs * 0.9 * 1000) | ||
| assertTrue(s"Expected replication to be < ${expectedDurationSecs * 2 * 1000} but was $took", | ||
| assertTrue(s"Expected replication to be < ${expectedDurationSecs * 3 * 1000} but was $took", | ||
| took < expectedDurationSecs * 2 * 1000) |
There was a problem hiding this comment.
This is the changing the error message, not the actual check :-) In any case, I am not sure this test makes any sense any more if we can't even get it to work with double the expected value. We should see if we can change message size or number of messages to get it to work consistently with a duration that is between 0.9t and 2t.
There was a problem hiding this comment.
Oh whoops, hehe :)
I will try to tweak it a bit but it is worth mentioning that these slow runs are definitely outliers. Less than 1 in 50 locally
|
I spent a fair bit of time debugging this. I ran the test what feels like a thousand times. One thing I noticed is that in runs that timed out, we would hit the before starting to produce messages in the test. I have no conclusive thoughts on why this is happening. Another interesting thing is that the backoff would be called only once when it was 1000. Now that it is 100, I see it much more frequently in the logs. It gets called when we have no active partitions eligible for fetching - and that is expected if all are throttled. |
|
@stanislavkozlovski that makes sense right? If the backoff is large enough, it can cause us to wait for too long before restarting fetches given the test workload. The backoff is configurable (replica.fetch.backoff.ms), so maybe we just reduce it to 100ms for this test? |
| saslSslPort: Int = RandomPort, | ||
| rack: Option[String] = None, | ||
| logDirCount: Int = 1, | ||
| replicaFetchBackoff: Int = 1000, |
There was a problem hiding this comment.
This is not common enough to be here. I suggest setting the config on the returned Properties instance.
There was a problem hiding this comment.
That's fair. updated.
|
Ah, I see that you have done that in the PR, just left a minor comment. Might be worth explaining why we set the config too. |
| def startBrokers(brokerIds: Seq[Int]) { | ||
| servers = brokerIds.map(i => createBrokerConfig(i, zkConnect, enableControlledShutdown = false, logDirCount = 3)) | ||
| .map(c => createServer(KafkaConfig.fromProps(c))) | ||
| servers = brokerIds.map(i => { |
There was a problem hiding this comment.
Style nit: broker.ids.map { i => ....
…st (#5887) The default backoff of 1000ms when there are no partitions to fetch can cause `shouldExecuteThrottledReassignment` to fail due to it taking too long. So we reduce it to 100ms. Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Ismael Juma <ismael@juma.me.uk
|
Cherry-picked to |
* AK/trunk: fix typo (apache#5150) MINOR: Reduce replica.fetch.backoff.ms in ReassignPartitionsClusterTest (apache#5887) KAFKA-7766: Fail fast PR builds (apache#6059) KAFKA-7798: Expose embedded clientIds (apache#6107) KAFKA-7641; Introduce "group.max.size" config to limit group sizes (apache#6163) KAFKA-7433; Introduce broker options in TopicCommand to use AdminClient (KIP-377) MINOR: Fix some field definitions for ListOffsetReponse (apache#6214) KAFKA-7873; Always seek to beginning in KafkaBasedLog (apache#6203) KAFKA-7719: Improve fairness in SocketServer processors (KIP-402) (apache#6022) MINOR: fix checkstyle suppressions for generated RPC code to work on Windows KAFKA-7859: Use automatic RPC generation in LeaveGroups (apache#6188) KAFKA-7652: Part II; Add single-point query for SessionStore and use for flushing / getter (apache#6161) KAFKA-3522: Add RocksDBTimestampedStore (apache#6149) KAFKA-3522: Replace RecordConverter with TimestampedBytesStore (apache#6204)
…st (apache#5887) The default backoff of 1000ms when there are no partitions to fetch can cause `shouldExecuteThrottledReassignment` to fail due to it taking too long. So we reduce it to 100ms. Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Ismael Juma <ismael@juma.me.uk
We've seen this test fail in Jenkins (https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/) with 10400ms.
Running locally 50 times, I had two instances where it took 8.2s and 9.3s. Since Jenkins is typically running on a slower machine, I think that it is reasonable to increase the acceptable duration here in order to reduce failed builds due to test flakiness.