MINOR: Reduce intermittent test failures for testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics and make log cleaner tests more efficient#5836
MINOR: Reduce intermittent test failures for testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics and make log cleaner tests more efficient#5836ijuma merged 2 commits intoapache:trunkfrom stanislavkozlovski:minor-logcleaner-test-fix
Conversation
…lineAndPopulatesUncleanableMetrics The problem was that the segment size for the logs was too small, triggering ~60 log rolls per tests. Sometimes that would pass the 15s timeout
ijuma
left a comment
There was a problem hiding this comment.
Thanks for the PR. Two minor comments below and one question: how did you validate that existing tests were not relying on the smaller segment size, by the fact that they still passed? It seems like you also checked how many rolls were still occurring, was this done for every test affected?
| private val defaultCompactionLag = 0L | ||
| private val defaultDeleteDelay = 1000 | ||
| private val defaultSegmentSize = 256 | ||
| private val defaultSegmentSize = 5120 |
There was a problem hiding this comment.
Why would we make it less if this works?
There was a problem hiding this comment.
No strong reason, but the roll logic is an important part of log cleaner integration tests so I didn't want to go too far in the other direction.
|
|
||
| def runCleanerAndCheckCompacted(numKeys: Int): (Log, Seq[(Int, String, Long)]) = { | ||
| cleaner = makeCleaner(partitions = topicPartitions.take(1), propertyOverrides = logProps, backOffMs = 100L) | ||
| cleaner = makeCleaner(partitions = topicPartitions.take(1), propertyOverrides = logProps, backOffMs = 100L, segmentSize = 2560) |
There was a problem hiding this comment.
Do we still need this override if we set 2048 as the default?
Yes, I also looked at how many rolls they did. Some had less than 10 rolls on the small size
Yes, all now do less than 10. |
`testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics` sometimes fails because the 15 second timeout expires. Inspecting the error message from the build failure, we see that this timeout happens in the writeDups() calls which call roll(). ```text [2018-10-23 15:18:51,018] ERROR Error while flushing log for log-1 in dir /tmp/kafka-8190355063195903574 with offset 74 (kafka.server.LogDirFailureChannel:76) java.nio.channels.ClosedByInterruptException ... at kafka.log.Log.roll(Log.scala:1550) ... at kafka.log.AbstractLogCleanerIntegrationTest.writeDups(AbstractLogCleanerIntegrationTest.scala:132) ... ``` After investigating, I saw that this test would call Log#roll() around 60 times every run. Increasing the segmentSize config to `2048` reduces the number of Log#roll() calls while ensuring that there are multiple rolls still. I saw that most other LogCleaner tests also call roll() ~90 times, so I've changed the default to be `2048`. I've also made the one test which requires a smaller segmentSize to set it via the args. Reviewers: Ismael Juma <ismael@juma.me.uk>
|
JDK8 passed, JDK11 failures are unrelated. Merged to trunk and 2.1 branches. |
) `testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics` sometimes fails because the 15 second timeout expires. Inspecting the error message from the build failure, we see that this timeout happens in the writeDups() calls which call roll(). ```text [2018-10-23 15:18:51,018] ERROR Error while flushing log for log-1 in dir /tmp/kafka-8190355063195903574 with offset 74 (kafka.server.LogDirFailureChannel:76) java.nio.channels.ClosedByInterruptException ... at kafka.log.Log.roll(Log.scala:1550) ... at kafka.log.AbstractLogCleanerIntegrationTest.writeDups(AbstractLogCleanerIntegrationTest.scala:132) ... ``` After investigating, I saw that this test would call Log#roll() around 60 times every run. Increasing the segmentSize config to `2048` reduces the number of Log#roll() calls while ensuring that there are multiple rolls still. I saw that most other LogCleaner tests also call roll() ~90 times, so I've changed the default to be `2048`. I've also made the one test which requires a smaller segmentSize to set it via the args. Reviewers: Ismael Juma <ismael@juma.me.uk>
As seen in https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/239/testReport/junit/kafka.log/LogCleanerIntegrationTest/testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics/
This test sometimes fails because of passing the 15 second timeout. Inspecting the error message from the build failure, we see that this timeout happens in the
writeDups()calls which callroll().After investigating, I saw that this test would call
Log#roll()around 60 times every run. Increasing thesegmentSizeconfig to5120reduces theLog#roll()calls to 4 per test.I saw that most other LogCleaner tests also call
roll()~90 times, so I've changed the default to be5120. I've also made the one test which requires a smaller segmentSize to set it via the args