Skip to content

MINOR: Reduce intermittent test failures for testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics and make log cleaner tests more efficient#5836

Merged
ijuma merged 2 commits intoapache:trunkfrom
stanislavkozlovski:minor-logcleaner-test-fix
Oct 27, 2018
Merged

MINOR: Reduce intermittent test failures for testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics and make log cleaner tests more efficient#5836
ijuma merged 2 commits intoapache:trunkfrom
stanislavkozlovski:minor-logcleaner-test-fix

Conversation

@stanislavkozlovski
Copy link
Copy Markdown
Contributor

As seen in https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/239/testReport/junit/kafka.log/LogCleanerIntegrationTest/testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics/

This test sometimes fails because of passing the 15 second timeout. Inspecting the error message from the build failure, we see that this timeout happens in the writeDups() calls which call roll().

[2018-10-23 15:18:51,018] ERROR Error while flushing log for log-1 in dir /tmp/kafka-8190355063195903574 with offset 74 (kafka.server.LogDirFailureChannel:76)
java.nio.channels.ClosedByInterruptException
...
	at kafka.log.Log.roll(Log.scala:1550)
...
	at kafka.log.AbstractLogCleanerIntegrationTest.writeDups(AbstractLogCleanerIntegrationTest.scala:132)
...

After investigating, I saw that this test would call Log#roll() around 60 times every run. Increasing the segmentSize config to 5120 reduces the Log#roll() calls to 4 per test.
I saw that most other LogCleaner tests also call roll() ~90 times, so I've changed the default to be 5120. I've also made the one test which requires a smaller segmentSize to set it via the args

…lineAndPopulatesUncleanableMetrics

The problem was that the segment size for the logs was too small, triggering ~60 log rolls per tests. Sometimes that would pass the 15s timeout
Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Two minor comments below and one question: how did you validate that existing tests were not relying on the smaller segment size, by the fact that they still passed? It seems like you also checked how many rolls were still occurring, was this done for every test affected?

private val defaultCompactionLag = 0L
private val defaultDeleteDelay = 1000
private val defaultSegmentSize = 256
private val defaultSegmentSize = 5120
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make this 2048.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we make it less if this works?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong reason, but the roll logic is an important part of log cleaner integration tests so I didn't want to go too far in the other direction.


def runCleanerAndCheckCompacted(numKeys: Int): (Log, Seq[(Int, String, Long)]) = {
cleaner = makeCleaner(partitions = topicPartitions.take(1), propertyOverrides = logProps, backOffMs = 100L)
cleaner = makeCleaner(partitions = topicPartitions.take(1), propertyOverrides = logProps, backOffMs = 100L, segmentSize = 2560)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this override if we set 2048 as the default?

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

stanislavkozlovski commented Oct 26, 2018

how did you validate that existing tests were not relying on the smaller segment size, by the fact that they still passed?

Yes, I also looked at how many rolls they did. Some had less than 10 rolls on the small size

It seems like you also checked how many rolls were still occurring, was this done for every test affected?

Yes, all now do less than 10.

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ijuma ijuma merged commit ccfcbfd into apache:trunk Oct 27, 2018
ijuma pushed a commit that referenced this pull request Oct 27, 2018
`testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics` sometimes fails
because the 15 second timeout expires. Inspecting the error message from the build
failure, we see that this timeout happens in the writeDups() calls which call roll().

```text
[2018-10-23 15:18:51,018] ERROR Error while flushing log for log-1 in dir /tmp/kafka-8190355063195903574 with offset 74 (kafka.server.LogDirFailureChannel:76)
java.nio.channels.ClosedByInterruptException
...
	at kafka.log.Log.roll(Log.scala:1550)
...
	at kafka.log.AbstractLogCleanerIntegrationTest.writeDups(AbstractLogCleanerIntegrationTest.scala:132)
...
```

After investigating, I saw that this test would call Log#roll() around 60 times every run.
Increasing the segmentSize config to `2048` reduces the number of Log#roll() calls
while ensuring that there are multiple rolls still.

I saw that most other LogCleaner tests also call roll() ~90 times, so I've changed the
default to be `2048`. I've also made the one test which requires a smaller segmentSize
to set it via the args.

Reviewers: Ismael Juma <ismael@juma.me.uk>
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Oct 27, 2018

JDK8 passed, JDK11 failures are unrelated. Merged to trunk and 2.1 branches.

pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
)

`testMarksPartitionsAsOfflineAndPopulatesUncleanableMetrics` sometimes fails
because the 15 second timeout expires. Inspecting the error message from the build
failure, we see that this timeout happens in the writeDups() calls which call roll().

```text
[2018-10-23 15:18:51,018] ERROR Error while flushing log for log-1 in dir /tmp/kafka-8190355063195903574 with offset 74 (kafka.server.LogDirFailureChannel:76)
java.nio.channels.ClosedByInterruptException
...
	at kafka.log.Log.roll(Log.scala:1550)
...
	at kafka.log.AbstractLogCleanerIntegrationTest.writeDups(AbstractLogCleanerIntegrationTest.scala:132)
...
```

After investigating, I saw that this test would call Log#roll() around 60 times every run.
Increasing the segmentSize config to `2048` reduces the number of Log#roll() calls
while ensuring that there are multiple rolls still.

I saw that most other LogCleaner tests also call roll() ~90 times, so I've changed the
default to be `2048`. I've also made the one test which requires a smaller segmentSize
to set it via the args.

Reviewers: Ismael Juma <ismael@juma.me.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants