KAFKA-14242: use mock managers to avoid duplicated resource allocation by showuon · Pull Request #12639 · apache/kafka

showuon · 2022-09-14T13:23:21Z

Recently, we got a lot of build failed (and terminated) with core:unitTest failure. The failed messages look like this:

FAILURE: Build failed with an exception.
[2022-09-14T09:51:52.190Z] 
[2022-09-14T09:51:52.190Z] * What went wrong:
[2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'.
[2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with non-zero exit value 1

After investigation, I found one reason of it (maybe there are other reasons). In BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange test, we created logManager twice, but when cleanup, we only close one of them. So, there will be a log cleaner keeping running. But during this time, the temp log dirs are deleted, so it will Exit.halt(1), and got the error we saw in gradle, like this code did when we encounter IOException in all our log dirs:

fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have failed")
Exit.halt(1)

And, why does it sometimes pass, sometimes failed? Because during test cluster close, we shutdown broker first, and then other components. And the log cleaner is triggered in an interval. So, if the cluster can close fast enough, and finish this test, it'll be passed. Otherwise, it'll exit with 1.

Fixed it by mock log manager and other managers in mock publisher to avoid duplicate resource allocation. This change won't change the original test goal since we only want to make sure publisher will invoke reloadUpdatedFilesWithoutConfigChange when necessary.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

showuon · 2022-09-14T13:25:06Z

@hachikuji @jsancio , please take a look. Thanks.

divijvaidya · 2022-09-14T16:36:23Z

Wow! Great deep dive to find the root cause here @showuon 👏

I am curious, how did you narrow down that Gradle Test Executor 128 is related to testReloadUpdatedFilesWithoutConfigChange. Did you check the the logs and search for exception stack traces and managed to get the IOException?

showuon · 2022-09-15T07:21:26Z

@divijvaidya , thanks.

I am curious, how did you narrow down that Gradle Test Executor 128 is related to testReloadUpdatedFilesWithoutConfigChange. Did you check the the logs and search for exception stack traces and managed to get the IOException?

Since the test got terminated, not closed gracefully, so there's no exception stack traces unfortunately, which is why it is not easy to identify the problem. For me, I don't have any trick here. I analyze some of the build logs like here, and found the "core:unitTest" mostly got terminated at ClusterTestExtensionsTest suite. Starting from it, I check the test code, found nothing special, then, find the tests earlier, since I believe there could be some earlier test affect it or somewhat. So, in the end, I found the issue in this PR in the BrokerMetadataPublisherTest test suite. Hope this is the only issue we need to fix!

2022-09-15T01:16:27.075Z] BrokerMetadataPublisherTest > testReloadUpdatedFilesWithoutConfigChange() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataPublisherTest > testReloadUpdatedFilesWithoutConfigChange() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateAndClose() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateAndClose() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshot() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshot() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshotMultipleReasons() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshotMultipleReasons() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testRecordListConsumer() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testRecordListConsumer() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testEmptyRepository() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testEmptyRepository() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testSetTopicConfig() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testSetTopicConfig() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] ClusterTestExtensionsTest > testClusterTest(ClusterConfig, ClusterInstance) > kafka.test.ClusterTestExtensionsTest.testClusterTest(ClusterConfig, ClusterInstance)[1] STARTED
[2022-09-15T01:16:28.050Z] 
[2022-09-15T01:16:28.050Z] ClusterTestExtensionsTest > testClusterTest(ClusterConfig, ClusterInstance) > kafka.test.ClusterTestExtensionsTest.testClusterTest(ClusterConfig, ClusterInstance)[1] PASSED
[2022-09-15T01:16:28.050Z] 
[2022-09-15T01:16:28.050Z] ClusterTestExtensionsTest > testNoAutoStart() > kafka.test.ClusterTestExtensionsTest.testNoAutoStart()[1] STARTED
[2022-09-15T01:16:32.108Z] 
[2022-09-15T01:16:32.108Z] ClusterTestExtensionsTest > testNoAutoStart() > kafka.test.ClusterTestExtensionsTest.testNoAutoStart()[1] PASSED
[2022-09-15T01:16:32.108Z] 
[2022-09-15T01:16:32.108Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[1] STARTED
[2022-09-15T01:16:34.943Z] 
[2022-09-15T01:16:34.943Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[1] PASSED
[2022-09-15T01:16:34.943Z] 
[2022-09-15T01:16:34.943Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[2] STARTED
[2022-09-15T01:16:38.147Z] 
[2022-09-15T01:16:38.147Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[2] PASSED
[2022-09-15T01:16:38.147Z] 
[2022-09-15T01:16:38.147Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[3] STARTED
[2022-09-15T01:16:40.319Z] 
[2022-09-15T01:16:40.319Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[3] SKIPPED
[2022-09-15T01:16:41.474Z] 
[2022-09-15T01:16:41.474Z] > Task :core:unitTest FAILED

showuon · 2022-09-15T10:53:45Z

There's been 3 straight builds not get terminated in the :core:unitTest.

showuon · 2022-09-15T12:47:51Z

cc @ijuma , I think we should merge this fix soon to help our build turn red back to yellow/green light.

showuon · 2022-09-18T23:55:53Z

@dajac @hachikuji , FYI

ijuma · 2022-09-19T00:48:32Z

@showuon Great find. I am not sure about the fix though, it seems a bit odd to update the mock in this way. To get things back into a stable state, we should perhaps disable this test while we figure out the right way to fix it.

showuon · 2022-09-19T01:18:28Z

@showuon Great find. I am not sure about the fix though, it seems a bit odd to update the mock in this way. To get things back into a stable state, we should perhaps disable this test while we figure out the right way to fix it.

Sure, I've opened #12658 to disable the test first. Thanks.

C0urante · 2022-09-20T15:43:18Z

@showuon @ijuma I've taken a stab at a safeguard to prevent tests from accidentally terminating the JVM, and to accurately report where those calls for termination come from. If you're interested, the PR can be found here.

…source allocation

showuon · 2022-09-29T12:52:41Z

    Mockito.spy(new BrokerMetadataPublisher(
      conf = broker.config,
      metadataCache = broker.metadataCache,
-      logManager = broker.logManager,
-      replicaManager = broker.replicaManager,
-      groupCoordinator = broker.groupCoordinator,
-      txnCoordinator = broker.transactionCoordinator,


Feed mock publisher with mock log manager and other managers to avoid duplicate resource allocation in initManager

showuon · 2022-09-29T12:53:27Z

@hachikuji , please take a look. Thanks.

showuon · 2022-10-31T07:15:00Z

@hachikuji @mumrah , please take a look. Thanks.

dengziming

Wish this can fix the flakiness.

apache#12639) Recently, we got a lot of build failed (and terminated) with core:unitTest failure. The failed messages look like this: FAILURE: Build failed with an exception. [2022-09-14T09:51:52.190Z] [2022-09-14T09:51:52.190Z] * What went wrong: [2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'. [2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with non-zero exit value 1 After investigation, I found one reason of it (maybe there are other reasons). In BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange test, we created logManager twice, but when cleanup, we only close one of them. So, there will be a log cleaner keeping running. But during this time, the temp log dirs are deleted, so it will Exit.halt(1), and got the error we saw in gradle, like this code did when we encounter IOException in all our log dirs: fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have failed") Exit.halt(1) And, why does it sometimes pass, sometimes failed? Because during test cluster close, we shutdown broker first, and then other components. And the log cleaner is triggered in an interval. So, if the cluster can close fast enough, and finish this test, it'll be passed. Otherwise, it'll exit with 1. Fixed it by mock log manager and other managers in mock publisher to avoid duplicate resource allocation. This change won't change the original test goal since we only want to make sure publisher will invoke reloadUpdatedFilesWithoutConfigChange when necessary. Reviewers: dengziming <dengziming1993@gmail.com>

KAFKA-14233: do not init managers twice to avoid resource leak

28cde03

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-14233

c42a7ce

showuon mentioned this pull request Sep 17, 2022

KAFKA-14238; KRaft metadata log should not delete segment past the latest snapshot #12655

Merged

3 tasks

showuon mentioned this pull request Sep 19, 2022

KAFKA-14240; Validate kraft snapshot state on startup #12653

Merged

3 tasks

showuon changed the title ~~KAFKA-14233: do not init managers twice to avoid resource leak~~ KAFKA-14242: do not init managers twice to avoid resource leak Sep 19, 2022

showuon added 2 commits September 29, 2022 17:57

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-14233

0a83ff9

KAFKA-14242: mock logmanager and other managers to avoid duplicate re…

9862871

…source allocation

showuon commented Sep 29, 2022

View reviewed changes

showuon changed the title ~~KAFKA-14242: do not init managers twice to avoid resource leak~~ KAFKA-14242: use mock managers to avoid duplicated resource allocation Sep 29, 2022

dengziming approved these changes Nov 25, 2022

View reviewed changes

showuon merged commit 5166890 into apache:trunk Nov 28, 2022

Conversation

showuon commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

showuon commented Sep 14, 2022

Uh oh!

divijvaidya commented Sep 14, 2022

Uh oh!

showuon commented Sep 15, 2022

Uh oh!

showuon commented Sep 15, 2022

Uh oh!

showuon commented Sep 15, 2022

Uh oh!

showuon commented Sep 18, 2022

Uh oh!

ijuma commented Sep 19, 2022

Uh oh!

showuon commented Sep 19, 2022

Uh oh!

C0urante commented Sep 20, 2022

Uh oh!

showuon Sep 29, 2022

Choose a reason for hiding this comment

Uh oh!

showuon commented Sep 29, 2022

Uh oh!

showuon commented Oct 31, 2022

Uh oh!

dengziming left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

showuon commented Sep 14, 2022 •

edited

Loading