Skip to content

KAFKA-14242: use mock managers to avoid duplicated resource allocation#12639

Merged
showuon merged 4 commits intoapache:trunkfrom
showuon:KAFKA-14233
Nov 28, 2022
Merged

KAFKA-14242: use mock managers to avoid duplicated resource allocation#12639
showuon merged 4 commits intoapache:trunkfrom
showuon:KAFKA-14233

Conversation

@showuon
Copy link
Copy Markdown
Member

@showuon showuon commented Sep 14, 2022

Recently, we got a lot of build failed (and terminated) with core:unitTest failure. The failed messages look like this:

FAILURE: Build failed with an exception.
[2022-09-14T09:51:52.190Z] 
[2022-09-14T09:51:52.190Z] * What went wrong:
[2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'.
[2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with non-zero exit value 1

After investigation, I found one reason of it (maybe there are other reasons). In BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange test, we created logManager twice, but when cleanup, we only close one of them. So, there will be a log cleaner keeping running. But during this time, the temp log dirs are deleted, so it will Exit.halt(1), and got the error we saw in gradle, like this code did when we encounter IOException in all our log dirs:

fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have failed")
Exit.halt(1)

And, why does it sometimes pass, sometimes failed? Because during test cluster close, we shutdown broker first, and then other components. And the log cleaner is triggered in an interval. So, if the cluster can close fast enough, and finish this test, it'll be passed. Otherwise, it'll exit with 1.

Fixed it by mock log manager and other managers in mock publisher to avoid duplicate resource allocation. This change won't change the original test goal since we only want to make sure publisher will invoke reloadUpdatedFilesWithoutConfigChange when necessary.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Sep 14, 2022

@hachikuji @jsancio , please take a look. Thanks.

@divijvaidya
Copy link
Copy Markdown
Member

Wow! Great deep dive to find the root cause here @showuon 👏

I am curious, how did you narrow down that Gradle Test Executor 128 is related to testReloadUpdatedFilesWithoutConfigChange. Did you check the the logs and search for exception stack traces and managed to get the IOException?

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Sep 15, 2022

@divijvaidya , thanks.

I am curious, how did you narrow down that Gradle Test Executor 128 is related to testReloadUpdatedFilesWithoutConfigChange. Did you check the the logs and search for exception stack traces and managed to get the IOException?

Since the test got terminated, not closed gracefully, so there's no exception stack traces unfortunately, which is why it is not easy to identify the problem. For me, I don't have any trick here. I analyze some of the build logs like here, and found the "core:unitTest" mostly got terminated at ClusterTestExtensionsTest suite. Starting from it, I check the test code, found nothing special, then, find the tests earlier, since I believe there could be some earlier test affect it or somewhat. So, in the end, I found the issue in this PR in the BrokerMetadataPublisherTest test suite. Hope this is the only issue we need to fix!

2022-09-15T01:16:27.075Z] BrokerMetadataPublisherTest > testReloadUpdatedFilesWithoutConfigChange() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataPublisherTest > testReloadUpdatedFilesWithoutConfigChange() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateAndClose() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateAndClose() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshot() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshot() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshotMultipleReasons() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testCreateSnapshotMultipleReasons() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testRecordListConsumer() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] BrokerMetadataSnapshotterTest > testRecordListConsumer() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testEmptyRepository() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testEmptyRepository() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testSetTopicConfig() STARTED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] MockConfigRepositoryTest > testSetTopicConfig() PASSED
[2022-09-15T01:16:27.075Z] 
[2022-09-15T01:16:27.075Z] ClusterTestExtensionsTest > testClusterTest(ClusterConfig, ClusterInstance) > kafka.test.ClusterTestExtensionsTest.testClusterTest(ClusterConfig, ClusterInstance)[1] STARTED
[2022-09-15T01:16:28.050Z] 
[2022-09-15T01:16:28.050Z] ClusterTestExtensionsTest > testClusterTest(ClusterConfig, ClusterInstance) > kafka.test.ClusterTestExtensionsTest.testClusterTest(ClusterConfig, ClusterInstance)[1] PASSED
[2022-09-15T01:16:28.050Z] 
[2022-09-15T01:16:28.050Z] ClusterTestExtensionsTest > testNoAutoStart() > kafka.test.ClusterTestExtensionsTest.testNoAutoStart()[1] STARTED
[2022-09-15T01:16:32.108Z] 
[2022-09-15T01:16:32.108Z] ClusterTestExtensionsTest > testNoAutoStart() > kafka.test.ClusterTestExtensionsTest.testNoAutoStart()[1] PASSED
[2022-09-15T01:16:32.108Z] 
[2022-09-15T01:16:32.108Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[1] STARTED
[2022-09-15T01:16:34.943Z] 
[2022-09-15T01:16:34.943Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[1] PASSED
[2022-09-15T01:16:34.943Z] 
[2022-09-15T01:16:34.943Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[2] STARTED
[2022-09-15T01:16:38.147Z] 
[2022-09-15T01:16:38.147Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[2] PASSED
[2022-09-15T01:16:38.147Z] 
[2022-09-15T01:16:38.147Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[3] STARTED
[2022-09-15T01:16:40.319Z] 
[2022-09-15T01:16:40.319Z] ClusterTestExtensionsTest > testClusterTests() > kafka.test.ClusterTestExtensionsTest.testClusterTests()[3] SKIPPED
[2022-09-15T01:16:41.474Z] 
[2022-09-15T01:16:41.474Z] > Task :core:unitTest FAILED

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Sep 15, 2022

There's been 3 straight builds not get terminated in the :core:unitTest.

image

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Sep 15, 2022

cc @ijuma , I think we should merge this fix soon to help our build turn red back to yellow/green light.

image

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Sep 18, 2022

@dajac @hachikuji , FYI

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Sep 19, 2022

@showuon Great find. I am not sure about the fix though, it seems a bit odd to update the mock in this way. To get things back into a stable state, we should perhaps disable this test while we figure out the right way to fix it.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Sep 19, 2022

@showuon Great find. I am not sure about the fix though, it seems a bit odd to update the mock in this way. To get things back into a stable state, we should perhaps disable this test while we figure out the right way to fix it.

Sure, I've opened #12658 to disable the test first. Thanks.

@showuon showuon changed the title KAFKA-14233: do not init managers twice to avoid resource leak KAFKA-14242: do not init managers twice to avoid resource leak Sep 19, 2022
@C0urante
Copy link
Copy Markdown
Contributor

@showuon @ijuma I've taken a stab at a safeguard to prevent tests from accidentally terminating the JVM, and to accurately report where those calls for termination come from. If you're interested, the PR can be found here.

Comment on lines 192 to -189
Mockito.spy(new BrokerMetadataPublisher(
conf = broker.config,
metadataCache = broker.metadataCache,
logManager = broker.logManager,
replicaManager = broker.replicaManager,
groupCoordinator = broker.groupCoordinator,
txnCoordinator = broker.transactionCoordinator,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feed mock publisher with mock log manager and other managers to avoid duplicate resource allocation in initManager

@showuon showuon changed the title KAFKA-14242: do not init managers twice to avoid resource leak KAFKA-14242: use mock managers to avoid duplicated resource allocation Sep 29, 2022
@showuon
Copy link
Copy Markdown
Member Author

showuon commented Sep 29, 2022

@hachikuji , please take a look. Thanks.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Oct 31, 2022

@hachikuji @mumrah , please take a look. Thanks.

Copy link
Copy Markdown
Member

@dengziming dengziming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish this can fix the flakiness.

@showuon showuon merged commit 5166890 into apache:trunk Nov 28, 2022
guozhangwang pushed a commit to guozhangwang/kafka that referenced this pull request Jan 25, 2023
apache#12639)

Recently, we got a lot of build failed (and terminated) with core:unitTest failure. The failed messages look like this:

FAILURE: Build failed with an exception.
[2022-09-14T09:51:52.190Z] 
[2022-09-14T09:51:52.190Z] * What went wrong:
[2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'.
[2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with non-zero exit value 1

After investigation, I found one reason of it (maybe there are other reasons). In BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange test, we created logManager twice, but when cleanup, we only close one of them. So, there will be a log cleaner keeping running. But during this time, the temp log dirs are deleted, so it will Exit.halt(1), and got the error we saw in gradle, like this code did when we encounter IOException in all our log dirs:

fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have failed")
Exit.halt(1)

And, why does it sometimes pass, sometimes failed? Because during test cluster close, we shutdown broker first, and then other components. And the log cleaner is triggered in an interval. So, if the cluster can close fast enough, and finish this test, it'll be passed. Otherwise, it'll exit with 1.

Fixed it by mock log manager and other managers in mock publisher to avoid duplicate resource allocation. This change won't change the original test goal since we only want to make sure publisher will invoke reloadUpdatedFilesWithoutConfigChange when necessary.

Reviewers: dengziming <dengziming1993@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants