KAFKA-14242: use mock managers to avoid duplicated resource allocation#12639
KAFKA-14242: use mock managers to avoid duplicated resource allocation#12639showuon merged 4 commits intoapache:trunkfrom
Conversation
|
@hachikuji @jsancio , please take a look. Thanks. |
|
Wow! Great deep dive to find the root cause here @showuon 👏 I am curious, how did you narrow down that |
|
@divijvaidya , thanks.
Since the test got terminated, not closed gracefully, so there's no exception stack traces unfortunately, which is why it is not easy to identify the problem. For me, I don't have any trick here. I analyze some of the build logs like here, and found the "core:unitTest" mostly got terminated at |
|
cc @ijuma , I think we should merge this fix soon to help our build turn red back to yellow/green light. |
|
@dajac @hachikuji , FYI |
|
@showuon Great find. I am not sure about the fix though, it seems a bit odd to update the mock in this way. To get things back into a stable state, we should perhaps disable this test while we figure out the right way to fix it. |
| Mockito.spy(new BrokerMetadataPublisher( | ||
| conf = broker.config, | ||
| metadataCache = broker.metadataCache, | ||
| logManager = broker.logManager, | ||
| replicaManager = broker.replicaManager, | ||
| groupCoordinator = broker.groupCoordinator, | ||
| txnCoordinator = broker.transactionCoordinator, |
There was a problem hiding this comment.
Feed mock publisher with mock log manager and other managers to avoid duplicate resource allocation in initManager
|
@hachikuji , please take a look. Thanks. |
|
@hachikuji @mumrah , please take a look. Thanks. |
dengziming
left a comment
There was a problem hiding this comment.
Wish this can fix the flakiness.
apache#12639) Recently, we got a lot of build failed (and terminated) with core:unitTest failure. The failed messages look like this: FAILURE: Build failed with an exception. [2022-09-14T09:51:52.190Z] [2022-09-14T09:51:52.190Z] * What went wrong: [2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'. [2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with non-zero exit value 1 After investigation, I found one reason of it (maybe there are other reasons). In BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange test, we created logManager twice, but when cleanup, we only close one of them. So, there will be a log cleaner keeping running. But during this time, the temp log dirs are deleted, so it will Exit.halt(1), and got the error we saw in gradle, like this code did when we encounter IOException in all our log dirs: fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have failed") Exit.halt(1) And, why does it sometimes pass, sometimes failed? Because during test cluster close, we shutdown broker first, and then other components. And the log cleaner is triggered in an interval. So, if the cluster can close fast enough, and finish this test, it'll be passed. Otherwise, it'll exit with 1. Fixed it by mock log manager and other managers in mock publisher to avoid duplicate resource allocation. This change won't change the original test goal since we only want to make sure publisher will invoke reloadUpdatedFilesWithoutConfigChange when necessary. Reviewers: dengziming <dengziming1993@gmail.com>


Recently, we got a lot of build failed (and terminated) with core:unitTest failure. The failed messages look like this:
After investigation, I found one reason of it (maybe there are other reasons). In
BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChangetest, we created logManager twice, but when cleanup, we only close one of them. So, there will be a log cleaner keeping running. But during this time, the temp log dirs are deleted, so it willExit.halt(1), and got the error we saw in gradle, like this code did when we encounter IOException in all our log dirs:And, why does it sometimes pass, sometimes failed? Because during test cluster close, we shutdown broker first, and then other components. And the log cleaner is triggered in an interval. So, if the cluster can close fast enough, and finish this test, it'll be passed. Otherwise, it'll exit with 1.
Fixed it by mock log manager and other managers in mock publisher to avoid duplicate resource allocation. This change won't change the original test goal since we only want to make sure publisher will invoke
reloadUpdatedFilesWithoutConfigChangewhen necessary.Committer Checklist (excluded from commit message)