Skip to content

MINOR: extract jointly owned parts of BrokerServer and ControllerServer#12837

Merged
cmccabe merged 12 commits intoapache:trunkfrom
cmccabe:jointserver
Dec 2, 2022
Merged

MINOR: extract jointly owned parts of BrokerServer and ControllerServer#12837
cmccabe merged 12 commits intoapache:trunkfrom
cmccabe:jointserver

Conversation

@cmccabe
Copy link
Copy Markdown
Contributor

@cmccabe cmccabe commented Nov 9, 2022

Extract jointly owned parts of BrokerServer and ControllerServer into JointServer. Shut down JointServer when the last component using it shuts down. (But make sure to stop the raft manager before closing the ControllerServer's sockets.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need the controller to be available in order to process controlled shutdown.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll rework this comment a bit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown jointServer here if broker is null? (Same thing for controller below.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, if broker is null, then jointServer cannot have been started.

However, just to make it simple to understand what is going on, I'll add an unconditional call to jointServer#stopForBroker here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's a little clearer if you provide the argument names, at least for the fatal arg. Same for other factories below.

fatal = config.processRoles.contains(ControllerRole)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing issue, but I think we also need to close the Metrics instance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this to JointServer#stop.

It was previously being done in BrokerServer#shutdown, which handles broker-only and combined mode. But I guess in the case of testing ControllerServer, we leaked this resource.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove newline

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra \ at the end of the line intended?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is weird. I think the IDE inserted it?

It does nothing here. I will remove it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We end up doing a bunch of stuff that KafkaRaftServer is already doing. Seems like it would be simpler to use KafkaRaftServer directly and get rid of all this logic to manage the lower-level components. That would also get us more consistent shutdown behavior.

Copy link
Copy Markdown
Contributor Author

@cmccabe cmccabe Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not simple at all to use KafkaRaftServer in most of our tests. Let me give an example. If someone shuts down a broker in a test by calling BrokerServer#shutdown, and the broker was a standalone broker, you have to somehow shut down the associated KafkaRaftServer, the associated snapshot generator, and the associated metadata loader. And clear the associated dynamic metadata.

If you maintain KafkaRaftServer more or less the way it is, where it owns a BrokerServer, ControllerServer, and some other stuff, and those owned objects don't have any pointers back to it, this is not really possible. You would have to either rewrite the tests in terms of KafkaRaftServer, which is not really feasible in the time we have available, or just accept that BrokerServer#shutdown is not going to clean up everything. I don't think either course of action really works here.

In general we haven't tested combined mode very much, so we've been able to handwave some of this. Or just accept resource leaks in the tests. But to do it correctly, we should acknowledge that in combined mode there is some joint state. Hence, JointServer.

I think this PR greatly simplifies the test code (and will do so for the other test harnesses we have). We cannot have each test harness manually managing the joint state, it is just too much (and grows over time). This is a clean way to do that: standalone mode = your own JointServer, combined mode = shared JointServer.

Do not use colocation in PlaintextAdminIntegrationTest.testCreateTopicsReturnsConfigs since QuorumTestHarness does not really support colocation

Allow node configurations to be set for controllers
@cmccabe
Copy link
Copy Markdown
Contributor Author

cmccabe commented Nov 30, 2022

There were a few test flakes that don't look related. Will merge in trunk and get another test run, I guess.

val config = jointServer.config
val time = jointServer.time
val metrics = jointServer.metrics
def raftManager: KafkaRaftManager[ApiMessageAndVersion] = jointServer.raftManager
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not make this a val also?

JavaConverters.asScalaBuffer(Collections.<String>emptyList()).toSeq());
} catch (Throwable e) {
log.error("Error creating broker {}", node.id(), e);
if (broker != null) broker.shutdown();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: similar comment as before. I think JointServer still has resources that need to be cleaned up even if it doesn't get started (e.g. Metrics). Same thing for the controller above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Metrics is the only one. It looks like its constructor starts a thread and so on.

But yes, we should shut down all that. Fixed.

if (kafkaRaftMetrics != null) {
kafkaRaftMetrics.close();
}
if (memoryPool instanceof BatchMemoryPool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can let MemoryPool implement Closeable?

By the way, was this a leak? I think the only reference to the pool is in KafkaRaftClient, so does that mean we were leaking KafkaRaftClient references?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been playing whack-a-mole with cases where we leak KafkaRaftClient instances in tests. Most of them seem to be related to the metrics closures dragging in giant objects, and the metrics not getting deleted properly. These are existing issues not caused by this PR.

This is a bit ugly but it stabilizes the build greatly so I think we should leave it in. I didn't want to make it a close() function since I didn't want to start thinking about dealing with a closed state in all of the memory pools.

Can we leave this for now and open a JIRA?

@cmccabe
Copy link
Copy Markdown
Contributor Author

cmccabe commented Nov 30, 2022

Test failures not related

StickyAssignorTest > testLargeAssignmentAndGroupWithNonEqualSubscription() FAILED
CoordinatorTest > testTaskRequestWithOldStartMsGetsUpdated() FAILED
 MirrorConnectorsWithCustomForwardingAdminIntegrationTest > testReplicationIsCreatingTopicsUsingProvidedForwardingAdmin() FAILED
[DefaultTaskExecutorTest > shouldUnassignTaskWhenRequired() FAILED
 SmokeTestDriverIntegrationTest > shouldWorkWithRebalance() FAILED

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@hachikuji
Copy link
Copy Markdown
Contributor

The test failures in the latest build are showing an NPE:

java.lang.NullPointerException: Cannot invoke "kafka.raft.KafkaRaftManager.client()" because the return value of "kafka.server.BrokerServer.raftManager()" is null
	at kafka.server.BrokerServer.startup(BrokerServer.scala:304)
	at kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1(KafkaServerTestHarness.scala:346)
	at kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1$adapted(KafkaServerTestHarness.scala:342)
	at scala.collection.immutable.Vector.foreach(Vector.scala:1856)
	at kafka.integration.KafkaServerTestHarness.createBrokers(KafkaServerTestHarness.scala:342)
	at kafka.integration.KafkaServerTestHarness.setUp(KafkaServerTestHarness.scala:121)
	at kafka.api.IntegrationTestHarness.doSetup(IntegrationTestHarness.scala:123)
	at kafka.api.IntegrationTestHarness.setUp(IntegrationTestHarness.scala:103)
	at kafka.admin.AddPartitionsTest.setUp(AddPartitionsTest.scala:60)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727)
	at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)

* their reference. We opted to use two booleans here rather than a reference count in order to
* make debugging easier and reduce the chance of resource leaks.
*/
class JointServer(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would SharedServer be a better name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I suppose SharedServer might be better.

@cmccabe
Copy link
Copy Markdown
Contributor Author

cmccabe commented Dec 1, 2022

The test failures in the latest build are showing an NPE:

Yes. raftManager is initialized in JointServer#start so BrokerServer#raftManager needs to be a function, not a val.

- BrokerServer#raftManager must be a def, not val

- rename jointserver -> sharedserver

- fix one more failure case in a test harness where we needed to call stopForController to clean up the metrics object
val config = sharedServer.config
val time = sharedServer.time
val metrics = sharedServer.metrics
def raftManager: KafkaRaftManager[ApiMessageAndVersion] = sharedServer.raftManager
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here why this needs to be a def?

@cmccabe cmccabe merged commit 5514f37 into apache:trunk Dec 2, 2022
@cmccabe cmccabe deleted the jointserver branch December 2, 2022 08:27
guozhangwang pushed a commit to guozhangwang/kafka that referenced this pull request Jan 25, 2023
…er (apache#12837)

Extract jointly owned parts of BrokerServer and ControllerServer into SharedServer. Shut down
SharedServer when the last component using it shuts down. But make sure to stop the raft manager
before closing the ControllerServer's sockets.

This PR also fixes a memory leak where ReplicaManager was not removing some topic metric callbacks
during shutdown. Finally, we now release memory from the BatchMemoryPool in KafkaRaftClient#close.
These changes should reduce memory consumption while running junit tests.

Reviewers: Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants