Skip to content

MINOR: Close ZooKeeperClient if waitUntilConnected fails during construction#5411

Merged
ijuma merged 8 commits intoapache:trunkfrom
omkreddy:close-zk
Jul 22, 2018
Merged

MINOR: Close ZooKeeperClient if waitUntilConnected fails during construction#5411
ijuma merged 8 commits intoapache:trunkfrom
omkreddy:close-zk

Conversation

@omkreddy
Copy link
Copy Markdown
Contributor

@omkreddy omkreddy commented Jul 21, 2018

This has always been an issue, but the recent upgrade to ZooKeeper
3.4.13 means that it is also an issue when an unresolvable ZK
address is used, causing some tests to leak threads.

The change in behaviour in ZK 3.4.13 is that no exception is thrown
from the ZooKeeper constructor in case of an unresolvable address.
Instead, ZooKeeper tries to re-resolve the address hoping it becomes
resolvable again. We eventually throw a
ZooKeeperClientTimeoutException, which is similar to the case
where the the address is resolvable, but ZooKeeper is not
reachable.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@omkreddy
Copy link
Copy Markdown
Contributor Author

omkreddy commented Jul 21, 2018

@ijuma @rajinisivaram

Three is a zookeeper thread leak incase of unknown host address passed to zk connect string. This is due to recent ZK client lib changes. Previously zookeeper fail immediately for unknown address. In new version re-resolves hosts when connection attempts fail.

This is leading to thread leak in ServerShutdownTest.testCleanShutdownAfterFailedStartup test, ZooKeeperClientTest.testUnresolvableConnectString.

Thread leak in ServerShutdownTest.testCleanShutdownAfterFailedStartup

[2018-07-21 16:57:13,212] WARN Session 0x0 for server some.invalid.hostname.foo.bar.local:2181, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:1168)
java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:101)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
	at org.apache.zookeeper.ClientCnxnSocketNIO.registerAndConnect(ClientCnxnSocketNIO.java:277)
	at org.apache.zookeeper.ClientCnxnSocketNIO.connect(ClientCnxnSocketNIO.java:287)
	at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1021)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1064)

Before zookeeper-3.4.13

java.net.UnknownHostException: some.invalid.hostname.foo.bar.local: unknown error
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
	at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
	at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
	at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:86)
	at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1538)
	at kafka.server.KafkaServer.kafka$server$KafkaServer$$createZkClient$1(KafkaServer.scala:345)
	at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:369)
	at kafka.server.KafkaServer.startup(KafkaServer.scala:202)

With zookeeper-3.4.13

kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING
	at kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:226)
	at kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221)
	at kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221)
	at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
	at kafka.zookeeper.ZooKeeperClient.kafka$zookeeper$ZooKeeperClient$$waitUntilConnected(ZooKeeperClient.scala:221)
	at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:95)
	at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1580)
	at kafka.server.KafkaServer.kafka$server$KafkaServer$$createZkClient$1(KafkaServer.scala:348)
	at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:372)
	at kafka.server.KafkaServer.startup(KafkaServer.scala:202)

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @omkreddy. Can we update the test to verify that the underlying zk is closed?

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jul 21, 2018

This fix doesn't look like the right fix. We want to close zookeeper if this fails during construction only right? In other cases, the caller can decide what to do if waitUntilConnected fails.

@omkreddy
Copy link
Copy Markdown
Contributor Author

Ye, we can close in constructor. Updated the PR.
Facing bit difficult to write a test. underlying ZK is not accessible incase of failure at constructor.

case e: ZooKeeperClientTimeoutException =>
zooKeeper.close()
throw e
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be a finally instead of catch, I think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it some more, we actually need to shutdown the scheduler too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can call ZooKeeperClient.close()?

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jul 21, 2018

@omkreddy I agree that writing a test is a bit challenging. The only way I can think of is to subclass ZooKeeperClient and catch the parent constructor exception in the subclass.

close()
throw e
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating, I still think we should use a finally block instead of catching a specific exception.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait. I was wrong, we can't use finally here. But we should probably catch a more general exception?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can user ZooKeeperClientException, which parent is parent class for zk excpetions

Copy link
Copy Markdown
Member

@ijuma ijuma Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as other exceptions can be thrown too. But to be honest, we might as well catch Throwable since we rethrow he exception anyway. We just want to avoid leaking the resource.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR with Throwable . Not sure, if we can catch parent constructor exception in subclass constructor. For now, added a test based on zk thread count.

case e: Throwable =>
close()
throw e
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we don't need the block for try, e.g.

try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS)
catch {
  case e: Throwable =>
    close()
    throw e
}

.filter(_.isAlive)
.map(_.getName)
.count(t => t.contains("SendThread()") || t.contains(hostAddress)) // Verify threadName pattern for unresolvable host zk send threads

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea for a test. Why not simply:

  @Test
  def testUnresolvableConnectString(): Unit = {
    val hostAddress = "some.invalid.hostname.foo.bar.local"
    try {
      new ZooKeeperClient(hostAddress, zkSessionTimeout, connectionTimeoutMs = 10, Int.MaxValue, time,
        "testMetricGroup", "testMetricType")
    } catch {
      case _: ZooKeeperClientTimeoutException =>
        assertEquals("ZooKeeper client threads still running", Set.empty, runningZkSendThreads)
    }
  }

  private def runningZkSendThreads: collection.Set[String] = Thread.getAllStackTraces.keySet.asScala
    .filter(_.isAlive)
    .map(_.getName)
    .filter(_.contains("SendThread()"))

Copy link
Copy Markdown
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, LGTM.

@ijuma ijuma changed the title MINOR: Close Zookeeper instance incase of timeout in ZooKeeperClient initialization MINOR: Release resources if exception is thrown during ZooKeeperClient construction Jul 22, 2018
@ijuma ijuma changed the title MINOR: Release resources if exception is thrown during ZooKeeperClient construction MINOR: Close ZooKeeperClient if waitUntilConnected fails during construction Jul 22, 2018
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jul 22, 2018

Merging to trunk and 2.0.

@ijuma ijuma merged commit 5db2f99 into apache:trunk Jul 22, 2018
ijuma pushed a commit that referenced this pull request Jul 22, 2018
…ruction (#5411)

This has always been an issue, but the recent upgrade to ZooKeeper
3.4.13 means it is also an issue when an unresolvable ZK
address is used, causing some tests to leak threads.

The change in behaviour in ZK 3.4.13 is that no exception is thrown
from the ZooKeeper constructor in case of an unresolvable address.
Instead, ZooKeeper tries to re-resolve the address hoping it becomes
resolvable again. We eventually throw a
`ZooKeeperClientTimeoutException`, which is similar to the case
where the the address is resolvable but ZooKeeper is not
reachable.

Reviewers: Ismael Juma <ismael@juma.me.uk>
ijuma added a commit to confluentinc/kafka that referenced this pull request Jul 23, 2018
* apache-github/2.0:
  MINOR: Close ZooKeeperClient if waitUntilConnected fails during construction (apache#5411)
  KAFKA-6897; Prevent KafkaProducer.send from blocking when producer is closed (apache#5027)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants