Skip to content

[Bug Fix] Broker will not wait for its SQL metadata view to fully initialize before starting up, even though set awaitInitializationOnStart true#10779

Merged
suneet-s merged 2 commits intoapache:masterfrom
zhangyue19921010:broker-meta-fresh
Jan 22, 2021

Conversation

@zhangyue19921010
Copy link
Copy Markdown
Contributor

@zhangyue19921010 zhangyue19921010 commented Jan 19, 2021

druid.sql.planner.awaitInitializationOnStart is used for control whether the Broker will wait for its SQL metadata view to fully initialize before starting up.

And PR #6765 was merge to solve the bug that Start up DruidSchema immediately if there are no segments

But this PR brings another bug that Broker will not wait for its SQL metadata view to fully initialize before starting up, even though set druid.sql.planner.awaitInitializationOnStart true.

Description

From the design view, if exceptions threw in

while (!Thread.currentThread().isInterrupted()) {

Broker will do retry in next loop.

Code here


break this design. In next loop, isServerViewInitialized is true and exit the loop directly, but there are still several refresh work need to be done.

Here is the full logs of this bug:

2021-01-19T05:02:52,921 INFO [main] org.eclipse.jetty.util.log - Logging initialized @7838ms to org.eclipse.jetty.util.log.Slf4jLog
2021-01-19T05:02:52,934 INFO [main] org.apache.druid.server.initialization.jetty.JettyServerModule - Creating http connector with port [8082]
2021-01-19T05:02:53,356 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [INIT]
2021-01-19T05:02:53,357 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Starting lifecycle [module] stage [NORMAL]
2021-01-19T05:02:53,357 INFO [main] org.apache.curator.framework.imps.CuratorFrameworkImpl - Starting
2021-01-19T05:02:53,364 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on 03/06/2019 16:18 GMT
2021-01-19T05:02:53,364 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:host.name=druid-dev-8-broker-0.druid-dev-8-broker.druid-dev-8.svc.cluster.local
2021-01-19T05:02:53,364 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.8.0_221
2021-01-19T05:02:53,364 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Oracle Corporation
2021-01-19T05:02:53,364 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.home=/usr/java/jdk1.8.0_221-amd64/jre
2021-01-19T05:02:53,364 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.class.path=conf/druid/_common:conf/druid/broker:lib/FastInfoset-1.2.15.jar:lib/RoaringBitmap-0.8.11.jar:lib/accessors-smart-1.2.jar:lib/aether-api-0.9.0.M2.jar:lib/aether-connector-file-0.9.0.M2.jar:lib/aether-connector-okhttp-0.0.9.jar:lib/aether-impl-0.9.0.M2.jar:lib/aether-spi-0.9.0.M2.jar:lib/aether-util-0.9.0.M2.jar:lib/aggdesigner-algorithm-6.0.jar:lib/airline-0.7.jar:lib/antlr4-runtime-4.5.1.jar:lib/aopalliance-1.0.jar:lib/asm-7.1.jar:lib/asm-analysis-7.1.jar:lib/asm-commons-7.1.jar:lib/asm-tree-7.1.jar:lib/async-http-client-2.5.3.jar:lib/async-http-client-netty-utils-2.5.3.jar:lib/audience-annotations-0.5.0.jar:lib/avatica-core-1.15.0.jar:lib/avatica-metrics-1.15.0.jar:lib/avatica-server-1.15.0.jar:lib/aws-java-sdk-core-1.11.199.jar:lib/aws-java-sdk-ec2-1.11.199.jar:lib/aws-java-sdk-kms-1.11.199.jar:lib/aws-java-sdk-s3-1.11.199.jar:lib/caffeine-2.8.0.jar:lib/calcite-core-1.21.0.jar:lib/calcite-linq4j-1.21.0.jar:lib/checker-qual-2.5.7.jar:lib/classmate-1.1.0.jar:lib/commons-beanutils-1.9.4.jar:lib/commons-codec-1.13.jar:lib/commons-collections-3.2.2.jar:lib/commons-collections4-4.2.jar:lib/commons-compiler-3.0.11.jar:lib/commons-compress-1.19.jar:lib/commons-dbcp2-2.0.1.jar:lib/commons-io-2.6.jar:lib/commons-lang-2.6.jar:lib/commons-lang3-3.8.1.jar:lib/commons-logging-1.1.1.jar:lib/commons-math3-3.6.1.jar:lib/commons-net-3.6.jar:lib/commons-pool2-2.2.jar:lib/commons-text-1.3.jar:lib/compress-lzf-1.0.4.jar:lib/config-magic-0.9.jar:lib/curator-client-4.1.0.jar:lib/curator-framework-4.1.0.jar:lib/curator-recipes-4.1.0.jar:lib/curator-x-discovery-4.1.0.jar:lib/datasketches-java-1.1.0-incubating.jar:lib/datasketches-memory-1.2.0-incubating.jar:lib/derby-10.14.2.0.jar:lib/derbyclient-10.14.2.0.jar:lib/derbynet-10.14.2.0.jar:lib/disruptor-3.3.6.jar:lib/druid-aws-common-0.17.1.jar:lib/druid-console-0.17.1.jar:lib/druid-core-0.17.1.jar:lib/druid-gcp-common-0.17.1.jar:lib/druid-hll-0.17.1.jar:lib/druid-indexing-hadoop-0.17.1.jar:lib/druid-indexing-service-0.17.1.jar:lib/druid-processing-0.17.1.jar:lib/druid-server-0.17.1.jar:lib/druid-services-0.17.1.jar:lib/druid-sql-0.17.1.jar:lib/error_prone_annotations-2.3.2.jar:lib/esri-geometry-api-2.2.0.jar:lib/extendedset-0.17.1.jar:lib/fastutil-8.2.3.jar:lib/google-api-client-1.22.0.jar:lib/google-http-client-1.22.0.jar:lib/google-http-client-jackson2-1.22.0.jar:lib/google-oauth-client-1.22.0.jar:lib/guava-16.0.1.jar:lib/guice-4.1.0.jar:lib/guice-multibindings-4.1.0.jar:lib/guice-servlet-4.1.0.jar:lib/hibernate-validator-5.2.5.Final.jar:lib/httpclient-4.5.10.jar:lib/httpcore-4.4.11.jar:lib/icu4j-55.1.jar:lib/ion-java-1.0.2.jar:lib/istack-commons-runtime-3.0.7.jar:lib/jackson-annotations-2.10.1.jar:lib/jackson-core-2.10.1.jar:lib/jackson-core-asl-1.9.13.jar:lib/jackson-databind-2.10.1.jar:lib/jackson-dataformat-cbor-2.10.1.jar:lib/jackson-dataformat-smile-2.10.1.jar:lib/jackson-datatype-guava-2.10.1.jar:lib/jackson-datatype-joda-2.10.1.jar:lib/jackson-jaxrs-base-2.10.1.jar:lib/jackson-jaxrs-json-provider-2.10.1.jar:lib/jackson-jaxrs-smile-provider-2.10.1.jar:lib/jackson-jq-0.0.10.jar:lib/jackson-mapper-asl-1.9.13.jar:lib/jackson-module-guice-2.10.1.jar:lib/jackson-module-jaxb-annotations-2.10.1.jar:lib/jakarta.activation-api-1.2.1.jar:lib/jakarta.xml.bind-api-2.3.2.jar:lib/janino-3.0.11.jar:lib/javax.activation-1.2.0.jar:lib/javax.activation-api-1.2.0.jar:lib/javax.el-3.0.0.jar:lib/javax.el-api-3.0.0.jar:lib/javax.inject-1.jar:lib/javax.servlet-api-3.1.0.jar:lib/jaxb-api-2.3.1.jar:lib/jaxb-runtime-2.3.1.jar:lib/jboss-logging-3.2.1.Final.jar:lib/jcl-over-slf4j-1.7.12.jar:lib/jcodings-1.0.43.jar:lib/jdbi-2.63.1.jar:lib/jersey-core-1.19.3.jar:lib/jersey-guice-1.19.3.jar:lib/jersey-server-1.19.3.jar:lib/jersey-servlet-1.19.3.jar:lib/jetty-client-9.4.12.v20180830.jar:lib/jetty-continuation-9.4.12.v20180830.jar:lib/jetty-http-9.4.12.v20180830.jar:lib/jetty-io-9.4.12.v20180830.jar:lib/jetty-proxy-9.4.12.v20180830.jar:lib/jetty-rewrite-9.4.12.v20180830.jar:lib/jetty-security-9.4.12.v20180830.jar:lib/jetty-server-9.4.12.v20180830.jar:lib/jetty-servlet-9.4.12.v20180830.jar:lib/jetty-servlets-9.4.12.v20180830.jar:lib/jetty-util-9.4.12.v20180830.jar:lib/jline-0.9.94.jar:lib/jmespath-java-1.11.199.jar:lib/jna-4.5.1.jar:lib/joda-time-2.10.5.jar:lib/joni-2.1.27.jar:lib/json-path-2.3.0.jar:lib/json-smart-2.3.jar:lib/jsr305-2.0.1.jar:lib/jsr311-api-1.1.1.jar:lib/jvm-attach-api-1.5.jar:lib/log4j-1.2-api-2.8.2.jar:lib/log4j-api-2.8.2.jar:lib/log4j-core-2.8.2.jar:lib/log4j-jul-2.8.2.jar:lib/log4j-slf4j-impl-2.8.2.jar:lib/lz4-java-1.6.0.jar:lib/maven-aether-provider-3.1.1.jar:lib/maven-artifact-3.6.0.jar:lib/maven-model-3.1.1.jar:lib/maven-model-builder-3.1.1.jar:lib/maven-repository-metadata-3.1.1.jar:lib/maven-settings-3.1.1.jar:lib/maven-settings-builder-3.1.1.jar:lib/metrics-core-4.0.0.jar:lib/netty-3.10.6.Final.jar:lib/netty-buffer-4.1.42.Final.jar:lib/netty-codec-4.1.42.Final.jar:lib/netty-codec-dns-4.1.42.Final.jar:lib/netty-codec-http-4.1.42.Final.jar:lib/netty-codec-socks-4.1.42.Final.jar:lib/netty-common-4.1.42.Final.jar:lib/netty-handler-4.1.42.Final.jar:lib/netty-handler-proxy-4.1.42.Final.jar:lib/netty-reactive-streams-2.0.0.jar:lib/netty-resolver-4.1.42.Final.jar:lib/netty-resolver-dns-4.1.42.Final.jar:lib/netty-transport-4.1.42.Final.jar:lib/netty-transport-native-epoll-4.1.42.Final-linux-x86_64.jar:lib/netty-transport-native-unix-common-4.1.42.Final.jar:lib/okhttp-1.0.2.jar:lib/opencsv-4.6.jar:lib/plexus-interpolation-1.19.jar:lib/plexus-utils-3.0.24.jar:lib/protobuf-java-3.11.0.jar:lib/reactive-streams-1.0.2.jar:lib/rhino-1.7.11.jar:lib/shims-0.8.11.jar:lib/sigar-1.6.5.132.jar:lib/slf4j-api-1.7.25.jar:lib/spymemcached-2.12.3.jar:lib/stax-ex-1.8.jar:lib/tesla-aether-0.0.5.jar:lib/txw2-2.3.1.jar:lib/validation-api-1.1.0.Final.jar:lib/wagon-provider-api-2.4.jar:lib/xz-1.8.jar:lib/zookeeper-3.4.14.jar:lib/zstd-jni-1.3.3-1.jar:
2021-01-19T05:02:53,365 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2021-01-19T05:02:53,365 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=var/tmp
2021-01-19T05:02:53,365 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA>
2021-01-19T05:02:53,365 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux
2021-01-19T05:02:53,365 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64
2021-01-19T05:02:53,365 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.version=4.14.198-152.320.amzn2.x86_64
2021-01-19T05:02:53,365 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:user.name=root
2021-01-19T05:02:53,365 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:user.home=/root
2021-01-19T05:02:53,365 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/opt/apache-druid-0.17.1
2021-01-19T05:02:53,366 INFO [main] org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=druid-dev-8-zookeeper-headless:2181 sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@621624b1
2021-01-19T05:02:53,381 INFO [main] org.apache.curator.framework.imps.CuratorFrameworkImpl - Default schema
2021-01-19T05:02:53,382 INFO [main] org.apache.druid.java.util.emitter.core.ComposingEmitter - Starting Composing Emitter.
2021-01-19T05:02:53,382 INFO [main] org.apache.druid.java.util.emitter.core.ComposingEmitter - Starting emitter org.apache.druid.emitter.statsd.StatsDEmitter.
2021-01-19T05:02:53,383 INFO [main] org.apache.druid.java.util.emitter.core.ComposingEmitter - Starting Composing Emitter.
2021-01-19T05:02:53,383 INFO [main] org.apache.druid.java.util.emitter.core.ComposingEmitter - Starting emitter org.apache.druid.emitter.statsd.StatsDEmitter.
2021-01-19T05:02:53,385 INFO [main-SendThread(druid-dev-8-zookeeper-headless:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server druid-dev-8-zookeeper-headless/10.23.37.215:2181. Will not attempt to authenticate using SASL (unknown error)
2021-01-19T05:02:53,390 INFO [main-SendThread(druid-dev-8-zookeeper-headless:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established to druid-dev-8-zookeeper-headless/10.23.37.215:2181, initiating session
2021-01-19T05:02:53,398 INFO [main-SendThread(druid-dev-8-zookeeper-headless:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server druid-dev-8-zookeeper-headless/10.23.37.215:2181, sessionid = 0x200000192880017, negotiated timeout = 30000
2021-01-19T05:02:53,403 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED
2021-01-19T05:02:53,435 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8104', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8104', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:02:53,437 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8102', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8102', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:02:53,438 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8103', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8103', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:02:53,438 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8100', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8100', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:02:53,439 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8101', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8101', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:02:53,440 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8101', hostAndPort='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8101', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:02:53,441 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8100', hostAndPort='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8100', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:02:53,441 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-historical-0.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083', hostAndPort='druid-dev-8-historical-0.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083', hostAndTlsPort='null', maxSize=1900000000000, tier='_default_tier', type=historical, priority=0}]
2021-01-19T05:02:53,442 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8102', hostAndPort='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8102', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:02:56,482 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - Inventory Initialized
2021-01-19T05:02:56,926 INFO [main] org.apache.druid.client.BrokerServerView - BrokerServerView initialized in [3,536] ms.
2021-01-19T05:02:56,983 INFO [main] org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager - Starting CoordinatorPollingBasicAuthenticatorCacheManager.
2021-01-19T05:02:56,984 INFO [main] org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager - Started CoordinatorPollingBasicAuthenticatorCacheManager.
2021-01-19T05:02:56,985 INFO [main] org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager - Starting CoordinatorPollingBasicAuthorizerCacheManager.
2021-01-19T05:02:56,986 INFO [main] org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager - Started CoordinatorPollingBasicAuthorizerCacheManager.
2021-01-19T05:02:56,988 INFO [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node[http://druid-dev-8-coordinator-0.druid-dev-8-coordinator.druid-dev-8.svc.cluster.local:8081] of role[coordinator] detected.
2021-01-19T05:02:56,989 INFO [NodeRoleWatcher[COORDINATOR]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node watcher of role[coordinator] is now initialized.
2021-01-19T05:04:01,863 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-historical-1.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083', hostAndPort='druid-dev-8-historical-1.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083', hostAndTlsPort='null', maxSize=1900000000000, tier='_default_tier', type=historical, priority=0}]
2021-01-19T05:04:13,336 WARN [DruidSchema-Cache-0] org.apache.druid.server.QueryLifecycle - Exception while processing queryId [13ac5d21-6555-4353-8aa3-fb553e3dbc95] (QueryInterruptedException{msg=org.jboss.netty.channel.ChannelException: Faulty channel in resource pool, code=Unknown exception, class=java.util.concurrent.ExecutionException, host=druid-dev-8-historical-1.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083})
2021-01-19T05:04:13,336 WARN [DruidSchema-Cache-0] org.apache.druid.sql.calcite.schema.DruidSchema - Metadata refresh failed, trying again soon.
org.apache.druid.java.util.common.RE: Query[d372ab30-009d-40b4-b073-4e870e5194b4] url[http://druid-dev-8-historical-0.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083/druid/v2/] timed out.
	at org.apache.druid.client.DirectDruidClient$1.dequeue(DirectDruidClient.java:208) ~[druid-server-0.17.1.jar:0.17.1]
	at org.apache.druid.client.DirectDruidClient$1.access$800(DirectDruidClient.java:168) ~[druid-server-0.17.1.jar:0.17.1]
	at org.apache.druid.client.DirectDruidClient$1$2.nextElement(DirectDruidClient.java:287) ~[druid-server-0.17.1.jar:0.17.1]
	at org.apache.druid.client.DirectDruidClient$1$2.nextElement(DirectDruidClient.java:263) ~[druid-server-0.17.1.jar:0.17.1]
	at java.io.SequenceInputStream.nextStream(SequenceInputStream.java:110) ~[?:1.8.0_221]
	at java.io.SequenceInputStream.read(SequenceInputStream.java:211) ~[?:1.8.0_221]
	at com.fasterxml.jackson.dataformat.smile.SmileParser._loadMore(SmileParser.java:248) ~[jackson-dataformat-smile-2.10.1.jar:2.10.1]
	at com.fasterxml.jackson.dataformat.smile.SmileParser.nextToken(SmileParser.java:377) ~[jackson-dataformat-smile-2.10.1.jar:2.10.1]
	at org.apache.druid.client.JsonParserIterator.init(JsonParserIterator.java:160) ~[druid-server-0.17.1.jar:0.17.1]
	at org.apache.druid.client.JsonParserIterator.hasNext(JsonParserIterator.java:95) ~[druid-server-0.17.1.jar:0.17.1]
	at org.apache.druid.java.util.common.guava.BaseSequence.makeYielder(BaseSequence.java:89) ~[druid-core-0.17.1.jar:0.17.1]
	at org.apache.druid.java.util.common.guava.BaseSequence.toYielder(BaseSequence.java:69) ~[druid-core-0.17.1.jar:0.17.1]
	at org.apache.druid.java.util.common.guava.MappedSequence.toYielder(MappedSequence.java:49) ~[druid-core-0.17.1.jar:0.17.1]
	at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$ResultBatch.fromSequence(ParallelMergeCombiningSequence.java:847) ~[druid-core-0.17.1.jar:0.17.1]
	at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$SequenceBatcher.block(ParallelMergeCombiningSequence.java:897) ~[druid-core-0.17.1.jar:0.17.1]
	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313) ~[?:1.8.0_221]
	at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$SequenceBatcher.getBatchYielder(ParallelMergeCombiningSequence.java:886) ~[druid-core-0.17.1.jar:0.17.1]
	at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$YielderBatchedResultsCursor.initialize(ParallelMergeCombiningSequence.java:993) ~[druid-core-0.17.1.jar:0.17.1]
	at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$PrepareMergeCombineInputsAction.compute(ParallelMergeCombiningSequence.java:702) ~[druid-core-0.17.1.jar:0.17.1]
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189) ~[?:1.8.0_221]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[?:1.8.0_221]
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[?:1.8.0_221]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[?:1.8.0_221]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) ~[?:1.8.0_221]
	Suppressed: org.apache.druid.java.util.common.RE: Query[d372ab30-009d-40b4-b073-4e870e5194b4] url[http://druid-dev-8-historical-0.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083/druid/v2/] timed out.
		at org.apache.druid.client.DirectDruidClient$1.checkQueryTimeout(DirectDruidClient.java:415) ~[druid-server-0.17.1.jar:0.17.1]
		at org.apache.druid.client.DirectDruidClient$1.access$500(DirectDruidClient.java:168) ~[druid-server-0.17.1.jar:0.17.1]
		at org.apache.druid.client.DirectDruidClient$1$2.hasMoreElements(DirectDruidClient.java:270) ~[druid-server-0.17.1.jar:0.17.1]
		at java.io.SequenceInputStream.nextStream(SequenceInputStream.java:109) ~[?:1.8.0_221]
		at java.io.SequenceInputStream.close(SequenceInputStream.java:232) ~[?:1.8.0_221]
		at com.fasterxml.jackson.dataformat.smile.SmileParser._closeInput(SmileParser.java:309) ~[jackson-dataformat-smile-2.10.1.jar:2.10.1]
		at com.fasterxml.jackson.dataformat.smile.SmileParserBase.close(SmileParserBase.java:384) ~[jackson-dataformat-smile-2.10.1.jar:2.10.1]
		at org.apache.druid.client.JsonParserIterator.close(JsonParserIterator.java:182) ~[druid-server-0.17.1.jar:0.17.1]
		at org.apache.druid.java.util.common.guava.CloseQuietly.close(CloseQuietly.java:39) ~[druid-core-0.17.1.jar:0.17.1]
		at org.apache.druid.client.DirectDruidClient$3.cleanup(DirectDruidClient.java:504) ~[druid-server-0.17.1.jar:0.17.1]
		at org.apache.druid.client.DirectDruidClient$3.cleanup(DirectDruidClient.java:486) ~[druid-server-0.17.1.jar:0.17.1]
		at org.apache.druid.java.util.common.guava.BaseSequence.toYielder(BaseSequence.java:73) ~[druid-core-0.17.1.jar:0.17.1]
		at org.apache.druid.java.util.common.guava.MappedSequence.toYielder(MappedSequence.java:49) ~[druid-core-0.17.1.jar:0.17.1]
		at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$ResultBatch.fromSequence(ParallelMergeCombiningSequence.java:847) ~[druid-core-0.17.1.jar:0.17.1]
		at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$SequenceBatcher.block(ParallelMergeCombiningSequence.java:897) ~[druid-core-0.17.1.jar:0.17.1]
		at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313) ~[?:1.8.0_221]
		at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$SequenceBatcher.getBatchYielder(ParallelMergeCombiningSequence.java:886) ~[druid-core-0.17.1.jar:0.17.1]
		at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$YielderBatchedResultsCursor.initialize(ParallelMergeCombiningSequence.java:993) ~[druid-core-0.17.1.jar:0.17.1]
		at org.apache.druid.java.util.common.guava.ParallelMergeCombiningSequence$PrepareMergeCombineInputsAction.compute(ParallelMergeCombiningSequence.java:702) ~[druid-core-0.17.1.jar:0.17.1]
		at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189) ~[?:1.8.0_221]
		at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[?:1.8.0_221]
		at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[?:1.8.0_221]
		at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[?:1.8.0_221]
		at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) ~[?:1.8.0_221]

2021-01-19T05:04:13,525 INFO [main] org.apache.druid.sql.calcite.schema.DruidSchema - DruidSchema initialized in [76,537] ms.
2021-01-19T05:04:13,525 INFO [main] org.apache.druid.sql.calcite.schema.MetadataSegmentView - MetadataSegmentView Started.
2021-01-19T05:04:13,529 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8104', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8104', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:04:13,529 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-historical-1.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083', hostAndPort='druid-dev-8-historical-1.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083', hostAndTlsPort='null', maxSize=1900000000000, tier='_default_tier', type=historical, priority=0}]
2021-01-19T05:04:13,530 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8102', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8102', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:04:13,530 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8103', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8103', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:04:13,531 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8100', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8100', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:04:13,531 INFO [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node[http://druid-dev-8-overlord-0.druid-dev-8-overlord.druid-dev-8.svc.cluster.local:8090] of role[overlord] detected.
2021-01-19T05:04:13,531 INFO [NodeRoleWatcher[OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidNodeDiscoveryProvider$NodeRoleWatcher - Node watcher of role[overlord] is now initialized.
2021-01-19T05:04:13,532 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8101', hostAndPort='druid-dev-8-middle-manager-medium-0.druid-dev-8-middle-manager-medium.druid-dev-8.svc.cluster.local:8101', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:04:13,532 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8101', hostAndPort='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8101', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:04:13,532 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8100', hostAndPort='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8100', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:04:13,533 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-historical-0.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083', hostAndPort='druid-dev-8-historical-0.druid-dev-8-historical.druid-dev-8.svc.cluster.local:8083', hostAndTlsPort='null', maxSize=1900000000000, tier='_default_tier', type=historical, priority=0}]
2021-01-19T05:04:13,533 INFO [ServerInventoryView-0] org.apache.druid.client.BatchServerInventoryView - New Server[DruidServerMetadata{name='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8102', hostAndPort='druid-dev-8-middle-manager-large-0.druid-dev-8-middle-manager-large.druid-dev-8.svc.cluster.local:8102', hostAndTlsPort='null', maxSize=0, tier='_default_tier', type=indexer-executor, priority=0}]
2021-01-19T05:04:13,550 INFO [main] org.apache.druid.query.lookup.LookupReferencesManager - Starting lookup lading process.
2021-01-19T05:04:13,550 INFO [main] org.apache.druid.query.lookup.LookupReferencesManager - Round of attempts #1, [5] lookups

…

As you can see, DruidSchema initialized in [76,537] ms. DruidSchema initialized is completed immediately, even though there are exceptions which need a retry. What's worse, Broker is started but queries served by this broker will fail.

This PR add a new judgment condition to fix it.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • DruidSchema.java

if (isServerViewInitialized) {
// lastFailure != 0L means exceptions happened before and there're some refresh work was not completed.
// so that even ServerView is initialized, we can't let broker complete initialization.
if (isServerViewInitialized && lastFailure == 0L) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is saying that we will mark as initialized when all of these are true:

(a) no refresh is needed
(b) the server view is initialized
(c) there wasn't a refresh that just failed

Sounds good to me - thanks for the fix!

Do you think you could add a test for this case too? Maybe something like DruidSchemaTest, but that creates a DruidSchema using a QueryLifecycleFactory where the first query fails fail, and then after that they start succeeding. We'd want to make sure that when awaitInitialization() returns, things are really initialized.

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

zhangyue19921010 commented Jan 20, 2021

Hi @gianm Thanks for your review. I just add a UT named testSchemaInit as you suggested.
If condition is if (isServerViewInitialized), this UT will fail with errors

java.lang.AssertionError: 
Expected :2
Actual   :0

If condition is isServerViewInitialized && lastFailure == 0L, this UT will success and DruidSchema is initialized as expected.
Also I test this pr in our dev cluster. It works fine.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM after tests pass.

@suneet-s suneet-s merged commit 8c6153d into apache:master Jan 22, 2021
@zhangyue19921010 zhangyue19921010 deleted the broker-meta-fresh branch January 22, 2021 17:19
@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

@gianm @suneet-s Thanks for your review and merge.

@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
sachinsagare added a commit to sachinsagare/druid that referenced this pull request Oct 17, 2022
Summary:
Pull upstream bug fix for broker not waiting for Druid schema to init

Context: For large Druid clusters with a lot of data sources, we see SQL queries hit exceptions of 'unkonwn Druid table ABC' right after the broker host is restarted or new host is up, this is because there is a bug that broker hosts don't wait for Druid schema to init before declaring itself as ready.

This PR has two changes:

Pull an upstream bug fix apache#10779 [Bug Fix] Broker will not wait for its SQL metadata view to fully initialize before starting up, even though set awaitInitializationOnStart true
This fix will make sure broker doesn't declare itself as ready before it init the schema for all the Druid tables: /status/health REST api will not prematurely return true and broker won't prematurely register itself in ZK.

Add a config to control the number of segments per query when doing metadata refresh.
Note that in the Druid schema init stage above, it does not guarantee to wait for metadata of ALL segments are refreshed. Rather, it waits for the first batch of segments' metadata for each data source is refreshed to make sure we have the latest schema ready then declares the Druid schema init as finished. The metadata refreshment of the remaining segments will be done asyncly afterwards. The metadata refreshment is also done from the newest segment to the oldest segment, so after the Druid schema init, we will have the schema for the latest segments, which is usually enough. The default batch size 15000 causes frequent time out from data nodes which delays Druid schema init and broker bootstrap. When a data source has 20000 segments, 15000 is not that different from 5000 if we only care about the latest schemas so it's OK to decrease.

Test Plan:
Tested on a canary host in Insights cluster,

The broker waits for all Druid data source's latest schemas to be ready before declaring itself as ready.
Changed the METADATA_REFRESH_MAX_SEGMENTS_PER_QUERY from 15000 to 5000, the Druid schema init decreased from 3.3 minutes to 18 seconds
Reviewers: O1139 Druid, itallam, yyang

Reviewed By: O1139 Druid, itallam, yyang

Subscribers: shawncao, realtime-analytics

Differential Revision: https://phabricator.pinadmin.com/D921875

Commit: 46de8df5fd30d2f15684da62278d616823d5dfcb
Signed-off-by: ssagare <ssagare@pinterest.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants