Skip to content

Scan query gracefully closing connections which results in partial data read from datasource #11422

@bharadwajrembar

Description

@bharadwajrembar

Affected Version

Druid 0.19.1

Description

The scenario arises when we fire a Scan query to read all rows for a given day (typically around 1mill records, takes around 10-15 min to complete) from the datasource.

We observe that if there are N records for that given day, sometimes it will read <N records and throws no errors/timeouts resulting in a mismatch of data. We believe Druid is closing the connection gracefully. We have set a very high query timeout on all of our components as well (around 60min).

Cluster size

6 brokers - m5.2xlarge
15 historicals - i3.2xlarge
2 routers - t3.small

Configurations

Broker

druid.broker.balancer.type=connectionCount
druid.broker.http.maxQueuedBytes=10000000
druid.broker.http.numConnections=50
druid.broker.http.readTimeout=PT40M
druid.broker.http.unusedConnectionTimeout=PT35M
druid.broker.retryPolicy.numTries=2
druid.broker.segment.watchedTiers=["ingestion_tier"]
druid.broker.select.tier=custom
druid.broker.select.tier.custom.priorities=[1]
druid.plaintextPort=11025
druid.processing.buffer.sizeBytes=750000000
druid.processing.numMergeBuffers=6
druid.processing.numThreads=1
druid.query.groupBy.maxOnDiskStorage=10000000000
druid.server.http.defaultQueryTimeout=3600000
druid.server.http.maxIdleTime=PT30M
druid.server.http.numThreads=60
druid.service=druid/broker-ingestion
druid.processing.tmpDir=/mnt/data/processing

Historical

druid.plaintextPort=11026
druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=4
druid.processing.numThreads=15
druid.query.groupBy.maxOnDiskStorage=10000000000
druid.segmentCache.numBootstrapThreads=30
druid.segmentCache.numLoadingThreads=4
druid.server.http.defaultQueryTimeout=3600000
druid.server.http.maxIdleTime=PT30M
druid.server.http.numThreads=250
druid.server.maxSize=1850000000000
druid.server.priority=1
druid.server.tier=ingestion_tier
druid.service=historical-ingestion
druid.processing.tmpDir=/mnt/data/processing
druid.segmentCache.locations=[{"path":"/mnt/data/segment-cache","maxSize":1850000000000 }]
druid.server.maxSize=1850000000000

Router

druid.plaintextPort=11024
druid.router.coordinatorServiceName=druid/coordinator
druid.router.defaultBrokerServiceName=druid/broker-ingestion
druid.router.http.numConnections=100
druid.router.http.numMaxThreads=200
druid.router.http.readTimeout=PT55M
druid.router.managementProxy.enabled=true
druid.router.tierToBrokerMap={"ingestion_tier":"druid/broker-ingestion"}
druid.server.http.numThreads=200
druid.service=druid/router-ingestion

Posted below in the comments more context about how we stress-tested Druid by running parallel scan query over a time interval

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions