Scan query gracefully closing connections which results in partial data read from datasource

### Affected Version

Druid 0.19.1

### Description

The scenario arises when we fire a Scan query to read all rows for a given day (typically around 1mill records, takes around 10-15 min to complete) from the datasource.

We observe that if there are N records for that given day, sometimes it will read <N records and throws no errors/timeouts resulting in a mismatch of data. We believe Druid is closing the connection gracefully. We have set a very high query timeout on all of our components as well (around 60min).

### Cluster size

6 brokers - m5.2xlarge
15 historicals - i3.2xlarge
2 routers - t3.small

### Configurations

#### Broker

druid.broker.balancer.type=connectionCount
druid.broker.http.maxQueuedBytes=10000000
druid.broker.http.numConnections=50
druid.broker.http.readTimeout=PT40M
druid.broker.http.unusedConnectionTimeout=PT35M
druid.broker.retryPolicy.numTries=2
druid.broker.segment.watchedTiers=["ingestion_tier"]
druid.broker.select.tier=custom
druid.broker.select.tier.custom.priorities=[1]
druid.plaintextPort=11025
druid.processing.buffer.sizeBytes=750000000
druid.processing.numMergeBuffers=6
druid.processing.numThreads=1
druid.query.groupBy.maxOnDiskStorage=10000000000
druid.server.http.defaultQueryTimeout=3600000
druid.server.http.maxIdleTime=PT30M
druid.server.http.numThreads=60
druid.service=druid/broker-ingestion
druid.processing.tmpDir=/mnt/data/processing

#### Historical

druid.plaintextPort=11026
druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=4
druid.processing.numThreads=15
druid.query.groupBy.maxOnDiskStorage=10000000000
druid.segmentCache.numBootstrapThreads=30
druid.segmentCache.numLoadingThreads=4
druid.server.http.defaultQueryTimeout=3600000
druid.server.http.maxIdleTime=PT30M
druid.server.http.numThreads=250
druid.server.maxSize=1850000000000
druid.server.priority=1
druid.server.tier=ingestion_tier
druid.service=historical-ingestion
druid.processing.tmpDir=/mnt/data/processing
druid.segmentCache.locations=[{"path":"/mnt/data/segment-cache","maxSize":1850000000000 }]
druid.server.maxSize=1850000000000

#### Router

druid.plaintextPort=11024
druid.router.coordinatorServiceName=druid/coordinator
druid.router.defaultBrokerServiceName=druid/broker-ingestion
druid.router.http.numConnections=100
druid.router.http.numMaxThreads=200
druid.router.http.readTimeout=PT55M
druid.router.managementProxy.enabled=true
druid.router.tierToBrokerMap={"ingestion_tier":"druid/broker-ingestion"}
druid.server.http.numThreads=200
druid.service=druid/router-ingestion

Posted below in the comments more context about how we stress-tested Druid by running parallel scan query over a time interval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scan query gracefully closing connections which results in partial data read from datasource #11422

Affected Version

Description

Cluster size

Configurations

Broker

Historical

Router

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scan query gracefully closing connections which results in partial data read from datasource #11422

Description

Affected Version

Description

Cluster size

Configurations

Broker

Historical

Router

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions