Decouple segment storage and serving on Historicals

## Background

I've explored the proverbial *decoupling of compute and storage* in the context of time series data stores (read: Druid) in [this post](https://medium.com/@leventov/design-of-a-cost-efficient-time-series-store-for-big-data-88c5dc41af8e). The conclusion is that there are no technologies in the cloud landscape available at the moment that could serve as the storage layer and would support the latency, cost, and throughput requirements of big Druid clusters.

Recently, @nishantmonu51 proposed cloud NFS such as [Amazon EFS](https://aws.amazon.com/efs/), [Google Filestore](https://cloud.google.com/filestore/), or [Elastifile](https://www.elastifile.com/) as the Storage. This type of technology comes really close but still doesn't quite hit the mark. The weak side of cloud FS is the download throughput.

Another fundamental problem with any external Storage implementation is its lack of insight about how the Druid's data is typically accessed. For example, we would like the segments which are partitions within the same interval in the same data source to be stored all on different disks/"chunkservers" (or distributed among the disks as evenly as possible, if there are more partitions than disks) in the guts of the Storage to maximize the effective throughput.

This insight has led us to the idea which is halfway between complete decoupling of compute and storage and the current architecture of Druid where there is a one-to-one correspondence between the disk storage and "compute" (memory and CPU) on Historicals.

## Description

Implementing https://github.com/apache/incubator-druid/issues/4773 is a mandatory prerequisite.

Rather than having one logical "Historical => segments" mapping, Druid has two:
 - *Storage* "Historical => segments"
 - *Serving* "Historical => segments"

If a Historical is *storing* a segment, that would be usually reasonable that this Historical also *serves* the segment, but not necessarily. The opposite is not true: a Historical *serving* a segment may not *store* it. In theory, the *storage* and *serving* can be independent.

Brokers load both mappings into their memory, meaning ~2x increase in their heap requirements. This definitely means that issues like https://github.com/apache/incubator-druid/pull/8165 would need to receive attention.

When `CachingClusteredClient` determines the Historicals serving the segment to send the subqueries to, it also determines the Historicals *storing* the queried segments and sends this information down to the serving Historical in the query context.

When a Historical receives a query for a segment which it doesn't also store, it consults the context for the information about the Historicals which store the needed segment and sends a *download* query to one of them with the list of needed columns. Upon downloading, it may cache the columns locally in memory as usual, or skip them, according to the strategy (see #4773).

To facilitate segment downloads, there are very simple servers running alongside historical Java processes (sidecars) which know only how to serve specific columns of segments. To avoid implementing the Druid segment parsing in a native language and keeping this in sync with the Java implementation, the download queries may already just specify the offsets and lengths of the necessary columns within the segment files, assuming that the Historicals which *serve* the segment always cache the segment metadata/header.

The sidecar server may be [nginx with a custom module](https://www.nginx.com/resources/wiki/modules/), perhaps, if nginx doesn't support what we need out of the box already. Or it may be a really simple server written in C or Rust. In general, we need a sidecar server in a native language rather than just serving the download queries from the Historical Java process to reduce the latency of the download queries and improve reliability: problems in Historical Java processes may not affect the download layer and vice versa.

Segment downloads must bypass the file cache: memory caching is the responsibility of the *serving*, not *storage* segments.

## Advantages

Decoupling of segment storage and serving on Historicals would allow spinning up additional, *serving-only* Historicals at peak hours and scaling them down at low-traffic times: nights and weekends. Depending on the Druid usage pattern, the data and the queries, this may allow to cut a Druid cluster TCO by about 50%.

This decoupling may also enable some interesting stuff around [broadcasted segments](#4077) (they may be *served* on all Historicals in the cluster but actually stored on just a few of them) and true distributed [JOIN support](#6457), but I didn't explore this in detail.

## Disadvantages

- Segment balancing may become even more complex than now, because both *storage* and *serving* should be balanced.
- Doubling memory requirements on Brokers.
- The need to support some non-Java code: a sidecar download server on Historicals.

FYI @drcrallen @gianm 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple segment storage and serving on Historicals #8575

Background

Description

Advantages

Disadvantages

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Decouple segment storage and serving on Historicals #8575

Description

Background

Description

Advantages

Disadvantages

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions