Skip to content

Decouple segment storage and serving on Historicals #8575

@leventov

Description

@leventov

Background

I've explored the proverbial decoupling of compute and storage in the context of time series data stores (read: Druid) in this post. The conclusion is that there are no technologies in the cloud landscape available at the moment that could serve as the storage layer and would support the latency, cost, and throughput requirements of big Druid clusters.

Recently, @nishantmonu51 proposed cloud NFS such as Amazon EFS, Google Filestore, or Elastifile as the Storage. This type of technology comes really close but still doesn't quite hit the mark. The weak side of cloud FS is the download throughput.

Another fundamental problem with any external Storage implementation is its lack of insight about how the Druid's data is typically accessed. For example, we would like the segments which are partitions within the same interval in the same data source to be stored all on different disks/"chunkservers" (or distributed among the disks as evenly as possible, if there are more partitions than disks) in the guts of the Storage to maximize the effective throughput.

This insight has led us to the idea which is halfway between complete decoupling of compute and storage and the current architecture of Druid where there is a one-to-one correspondence between the disk storage and "compute" (memory and CPU) on Historicals.

Description

Implementing #4773 is a mandatory prerequisite.

Rather than having one logical "Historical => segments" mapping, Druid has two:

  • Storage "Historical => segments"
  • Serving "Historical => segments"

If a Historical is storing a segment, that would be usually reasonable that this Historical also serves the segment, but not necessarily. The opposite is not true: a Historical serving a segment may not store it. In theory, the storage and serving can be independent.

Brokers load both mappings into their memory, meaning ~2x increase in their heap requirements. This definitely means that issues like #8165 would need to receive attention.

When CachingClusteredClient determines the Historicals serving the segment to send the subqueries to, it also determines the Historicals storing the queried segments and sends this information down to the serving Historical in the query context.

When a Historical receives a query for a segment which it doesn't also store, it consults the context for the information about the Historicals which store the needed segment and sends a download query to one of them with the list of needed columns. Upon downloading, it may cache the columns locally in memory as usual, or skip them, according to the strategy (see #4773).

To facilitate segment downloads, there are very simple servers running alongside historical Java processes (sidecars) which know only how to serve specific columns of segments. To avoid implementing the Druid segment parsing in a native language and keeping this in sync with the Java implementation, the download queries may already just specify the offsets and lengths of the necessary columns within the segment files, assuming that the Historicals which serve the segment always cache the segment metadata/header.

The sidecar server may be nginx with a custom module, perhaps, if nginx doesn't support what we need out of the box already. Or it may be a really simple server written in C or Rust. In general, we need a sidecar server in a native language rather than just serving the download queries from the Historical Java process to reduce the latency of the download queries and improve reliability: problems in Historical Java processes may not affect the download layer and vice versa.

Segment downloads must bypass the file cache: memory caching is the responsibility of the serving, not storage segments.

Advantages

Decoupling of segment storage and serving on Historicals would allow spinning up additional, serving-only Historicals at peak hours and scaling them down at low-traffic times: nights and weekends. Depending on the Druid usage pattern, the data and the queries, this may allow to cut a Druid cluster TCO by about 50%.

This decoupling may also enable some interesting stuff around broadcasted segments (they may be served on all Historicals in the cluster but actually stored on just a few of them) and true distributed JOIN support, but I didn't explore this in detail.

Disadvantages

  • Segment balancing may become even more complex than now, because both storage and serving should be balanced.
  • Doubling memory requirements on Brokers.
  • The need to support some non-Java code: a sidecar download server on Historicals.

FYI @drcrallen @gianm

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions