diff --git a/CHANGELOG.md b/CHANGELOG.md index 059f3039b2f..cc58a87b22f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -16,10 +16,15 @@ * [FEATURE] Added `global` ingestion rate limiter strategy. Deprecated `-distributor.limiter-reload-period` flag. #1766 * [FEATURE] Added support for Microsoft Azure blob storage to be used for storing chunk data. #1913 * [FEATURE] Added readiness probe endpoint`/ready` to queriers. #1934 +* [FEATURE] EXPERIMENTAL: Added `/series` API endpoint support with TSDB blocks storage. #1830 * [ENHANCEMENT] Added `password` and `enable_tls` options to redis cache configuration. Enables usage of Microsoft Azure Cache for Redis service. * [BUGFIX] Fixed unnecessary CAS operations done by the HA tracker when the jitter is enabled. #1861 * [BUGFIX] Fixed #1904 ingesters getting stuck in a LEAVING state after coming up from an ungraceful exit. #1921 - +* [BUGFIX] TSDB: Fixed handling of out of order/bound samples in ingesters with the experimental TSDB blocks storage. #1864 +* [BUGFIX] TSDB: Fixed querying ingesters in `LEAVING` state with the experimental TSDB blocks storage. #1854 +* [BUGFIX] TSDB: Fixed error handling in the series to chunks conversion with the experimental TSDB blocks storage. #1837 +* [BUGFIX] TSDB: Fixed TSDB creation conflict with blocks transfer in a `JOINING` ingester with the experimental TSDB blocks storage. #1818 + ## 0.4.0 / 2019-12-02 * [CHANGE] The frontend component has been refactored to be easier to re-use. When upgrading the frontend, cache entries will be discarded and re-created with the new protobuf schema. #1734 @@ -39,6 +44,7 @@ * [FEATURE] EXPERIMENTAL: Use TSDB in the ingesters & flush blocks to S3/GCS ala Thanos. This will let us use an Object Store more efficiently and reduce costs. #1695 * [FEATURE] Allow Query Frontend to log slow queries with `frontend.log-queries-longer-than`. #1744 * [FEATURE] Add HTTP handler to trigger ingester flush & shutdown - used when running as a stateful set with the WAL enabled. #1746 +* [FEATURE] EXPERIMENTAL: Added GCS support to TSDB blocks storage. #1772 * [ENHANCEMENT] Reduce memory allocations in the write path. #1706 * [ENHANCEMENT] Consul client now follows recommended practices for blocking queries wrt returned Index value. #1708 * [ENHANCEMENT] Consul client can optionally rate-limit itself during Watch (used e.g. by ring watchers) and WatchPrefix (used by HA feature) operations. Rate limiting is disabled by default. New flags added: `--consul.watch-rate-limit`, and `--consul.watch-burst-size`. #1708 @@ -48,6 +54,7 @@ * [BUGFIX] Fix bug where duplicate labels can be returned through metadata APIs. #1790 * [BUGFIX] Fix reading of old, v3 chunk data. #1779 * [BUGFIX] Now support IAM roles in service accounts in AWS EKS. #1803 +* [BUGFIX] Fixed duplicated series returned when querying both ingesters and store with the experimental TSDB blocks storage. #1778 In this release we updated the following dependencies: - gRPC v1.25.0 (resulted in a drop of 30% CPU usage when compression is on) diff --git a/docs/architecture.md b/docs/architecture.md index 58db734c9f0..a1bf5d4affb 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -1,6 +1,6 @@ --- title: "Cortex Architecture" -linkTitle: "Cortex Architecture" +linkTitle: "Architecture" weight: 4 slug: architecture --- @@ -15,139 +15,214 @@ Prometheus instances scrape samples from various targets and then push them to C Cortex requires that each HTTP request bear a header specifying a tenant ID for the request. Request authentication and authorization are handled by an external reverse proxy. -Incoming samples (writes from Prometheus) are handled by the [distributor](#distributor) while incoming reads (PromQL queries) are handled by the [query frontend](#query-frontend). +Incoming samples (writes from Prometheus) are handled by the [distributor](#distributor) while incoming reads (PromQL queries) are handled by the [querier](#querier) or optionally by the [query frontend](#query-frontend). + +## Storage + +Cortex currently supports two storage engines to store and query the time series: + +- Chunks (default, stable) +- Blocks (experimental) + +The two engines mostly share the same Cortex architecture with few differences outlined in the rest of the document. + +### Chunks storage (default) + +The chunks storage stores each single time series into a separate object called _Chunk_. Each Chunk contains the samples for a given period (defaults to 12 hours). Chunks are then indexed by time range and labels, in order to provide a fast lookup across many (over millions) Chunks. + +For this reason, the chunks storage consists of: + +* An index for the Chunks. This index can be backed by: + * [Amazon DynamoDB](https://aws.amazon.com/dynamodb) + * [Google Bigtable](https://cloud.google.com/bigtable) + * [Apache Cassandra](https://cassandra.apache.org) +* An object store for the Chunk data itself, which can be: + * [Amazon DynamoDB](https://aws.amazon.com/dynamodb) + * [Google Bigtable](https://cloud.google.com/bigtable) + * [Apache Cassandra](https://cassandra.apache.org) + * [Amazon S3](https://aws.amazon.com/s3) + * [Google Cloud Storage](https://cloud.google.com/storage/) + +Internally, the access to the chunks storage relies on a unified interface called "chunks store". Unlike other Cortex components, the chunk store is not a separate service, but rather a library embedded in the services that need to access the long-term storage: [ingester](#ingester), [querier](#querier) and [ruler](#ruler). + +The chunk and index format are versioned, this allows Cortex operators to upgrade the cluster to take advantage of new features and improvements. This strategy enables changes in the storage format without requiring any downtime or complex procedures to rewrite the stored data. A set of schemas are used to map the version while reading and writing time series belonging to a specific period of time. + +The current schema recommendation is the **v10 schema** (v11 is still experimental). For more information about the schema, please check out the [Schema](guides/running.md#schema) documentation. + +### Blocks storage (experimental) + +The blocks storage is based on [Prometheus TSDB](https://prometheus.io/docs/prometheus/latest/storage/): it stores each tenant's time series into their own TSDB which write out their series to a on-disk Block (defaults to 2h block range periods). Each Block is composed by few files storing the chunks and the block index. + +The TSDB chunk files contain the samples for multiple series. The series inside the Chunks are then indexed by a per-block index, which indexes metric names and labels to time series in the chunk files. + +The blocks storage doesn't require a dedicated storage backend for the index. The only requirement is an object store for the Block files, which can be: + +* [Amazon S3](https://aws.amazon.com/s3) +* [Google Cloud Storage](https://cloud.google.com/storage/) + +For more information, please check out the [Blocks storage](operations/blocks-storage.md) documentation. ## Services -Cortex has a service-based architecture, in which the overall system is split up into a variety of components that perform specific tasks and run separately (and potentially in parallel). +Cortex has a service-based architecture, in which the overall system is split up into a variety of components that perform a specific task, run separately, and in parallel. Cortex can alternatively run in a single process mode, where all components are executed within a single process. The single process mode is particularly handy for local testing and development. Cortex is, for the most part, a shared-nothing system. Each layer of the system can run multiple instances of each component and they don't coordinate or communicate with each other within that layer. +The Cortex services are: + +- [Distributor](#distributor) +- [Ingester](#ingester) +- [Querier](#querier) +- [Query frontend](#query-frontend) (optional) +- [Ruler](#ruler) (optional) +- [Alertmanager](#alertmanager) (optional) + ### Distributor -The **distributor** service is responsible for handling samples written by Prometheus. It's essentially the "first stop" in the write path for Prometheus samples. Once the distributor receives samples from Prometheus, it splits them into batches and then sends them to multiple [ingesters](#ingester) in parallel. +The **distributor** service is responsible for handling incoming samples from Prometheus. It's the first stop in the write path for series samples. Once the distributor receives samples from Prometheus, each sample is validated for correctness and to ensure that it is within the configured tenant limits, falling back to default ones in case limits have not been overridden for the specific tenant. Valid samples are then split into batches and sent to multiple [ingesters](#ingester) in parallel. -Distributors communicate with ingesters via [gRPC](https://grpc.io). They are stateless and can be scaled up and down as needed. +The validation done by the distributor includes: -If the HA Tracker is enabled, the Distributor will deduplicate incoming samples that contain both a cluster and replica label. It talks to a KVStore to store state about which replica per cluster it's accepting samples from for a given user ID. Samples with one or neither of these labels will be accepted by default. +- The metric labels name are formally correct +- The configured max number of labels per metric is respected +- The configured max length of a label name and value is respected +- The timestamp is not older/newer than the configured min/max time range -#### Hashing +Distributors are **stateless** and can be scaled up and down as needed. + +#### High Availability Tracker + +The distributor features a **High Availability (HA) Tracker**. When enabled, the distributor deduplicates incoming samples from redundant Prometheus servers. This allows you to have multiple HA replicas of the same Prometheus servers, writing the same series to Cortex and then deduplicate these series in the Cortex distributor. -Distributors use consistent hashing, in conjunction with the (configurable) replication factor, to determine *which* instances of the ingester service receive each sample. +The HA Tracker deduplicates incoming samples based on a cluster and replica label. The cluster label uniquely identifies the cluster of redundant Prometheus servers for a given tenant, while the replica label uniquely identifies the replica within the Prometheus cluster. Incoming samples are considered duplicated - and thus dropped - if received by any replica which is not the current primary within a cluster. -The hash itself is based on one of two schemes: +The HA Tracker requires a key-value (KV) store to coordinate which replica is currently elected. The distributor will only accept samples from the current leader. Samples with one or no labels (of the replica and cluster) are accepted by default and never deduplicated. -1. The metric name and tenant ID -2. All the series labels and tenant ID +The supported KV stores for the HA tracker are: -The trade-off associated with the latter is that writes are more balanced but they must involve every ingester in each query. +* [Consul](https://www.consul.io) +* [Etcd](https://etcd.io) + +For more information, please refer to [config for sending HA pairs data to Cortex](guides/ha-pair-handling.md) in the documentation. + +#### Hashing -> This hashing scheme was originally chosen to reduce the number of required ingesters on the query path. The trade-off, however, is that the write load on the ingesters is less even. +Distributors use consistent hashing, in conjunction with a configurable replication factor, to determine which ingester instance(s) should receive a given series. + +Cortex supports two hashing strategies: + +1. Hash the metric name and tenant ID (default) +2. Hash the metric name, labels and tenant ID (enabled with `-distributor.shard-by-all-labels=true`) + +The trade-off associated with the latter is that writes are more balanced across ingesters but each query needs to talk to any ingester since a metric could be spread across multiple ingesters given different label sets. #### The hash ring -A consistent hash ring is stored in [Consul](https://www.consul.io/) as a single key-value pair, with the ring data structure also encoded as a [Protobuf](https://developers.google.com/protocol-buffers/) message. The consistent hash ring consists of a list of tokens and ingesters. Hashed values are looked up in the ring; the replication set is built for the closest unique ingesters by token. One of the benefits of this system is that adding and remove ingesters results in only 1/_N_ of the series being moved (where _N_ is the number of ingesters). +A hash ring - stored in a key-value (KV) store - is used to achieve consistent hashing for the series sharding and replication across the ingesters. All [ingesters](#ingester) register themselves into the hash ring with a set of tokens they own; each token is a random unsigned 32-bit number. Each incoming series is [hashed](#hashing) in the distributor and then pushed to the ingester owning the tokens range for the series hash number plus N-1 subsequent ingesters in the ring, where N is the replication factor. + +To do the hash lookup, distributors find the smallest appropriate token whose value is larger than the [hash of the series](#hashing). When the replication factor is larger than 1, the next subsequent tokens (clockwise in the ring) that belong to different ingesters will also be included in the result. + +The effect of this hash set up is that each token that an ingester owns is responsible for a range of hashes. If there are three tokens with values 0, 25, and 50, then a hash of 3 would be given to the ingester that owns the token 25; the ingester owning token 25 is responsible for the hash range of 1-25. + +The supported KV stores for the hash ring are: + +* [Consul](https://www.consul.io) +* [Etcd](https://etcd.io) +* Gossip [memberlist](https://github.com/hashicorp/memberlist) (experimental) #### Quorum consistency -All distributors share access to the same hash ring, which means that write requests can be sent to any distributor. +Since all distributors share access to the same hash ring, write requests can be sent to any distributor and you can setup a stateless load balancer in front of it. -To ensure consistent query results, Cortex uses [Dynamo](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)-style quorum consistency on reads and writes. This means that the distributor will wait for a positive response of at least one half plus one of the ingesters to send the sample to before responding to the user. +To ensure consistent query results, Cortex uses [Dynamo-style](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) quorum consistency on reads and writes. This means that the distributor will wait for a positive response of at least one half plus one of the ingesters to send the sample to before successfully responding to the Prometheus write request. #### Load balancing across distributors -We recommend randomly load balancing write requests across distributor instances, ideally by running the distributors as a Kubernetes [Service](https://kubernetes.io/docs/concepts/services-networking/service/). +We recommend randomly load balancing write requests across distributor instances. For example, if you're running Cortex in a Kubernetes cluster, you could run the distributors as a Kubernetes [Service](https://kubernetes.io/docs/concepts/services-networking/service/). ### Ingester -The **ingester** service is responsible for writing sample data to long-term storage backends (DynamoDB, S3, Cassandra, etc.). +The **ingester** service is responsible for writing incoming series to a [long-term storage backend](#storage) on the write path and returning in-memory series samples for queries on the read path. -Samples from each timeseries are built up in "chunks" in memory inside each ingester, then flushed to the [chunk store](#chunk-store). By default each chunk is up to 12 hours long. +Incoming series are not immediately written to the storage but kept in memory and periodically flushed to the storage (by default, 12 hours for the chunks storage and 2 hours for the experimental blocks storage). For this reason, the [queriers](#querier) may need to fetch samples both from ingesters and long-term storage while executing a query on the read path. -If an ingester process crashes or exits abruptly, all the data that has not yet been flushed will be lost. Cortex is usually configured to hold multiple (typically 3) replicas of each timeseries to mitigate this risk. +Ingesters contain a **lifecycler** which manages the lifecycle of an ingester and stores the **ingester state** in the [hash ring](#the-hash-ring). Each ingester could be in one of the following states: -A [hand-over process](ingester-handover.md) manages the state when ingesters are added, removed or replaced. +1. `PENDING` is an ingester's state when it just started and is waiting for a hand-over from another ingester that is `LEAVING`. If no hand-over occurs within the configured timeout period, the ingester will join the ring with a new set of random tokens (ie. during a scale up). -#### Write de-amplification +2. `JOINING` is an ingester's state when it is currently inserting its tokens into the ring and initializing itself. While in this state, the ingester may receive series and tokens from another `LEAVING` ingester, as part of the hand-over process. -Ingesters store the last 12 hours worth of samples in order to perform **write de-amplification**, i.e. batching and compressing samples for the same series and flushing them out to the [chunk store](#chunk-store). Under normal operations, there should be *many* orders of magnitude fewer operations per second (OPS) worth of writes to the chunk store than to the ingesters. +3. `ACTIVE` is an ingester's state when it is fully initialized. It may receive both write and read requests for tokens it owns. -Write de-amplification is the main source of Cortex's low total cost of ownership (TCO). +4. `LEAVING` is an ingester's state when it is shutting down. It cannot receive write requests anymore, while it could still receive read requests for series it has in memory. While in this state, the ingester may look for a `JOINING` ingester with which engage an hand-over process, used to transfer the `LEAVING` ingester state to a `JOINING` one (ie. during a rolling update). -### Ruler +5. `UNHEALTHY` is an ingester's state when it has failed to heartbeat to the ring's KV Store. While in this state, distributors skip the ingester while building the replication set for incoming series and the ingester does not receive write or read requests. -Ruler executes PromQL queries for Recording Rules and Alerts. Ruler -is configured from a database, so that different rules can be set for -each tenant. +For more information about the hand-over process, please check out the [Ingester hand-over](guides/ingester-handover.md) documentation. -All the rules for one instance are executed as a group, then -rescheduled to be executed again 15 seconds later. Execution is done -by a 'worker' running on a goroutine - if you don't have enough -workers then the ruler will lag behind. +Ingesters are **semi-stateful**. -Ruler can be scaled horizontally. +#### Ingesters failure and data loss -### AlertManager +If an ingester process crashes or exits abruptly, all the in-memory series that have not yet been flushed to the long-term storage will be lost. There are two main ways to mitigate this failure mode: -AlertManager is responsible for accepting alert notifications from -Ruler, grouping them, and passing on to a notification channel such as -email, PagerDuty, etc. +1. Replication +2. Write-ahead log (WAL) -Like the Ruler, AlertManager is configured per-tenant in a database. +The **replication** is used to hold multiple (typically 3) replicas of each time series in the ingesters. If the Cortex cluster looses an ingester, the in-memory series hold by the lost ingester are also replicated at least to another ingester. In the event of a single ingester failure, no time series samples will be lost while, in the event of multiple ingesters failure, time series may be potentially lost if failure affects all the ingesters holding the replicas of a specific time series. -[Upstream Docs](https://prometheus.io/docs/alerting/alertmanager/). +The **write-ahead log** (WAL) is used to write to a persistent local disk all incoming series samples until they're flushed to the long-term storage. In the event of an ingester failure, a subsequent process restart will replay the WAL and recover the in-memory series samples. -### Query frontend +Contrary to the sole replication and given the persistent local disk data is not lost, in the event of multiple ingesters failure each ingester will recover the in-memory series samples from WAL upon subsequent restart. The replication is still recommended in order to ensure no temporary failures on the read path in the event of a single ingester failure. -The **query frontend** is an optional service that accepts HTTP requests, queues them by tenant ID, and retries in case of errors. +The WAL for the chunks storage is an experimental feature (disabled by default), while it's always enabled for the blocks storage. -> The query frontend is completely optional; you can use queriers directly. To use the query frontend, direct incoming authenticated reads at them and set the `-querier.frontend-address` flag on the queriers. +#### Ingesters write de-amplification -#### Queueing +Ingesters store recently received samples in-memory in order to perform write de-amplification. If the ingesters would immediately write received samples to the long-term storage, the system would be very difficult to scale due to the very high pressure on the storage. For this reason, the ingesters batch and compress samples in-memory and periodically flush them out to the storage. -Queuing performs a number of functions for the query frontend: +Write de-amplification is the main source of Cortex's low total cost of ownership (TCO). -* It ensures that large queries that cause an out-of-memory (OOM) error in the querier will be retried. This allows administrators to under-provision memory for queries, or optimistically run more small queries in parallel, which helps to reduce TCO. -* It prevents multiple large requests from being convoyed on a single querier by distributing them first-in/first-out (FIFO) across all queriers. -* It prevents a single tenant from denial-of-service-ing (DoSing) other tenants by fairly scheduling queries between tenants. +### Querier -#### Splitting +The **querier** service handles queries using the [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) query language. -The query frontend splits multi-day queries into multiple single-day queries, executing these queries in parallel on downstream queriers and stitching the results back together again. This prevents large, multi-day queries from OOMing a single querier and helps them execute faster. +Queriers fetch series samples both from the ingesters and long-term storage: the ingesters hold the in-memory series which have not yet been flushed to the long-term storage. Because of the replication factor, it is possible that the querier may receive duplicated samples; to resolve this, for a given time series the querier internally **deduplicates** samples with the same exact timestamp. -#### Caching +Queriers are **stateless** and can be scaled up and down as needed. -The query frontend caches query results and reuses them on subsequent queries. If the cached results are incomplete, the query frontend calculates the required subqueries and executes them in parallel on downstream queriers. The query frontend can optionally align queries with their step parameter to improve the cacheability of the query results. +### Query frontend -#### Parallelism +The **query frontend** is an **optional service** providing the querier's API endpoints and can be used to accelerate the read path. When the query frontend is in place, incoming query requests should be directed to the query frontend instead of the queriers. The querier service will be still required within the cluster, in order to execute the actual queries. -The query frontend job accepts gRPC streaming requests from the queriers, which then "pull" requests from the frontend. For high availability it's recommended that you run multiple frontends; the queriers will connect to—and pull requests from—all of them. To reap the benefit of fair scheduling, it is recommended that you run fewer frontends than queriers. Two should suffice in most cases. +The query frontend internally performs some query adjustments and holds queries in an internal queue. In this setup, queriers act as workers which pull jobs from the queue, execute them, and return them to the query-frontend for aggregation. Queriers need to be configured with the query frontend address - via the `-querier.frontend-address` CLI flag - in order to allow them to connect to the query frontends. -### Querier +Query frontends are **stateless**. However, due to how the internal queue works, it's recommended to run a few query frontend replicas to reap the benefit of fair scheduling. Two replicas should suffice in most cases. -The **querier** service handles the actual [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) evaluation of samples stored in long-term storage. +#### Queueing + +The query frontend queuing mechanism is used to: -It embeds the chunk store client code for fetching data from long-term storage and communicates with [ingesters](#ingester) for more recent data. +* Ensure that large queries - that could cause an out-of-memory (OOM) error in the querier - will be retried on failure. This allows administrators to under-provision memory for queries, or optimistically run more small queries in parallel, which helps to reduce the TCO. +* Prevent multiple large requests from being convoyed on a single querier by distributing them across all queriers using a first-in/first-out queue (FIFO). +* Prevent a single tenant from denial-of-service-ing (DOSing) other tenants by fairly scheduling queries between tenants. -## Chunk store +#### Splitting -The **chunk store** is Cortex's long-term data store, designed to support interactive querying and sustained writing without the need for background maintenance tasks. It consists of: +The query frontend splits multi-day queries into multiple single-day queries, executing these queries in parallel on downstream queriers and stitching the results back together again. This prevents large (multi-day) queries from causing out of memory issues in a single querier and helps to execute them faster. -* An index for the chunks. This index can be backed by [DynamoDB from Amazon Web Services](https://aws.amazon.com/dynamodb), [Bigtable from Google Cloud Platform](https://cloud.google.com/bigtable), [Apache Cassandra](https://cassandra.apache.org). -* A key-value (KV) store for the chunk data itself, which can be DynamoDB, Bigtable, Cassandra again, or an object store such as [Amazon S3](https://aws.amazon.com/s3) +#### Caching -> Unlike the other core components of Cortex, the chunk store is not a separate service, job, or process, but rather a library embedded in the three services that need to access Cortex data: the [ingester](#ingester), [querier](#querier), and [ruler](#ruler). +The query frontend supports caching query results and reuses them on subsequent queries. If the cached results are incomplete, the query frontend calculates the required subqueries and executes them in parallel on downstream queriers. The query frontend can optionally align queries with their step parameter to improve the cacheability of the query results. The result cache is compatible with any cortex caching backend (currently memcached, redis, and an in-memory cache). -The chunk store relies on a unified interface to the "[NoSQL](https://en.wikipedia.org/wiki/NoSQL)" stores—DynamoDB, Bigtable, and Cassandra—that can be used to back the chunk store index. This interface assumes that the index is a collection of entries keyed by: +### Ruler -* A **hash key**. This is required for *all* reads and writes. -* A **range key**. This is required for writes and can be omitted for reads, which can be queried by prefix or range. +The **ruler** is an **optional service** executing PromQL queries for recording rules and alerts. The ruler requires a database storing the recording rules and alerts for each tenant. -The interface works somewhat differently across the supported databases: +Ruler can be scaled horizontally. -* DynamoDB supports range and hash keys natively. Index entries are thus modelled directly as DynamoDB entries, with the hash key as the distribution key and the range as the range key. -* For Bigtable and Cassandra, index entries are modelled as individual column values. The hash key becomes the row key and the range key becomes the column key. +### Alertmanager -A set of schemas are used to map the matchers and label sets used on reads and writes to the chunk store into appropriate operations on the index. Schemas have been added as Cortex has evolved, mainly in an attempt to better load balance writes and improve query performance. +The **alertmanager** is an **optional service** responsible for accepting alert notifications from the [ruler](#ruler), deduplicating and grouping them, and routing them to the correct notification channel, such as email, PagerDuty or OpsGenie. -> The current schema recommendation is the **v10 schema**. v11 schema is an experimental schema. +The Cortex alertmanager is built on top of the [Prometheus Alertmanager](https://prometheus.io/docs/alerting/alertmanager/), adding multi-tenancy support. Like the [ruler](#ruler), the alertmanager requires a database storing the per-tenant configuration. diff --git a/docs/operations/_index.md b/docs/operations/_index.md new file mode 100644 index 00000000000..7243bf928e8 --- /dev/null +++ b/docs/operations/_index.md @@ -0,0 +1,6 @@ +--- +title: "Operating Cortex" +linkTitle: "Operations" +weight: 7 +menu: +--- \ No newline at end of file diff --git a/docs/operations/blocks-storage.md b/docs/operations/blocks-storage.md new file mode 100644 index 00000000000..8b02d770135 --- /dev/null +++ b/docs/operations/blocks-storage.md @@ -0,0 +1,227 @@ +--- +title: "Blocks storage (experimental)" +linkTitle: "Blocks storage (experimental)" +weight: 1 +slug: blocks-storage +--- + +The blocks storage is an **experimental** Cortex storage engine based on [Prometheus TSDB](https://prometheus.io/docs/prometheus/latest/storage/): it stores each tenant's time series into their own TSDB which write out their series to a on-disk Block (defaults to 2h block range periods). Each Block is composed by chunk files - containing the timestamp-value pairs for multiple series - and an index, which indexes metric names and labels to time series in the chunk files. + +The supported backends for the blocks storage are: + +* [Amazon S3](https://aws.amazon.com/s3) +* [Google Cloud Storage](https://cloud.google.com/storage/) + +_Internally, this storage engine is based on [Thanos](https://thanos.io), but no Thanos knowledge is required in order to run it._ + +The rest of the document assumes you have read the [Cortex architecture](../architecture.md) documentation. + +## How it works + +When the blocks storage is used, each **ingester** creates a per-tenant TSDB and ships the TSDB Blocks - which by default are cut every 2 hours - to the long-term storage. + +**Queriers** periodically iterate over the storage bucket to discover recently uploaded Blocks and - for each Block - download a subset of the block index - called "index header" - which is kept in memory and used to provide fast lookups. + +### The write path + +**Ingesters** receive incoming samples from the distributors. Each push request belongs to a tenant, and the ingester append the received samples to the specific per-tenant TSDB. The received samples are both kept in-memory and written to a write-ahead log (WAL) stored on the local disk and used to recover the in-memory series in case the ingester abruptly terminates. The per-tenant TSDB is lazily created in each ingester upon the first push request is received for that tenant. + +The in-memory samples are periodically flushed to disk - and the WAL truncated - when a new TSDB Block is cut, which by default occurs every 2 hours. Each new Block cut is then uploaded to the long-term storage and kept in the ingester for some more time, in order to give queriers enough time to discover the new Block from the storage and download its index header. + + +In order to effectively use the **WAL** and being able to recover the in-memory series upon ingester abruptly termination, the WAL needs to be stored to a persistent local disk which can survive in the event of an ingester failure (ie. AWS EBS volume or GCP persistent disk when running in the cloud). For example, if you're running the Cortex cluster in Kubernetes, you may use a StatefulSet with a persistent volume claim for the ingesters. + +### The read path + +**Queriers** - at startup - iterate over the entire storage bucket to discover all tenants Blocks and - for each of them - download the index header. During this initial synchronization phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail. + +Queriers also periodically re-iterate over the storage bucket to discover newly uploaded Blocks (by the ingesters) and find out Blocks deleted in the meanwhile, as effect of an optional retention policy. + +The blocks chunks and the entire index is never fully downloaded by the queriers. In the read path, a querier lookups the series label names and values using the in-memory index header and then download the required segments of the index and chunks for the matching series directly from the long-term storage using byte-range requests. + +The index header is also stored to the local disk, in order to avoid to re-download it on subsequent restarts of a querier. For this reason, it's recommended - but not required - to run the querier with a persistent local disk. For example, if you're running the Cortex cluster in Kubernetes, you may use a StatefulSet with a persistent volume claim for the queriers. + +### Sharding and Replication + +The series sharding and replication doesn't change based on the storage engine, so the general overview provided by the "[Cortex architecture](../architecture.md)" documentation applies to the blocks storage as well. + +It's important to note that - differently than the [chunks storage](../architecture.md#chunks-storage-default) - time series are effectively written N times to the long-term storage, where N is the replication factor (typically 3). This may lead to a storage utilization N times more than the chunks storage, but is actually mitigated by the [compactor](#compactor) service (see "vertical compaction"). + +### Compactor + +The **compactor** is an optional - but highly recommended - service which compacts multiple Blocks of a given tenant into a single optimized larger Block. The compactor has two main benefits: + +1. Vertically compact Blocks uploaded by all ingesters for the same time range +2. Horizontally compact Blocks with small time ranges into a single larger Block + +The **vertical compaction** compacts all the Blocks of a tenant uploaded by any ingester for the same Block range period (defaults to 2 hours) into a single Block, de-duplicating samples that are originally written to N Blocks as effect of the replication. + +The **horizontal compaction** triggers after the vertical compaction and compacts several Blocks belonging to adjacent small range periods (2 hours) into a single larger Block. Despite the total block chunks size doesn't change after this compaction, it may have a significative impact on the reduction of the index size and its index header kept in memory by queriers. + +The compactor is **stateless**. + +## Configuration + +The general [configuration documentation](../configuration/_index.md) also applied to a Cortex cluster running the blocks storage, with few differences: + +- [`storage_config`](#storage-config) +- [`tsdb_config`](#tsdb-config) +- [`compactor_config`](#compactor-config) + +### `storage_config` + +The `storage_config` block configures the storage engine. + +```yaml +storage: + # The storage engine to use. Use "tsdb" for the blocks storage. + # CLI flag: -store.engine + engine: tsdb +``` + +### `tsdb_config` + +The `tsdb_config` block configures the blocks storage engine (based on TSDB). + +```yaml +tsdb: + # Backend storage to use. Either "s3" or "gcs". + # CLI flag: -experimental.tsdb.backend + backend: + + # Local directory to store TSDBs in the ingesters. + # CLI flag: -experimental.tsdb.dir + [dir: | default = "tsdb"] + + # TSDB blocks range period. + # CLI flag: -experimental.tsdb.block-ranges-period + [ block_ranges_period: | default = [2h]] + + # TSDB blocks retention in the ingester before a block is removed. This + # should be larger than the block_ranges_period and large enough to give + # ingesters enough time to discover newly uploaded blocks. + # CLI flag: -experimental.tsdb.retention-period duration + [ retention_period: | default = 6h] + + # The frequency at which the ingester shipper look for unshipped TSDB blocks + # and start uploading them to the long-term storage. + # CLI flag: -experimental.tsdb.ship-interval duration + [ship_interval: | default = 30s] + + # The bucket store configuration applies to queriers and configure how queriers + # iteract with the long-term storage backend. + bucket_store: + # Directory to store synched TSDB index headers. + # CLI flag: -experimental.tsdb.bucket-store.sync-dir + [sync_dir: | default = "tsdb-sync"] + + # Size - in bytes - of a per-tenant in-memory index cache used to speed up + # blocks index lookups. + # CLI flag: -experimental.tsdb.bucket-store.index-cache-size-bytes + [ index_cache_size_bytes: | default = 262144000] + + # Max size - in bytes - of a per-tenant chunk pool, used to reduce memory + # allocations. + # CLI flag: -experimental.tsdb.bucket-store.max-chunk-pool-bytes + [max_chunk_pool_bytes: | default = 2147483648] + + # Max number of samples per query when loading series from the long-term + # storage. 0 disables the limit. + # CLI flag: -experimental.tsdb.bucket-store.max-sample-count + [max_sample_count: | default = 0] + + # Max number of concurrent queries to execute against the long-term storage + # on a per-tenant basis. + # CLI flag: -experimental.tsdb.bucket-store.max-concurrent + [max_concurrent: | default = 20] + + # Number of Go routines, per tenant, to use when syncing blocks from the + # long-term storage. + # CLI flag: -experimental.tsdb.bucket-store.block-sync-concurrency + [block_sync_concurrency: | default = 20s] + + # Configures the S3 storage backend. + # Required only when "s3" backend has been selected. + s3: + # S3 bucket name. + # CLI flag: -experimental.tsdb.s3.bucket-name string + bucket_name: + + # S3 access key ID. If empty, fallbacks to AWS SDK default logic. + # CLI flag: -experimental.tsdb.s3.access-key-id string + [access_key_id: ] + + # S3 secret access key. If empty, fallbacks to AWS SDK default logic. + # CLI flag: -experimental.tsdb.s3.secret-access-key string + [secret_access_key: ] + + # S3 endpoint without schema. By defaults it use the AWS S3 endpoint. + # CLI flag: -experimental.tsdb.s3.endpoint + [endpoint: | default = ""] + + # If enabled, use http:// for the S3 endpoint instead of https://. + # This could be useful in local dev/test environments while using + # an S3-compatible backend storage, like Minio. + # CLI flag: -experimental.tsdb.s3.insecure + [insecure: | default = false] + + # Configures the GCS storage backend. + # Required only when "gcs" backend has been selected. + gcs: + # GCS bucket name. + # CLI flag: -experimental.tsdb.gcs.bucket-name + bucket_name: + + # JSON representing either a Google Developers Console client_credentials.json + # file or a Google Developers service account key file. If empty, fallbacks to + # Google SDK default logic. + # CLI flag: -experimental.tsdb.gcs.service-account string + [ service_account: ] +``` + +### `compactor_config` + +The `compactor_config` block configures the optional compactor service. + +```yaml +compactor: + # List of compaction time ranges. + # CLI flag: -compactor.block-ranges + [block_ranges: | default = [2h,12h,24h]] + + # Number of Go routines to use when syncing block metadata from the long-term + # storage. + # CLI flag: -compactor.block-sync-concurrency + [block_sync_concurrency: | default = 20] + + # Minimum age of fresh (non-compacted) blocks before they are being processed, + # in order to skip blocks that are still uploading from ingesters. Malformed + # blocks older than the maximum of consistency-delay and 30m will be removed. + # CLI flag: -compactor.consistency-delay + [consistency_delay: | default = 30m] + + # Directory in which to cache downloaded blocks to compact and to store the + # newly created block during the compaction process. + # CLI flag: -compactor.data-dir + [data_dir: | default = "./data"] + + # The frequency at which the compactor should look for new blocks eligible for + # compaction and trigger their compaction. + # CLI flag: -compactor.compaction-interval + [compaction_interval: | default = 1h] + + # How many times to retry a failed compaction during a single compaction + # interval. + # CLI flag: -compactor.compaction-retries + [compaction_retries: | default = 3] +``` + +## Known issues + +### Horizontal scalability + +The compactor currently doesn't support horizontal scalability and only 1 replica of the compactor should run at any given time within a Cortex cluster. + +### Migrating from the chunks to the blocks storage + +Currently, no smooth migration path is provided to migrate from chunks to blocks storage. For this reason, the blocks storage can only be enabled in new Cortex clusters.