Cluster Metadata - Long Term Plan

There exists two primary ways of agreeing on metadata within a Riak cluster.  

- The ring;
  - For non-typed buckets, their properties are stored on the ring.
  - Otherwise the ring is reserved for information about the allocation of partitions within the cluster (and current known status of cluster members).
  - The ring is gossiped (using `riak_core_gossip`) and persisted on changes,  and conflicting changes are merged.
  - Although changes to the cluster membership are made through the claimant, changes to bucket properties bypass the claimant and are made directly.
  - There may be potential issues with changing bucket properties when planning cluster membership changes
  - When the ring is updated, the bucket properties are flushed and re-written to an ETS table from where they are read.
  - After 90s without a ring change, the latest ring has read access optimised using `riak_core_mochiglobal` (essentially compiling it as a module) - but requests for bucket properties only ever use the ETS table.
  - Each node sends its ring to another random node every 60s, even if there are no changes, so every node should eventually find out about a new version of the ring, even if the node was down when a ring change was gossiped.
- Cluster metadata;
  - For typed buckets and bucket types, their properties are  stored in cluster metadata.
  - Configuration for `riak_core_security` is also stored in cluster metadata.
  - There are no other uses for cluster metadata.
  - Cluster metadata is entirely separate to the ring.  It is gossiped using [`riak_core_broadcast`](https://marianoguerra.github.io/presentations/berlin-efl-2016/#/gossip-protocols-epidemic-broadcast-and-eventual-consistency-in-practice), and stored within a DETS file, with a read-only version promoted to an ETS table.
  - Reading bucket properties for typed/non-typed buckets follows two different code paths (as one is a read from the ring, and one a read from cluster metadata) - but code paths ultimately lead to an `ets:lookup/2` (although in the case of a typed bucket two lookups are required).
  - If cluster metadata gets out of sync between nodes, the differences are detected by active anti-entropy (using the `riak_core_metadata_hashtree` module, which then uses `hashtree` and `eleveldb`).
  - For managing eventual consistency in cluster metadata the [dvvset](https://github.com/basho/riak_core/blob/develop/src/dvvset.erl) module is used.  This is the only use of `dvvset` within riak_core.
  - Although typed bucket property changes do not use the ring, they are channeled via the claimant node.

Some notes:
- Both the use of an ets table as a read-only cache, and the use of `riak_core_mochiglobal` seem to anachronistic in the context of [erlang persistent_term](https://www.erlang.org/doc/man/persistent_term.html).
- The most obvious change is to replace the use of riak_core_mochiglobal with erlang persistent_term for storing the ring (and the ring epoch).  There is already the delayed protection of promote_ring to prevent over-frequent updates.
- If `riak_core_security` is not used, and buckets are either generally un-typed, or limited in number - the `riak_core_metadata` approach seems excessive.
- The `riak_core_metadata` was added for a reason though - is `riak_core_metadata`/`riak_core_broadcast` a more robust and scalable approach than `riak_core_ring`/`riak_core_gossip`?
- The need for AAE in `riak_core_metadata` creates a dependency on `eleveldb`.  `eleveldb` itself [can be removed](https://github.com/nhs-riak/riak_core/pull/2), but that doesn't eliminate the complexity of the process.  Also, this PR changes the underlying store for hashtree - and so has an impact on those using AAE for KV anti-entropy.
- Riak core metadata is subject to hard limits with the 2GB maximum size of DETS table files.
- The need for AAE is to ensure eventual consistency (if a node was down and missed broadcasts, it must discover the discrepancy via AAE), but given a 2GB maximum size, perhaps simpler methods than hashtree can be used for comparisons e.g. iterate over keys and comparing directly.
- There is the additional complexity of fixups (which is currently only used by riak_repl); fixups allow an application to register a need to filter all bucket property changes before they are applied (in the case of riak_repl this is used to ensure that the riak_repl post-commit hook is applied on every bucket).
- The net effect of the different methods for holding cluster metadata is confusing.
- There are no known issues with the current setup; it may be confusing, but it works.

Should there be a long-term plan to change this?  To unify or simplify the methods of storing and gossiping information required across the cluster?  Is there sufficient potential reward to justify the risk of any change (particularly wrt to the overheads of testing backwards compatibility).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster Metadata - Long Term Plan #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster Metadata - Long Term Plan #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions