-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
There exists two primary ways of agreeing on metadata within a Riak cluster.
- The ring;
- For non-typed buckets, their properties are stored on the ring.
- Otherwise the ring is reserved for information about the allocation of partitions within the cluster (and current known status of cluster members).
- The ring is gossiped (using
riak_core_gossip) and persisted on changes, and conflicting changes are merged. - Although changes to the cluster membership are made through the claimant, changes to bucket properties bypass the claimant and are made directly.
- There may be potential issues with changing bucket properties when planning cluster membership changes
- When the ring is updated, the bucket properties are flushed and re-written to an ETS table from where they are read.
- After 90s without a ring change, the latest ring has read access optimised using
riak_core_mochiglobal(essentially compiling it as a module) - but requests for bucket properties only ever use the ETS table. - Each node sends its ring to another random node every 60s, even if there are no changes, so every node should eventually find out about a new version of the ring, even if the node was down when a ring change was gossiped.
- Cluster metadata;
- For typed buckets and bucket types, their properties are stored in cluster metadata.
- Configuration for
riak_core_securityis also stored in cluster metadata. - There are no other uses for cluster metadata.
- Cluster metadata is entirely separate to the ring. It is gossiped using
riak_core_broadcast, and stored within a DETS file, with a read-only version promoted to an ETS table. - Reading bucket properties for typed/non-typed buckets follows two different code paths (as one is a read from the ring, and one a read from cluster metadata) - but code paths ultimately lead to an
ets:lookup/2(although in the case of a typed bucket two lookups are required). - If cluster metadata gets out of sync between nodes, the differences are detected by active anti-entropy (using the
riak_core_metadata_hashtreemodule, which then useshashtreeandeleveldb). - For managing eventual consistency in cluster metadata the dvvset module is used. This is the only use of
dvvsetwithin riak_core. - Although typed bucket property changes do not use the ring, they are channeled via the claimant node.
Some notes:
- Both the use of an ets table as a read-only cache, and the use of
riak_core_mochiglobalseem to anachronistic in the context of erlang persistent_term. - The most obvious change is to replace the use of riak_core_mochiglobal with erlang persistent_term for storing the ring (and the ring epoch). There is already the delayed protection of promote_ring to prevent over-frequent updates.
- If
riak_core_securityis not used, and buckets are either generally un-typed, or limited in number - theriak_core_metadataapproach seems excessive. - The
riak_core_metadatawas added for a reason though - isriak_core_metadata/riak_core_broadcasta more robust and scalable approach thanriak_core_ring/riak_core_gossip? - The need for AAE in
riak_core_metadatacreates a dependency oneleveldb.eleveldbitself can be removed, but that doesn't eliminate the complexity of the process. Also, this PR changes the underlying store for hashtree - and so has an impact on those using AAE for KV anti-entropy. - Riak core metadata is subject to hard limits with the 2GB maximum size of DETS table files.
- The need for AAE is to ensure eventual consistency (if a node was down and missed broadcasts, it must discover the discrepancy via AAE), but given a 2GB maximum size, perhaps simpler methods than hashtree can be used for comparisons e.g. iterate over keys and comparing directly.
- There is the additional complexity of fixups (which is currently only used by riak_repl); fixups allow an application to register a need to filter all bucket property changes before they are applied (in the case of riak_repl this is used to ensure that the riak_repl post-commit hook is applied on every bucket).
- The net effect of the different methods for holding cluster metadata is confusing.
- There are no known issues with the current setup; it may be confusing, but it works.
Should there be a long-term plan to change this? To unify or simplify the methods of storing and gossiping information required across the cluster? Is there sufficient potential reward to justify the risk of any change (particularly wrt to the overheads of testing backwards compatibility).
Metadata
Metadata
Assignees
Labels
No labels