New integrated integrity checking for previously committed data#37
New integrated integrity checking for previously committed data#37
Conversation
Up to this point, riak_ensemble does not have any included unit tests, relying instead upon high level riak_test system tests that indirectly test riak_ensemble via Riak. This commit changes this. This commit adds support for test-only dependencies and adds riak_test as one of those dependencies to get access to the intercept logic included with riak_test. This commit also adds the 'test/run.sh' runner script which is designed to run individual eunit tests separately for isolation -- preventing test environment woes. This is necessary because most interesting riak_ensemble tests are inherently stateful and correctly cleaning up after a stateful test has historically proven challenging (one reason why riak_ensemble has preferred riak_tests up to this point). This commit also adds the `drop_write_test` which uses an intercept to lie about storing a peer-write to stable storage. Against the current design of riak_ensemble, this can result in a notfound being returned to the user for a value we previously acknowledged as written. This is a serious bug that will be addressed in a later commit.
It is expected that orddict_delta returns the delta list in sorted order. However, the current implementation incorrectly mixes a list generated in reverse with a list generated in normal order, leading to inconsistent ordering. This commit fixes that issue, leading to orddict_delta always returning results in proper order.
The new synctree module implements a self-validating hashtree that is going to be used as riak_ensemble's new data integrity mechanism (in a future commit). This tree is designed similar to the hashtree used in file systems such as ZFS, with parent nodes containing the hash of their children. Thus, during traversal a tree path can be completely verified from the root node to the endpoint. The tree is designed to be replicated on multiple nodes and provides built-in exchange logic that efficiently determines the differences between two trees -- thus trees can exchange with peers and heal missing/corrupted data. The design is conceptually similar to hashtree.erl from riak_core. For further details, look at the module level edoc in synctree. The synctree design allows for pluggable backends. This commit provides the synctree_orddict and synctree_ets backends.
The synctree_leveldb module provides a synctree backend built on LevelDB This is the first persistent synctree backend and will be the backend used for riak_ensemble peers (in a future commit).
This commit completely changes how riak_ensemble deals with backend data loss situations. Traditionally, once data was committed to a given backend, riak_ensemble assumed the data would never go missing. Of course, this is invalid assumption so the notion of backend trust was previously added to riak_ensemble along with the notion of backend syncing. Untrusted peers were required to sync with a majority of peers to become trusted. What exactly syncing meant was a backend-specific detail. While this handled certain cases of data loss scenarios, it was far from perfect. First, it assumed backend syncing was valid. For the Riak K/V backend, this was not true. The K/V backend relied upon AAE exchange which is was never designed to be 100% authoritative. Second, as an optimization, we only checked a peers trust at peer statup time (or after a backend specific process crashed). This does not help if a backend loses data while running in normal operation. This commit completely removes the notion of backend trust and syncing from riak_ensemble. Instead, riak_ensemble now implements an embedded integrity checking protocol based on self-validating hashtrees (using the new synctree module). Writing of keys now inserts appropriate metadata into this hash tree, which is always validated against the keys when read. The synctree is considered the authoritative source. If the backend disagrees, the data is considered corrupted and is repaired from data on other replicas. Like before, peers that restart are considered trusted and must do a synctree exchange with a trusted majority (which can include more than a simple quorum during byzantine situations). This establishes the initial trust of the synctree at peer startup. However, since the synctree is self-validating, every operation that touches the tree validates the traversed path. Thus, any corrupted to either the on-disk or in-memory components of the synctree will be detected, causing the tree to once again become untrust and require a new round of syncing with a trusted majority. Thus, as designed, a given peer is always able to validate the entirety of its backend data against the synctree metadata, as well as validate the synctree itself and repair it via exchange with sibling replicas as needed.
src/synctree.erl
Outdated
There was a problem hiding this comment.
typo: use need instead of needed
There was a problem hiding this comment.
Since now we can be in the leading state but the sync tree may not be ready, I imagine we will need to indicate that clearly in the aae-status command in riak_kv so people know why their requests are failing. I suppose it will generally be a brief period of time, but I thought adding a reminder here to do that wouldn't hurt.
By default, each peer synctree maps to a unique on-disk LevelDB database. This commit enables ensemble backends to override this behavior, supporting M:N mappings where multiple trees are stored in the same on-disk instance. This change is predominately aimed at Riak, which will use this feature to map all peers associated with a single vnode to a common per-vnode synctree LevelDB instance. Riak AAE already does a similar thing in its hashtree code. Using fewer instances results in better usage of file handles, write buffers, and the block/file cache.
This commit cleans up the object hashing code path used by peers to compute and verify hashes used for integrity checking. This commit also sets the stage to extend the hashing mechanism in the future to directly include object hashes as part of the synctree metadata.
The current behavior of asynchronously updating remote synctrees after a put works well in practice. However, for maximum byzantine tolerance synctree updates should be synchronous -- occurring on the put path itself and requiring a valid quorum response before considering a put successful. This commit adds synchronous synctree updates and makes the choice user configurable. The default remains asynchronous.
This commit adds a single byte to the data encoding for bucket hashes to enable easy future extension of the data format to support different hash functions.
-- Improve various comments / fix typos -- Add missing types and specs for synctree.erl -- Consistently use orddict_find in synctree.erl -- Add missing erlang:trunc calls in synctree.erl
This commit makes it so users can override the default behavior of never trusting synctrees after a peer restart. Doing so provides less guarantees against byzantine faults, however such allows ensembles to operate with simple majorities in all circumstances rather than needing byzantine-tolerant trusted majorities. This provides a knob similar to the previous low vs medium trust setting which is no longer available under the new integrity checking approach. This commit also updates the peer ping_quorum command to include the leader's tree_ready state. This is used in Riak as part of an updated ensemble_status command.
|
I've finally pushed out the sibling I've also pushed commits to this branch for all remaining work. I believe all issues mentioned in review comments in this pull-request as well as discussed in backchannels are covered. Please look at the individual commits for more details. I decided to punt on adding object hashes to the synctree metadata for 2.0. Instead, I just did the minimal work necessary to make the data format easy to extend in the future. We can do the more detailed work in 2.0.1 or 2.1. Also, we talked earlier this week about optimizing |
The recent change to synctree_leveldb to allow sharing LevelDB instances introduced a necessary ETS table that was not being created in the synctree_pure test. This commit fixes the test. While investigating the above failure, it become apparent that the corrupt_exchange_test was somewhat flaky and would fail on occasion. This commit improves the test by replacing a timeout with an explicit test condition.
|
Include {error, <<"failed">>} as allowed failure so that test passes
with changes for basho/riak_ensemble#37 and basho/riak_kv#1002
|
👍 b1f9030 ensemble_interleave is failing, but a fix is in basho/riak_test#657 |
Allow {error, <<"failed">>} as an error response in ensemble_sync. Fixes
the test with basho/riak_ensemble#37 and basho/riak_kv#1002
Merge integration branch for #37 Reviewed-by: andrewjstone
This pull-request completely changes how
riak_ensembleguarantees the integrity of committed backend data, even in the presence of byzantine faults. Prior to this pull-request, the approach used was to assume byzantine faults could only occur when a peer restarts and to use backend-specific syncing/verification logic to ensure data integrity and vouch for a given peer. This approach is discussed at length in basho/riak_kv#909 (along with the related paranoia setting pull-request and linked gist).This approach has two main problems. First, it assumes data loss can only occur between peer restarts -- which is fundamentally unsound. Second, it pushes all data integrity guarantees to the backend implementation and assumes the backend correctly guarantees data integrity. This is a risky assumption.
For example, for Riak the
riak_kv_ensemble_backendrelies upon traditional AAE syncing to ensure data integrity. However, this is fundamentally unsound since the AAE trees are best effort and are not designed to be canonical sources of truth. Additionally, the AAE trees impose a significant operational burden on the user. Specifically, Riak K/V ensembles will be completely unavailable (eg. fail user requests) if AAE is disabled or if AAE trees are in the process of being built (which can take hours to days depending on data set size) when a majority of peers restarts. Thus, this approach burdens the user, leads to potentially long periods of unavailability, yet doesn't provide strong integrity guarantees.In the end, the hash trees used for AAE are the wrong tool for guaranteeing data integrity. For traditional AP/Riak, the backend itself is the canonical source of truth. The AAE trees are simply best-effort projections of this truth, and are rebuilt weekly by scanning over the backend data directly.
For
riak_ensemble, we need a metadata source with stronger guarantees -- where the metadata is always trusted, and where the backend is considered wrong (corrupted, lying, etc) if it ever differs from this trusted metadata.This pull-request integrates backend integrity checking directly into
riak_ensembleitself, removing the notion of backend trust / syncing. Instead, individualriak_ensemblepeers now maintain an integrated self-validating hash tree (known assynctreein the code). The hash for a given node in the tree is stored in the node's parent, allowing all traversals through the tree to verify the integrity of the entire path -- from the root node down to the destination node. This is similar in spirit to the design used in the ZFS filesystem. (See this research paper for a high level discussion of the ZFS design as well as how it holds up to various evaluated dataloss/corruption scenarios).The
synctreeis maintained on each peer in an ensemble and is always kept up-to-date. As new data is written, thesynctreehash is updated on the leader after seeing a quorum accept the write. This hash tree update is then replicated to followers. (Note: this replication is currently asynchronous. We should make synchronous a configurable option for highest level of paranoia/data integrity).For all K/V operations, data returned from the backend is verified against the metadata stored in the hash tree. If the backend has incorrect data, the peer will detect the issue and attempt to repair the value from sibling peers.
As with the
hashtreemodule fromriak_core, which is used for AAE, thesynctreemodule is designed to perform a tree exchange with remotes replicas to detect and repair any missing data -- in this case, object integrity metadata.At startup, all peers are considered untrusted and are required to perform a tree exchange with a majority of trusted peers (or all peers if a trusted majority is unavailable). This requires more than a simple majority of online/reachable nodes. After performing this exchange, a given peer's tree is assumed to remain trusted. We make this assumption because in addition to the on-disk tree itself, we also maintain a copy of the top hash of the tree in memory. The assumption is that any combination of on-disk or in-memory corruption that occurs will not result in corrupted hashes that pass the tree verification check.
All operations on the tree (reads, writes, and exchange-related reads) verify all encountered tree paths. Thus, corruption in the tree itself will be detected. If detected, the tree discards the corrupted data, becomes untrusted, and must once again perform an exchange with a majority of trusted peers. This design therefore ensures the integrity of the object metadata at all times, enabling a peer to always be able to correctly detect an invalid object (or unexpected notfound) from the backend.
The specific data loss / byzantine faults this pull-request is designed to handle include both undetected data loss as well as data unexpectedly traveling back in time.
An example of undetected data loss would be a backend that lies about writing an object + CRC to disk. Thus, in the future the backend could return a notfound without firing any integrity check (since the CRC itself is also is missing). The same situation can occur in cases where a given backend does detect data corruption that is "repaired" by deleting corrupted keys (eg. LevelDB repair does this, as does Postgres's repair logic, and many, many others) without the higher level application (
riak_ensemblein this case) realizing those keys were deleted.An example of data going back in time would be a user restoring K/V data (eg. LevelDB or Bitcask data) from a backup without also restoring the
riak_ensembledata from the same backup. Thus, as far asriak_ensembleis concerned the peer is at epoch Y (with the assumption data is also at epoch Y), but in reality the data is at epoch X (where X < Y). Traditionally, there is no way for the consensus system to detect this case. However, the new integrated metadata trees inriak_ensemble(which are also at epoch Y) would immediately detect invalid data on access.This approach does nothing to detect intentionally crafted data corruption (eg. malicious users). If users can already access/modify your Riak data files, you've already lost (at least, with Riak's design today).
In addition to implementing this new integrity logic, this pull-request also removes the notion of backend trust/syncing from
riak_ensemble. For Riak itself, this means that Riak-level AAE is no longer necessary to use strongly consistent operations, removing the operational burden discussed above. However, this new integrity checking approach focuses solely on syncing metadata between peers -- not data. This metadata is enough to detect corruption and trigger repair as data is accessed (eg. read repair), but does not sync cold data. Enabling Riak-level AAE is still therefore useful to detect/repair corrupted cold data. (_Note_: Technically not just yet -- that's one of the remaining pull-request items that needs to be done, and will require a PR againstriak_kv. Coming soon!).For more details, please see the commit messages for each the commits in this pull-request as well as look at the module level edoc for
synctree.erl.