New integrated integrity checking for previously committed data by jtuple · Pull Request #37 · basho/riak_ensemble

jtuple · 2014-07-03T09:30:01Z

This pull-request completely changes how riak_ensemble guarantees the integrity of committed backend data, even in the presence of byzantine faults. Prior to this pull-request, the approach used was to assume byzantine faults could only occur when a peer restarts and to use backend-specific syncing/verification logic to ensure data integrity and vouch for a given peer. This approach is discussed at length in basho/riak_kv#909 (along with the related paranoia setting pull-request and linked gist).

This approach has two main problems. First, it assumes data loss can only occur between peer restarts -- which is fundamentally unsound. Second, it pushes all data integrity guarantees to the backend implementation and assumes the backend correctly guarantees data integrity. This is a risky assumption.

For example, for Riak theriak_kv_ensemble_backend relies upon traditional AAE syncing to ensure data integrity. However, this is fundamentally unsound since the AAE trees are best effort and are not designed to be canonical sources of truth. Additionally, the AAE trees impose a significant operational burden on the user. Specifically, Riak K/V ensembles will be completely unavailable (eg. fail user requests) if AAE is disabled or if AAE trees are in the process of being built (which can take hours to days depending on data set size) when a majority of peers restarts. Thus, this approach burdens the user, leads to potentially long periods of unavailability, yet doesn't provide strong integrity guarantees.

In the end, the hash trees used for AAE are the wrong tool for guaranteeing data integrity. For traditional AP/Riak, the backend itself is the canonical source of truth. The AAE trees are simply best-effort projections of this truth, and are rebuilt weekly by scanning over the backend data directly.

For riak_ensemble, we need a metadata source with stronger guarantees -- where the metadata is always trusted, and where the backend is considered wrong (corrupted, lying, etc) if it ever differs from this trusted metadata.

This pull-request integrates backend integrity checking directly into riak_ensemble itself, removing the notion of backend trust / syncing. Instead, individual riak_ensemble peers now maintain an integrated self-validating hash tree (known as synctree in the code). The hash for a given node in the tree is stored in the node's parent, allowing all traversals through the tree to verify the integrity of the entire path -- from the root node down to the destination node. This is similar in spirit to the design used in the ZFS filesystem. (See this research paper for a high level discussion of the ZFS design as well as how it holds up to various evaluated dataloss/corruption scenarios).

The synctree is maintained on each peer in an ensemble and is always kept up-to-date. As new data is written, the synctree hash is updated on the leader after seeing a quorum accept the write. This hash tree update is then replicated to followers. (Note: this replication is currently asynchronous. We should make synchronous a configurable option for highest level of paranoia/data integrity).

For all K/V operations, data returned from the backend is verified against the metadata stored in the hash tree. If the backend has incorrect data, the peer will detect the issue and attempt to repair the value from sibling peers.

As with the hashtree module from riak_core, which is used for AAE, the synctree module is designed to perform a tree exchange with remotes replicas to detect and repair any missing data -- in this case, object integrity metadata.

At startup, all peers are considered untrusted and are required to perform a tree exchange with a majority of trusted peers (or all peers if a trusted majority is unavailable). This requires more than a simple majority of online/reachable nodes. After performing this exchange, a given peer's tree is assumed to remain trusted. We make this assumption because in addition to the on-disk tree itself, we also maintain a copy of the top hash of the tree in memory. The assumption is that any combination of on-disk or in-memory corruption that occurs will not result in corrupted hashes that pass the tree verification check.

All operations on the tree (reads, writes, and exchange-related reads) verify all encountered tree paths. Thus, corruption in the tree itself will be detected. If detected, the tree discards the corrupted data, becomes untrusted, and must once again perform an exchange with a majority of trusted peers. This design therefore ensures the integrity of the object metadata at all times, enabling a peer to always be able to correctly detect an invalid object (or unexpected notfound) from the backend.

The specific data loss / byzantine faults this pull-request is designed to handle include both undetected data loss as well as data unexpectedly traveling back in time.

An example of undetected data loss would be a backend that lies about writing an object + CRC to disk. Thus, in the future the backend could return a notfound without firing any integrity check (since the CRC itself is also is missing). The same situation can occur in cases where a given backend does detect data corruption that is "repaired" by deleting corrupted keys (eg. LevelDB repair does this, as does Postgres's repair logic, and many, many others) without the higher level application (riak_ensemble in this case) realizing those keys were deleted.

An example of data going back in time would be a user restoring K/V data (eg. LevelDB or Bitcask data) from a backup without also restoring the riak_ensemble data from the same backup. Thus, as far as riak_ensemble is concerned the peer is at epoch Y (with the assumption data is also at epoch Y), but in reality the data is at epoch X (where X < Y). Traditionally, there is no way for the consensus system to detect this case. However, the new integrated metadata trees in riak_ensemble (which are also at epoch Y) would immediately detect invalid data on access.

This approach does nothing to detect intentionally crafted data corruption (eg. malicious users). If users can already access/modify your Riak data files, you've already lost (at least, with Riak's design today).

In addition to implementing this new integrity logic, this pull-request also removes the notion of backend trust/syncing from riak_ensemble. For Riak itself, this means that Riak-level AAE is no longer necessary to use strongly consistent operations, removing the operational burden discussed above. However, this new integrity checking approach focuses solely on syncing metadata between peers -- not data. This metadata is enough to detect corruption and trigger repair as data is accessed (eg. read repair), but does not sync cold data. Enabling Riak-level AAE is still therefore useful to detect/repair corrupted cold data. (_Note_: Technically not just yet -- that's one of the remaining pull-request items that needs to be done, and will require a PR against riak_kv. Coming soon!).

For more details, please see the commit messages for each the commits in this pull-request as well as look at the module level edoc for synctree.erl.

Up to this point, riak_ensemble does not have any included unit tests, relying instead upon high level riak_test system tests that indirectly test riak_ensemble via Riak. This commit changes this. This commit adds support for test-only dependencies and adds riak_test as one of those dependencies to get access to the intercept logic included with riak_test. This commit also adds the 'test/run.sh' runner script which is designed to run individual eunit tests separately for isolation -- preventing test environment woes. This is necessary because most interesting riak_ensemble tests are inherently stateful and correctly cleaning up after a stateful test has historically proven challenging (one reason why riak_ensemble has preferred riak_tests up to this point). This commit also adds the `drop_write_test` which uses an intercept to lie about storing a peer-write to stable storage. Against the current design of riak_ensemble, this can result in a notfound being returned to the user for a value we previously acknowledged as written. This is a serious bug that will be addressed in a later commit.

It is expected that orddict_delta returns the delta list in sorted order. However, the current implementation incorrectly mixes a list generated in reverse with a list generated in normal order, leading to inconsistent ordering. This commit fixes that issue, leading to orddict_delta always returning results in proper order.

The new synctree module implements a self-validating hashtree that is going to be used as riak_ensemble's new data integrity mechanism (in a future commit). This tree is designed similar to the hashtree used in file systems such as ZFS, with parent nodes containing the hash of their children. Thus, during traversal a tree path can be completely verified from the root node to the endpoint. The tree is designed to be replicated on multiple nodes and provides built-in exchange logic that efficiently determines the differences between two trees -- thus trees can exchange with peers and heal missing/corrupted data. The design is conceptually similar to hashtree.erl from riak_core. For further details, look at the module level edoc in synctree. The synctree design allows for pluggable backends. This commit provides the synctree_orddict and synctree_ets backends.

The synctree_leveldb module provides a synctree backend built on LevelDB This is the first persistent synctree backend and will be the backend used for riak_ensemble peers (in a future commit).

This commit completely changes how riak_ensemble deals with backend data loss situations. Traditionally, once data was committed to a given backend, riak_ensemble assumed the data would never go missing. Of course, this is invalid assumption so the notion of backend trust was previously added to riak_ensemble along with the notion of backend syncing. Untrusted peers were required to sync with a majority of peers to become trusted. What exactly syncing meant was a backend-specific detail. While this handled certain cases of data loss scenarios, it was far from perfect. First, it assumed backend syncing was valid. For the Riak K/V backend, this was not true. The K/V backend relied upon AAE exchange which is was never designed to be 100% authoritative. Second, as an optimization, we only checked a peers trust at peer statup time (or after a backend specific process crashed). This does not help if a backend loses data while running in normal operation. This commit completely removes the notion of backend trust and syncing from riak_ensemble. Instead, riak_ensemble now implements an embedded integrity checking protocol based on self-validating hashtrees (using the new synctree module). Writing of keys now inserts appropriate metadata into this hash tree, which is always validated against the keys when read. The synctree is considered the authoritative source. If the backend disagrees, the data is considered corrupted and is repaired from data on other replicas. Like before, peers that restart are considered trusted and must do a synctree exchange with a trusted majority (which can include more than a simple quorum during byzantine situations). This establishes the initial trust of the synctree at peer startup. However, since the synctree is self-validating, every operation that touches the tree validates the traversed path. Thus, any corrupted to either the on-disk or in-memory components of the synctree will be detected, causing the tree to once again become untrust and require a new round of syncing with a trusted majority. Thus, as designed, a given peer is always able to validate the entirety of its backend data against the synctree metadata, as well as validate the synctree itself and repair it via exchange with sibling replicas as needed.

andrewjstone · 2014-07-07T21:17:34Z

src/synctree.erl

typo: use need instead of needed

engelsanchez · 2014-07-09T19:41:38Z

src/riak_ensemble_peer.erl

Since now we can be in the leading state but the sync tree may not be ready, I imagine we will need to indicate that clearly in the aae-status command in riak_kv so people know why their requests are failing. I suppose it will generally be a brief period of time, but I thought adding a reminder here to do that wouldn't hurt.

By default, each peer synctree maps to a unique on-disk LevelDB database. This commit enables ensemble backends to override this behavior, supporting M:N mappings where multiple trees are stored in the same on-disk instance. This change is predominately aimed at Riak, which will use this feature to map all peers associated with a single vnode to a common per-vnode synctree LevelDB instance. Riak AAE already does a similar thing in its hashtree code. Using fewer instances results in better usage of file handles, write buffers, and the block/file cache.

This commit cleans up the object hashing code path used by peers to compute and verify hashes used for integrity checking. This commit also sets the stage to extend the hashing mechanism in the future to directly include object hashes as part of the synctree metadata.

The current behavior of asynchronously updating remote synctrees after a put works well in practice. However, for maximum byzantine tolerance synctree updates should be synchronous -- occurring on the put path itself and requiring a valid quorum response before considering a put successful. This commit adds synchronous synctree updates and makes the choice user configurable. The default remains asynchronous.

This commit adds a single byte to the data encoding for bucket hashes to enable easy future extension of the data format to support different hash functions.

-- Improve various comments / fix typos -- Add missing types and specs for synctree.erl -- Consistently use orddict_find in synctree.erl -- Add missing erlang:trunc calls in synctree.erl

This commit makes it so users can override the default behavior of never trusting synctrees after a peer restart. Doing so provides less guarantees against byzantine faults, however such allows ensembles to operate with simple majorities in all circumstances rather than needing byzantine-tolerant trusted majorities. This provides a knob similar to the previous low vs medium trust setting which is no longer available under the new integrity checking approach. This commit also updates the peer ping_quorum command to include the leader's tree_ready state. This is used in Riak as part of an updated ensemble_status command.

jtuple · 2014-07-11T13:39:19Z

I've finally pushed out the sibling riak_kv pull-request (basho/riak_kv#1002) that deletes all the no-longer-needed code plus updates ensemble_status (including adding the Leader ready info) as requested.

I've also pushed commits to this branch for all remaining work. I believe all issues mentioned in review comments in this pull-request as well as discussed in backchannels are covered. Please look at the individual commits for more details.

I decided to punt on adding object hashes to the synctree metadata for 2.0. Instead, I just did the minimal work necessary to make the data format easy to extend in the future. We can do the more detailed work in 2.0.1 or 2.1.

Also, we talked earlier this week about optimizing synctree by having the upper levels of the tree cached in memory outside of LevelDB. I agree that's important, but am punting to post 2.0.0. I might test a prototype while doing final 2.0 perf testing, but that work is now also a 2.0.1 or later task.

The recent change to synctree_leveldb to allow sharing LevelDB instances introduced a necessary ETS table that was not being created in the synctree_pure test. This commit fixes the test. While investigating the above failure, it become apparent that the corrupt_exchange_test was somewhat flaky and would fail on occasion. This commit improves the test by replacing a timeout with an explicit test condition.

andrewjstone · 2014-07-11T21:02:18Z

Include {error, <<"failed">>} as allowed failure so that test passes with changes for basho/riak_ensemble#37 and basho/riak_kv#1002

andrewjstone · 2014-07-11T22:04:34Z

👍 b1f9030

ensemble_interleave is failing, but a fix is in basho/riak_test#657
ensemble_sync is also failing for the same reason. A PR is coming.

Allow {error, <<"failed">>} as an error response in ensemble_sync. Fixes the test with basho/riak_ensemble#37 and basho/riak_kv#1002

Conflicts: tools.mk

Merge integration branch for #37 Reviewed-by: andrewjstone

jtuple added 6 commits July 3, 2014 00:18

Implement synctree_leveldb module

415d458

The synctree_leveldb module provides a synctree backend built on LevelDB This is the first persistent synctree backend and will be the backend used for riak_ensemble peers (in a future commit).

Hack around buildbot's hardcoded use of rebar eunit

9938af2

andrewjstone reviewed Jul 7, 2014
View reviewed changes

src/synctree.erl Outdated

Copy link
Copy Markdown

Contributor

andrewjstone Jul 7, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: use need instead of needed

This was referenced Jul 9, 2014

Remaining strong consistency work for 2.0 basho/riak#536

Open

Do not trust pending peers by default #17

Closed

riak_kv_ensemble_backend syncing should check sibling peers basho/riak_kv#908

Closed

andrewjstone mentioned this pull request Jul 9, 2014

Bugfix/pending peer dataloss #20

Closed

jonmeredith added this to the 2.0-RC milestone Jul 9, 2014

engelsanchez reviewed Jul 9, 2014
View reviewed changes

jtuple mentioned this pull request Jul 10, 2014

Fix numerous issues with leader leases #41

Closed

jtuple added 6 commits July 10, 2014 19:58

Make ens_test ensure app dependencies are started

bddcd8f

Delete all remaining backend syncing code

e6ab5e2

Modify synctree hashing to enable future extension

15735cb

This commit adds a single byte to the data encoding for bucket hashes to enable easy future extension of the data format to support different hash functions.

jtuple added 2 commits July 11, 2014 06:13

Minor code changes/cleanup

661ed2f

-- Improve various comments / fix typos -- Add missing types and specs for synctree.erl -- Consistently use orddict_find in synctree.erl -- Add missing erlang:trunc calls in synctree.erl

jtuple mentioned this pull request Jul 11, 2014

Update to latest riak_ensemble integrity approach basho/riak_kv#1002

Merged

andrewjstone added a commit to basho/riak_test that referenced this pull request Jul 11, 2014

Fix ensemble_interleave error condition

38bd839

Include {error, <<"failed">>} as allowed failure so that test passes with changes for basho/riak_ensemble#37 and basho/riak_kv#1002

andrewjstone mentioned this pull request Jul 11, 2014

Fix ensemble_interleave error condition basho/riak_test#657

Merged

andrewjstone added a commit to basho/riak_test that referenced this pull request Jul 11, 2014

Fix ensemble_sync by allowing {error, <<"failed">>

f0643db

Allow {error, <<"failed">>} as an error response in ensemble_sync. Fixes the test with basho/riak_ensemble#37 and basho/riak_kv#1002

andrewjstone mentioned this pull request Jul 11, 2014

Fix ensemble_sync by allowing {error, <<"failed">> basho/riak_test#658

Merged

jtuple added a commit that referenced this pull request Jul 11, 2014

Merge pull request #37 from basho/jdb-integrity

c07cfe8

Conflicts: tools.mk

jtuple mentioned this pull request Jul 11, 2014

Merge integration branch for #37 #42

Merged

jtuple added a commit that referenced this pull request Jul 11, 2014

Merge pull request #37 from basho/jdb-integrity

5a53ec2

Conflicts: tools.mk

borshop added a commit that referenced this pull request Jul 11, 2014

Merge pull request #42 from basho/jdb-integrity-integration

23a4742

Merge integration branch for #37 Reviewed-by: andrewjstone

borshop merged commit b1f9030 into develop Jul 12, 2014

seancribbs deleted the jdb-integrity branch April 1, 2015 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New integrated integrity checking for previously committed data#37

New integrated integrity checking for previously committed data#37
borshop merged 15 commits intodevelopfrom
jdb-integrity

jtuple commented Jul 3, 2014

Uh oh!

andrewjstone Jul 7, 2014

Uh oh!

engelsanchez Jul 9, 2014

Uh oh!

jtuple commented Jul 11, 2014

Uh oh!

andrewjstone commented Jul 11, 2014

Uh oh!

andrewjstone commented Jul 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

jtuple commented Jul 3, 2014

Uh oh!

andrewjstone Jul 7, 2014

Choose a reason for hiding this comment

Uh oh!

engelsanchez Jul 9, 2014

Choose a reason for hiding this comment

Uh oh!

jtuple commented Jul 11, 2014

Uh oh!

andrewjstone commented Jul 11, 2014

Uh oh!

andrewjstone commented Jul 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants