Skip to content

New integrated integrity checking for previously committed data#37

Merged
borshop merged 15 commits intodevelopfrom
jdb-integrity
Jul 12, 2014
Merged

New integrated integrity checking for previously committed data#37
borshop merged 15 commits intodevelopfrom
jdb-integrity

Conversation

@jtuple
Copy link
Copy Markdown
Contributor

@jtuple jtuple commented Jul 3, 2014

This pull-request completely changes how riak_ensemble guarantees the integrity of committed backend data, even in the presence of byzantine faults. Prior to this pull-request, the approach used was to assume byzantine faults could only occur when a peer restarts and to use backend-specific syncing/verification logic to ensure data integrity and vouch for a given peer. This approach is discussed at length in basho/riak_kv#909 (along with the related paranoia setting pull-request and linked gist).

This approach has two main problems. First, it assumes data loss can only occur between peer restarts -- which is fundamentally unsound. Second, it pushes all data integrity guarantees to the backend implementation and assumes the backend correctly guarantees data integrity. This is a risky assumption.

For example, for Riak theriak_kv_ensemble_backend relies upon traditional AAE syncing to ensure data integrity. However, this is fundamentally unsound since the AAE trees are best effort and are not designed to be canonical sources of truth. Additionally, the AAE trees impose a significant operational burden on the user. Specifically, Riak K/V ensembles will be completely unavailable (eg. fail user requests) if AAE is disabled or if AAE trees are in the process of being built (which can take hours to days depending on data set size) when a majority of peers restarts. Thus, this approach burdens the user, leads to potentially long periods of unavailability, yet doesn't provide strong integrity guarantees.

In the end, the hash trees used for AAE are the wrong tool for guaranteeing data integrity. For traditional AP/Riak, the backend itself is the canonical source of truth. The AAE trees are simply best-effort projections of this truth, and are rebuilt weekly by scanning over the backend data directly.

For riak_ensemble, we need a metadata source with stronger guarantees -- where the metadata is always trusted, and where the backend is considered wrong (corrupted, lying, etc) if it ever differs from this trusted metadata.

This pull-request integrates backend integrity checking directly into riak_ensemble itself, removing the notion of backend trust / syncing. Instead, individual riak_ensemble peers now maintain an integrated self-validating hash tree (known as synctree in the code). The hash for a given node in the tree is stored in the node's parent, allowing all traversals through the tree to verify the integrity of the entire path -- from the root node down to the destination node. This is similar in spirit to the design used in the ZFS filesystem. (See this research paper for a high level discussion of the ZFS design as well as how it holds up to various evaluated dataloss/corruption scenarios).

The synctree is maintained on each peer in an ensemble and is always kept up-to-date. As new data is written, the synctree hash is updated on the leader after seeing a quorum accept the write. This hash tree update is then replicated to followers. (Note: this replication is currently asynchronous. We should make synchronous a configurable option for highest level of paranoia/data integrity).

For all K/V operations, data returned from the backend is verified against the metadata stored in the hash tree. If the backend has incorrect data, the peer will detect the issue and attempt to repair the value from sibling peers.

As with the hashtree module from riak_core, which is used for AAE, the synctree module is designed to perform a tree exchange with remotes replicas to detect and repair any missing data -- in this case, object integrity metadata.

At startup, all peers are considered untrusted and are required to perform a tree exchange with a majority of trusted peers (or all peers if a trusted majority is unavailable). This requires more than a simple majority of online/reachable nodes. After performing this exchange, a given peer's tree is assumed to remain trusted. We make this assumption because in addition to the on-disk tree itself, we also maintain a copy of the top hash of the tree in memory. The assumption is that any combination of on-disk or in-memory corruption that occurs will not result in corrupted hashes that pass the tree verification check.

All operations on the tree (reads, writes, and exchange-related reads) verify all encountered tree paths. Thus, corruption in the tree itself will be detected. If detected, the tree discards the corrupted data, becomes untrusted, and must once again perform an exchange with a majority of trusted peers. This design therefore ensures the integrity of the object metadata at all times, enabling a peer to always be able to correctly detect an invalid object (or unexpected notfound) from the backend.

The specific data loss / byzantine faults this pull-request is designed to handle include both undetected data loss as well as data unexpectedly traveling back in time.

An example of undetected data loss would be a backend that lies about writing an object + CRC to disk. Thus, in the future the backend could return a notfound without firing any integrity check (since the CRC itself is also is missing). The same situation can occur in cases where a given backend does detect data corruption that is "repaired" by deleting corrupted keys (eg. LevelDB repair does this, as does Postgres's repair logic, and many, many others) without the higher level application (riak_ensemble in this case) realizing those keys were deleted.

An example of data going back in time would be a user restoring K/V data (eg. LevelDB or Bitcask data) from a backup without also restoring the riak_ensemble data from the same backup. Thus, as far as riak_ensemble is concerned the peer is at epoch Y (with the assumption data is also at epoch Y), but in reality the data is at epoch X (where X < Y). Traditionally, there is no way for the consensus system to detect this case. However, the new integrated metadata trees in riak_ensemble (which are also at epoch Y) would immediately detect invalid data on access.

This approach does nothing to detect intentionally crafted data corruption (eg. malicious users). If users can already access/modify your Riak data files, you've already lost (at least, with Riak's design today).

In addition to implementing this new integrity logic, this pull-request also removes the notion of backend trust/syncing from riak_ensemble. For Riak itself, this means that Riak-level AAE is no longer necessary to use strongly consistent operations, removing the operational burden discussed above. However, this new integrity checking approach focuses solely on syncing metadata between peers -- not data. This metadata is enough to detect corruption and trigger repair as data is accessed (eg. read repair), but does not sync cold data. Enabling Riak-level AAE is still therefore useful to detect/repair corrupted cold data. (_Note_: Technically not just yet -- that's one of the remaining pull-request items that needs to be done, and will require a PR against riak_kv. Coming soon!).

For more details, please see the commit messages for each the commits in this pull-request as well as look at the module level edoc for synctree.erl.

jtuple added 6 commits July 3, 2014 00:18
Up to this point, riak_ensemble does not have any included unit tests,
relying instead upon high level riak_test system tests that indirectly
test riak_ensemble via Riak. This commit changes this.

This commit adds support for test-only dependencies and adds riak_test
as one of those dependencies to get access to the intercept logic
included with riak_test.

This commit also adds the 'test/run.sh' runner script which is designed
to run individual eunit tests separately for isolation -- preventing
test environment woes. This is necessary because most interesting
riak_ensemble tests are inherently stateful and correctly cleaning up
after a stateful test has historically proven challenging (one reason
why riak_ensemble has preferred riak_tests up to this point).

This commit also adds the `drop_write_test` which uses an intercept to
lie about storing a peer-write to stable storage. Against the current
design of riak_ensemble, this can result in a notfound being returned
to the user for a value we previously acknowledged as written. This is
a serious bug that will be addressed in a later commit.
It is expected that orddict_delta returns the delta list in sorted
order. However, the current implementation incorrectly mixes a list
generated in reverse with a list generated in normal order, leading
to inconsistent ordering.

This commit fixes that issue, leading to orddict_delta always returning
results in proper order.
The new synctree module implements a self-validating hashtree that is
going to be used as riak_ensemble's new data integrity mechanism (in a
future commit). This tree is designed similar to the hashtree used in
file systems such as ZFS, with parent nodes containing the hash of
their children. Thus, during traversal a tree path can be completely
verified from the root node to the endpoint.

The tree is designed to be replicated on multiple nodes and provides
built-in exchange logic that efficiently determines the differences between
two trees -- thus trees can exchange with peers and heal missing/corrupted
data.

The design is conceptually similar to hashtree.erl from riak_core. For
further details, look at the module level edoc in synctree.

The synctree design allows for pluggable backends. This commit provides
the synctree_orddict and synctree_ets backends.
The synctree_leveldb module provides a synctree backend built on LevelDB
This is the first persistent synctree backend and will be the backend used
for riak_ensemble peers (in a future commit).
This commit completely changes how riak_ensemble deals with backend
data loss situations.

Traditionally, once data was committed to a given backend, riak_ensemble
assumed the data would never go missing. Of course, this is invalid
assumption so the notion of backend trust was previously added to
riak_ensemble along with the notion of backend syncing. Untrusted peers
were required to sync with a majority of peers to become trusted. What
exactly syncing meant was a backend-specific detail.

While this handled certain cases of data loss scenarios, it was far
from perfect.

First, it assumed backend syncing was valid. For the Riak K/V backend,
this was not true. The K/V backend relied upon AAE exchange which is
was never designed to be 100% authoritative.

Second, as an optimization, we only checked a peers trust at peer
statup time (or after a backend specific process crashed). This
does not help if a backend loses data while running in normal
operation.

This commit completely removes the notion of backend trust and syncing
from riak_ensemble. Instead, riak_ensemble now implements an embedded
integrity checking protocol based on self-validating hashtrees (using
the new synctree module).

Writing of keys now inserts appropriate metadata into this hash tree,
which is always validated against the keys when read. The synctree
is considered the authoritative source. If the backend disagrees, the
data is considered corrupted and is repaired from data on other
replicas.

Like before, peers that restart are considered trusted and must do
a synctree exchange with a trusted majority (which can include more
than a simple quorum during byzantine situations). This establishes
the initial trust of the synctree at peer startup. However, since
the synctree is self-validating, every operation that touches the
tree validates the traversed path. Thus, any corrupted to either the
on-disk or in-memory components of the synctree will be detected,
causing the tree to once again become untrust and require a new
round of syncing with a trusted majority.

Thus, as designed, a given peer is always able to validate the
entirety of its backend data against the synctree metadata, as well
as validate the synctree itself and repair it via exchange with
sibling replicas as needed.
src/synctree.erl Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: use need instead of needed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now we can be in the leading state but the sync tree may not be ready, I imagine we will need to indicate that clearly in the aae-status command in riak_kv so people know why their requests are failing. I suppose it will generally be a brief period of time, but I thought adding a reminder here to do that wouldn't hurt.

jtuple added 6 commits July 10, 2014 19:58
By default, each peer synctree maps to a unique on-disk LevelDB
database. This commit enables ensemble backends to override this
behavior, supporting M:N mappings where multiple trees are stored
in the same on-disk instance.

This change is predominately aimed at Riak, which will use this
feature to map all peers associated with a single vnode to a common
per-vnode synctree LevelDB instance. Riak AAE already does a similar
thing in its hashtree code. Using fewer instances results in better
usage of file handles, write buffers, and the block/file cache.
This commit cleans up the object hashing code path used by peers to
compute and verify hashes used for integrity checking. This commit
also sets the stage to extend the hashing mechanism in the future to
directly include object hashes as part of the synctree metadata.
The current behavior of asynchronously updating remote synctrees after
a put works well in practice. However, for maximum byzantine tolerance
synctree updates should be synchronous -- occurring on the put path
itself and requiring a valid quorum response before considering a put
successful.

This commit adds synchronous synctree updates and makes the choice
user configurable. The default remains asynchronous.
This commit adds a single byte to the data encoding for bucket hashes to
enable easy future extension of the data format to support different
hash functions.
jtuple added 2 commits July 11, 2014 06:13
-- Improve various comments / fix typos
-- Add missing types and specs for synctree.erl
-- Consistently use orddict_find in synctree.erl
-- Add missing erlang:trunc calls in synctree.erl
This commit makes it so users can override the default behavior of
never trusting synctrees after a peer restart. Doing so provides less
guarantees against byzantine faults, however such allows ensembles to
operate with simple majorities in all circumstances rather than
needing byzantine-tolerant trusted majorities.

This provides a knob similar to the previous low vs medium trust
setting which is no longer available under the new integrity checking
approach.

This commit also updates the peer ping_quorum command to include the
leader's tree_ready state. This is used in Riak as part of an updated
ensemble_status command.
@jtuple
Copy link
Copy Markdown
Contributor Author

jtuple commented Jul 11, 2014

I've finally pushed out the sibling riak_kv pull-request (basho/riak_kv#1002) that deletes all the no-longer-needed code plus updates ensemble_status (including adding the Leader ready info) as requested.

I've also pushed commits to this branch for all remaining work. I believe all issues mentioned in review comments in this pull-request as well as discussed in backchannels are covered. Please look at the individual commits for more details.

I decided to punt on adding object hashes to the synctree metadata for 2.0. Instead, I just did the minimal work necessary to make the data format easy to extend in the future. We can do the more detailed work in 2.0.1 or 2.1.

Also, we talked earlier this week about optimizing synctree by having the upper levels of the tree cached in memory outside of LevelDB. I agree that's important, but am punting to post 2.0.0. I might test a prototype while doing final 2.0 perf testing, but that work is now also a 2.0.1 or later task.

The recent change to synctree_leveldb to allow sharing LevelDB instances
introduced a necessary ETS table that was not being created in the
synctree_pure test. This commit fixes the test.

While investigating the above failure, it become apparent that the
corrupt_exchange_test was somewhat flaky and would fail on occasion.
This commit improves the test by replacing a timeout with an explicit
test condition.
@andrewjstone
Copy link
Copy Markdown
Contributor

andrewjstone added a commit to basho/riak_test that referenced this pull request Jul 11, 2014
Include {error, <<"failed">>} as allowed failure so that test passes
with changes for basho/riak_ensemble#37 and basho/riak_kv#1002
@andrewjstone
Copy link
Copy Markdown
Contributor

👍 b1f9030

ensemble_interleave is failing, but a fix is in basho/riak_test#657
ensemble_sync is also failing for the same reason. A PR is coming.

andrewjstone added a commit to basho/riak_test that referenced this pull request Jul 11, 2014
Allow {error, <<"failed">>} as an error response in ensemble_sync. Fixes
the test with basho/riak_ensemble#37 and basho/riak_kv#1002
jtuple added a commit that referenced this pull request Jul 11, 2014
jtuple added a commit that referenced this pull request Jul 11, 2014
borshop added a commit that referenced this pull request Jul 11, 2014
Merge integration branch for #37

Reviewed-by: andrewjstone
@borshop borshop merged commit b1f9030 into develop Jul 12, 2014
@seancribbs seancribbs deleted the jdb-integrity branch April 1, 2015 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants