Fix numerous issues with leader leases#41
Conversation
By design, riak_ensemble uses leader leases which enable the leader to reply to read requests without contacting other peers in the ensemble. This is similar to the design discussed in "Paxos Made Live" [1]. To safely use leader leases, a consensus protocol must: 1. Use strong leaders. A new leader should never be able to be elected while followers are supporting an existing leader. 2. Ensure that no peers (current followers or otherwise) can be become a leader before the current leader's lease expires. The protocol used by riak_ensemble supports strong leaders, thus satisfying the first condition. The second condition is challenging, because there are very few guarantees one can make about system clocks; and even fewer that one can make between clocks on different machines. This commit hardens the leader leasing logic in riak_ensemble: -- Leases are now explicitly tracked (rather than just implicitly tracked as part of the leader tick/step_down transition). -- Worker processes now check that a lease is still valid after performing a read (rather than just before). This ensures that slow reads that complete after a lease expires are rejected. -- The new lease tracking logic verifies a lease against both Erlang corrected time [2] (using receive timeouts) as well as against the OS monotonic clock. This is the strongest guarantee we can provide without special clock hardware (eg. GPS clocks). -- Users can now set 'trust_lease = false' to force riak_ensemble to never trust leases and always perform quorum operations. -- A new quorum operation was added (check_epoch) that solely verifies the legitimacy of the current leader. Previously, not trusting the lease performed a full quorum read which is unnecessary. There is no reason not to trust the leader's local data if the leader is legitimate. [1] http://dl.acm.org/citation.cfm?id=1281103 [2] http://www.erlang.org/doc/apps/erts/time_correction.html
src/riak_ensemble_lease.erl
Outdated
There was a problem hiding this comment.
Little typo here in case you modify this file again.
|
I agree that checking the lease after a read is more reliable than what we had before, and making the lease handling more explicit makes sense. I'm not sure what we are buying by having (and maintaining) our own monotonic clock NIFs since Erlang already handles correcting for time on receives. |
There was a problem hiding this comment.
You can improve on this by doing integer divide on timebase_info.denom and also calculating the remainder, then multiplying separately by the numerator. Then you can separately compute and add (time div denom) * numer + (remainder * number div denom). The term with the remainder is less likely to overflow and can be checked separately.
There was a problem hiding this comment.
This code is copied from the linked Apple technical note (https://developer.apple.com/library/mac/qa/qa1398/_index.html). As used by riak_ensemble, integer overflow is a non-issue. Is it really worth it to worry about this at all?
There was a problem hiding this comment.
I suppose it's just an OSX thing, but it does look sloppy. I'm fine with improving later if an issue arises. If you do, numer and denom are 32 bit unsigned values, so the remainder term above would never overflow if you take care of casting to 64 bits first. Apparently this is mostly just 1 these days (numer == denom) according to the source in the boost chrono library.
There was a problem hiding this comment.
Anyway, for future reference, this is how I think it should be done to make it more robust to overflow:
timeNano = (time / timebase_info.denom * timebase_info.numer) +
(time % timebase_info.denom) * timebase_info.numer / timebase_info.denom;
|
Concerning the comment about the benefit of checking the monotonic time: Checking the monotonic time buys us additional safety. Erlang time correction is a blackbox with various implementations on different platforms that have unknown guarantees. I'd rather not manually verify the time correction implementation in BEAM on multiple platforms to see how strong the guarantees actually are. Does Erlang immediately correct for a clock jump, or is there a delay before timer correction kicks in? Eg. will a subset of timers actually be delivered too early/late before we fix the situation? Unclear. Likewise, the Erlang timer correction is trying to track real time. The Erlang clock may run slower or faster to skew towards real time. Having the leader move faster than the followers if fine, but having the leader move slower could make the leader think a lease extends longer than it should. In practice, this shouldn't be an issue since Erlang only skews by 1% and our defaults set the follower timeout to 4x the lease time. But, nonetheless something else to consider. By requiring both Erlang corrected time and the direct monotonic clocks to agree, we have stronger guarantees and are less susceptible to either approach being wrong/buggy. The goal is to be at least as good as other systems (that just use monotonic clocks), if not better (because we can also exploit Erlang's independent time correction logic). In the end, the lease checking code is an optimization. Either both Erlang corrected time and the direct monotonic clock agree that a lease is still valid, or they don't and we force a quorum lease check. Adding the monotonic clock check can't make things worse, and (hopefully) makes things better. |
Move OS X timebase_info initialization in riak_ensemble_clock.c into the NIF initialization function to avoid issues with multiple threads. Correct several comment typos. Rename variables in the riak_ensemble_peer check_epoch code to make the code more clear.
|
Pushed a commit that should address all review comments made thus far. |
There was a problem hiding this comment.
I guess I don't understand why we don't simply give up if the lease is up. At this point things might be ugly, something is preventing the lease from being acquired, so maybe the leader shouldn't try to block hoping to verify it's still boss. You are more likely to be in the middle of an election here, etc. Maybe just let it fall back to the get_latest_obj (normal quorum get) path?
There was a problem hiding this comment.
The leader dosen't block, this is being performed in an async FSM worker.
This code path is used both when using leases and when not (eg. user set trust lease to false). Remember, leasing is just an optimization.
Without leases, the correct/desired behavior is local leader read + quorum epoch check. As explained in the commit message, quorum reads are unnecessary - if the epoch is valid, the leader has enough info locally so let's not burden the followers with useless read I/O.
Leasing is then just an optimization that uses the same code path. If lease is trusted and valid, skip the quorum epoch check. Otherwise, perform as usual.
There was a problem hiding this comment.
Ok yes. It makes sense. The leader blocking thought was just a brain fart.
|
I'm done reviewing this code. It has some merge conflicts now and also a possibly unrelated test failure (corrupt_exchange_test) that we should retry again when the PR is re-made against develop after merging jdb-integrity. |
|
Awesome! I fixed an issue with the |
Merge integration branch for #41 Reviewed-by: engelsanchez
|
Merged to develop via integration pull-request #43 |
This pull-request builds on top of #37, so I've made the pull-request against that branch in order to keep things clean for review. After that branch merges and this review is complete, I'll open a dummy pull-request against
developfor the final Bors checking/mergeBy design,
riak_ensembleuses leader leases which enable the leader to reply to read requests without contacting other peers in the ensemble.This is similar to the design discussed in "Paxos Made Live" [1].
To safely use leader leases, a consensus protocol must:
The protocol used by
riak_ensemblesupports strong leaders, thus satisfying the first condition.The second condition is challenging, because there are very few guarantees one can make about system clocks; and even fewer that one can make between clocks on different machines.
This commit hardens the leader leasing logic in
riak_ensemble:trust_lease = falseto forceriak_ensembleto never trust leases and always perform quorum operations.[1] http://dl.acm.org/citation.cfm?id=1281103
[2] http://www.erlang.org/doc/apps/erts/time_correction.html
/cc basho/riak#536