Skip to content

Fix numerous issues with leader leases#41

Closed
jtuple wants to merge 2 commits intojdb-integrityfrom
jdb-lease
Closed

Fix numerous issues with leader leases#41
jtuple wants to merge 2 commits intojdb-integrityfrom
jdb-lease

Conversation

@jtuple
Copy link
Copy Markdown
Contributor

@jtuple jtuple commented Jul 10, 2014

This pull-request builds on top of #37, so I've made the pull-request against that branch in order to keep things clean for review. After that branch merges and this review is complete, I'll open a dummy pull-request against develop for the final Bors checking/merge


By design, riak_ensemble uses leader leases which enable the leader to reply to read requests without contacting other peers in the ensemble.

This is similar to the design discussed in "Paxos Made Live" [1].

To safely use leader leases, a consensus protocol must:

  1. Use strong leaders. A new leader should never be able to be elected while followers are supporting an existing leader.
  2. Ensure that no peers (current followers or otherwise) can be become a leader before the current leader's lease expires.

The protocol used by riak_ensemble supports strong leaders, thus satisfying the first condition.

The second condition is challenging, because there are very few guarantees one can make about system clocks; and even fewer that one can make between clocks on different machines.

This commit hardens the leader leasing logic in riak_ensemble:

  • Leases are now explicitly tracked (rather than just implicitly tracked as part of the leader tick/step_down transition).
  • Worker processes now check that a lease is still valid after performing a read (rather than just before). This ensures that slow reads that complete after a lease expires are rejected.
  • The new lease tracking logic verifies a lease against both Erlang corrected time [2] as well as against the OS monotonic clock. This is the strongest guarantee we can provide without special clock hardware (eg. GPS clocks).
  • Users can now set trust_lease = false to force riak_ensemble to never trust leases and always perform quorum operations.
  • A new quorum operation was added (check_epoch) that solely verifies the legitimacy of the current leader. Previously, not trusting the lease performed a full quorum read which is unnecessary. There is no reason not to trust the leader's local data if the leader is legitimate.

[1] http://dl.acm.org/citation.cfm?id=1281103
[2] http://www.erlang.org/doc/apps/erts/time_correction.html

/cc basho/riak#536

By design, riak_ensemble uses leader leases which enable the leader to
reply to read requests without contacting other peers in the ensemble.

This is similar to the design discussed in "Paxos Made Live" [1].

To safely use leader leases, a consensus protocol must:

1. Use strong leaders. A new leader should never be able to be
   elected while followers are supporting an existing leader.

2. Ensure that no peers (current followers or otherwise) can be
   become a leader before the current leader's lease expires.

The protocol used by riak_ensemble supports strong leaders, thus
satisfying the first condition.

The second condition is challenging, because there are very few
guarantees one can make about system clocks; and even fewer that
one can make between clocks on different machines.

This commit hardens the leader leasing logic in riak_ensemble:

-- Leases are now explicitly tracked (rather than just implicitly
   tracked as part of the leader tick/step_down transition).

-- Worker processes now check that a lease is still valid after
   performing a read (rather than just before). This ensures that
   slow reads that complete after a lease expires are rejected.

-- The new lease tracking logic verifies a lease against both
   Erlang corrected time [2] (using receive timeouts) as well as
   against the OS monotonic clock. This is the strongest guarantee
   we can provide without special clock hardware (eg. GPS clocks).

-- Users can now set 'trust_lease = false' to force riak_ensemble
   to never trust leases and always perform quorum operations.

-- A new quorum operation was added (check_epoch) that solely verifies
   the legitimacy of the current leader. Previously, not trusting the
   lease performed a full quorum read which is unnecessary. There is
   no reason not to trust the leader's local data if the leader is
   legitimate.

[1] http://dl.acm.org/citation.cfm?id=1281103
[2] http://www.erlang.org/doc/apps/erts/time_correction.html
@jtuple jtuple added this to the 2.0-RC milestone Jul 10, 2014
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little typo here in case you modify this file again.

@engelsanchez
Copy link
Copy Markdown
Contributor

I agree that checking the lease after a read is more reliable than what we had before, and making the lease handling more explicit makes sense. I'm not sure what we are buying by having (and maintaining) our own monotonic clock NIFs since Erlang already handles correcting for time on receives.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can improve on this by doing integer divide on timebase_info.denom and also calculating the remainder, then multiplying separately by the numerator. Then you can separately compute and add (time div denom) * numer + (remainder * number div denom). The term with the remainder is less likely to overflow and can be checked separately.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is copied from the linked Apple technical note (https://developer.apple.com/library/mac/qa/qa1398/_index.html). As used by riak_ensemble, integer overflow is a non-issue. Is it really worth it to worry about this at all?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's just an OSX thing, but it does look sloppy. I'm fine with improving later if an issue arises. If you do, numer and denom are 32 bit unsigned values, so the remainder term above would never overflow if you take care of casting to 64 bits first. Apparently this is mostly just 1 these days (numer == denom) according to the source in the boost chrono library.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, for future reference, this is how I think it should be done to make it more robust to overflow:

timeNano =  (time / timebase_info.denom * timebase_info.numer) + 
     (time % timebase_info.denom) * timebase_info.numer / timebase_info.denom;

@jtuple
Copy link
Copy Markdown
Contributor Author

jtuple commented Jul 10, 2014

Concerning the comment about the benefit of checking the monotonic time:

Checking the monotonic time buys us additional safety. Erlang time correction is a blackbox with various implementations on different platforms that have unknown guarantees. I'd rather not manually verify the time correction implementation in BEAM on multiple platforms to see how strong the guarantees actually are.

Does Erlang immediately correct for a clock jump, or is there a delay before timer correction kicks in? Eg. will a subset of timers actually be delivered too early/late before we fix the situation? Unclear.

Likewise, the Erlang timer correction is trying to track real time. The Erlang clock may run slower or faster to skew towards real time. Having the leader move faster than the followers if fine, but having the leader move slower could make the leader think a lease extends longer than it should. In practice, this shouldn't be an issue since Erlang only skews by 1% and our defaults set the follower timeout to 4x the lease time. But, nonetheless something else to consider.

By requiring both Erlang corrected time and the direct monotonic clocks to agree, we have stronger guarantees and are less susceptible to either approach being wrong/buggy. The goal is to be at least as good as other systems (that just use monotonic clocks), if not better (because we can also exploit Erlang's independent time correction logic).

In the end, the lease checking code is an optimization. Either both Erlang corrected time and the direct monotonic clock agree that a lease is still valid, or they don't and we force a quorum lease check. Adding the monotonic clock check can't make things worse, and (hopefully) makes things better.

Move OS X timebase_info initialization in riak_ensemble_clock.c into
the NIF initialization function to avoid issues with multiple threads.

Correct several comment typos.

Rename variables in the riak_ensemble_peer check_epoch code to make
the code more clear.
@jtuple
Copy link
Copy Markdown
Contributor Author

jtuple commented Jul 11, 2014

Pushed a commit that should address all review comments made thus far.

@engelsanchez engelsanchez self-assigned this Jul 11, 2014
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand why we don't simply give up if the lease is up. At this point things might be ugly, something is preventing the lease from being acquired, so maybe the leader shouldn't try to block hoping to verify it's still boss. You are more likely to be in the middle of an election here, etc. Maybe just let it fall back to the get_latest_obj (normal quorum get) path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leader dosen't block, this is being performed in an async FSM worker.

This code path is used both when using leases and when not (eg. user set trust lease to false). Remember, leasing is just an optimization.

Without leases, the correct/desired behavior is local leader read + quorum epoch check. As explained in the commit message, quorum reads are unnecessary - if the epoch is valid, the leader has enough info locally so let's not burden the followers with useless read I/O.

Leasing is then just an optimization that uses the same code path. If lease is trusted and valid, skip the quorum epoch check. Otherwise, perform as usual.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yes. It makes sense. The leader blocking thought was just a brain fart.

@engelsanchez
Copy link
Copy Markdown
Contributor

I'm done reviewing this code. It has some merge conflicts now and also a possibly unrelated test failure (corrupt_exchange_test) that we should retry again when the PR is re-made against develop after merging jdb-integrity.

@jtuple
Copy link
Copy Markdown
Contributor Author

jtuple commented Jul 11, 2014

Awesome! I fixed an issue with the corrupt_exchange_test on the jdb-integrity branch so hopefully all is well after the merge. I'll handle the merge after that pull-request finishes up.

jtuple added a commit that referenced this pull request Jul 12, 2014
borshop added a commit that referenced this pull request Jul 12, 2014
Merge integration branch for #41

Reviewed-by: engelsanchez
@jtuple
Copy link
Copy Markdown
Contributor Author

jtuple commented Jul 12, 2014

Merged to develop via integration pull-request #43

@jtuple jtuple closed this Jul 12, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants