Skip to content

RATIS-1866. Maintain leader lease after AppendEntries#898

Merged
SzyWilliam merged 5 commits intoapache:feature/leaderleasefrom
SzyWilliam:jira1866
Sep 22, 2023
Merged

RATIS-1866. Maintain leader lease after AppendEntries#898
SzyWilliam merged 5 commits intoapache:feature/leaderleasefrom
SzyWilliam:jira1866

Conversation

@SzyWilliam
Copy link
Member

@SzyWilliam SzyWilliam commented Aug 4, 2023

What is a Leader Lease

In Raft, the leader is responsible for processing and coordinating client requests, replicating data among other followers, and maintaining the distributed state machine.
Vanilla Raft requires the leader to obtain majority acknowledgements before serving every read requests. During normal operations, this prerequisite leads to unnecessary rpcs exchanged among the cluster, diminished read throughput and increased latency.
The leader lease is a concept that allows the leader to maintain its leadership without obtaining majority acknowledgements for a certain period of time (lease duration), during which it can directly serve client read requests.

How to extend lease during normal operations

Prerequisite

  • Suppose the CPU clocks are perfectly synchronized among cluster machines.
  • Let each AppendEntries request AE(i) contain a send time T(i)

Initialize

Once a leader is elected and its authority being comfirmed by majorities through successfully replicating its first no-op log, the leader gains the lease. The lease validity starts from T(0).

Renewal

As long as the leader continues to send heartbeats and receives acknowledgments from a majority of other nodes, it can renew its lease. Theoretically, if the most recent acknowledged heartbeat was sent at time T(n), the validity of the new lease commences at T(n).
In practice, rather than updating the lease with every heartbeat, we opt for a more efficient approach by lazily updating the leader's lease upon each query. Here's how it works:
At time T(n), when the leader is questioned about its authority, it first collects the send times of the last replied AppendEntries from each of its followers, denoted as TR(1), TR(2), ..., TR(2n), sorted in descending order.
Next, it selects the maximum timestamp at when the majority of followers are known to be active, that is, TR(n).
If TR(n) falls within the time range [T(n), T(n) + LeaseTimeoutDuration], then the lease can be successfully renewed.

Revoke

If the lease is expired and the leader cannot renew it, it loses the lease and stops serving read-only requests directly.

How to handle lease during configuration changes

During the configuration changes, the lease can only be renewed if acknowledgments be received by both the old group and the new group. It is the same to leader election restrictions during reconfiguration.

What to do when forced step down

When a leader is forced down, its lease should be effectively revoked.

How to handle CPU drifts

We can lower the ratio allowed for lease timeouts. If the CPU drifts are unbound, better not to use lease read :)

See https://issues.apache.org/jira/browse/RATIS-1866.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SzyWilliam , thanks a lot for working on this! Please see the comments inlined.

@SzyWilliam
Copy link
Member Author

@szetszwo @OneSizeFitsQuorum Thanks a lot for this detailed review! I will address these issues a bit later ;)

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SzyWilliam , thanks a lot for working on this! I have some questions and comments inlined. The changing leader case is tricky.

Comment on lines 252 to 253
return Stream.concat(current.stream(),
Optional.ofNullable(old).map(List::stream).orElse(Stream.empty()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should deduplicate the peers.

@SzyWilliam
Copy link
Member Author

@szetszwo Thanks very much for the detailed review! I'll elaborate the leader changing process. (may be in next PR).


class LeaderLease {

private final long leaseTimeoutMs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it static?

getFollower().updateLastRpcSendTime(request.getEntriesCount() == 0);
final AppendEntriesReplyProto r = getServerRpc().appendEntries(request);
getFollower().updateLastRpcResponseTime();
getFollower().updateLastRespondedAppendEntriesSendTime(sendTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why still update LastRespondedAppendEntriesSendTime at the time of sending?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getServerRpc().appendEntries(request) is a blocking operation, and once this call returns, the response for the current AppendEntries request has been received. Therefore, we can update its(LastRespondedAppendEntries)sendTime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it~

@SzyWilliam
Copy link
Member Author

Made changes on code. @szetszwo @OneSizeFitsQuorum PTAL, thanks!

Copy link
Contributor

@OneSizeFitsQuorum OneSizeFitsQuorum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SzyWilliam , thanks for the update! The change looks good. Just a comment inlined.

Comment on lines 80 to 91
private Timestamp getMaxTimestampWithMajorityAck(List<FollowerInfo> peers) {
if (peers == null || peers.isEmpty()) {
return Timestamp.currentTime();
}

final List<Timestamp> lastRespondedAppendEntriesSendTimes = peers.stream()
.map(FollowerInfo::getLastRespondedAppendEntriesSendTime)
.sorted()
.collect(Collectors.toList());

return lastRespondedAppendEntriesSendTimes.get(lastRespondedAppendEntriesSendTimes.size() / 2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the leader is not in the peer list, we should use (lastRespondedAppendEntriesSendTimes.size() - 1)/ 2:

  • 1 or 2 followers: use index 0
  • 3 or 4 followers: use index 1

Instead of creating a list, we may use limit and skip as below.

  private Timestamp getMaxTimestampWithMajorityAck(List<FollowerInfo> followers) {
    if (followers == null || followers.isEmpty()) {
      return Timestamp.currentTime();
    }

    final int mid = (followers.size() - 1) / 2;
    return followers.stream()
        .map(FollowerInfo::getLastRespondedAppendEntriesSendTime)
        .sorted()
        .limit(mid + 1)
        .skip(mid)
        .iterator()
        .next();
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 1 or 2 followers: use index 0
  • 3 or 4 followers: use index 1

Oops, the timestamps are sorted in ascending order but not descending order. Then it should be

  • 1 follower: use index 0
  • 2 or 3 followers: use index 1
  • 4 or 5 followers: use index 2

You formula actually is correct!

final int mid = followers.size() / 2;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the reviews! Didn't know we can use limit and skip. Now the code is more light-weighted!

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@SzyWilliam SzyWilliam merged commit a13d81e into apache:feature/leaderlease Sep 22, 2023
@SzyWilliam
Copy link
Member Author

@szetszwo @OneSizeFitsQuorum Thanks a lot for your careful and thorough reviews!

RexXiong pushed a commit to apache/celeborn that referenced this pull request May 30, 2024
### What changes were proposed in this pull request?

Bump Ratis version from 2.5.1 to 3.0.1. Address incompatible changes:

- RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.(apache/ratis#964)
- RATIS-1677. Do not auto format RaftStorage in RECOVER.(apache/ratis#718)
- RATIS-1710. Refactor metrics api and implementation to separated modules. (apache/ratis#749)

### Why are the changes needed?

Bump Ratis version from 2.5.1 to 3.0.1. Ratis has released v3.0.0, v3.0.1, which release note refers to [3.0.0](https://ratis.apache.org/post/3.0.0.html), [3.0.1](https://ratis.apache.org/post/3.0.1.html). The 3.0.x version include new features like pluggable metrics and lease read, etc, some improvements and bugfixes including:

- 3.0.0: Change list of ratis 3.0.0 In total, there are roughly 100 commits diffing from 2.5.1 including:
   - Incompatible Changes
      - RaftStorage Auto-Format
      - RATIS-1677. Do not auto format RaftStorage in RECOVER. (apache/ratis#718)
      - RATIS-1694. Fix the compatibility issue of RATIS-1677. (apache/ratis#731)
      - RATIS-1871. Auto format RaftStorage when there is only one directory configured. (apache/ratis#903)
      - Pluggable Ratis-Metrics (RATIS-1688)
      - RATIS-1689. Remove the use of the thirdparty Gauge. (apache/ratis#728)
      - RATIS-1692. Remove the use of the thirdparty Counter. (apache/ratis#732)
      - RATIS-1693. Remove the use of the thirdparty Timer. (apache/ratis#734)
      - RATIS-1703. Move MetricsReporting and JvmMetrics to impl. (apache/ratis#741)
      - RATIS-1704. Fix SuppressWarnings(“VisibilityModifier”) in RatisMetrics. (apache/ratis#742)
      - RATIS-1710. Refactor metrics api and implementation to separated modules. (apache/ratis#749)
      - RATIS-1712. Add a dropwizard 3 implementation of ratis-metrics-api. (apache/ratis#751)
      - RATIS-1391. Update library dropwizard.metrics version to 4.x (apache/ratis#632)
      - RATIS-1601. Use the shaded dropwizard metrics and remove the dependency (apache/ratis#671)
      - Streaming Protocol Change
      - RATIS-1569. Move the asyncRpcApi.sendForward(..) call to the client side. (apache/ratis#635)
   - New Features
      - Leader Lease (RATIS-1864)
      - RATIS-1865. Add leader lease bound ratio configuration (apache/ratis#897)
      - RATIS-1866. Maintain leader lease after AppendEntries (apache/ratis#898)
      - RATIS-1894. Implement ReadOnly based on leader lease (apache/ratis#925)
      - RATIS-1882. Support read-after-write consistency (apache/ratis#913)
      - StateMachine API
      - RATIS-1874. Add notifyLeaderReady function in IStateMachine (apache/ratis#906)
      - RATIS-1897. Make TransactionContext available in DataApi.write(..). (apache/ratis#930)
      - New Configuration Properties
      - RATIS-1862. Add the parameter whether to take Snapshot when stopping to adapt to different services (apache/ratis#896)
      - RATIS-1930. Add a conf for enable/disable majority-add. (apache/ratis#961)
      - RATIS-1918. Introduces parameters that separately control the shutdown of RaftServerProxy by JVMPauseMonitor. (apache/ratis#950)
      - RATIS-1636. Support re-config ratis properties (apache/ratis#800)
      - RATIS-1860. Add ratis-shell cmd to generate a new raft-meta.conf. (apache/ratis#901)
   - Improvements & Bug Fixes
      - Netty
         - RATIS-1898. Netty should use EpollEventLoopGroup by default (apache/ratis#931)
         - RATIS-1899. Use EpollEventLoopGroup for Netty Proxies (apache/ratis#932)
         - RATIS-1921. Shared worker group in WorkerGroupGetter should be closed. (apache/ratis#955)
         - RATIS-1923. Netty: atomic operations require side-effect-free functions. (apache/ratis#956)
      - RaftServer
         - RATIS-1924. Increase the default of raft.server.log.segment.size.max. (apache/ratis#957)
         - RATIS-1892. Unify the lifetime of the RaftServerProxy thread pool (apache/ratis#923)
         - RATIS-1889. NoSuchMethodError: RaftServerMetricsImpl.addNumPendingRequestsGauge apache/ratis#922 (apache/ratis#922)
         - RATIS-761. Handle writeStateMachineData failure in leader. (apache/ratis#927)
         - RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (apache/ratis#933)
         - RATIS-1912. Fix infinity election when perform membership change. (apache/ratis#954)
         - RATIS-1858. Follower keeps logging first election timeout. (apache/ratis#894)

- 3.0.1:This is a bugfix release. See the [changes between 3.0.0 and 3.0.1](apache/ratis@ratis-3.0.0...ratis-3.0.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster manual test.

Closes #2480 from SteNicholas/CELEBORN-1400.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants