RATIS-1866. Maintain leader lease after AppendEntries#898
RATIS-1866. Maintain leader lease after AppendEntries#898SzyWilliam merged 5 commits intoapache:feature/leaderleasefrom
Conversation
ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDefault.java
Outdated
Show resolved
Hide resolved
szetszwo
left a comment
There was a problem hiding this comment.
@SzyWilliam , thanks a lot for working on this! Please see the comments inlined.
ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java
Outdated
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java
Outdated
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java
Outdated
Show resolved
Hide resolved
ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java
Outdated
Show resolved
Hide resolved
ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java
Outdated
Show resolved
Hide resolved
|
@szetszwo @OneSizeFitsQuorum Thanks a lot for this detailed review! I will address these issues a bit later ;) |
db098b6 to
34f3418
Compare
97220fe to
078b115
Compare
34f3418 to
ec7dc95
Compare
szetszwo
left a comment
There was a problem hiding this comment.
@SzyWilliam , thanks a lot for working on this! I have some questions and comments inlined. The changing leader case is tricky.
ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java
Outdated
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/FollowerInfoImpl.java
Outdated
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderLease.java
Outdated
Show resolved
Hide resolved
ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderLease.java
Outdated
Show resolved
Hide resolved
| return Stream.concat(current.stream(), | ||
| Optional.ofNullable(old).map(List::stream).orElse(Stream.empty())); |
There was a problem hiding this comment.
We should deduplicate the peers.
ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderLease.java
Outdated
Show resolved
Hide resolved
|
@szetszwo Thanks very much for the detailed review! I'll elaborate the leader changing process. (may be in next PR). |
|
|
||
| class LeaderLease { | ||
|
|
||
| private final long leaseTimeoutMs; |
| getFollower().updateLastRpcSendTime(request.getEntriesCount() == 0); | ||
| final AppendEntriesReplyProto r = getServerRpc().appendEntries(request); | ||
| getFollower().updateLastRpcResponseTime(); | ||
| getFollower().updateLastRespondedAppendEntriesSendTime(sendTime); |
There was a problem hiding this comment.
Why still update LastRespondedAppendEntriesSendTime at the time of sending?
There was a problem hiding this comment.
getServerRpc().appendEntries(request) is a blocking operation, and once this call returns, the response for the current AppendEntries request has been received. Therefore, we can update its(LastRespondedAppendEntries)sendTime.
|
Made changes on code. @szetszwo @OneSizeFitsQuorum PTAL, thanks! |
szetszwo
left a comment
There was a problem hiding this comment.
@SzyWilliam , thanks for the update! The change looks good. Just a comment inlined.
| private Timestamp getMaxTimestampWithMajorityAck(List<FollowerInfo> peers) { | ||
| if (peers == null || peers.isEmpty()) { | ||
| return Timestamp.currentTime(); | ||
| } | ||
|
|
||
| final List<Timestamp> lastRespondedAppendEntriesSendTimes = peers.stream() | ||
| .map(FollowerInfo::getLastRespondedAppendEntriesSendTime) | ||
| .sorted() | ||
| .collect(Collectors.toList()); | ||
|
|
||
| return lastRespondedAppendEntriesSendTimes.get(lastRespondedAppendEntriesSendTimes.size() / 2); | ||
| } |
There was a problem hiding this comment.
Since the leader is not in the peer list, we should use (lastRespondedAppendEntriesSendTimes.size() - 1)/ 2:
- 1 or 2 followers: use index 0
- 3 or 4 followers: use index 1
Instead of creating a list, we may use limit and skip as below.
private Timestamp getMaxTimestampWithMajorityAck(List<FollowerInfo> followers) {
if (followers == null || followers.isEmpty()) {
return Timestamp.currentTime();
}
final int mid = (followers.size() - 1) / 2;
return followers.stream()
.map(FollowerInfo::getLastRespondedAppendEntriesSendTime)
.sorted()
.limit(mid + 1)
.skip(mid)
.iterator()
.next();
}There was a problem hiding this comment.
- 1 or 2 followers: use index 0
- 3 or 4 followers: use index 1
Oops, the timestamps are sorted in ascending order but not descending order. Then it should be
- 1 follower: use index 0
- 2 or 3 followers: use index 1
- 4 or 5 followers: use index 2
You formula actually is correct!
final int mid = followers.size() / 2;There was a problem hiding this comment.
Thanks a lot for the reviews! Didn't know we can use limit and skip. Now the code is more light-weighted!
szetszwo
left a comment
There was a problem hiding this comment.
+1 the change looks good.
|
@szetszwo @OneSizeFitsQuorum Thanks a lot for your careful and thorough reviews! |
### What changes were proposed in this pull request? Bump Ratis version from 2.5.1 to 3.0.1. Address incompatible changes: - RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.(apache/ratis#964) - RATIS-1677. Do not auto format RaftStorage in RECOVER.(apache/ratis#718) - RATIS-1710. Refactor metrics api and implementation to separated modules. (apache/ratis#749) ### Why are the changes needed? Bump Ratis version from 2.5.1 to 3.0.1. Ratis has released v3.0.0, v3.0.1, which release note refers to [3.0.0](https://ratis.apache.org/post/3.0.0.html), [3.0.1](https://ratis.apache.org/post/3.0.1.html). The 3.0.x version include new features like pluggable metrics and lease read, etc, some improvements and bugfixes including: - 3.0.0: Change list of ratis 3.0.0 In total, there are roughly 100 commits diffing from 2.5.1 including: - Incompatible Changes - RaftStorage Auto-Format - RATIS-1677. Do not auto format RaftStorage in RECOVER. (apache/ratis#718) - RATIS-1694. Fix the compatibility issue of RATIS-1677. (apache/ratis#731) - RATIS-1871. Auto format RaftStorage when there is only one directory configured. (apache/ratis#903) - Pluggable Ratis-Metrics (RATIS-1688) - RATIS-1689. Remove the use of the thirdparty Gauge. (apache/ratis#728) - RATIS-1692. Remove the use of the thirdparty Counter. (apache/ratis#732) - RATIS-1693. Remove the use of the thirdparty Timer. (apache/ratis#734) - RATIS-1703. Move MetricsReporting and JvmMetrics to impl. (apache/ratis#741) - RATIS-1704. Fix SuppressWarnings(“VisibilityModifier”) in RatisMetrics. (apache/ratis#742) - RATIS-1710. Refactor metrics api and implementation to separated modules. (apache/ratis#749) - RATIS-1712. Add a dropwizard 3 implementation of ratis-metrics-api. (apache/ratis#751) - RATIS-1391. Update library dropwizard.metrics version to 4.x (apache/ratis#632) - RATIS-1601. Use the shaded dropwizard metrics and remove the dependency (apache/ratis#671) - Streaming Protocol Change - RATIS-1569. Move the asyncRpcApi.sendForward(..) call to the client side. (apache/ratis#635) - New Features - Leader Lease (RATIS-1864) - RATIS-1865. Add leader lease bound ratio configuration (apache/ratis#897) - RATIS-1866. Maintain leader lease after AppendEntries (apache/ratis#898) - RATIS-1894. Implement ReadOnly based on leader lease (apache/ratis#925) - RATIS-1882. Support read-after-write consistency (apache/ratis#913) - StateMachine API - RATIS-1874. Add notifyLeaderReady function in IStateMachine (apache/ratis#906) - RATIS-1897. Make TransactionContext available in DataApi.write(..). (apache/ratis#930) - New Configuration Properties - RATIS-1862. Add the parameter whether to take Snapshot when stopping to adapt to different services (apache/ratis#896) - RATIS-1930. Add a conf for enable/disable majority-add. (apache/ratis#961) - RATIS-1918. Introduces parameters that separately control the shutdown of RaftServerProxy by JVMPauseMonitor. (apache/ratis#950) - RATIS-1636. Support re-config ratis properties (apache/ratis#800) - RATIS-1860. Add ratis-shell cmd to generate a new raft-meta.conf. (apache/ratis#901) - Improvements & Bug Fixes - Netty - RATIS-1898. Netty should use EpollEventLoopGroup by default (apache/ratis#931) - RATIS-1899. Use EpollEventLoopGroup for Netty Proxies (apache/ratis#932) - RATIS-1921. Shared worker group in WorkerGroupGetter should be closed. (apache/ratis#955) - RATIS-1923. Netty: atomic operations require side-effect-free functions. (apache/ratis#956) - RaftServer - RATIS-1924. Increase the default of raft.server.log.segment.size.max. (apache/ratis#957) - RATIS-1892. Unify the lifetime of the RaftServerProxy thread pool (apache/ratis#923) - RATIS-1889. NoSuchMethodError: RaftServerMetricsImpl.addNumPendingRequestsGauge apache/ratis#922 (apache/ratis#922) - RATIS-761. Handle writeStateMachineData failure in leader. (apache/ratis#927) - RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (apache/ratis#933) - RATIS-1912. Fix infinity election when perform membership change. (apache/ratis#954) - RATIS-1858. Follower keeps logging first election timeout. (apache/ratis#894) - 3.0.1:This is a bugfix release. See the [changes between 3.0.0 and 3.0.1](apache/ratis@ratis-3.0.0...ratis-3.0.1) releases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Cluster manual test. Closes #2480 from SteNicholas/CELEBORN-1400. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
What is a Leader Lease
In Raft, the leader is responsible for processing and coordinating client requests, replicating data among other followers, and maintaining the distributed state machine.
Vanilla Raft requires the leader to obtain majority acknowledgements before serving every read requests. During normal operations, this prerequisite leads to unnecessary rpcs exchanged among the cluster, diminished read throughput and increased latency.
The leader lease is a concept that allows the leader to maintain its leadership without obtaining majority acknowledgements for a certain period of time (lease duration), during which it can directly serve client read requests.
How to extend lease during normal operations
Prerequisite
Initialize
Once a leader is elected and its authority being comfirmed by majorities through successfully replicating its first no-op log, the leader gains the lease. The lease validity starts from T(0).
Renewal
As long as the leader continues to send heartbeats and receives acknowledgments from a majority of other nodes, it can renew its lease. Theoretically, if the most recent acknowledged heartbeat was sent at time T(n), the validity of the new lease commences at T(n).
In practice, rather than updating the lease with every heartbeat, we opt for a more efficient approach by lazily updating the leader's lease upon each query. Here's how it works:
At time T(n), when the leader is questioned about its authority, it first collects the send times of the last replied AppendEntries from each of its followers, denoted as TR(1), TR(2), ..., TR(2n), sorted in descending order.
Next, it selects the maximum timestamp at when the majority of followers are known to be active, that is, TR(n).
If TR(n) falls within the time range [T(n), T(n) + LeaseTimeoutDuration], then the lease can be successfully renewed.
Revoke
If the lease is expired and the leader cannot renew it, it loses the lease and stops serving read-only requests directly.
How to handle lease during configuration changes
During the configuration changes, the lease can only be renewed if acknowledgments be received by both the old group and the new group. It is the same to leader election restrictions during reconfiguration.
What to do when forced step down
When a leader is forced down, its lease should be effectively revoked.
How to handle CPU drifts
We can lower the ratio allowed for lease timeouts. If the CPU drifts are unbound, better not to use lease read :)
See https://issues.apache.org/jira/browse/RATIS-1866.