-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-4063. Fix InstallSnapshot in OM HA #1294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bharatviswa504
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM.
Tested it on a OM HA Cluster, now I see installSnapshot is working when logs are missing.
Scenario tried:
- Stopped one of the follower OM,
- Ran freon.
- Deleted logs from leader and other follower.
Restarted OM, and now seen that transaction info is uptodate.
Log Snippet:
2020-08-05 23:30:35,660 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: inconsistency entries. Reply:om3<-om2#1:FAIL,INCONSISTENCY,nextIndex:21994,term:31,followerCommit:21992 2020-08-05 23:30:35,675 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: receive installSnapshot: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,675 DEBUG org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Latest Snapshot Info 27#21988 2020-08-05 23:30:35,675 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: notifyInstallSnapshot: nextIndex is 21994 but the leader's first available index is 45162. 2020-08-05 23:30:35,676 INFO org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Received install snapshot notification from OM leader: om3 with term index: (t:28, i:45162) 2020-08-05 23:30:35,677 INFO org.apache.hadoop.ozone.om.OzoneManager: Downloading checkpoint from leader OM om3 and reloading state from the checkpoint. 2020-08-05 23:30:35,677 INFO org.apache.hadoop.ozone.om.snapshot.OzoneManagerSnapshotProvider: Downloading latest checkpoint from Leader OM om3. Checkpoint URL: https://bv-oz-3.bv-oz.root.hwx.site:9875/dbCheckpoint?flushBeforeCheckpoint=true 2020-08-05 23:30:35,680 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: reply installSnapshot: om3<-om2#0:FAIL-t31,IN_PROGRESS 2020-08-05 23:30:35,681 INFO org.apache.ratis.grpc.server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,731 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: receive installSnapshot: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,731 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: Snapshot Installation by StateMachine is in progress. 2020-08-05 23:30:35,731 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: reply installSnapshot: om3<-om2#0:FAIL-t31,IN_PROGRESS 2020-08-05 23:30:35,731 INFO org.apache.ratis.grpc.server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,733 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: receive installSnapshot: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,733 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: Snapshot Installation by StateMachine is in progress. 2020-08-05 23:30:35,733 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: reply installSnapshot: om3<-om2#0:FAIL-t31,IN_PROGRESS 2020-08-05 23:30:35,734 INFO org.apache.ratis.grpc.server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:36,087 INFO org.apache.hadoop.ozone.om.snapshot.OzoneManagerSnapshotProvider: Successfully downloaded latest checkpoint from leader OM: om3 2020-08-05 23:30:36,089 INFO org.apache.hadoop.ozone.om.OzoneManager: Downloaded checkpoint from Leader om3 to the location /var/lib/hadoop-ozone/om/ratis/snapshot/om.db-om3-1596670235677 2020-08-05 23:30:36,175 INFO org.apache.hadoop.ozone.om.OzoneManager: Installing checkpoint with OMTransactionInfo org.apache.hadoop.ozone.om.ratis.OMTransactionInfo@e19e
|
Thank You @hanishakoneru for the contribution. |
(cherry picked from commit cc5901f)
What changes were proposed in this pull request?
OzoneManagerStateMachine#notifyInstallSnapshotFromLeader() checks the incoming roleInfoProto and proceeds with install snapshot request only if the role is Leader. This check is wrong and the roleInfoProto will contain the self node ID and not the leaders.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4063
How was this patch tested?
Testes manually on a docker cluster.