-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-13281. Disable Ratis metadata write to Raft Log on OM & SCM. #8637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR disables commit metadata writes to the Raft log for both OzoneManager and SCM, reducing disk IO and improving performance.
- Disable Raft log metadata writes in OzoneManager for performance.
- Disable Raft log metadata writes in SCM with a similar configuration change.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java | Disabled commit metadata writes to improve performance. |
| hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/RatisUtil.java | Disabled commit metadata writes; comment needs updating for SCM instead of OzoneManager. |
Comments suppressed due to low confidence (1)
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/RatisUtil.java:201
- The comment in this SCM file still references OzoneManager, which may confuse readers. Consider updating it to accurately reflect that the configuration is for SCM.
// commit index even if a majority of servers are dead. We don't need this for OzoneManager,
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandakumar131 Thanks for the patch. LGTM +1.
Before RATIS-2109 and this patch, each updateCommit will create a single metadata log entry, so there can be almost 1:1 ratio between normal Raft log entries and the metadata entries.
|
Thanks @szetszwo and @ivandika3 for the review. We actually make use of commit index in OzoneManager here. So, we cannot go ahead with this change for now. |
@nandakumar131 , Good catch! We should change it to use +++ b/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/upgrade/OMPrepareRequest.java
@@ -185,7 +185,6 @@ private static long waitForLogIndex(long minOMDBFlushIndex,
// If we purge logs without waiting for this index, it may not make it to
// the RocksDB snapshot, and then the log entry is lost on this OM.
long minRatisStateMachineIndex = minOMDBFlushIndex + 1; // for the ratis-metadata transaction
- long lastRatisCommitIndex = RaftLog.INVALID_LOG_INDEX;
// Wait OM state machine to apply the given index.
long lastOMDBFlushIndex = RaftLog.INVALID_LOG_INDEX;
@@ -202,11 +201,10 @@ private static long waitForLogIndex(long minOMDBFlushIndex,
lastOMDBFlushIndex);
// Check ratis state machine.
- lastRatisCommitIndex = stateMachine.getLastNotifiedTermIndex().getIndex();
- ratisStateMachineApplied = (lastRatisCommitIndex >=
- minRatisStateMachineIndex);
+ final long lastRatisAppliedIndex = stateMachine.getLastAppliedTermIndex().getIndex();
+ ratisStateMachineApplied = lastRatisAppliedIndex >= minRatisStateMachineIndex;
LOG.debug("{} Current Ratis state machine transaction index {}.",
- om.getOMNodeId(), lastRatisCommitIndex);
+ om.getOMNodeId(), lastRatisAppliedIndex);
if (!(omDBFlushed && ratisStateMachineApplied)) {
Thread.sleep(flushCheckInterval.toMillis()); |
|
Thanks @szetszwo for the suggestion, updated the PR accordingly. |
| ratisStateMachineApplied = lastRatisAppliedIndex >= minRatisStateMachineIndex; | ||
| LOG.debug("{} Current Ratis state machine transaction index {}.", | ||
| om.getOMNodeId(), lastRatisCommitIndex); | ||
| om.getOMNodeId(), ratisStateMachineApplied); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandakumar131 , It should print lastRatisAppliedIndex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Also the lastRatisCommitIndex was referred in the same method towards the end as well, replaced that with lastRatisAppliedIndex.
| @@ -185,7 +185,6 @@ private static long waitForLogIndex(long minOMDBFlushIndex, | |||
| // If we purge logs without waiting for this index, it may not make it to | |||
| // the RocksDB snapshot, and then the log entry is lost on this OM. | |||
| long minRatisStateMachineIndex = minOMDBFlushIndex + 1; // for the ratis-metadata transaction | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandakumar131 , just found that We should also remove minRatisStateMachineIndex and just use minOMDBFlushIndex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because we are not writing the commit index to Raft Log anymore, so there is no need to add 1 to minOMDBFlushIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the lastRatisAppliedIndex check?
Can't we just rely on om.getRatisSnapshotIndex() and make sure that the given minOMDBFlushIndex is present in the snapshot? If minOMDBFlushIndex is present in the snapshot, then it will definitely be applied in the Ratis' state machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... so there is no need to add 1 to minOMDBFlushIndex?
Yes.
Can't we just rely on om.getRatisSnapshotIndex() and make sure that the given minOMDBFlushIndex is present in the snapshot?
In OMPrepareRequest.validateAndUpdateCache(..), it waits and then takes snapshot. So it should wait for lastRatisAppliedIndex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Updated the patch accordingly.
|
Some integration tests seem to be timing out:
|
|
Thanks for the ping @adoroszlai, I'm looking at the test timeouts. |
|
@szetszwo, the SCM State machine was not updating LastAppliedTermIndex due to which the tests were timing out. |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the new change looks good.
|
@nandakumar131 , This PR currently is in "draft" state. Is it ready? |
|
@szetszwo, there are test failures in the CI run. Trying to check if they are related to this change, will make the PR ready once the test failures are addressed. |
|
@szetszwo @ivandika3 |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandakumar131 , Thanks for the update!
For the inFlightSnapshotCount, the current workaround is fine. We should file a separate Ozone JIRA for fixing it. The idea should be similar to acquiring a lock or opening a fille -- we should have something similar to try-finally for incrementing/decrementing the count. Cc @peterxcli
The code change looks good. I suggest updating the java comment for the inFlightSnapshotCount workaround .
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Outdated
Show resolved
Hide resolved
| int result = inFlightSnapshotCount.decrementAndGet(); | ||
| if (result < 0) { | ||
| resetInFlightSnapshotCount(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workaround is fine. How about updating the comment to below?
// TODO this is a work around for the accounting logic of `inFlightSnapshotCount`.
// - It incorrectly assumes that LeaderReady means that there are no inflight snapshot requests.
// We may consider fixing it by waiting all the pending requests in notifyLeaderReady().
// - Also, it seems to have another bug that the PrepareState could disallow snapshot requests.
// In such case, `inFlightSnapshotCount` won't be decremented. We should file a separate Ozone JIRA for fixing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created HDDS-13357 to fix resetInFlightSnapshotCount logic.
|
@nandakumar131 , this change turns out to be not straightforward. How about we separate the OM and the SCM changes into two JIRAs? I am fine if we want to keep everything in one JIRA. |
|
Thanks @szetszwo for the review. |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
|
Thanks @szetszwo & @ivandika3 for the review! |
What changes were proposed in this pull request?
Disable Ratis metadata write to Raft Log on OM & SCM
For OM & SCM we don't have to write the commit index to the Raft Log, this is meant for scenarios where we want to recover commit index on majority failure. This is not required for OM & SCM.
Disabling commit index write to Raft Log will improver the performance of Ratis as we will make one less disk IO for each transaction.
For SCM Statemachine, call
updateLastAppliedTermIndex(appliedTermIndex);is done to update same transaction index (a bug)What is the link to the Apache JIRA
HDDS-13281
How was this patch tested?
CI Run: https://github.com/nandakumar131/ozone/actions/runs/15940431628