Skip to content

Conversation

@nandakumar131
Copy link
Contributor

@nandakumar131 nandakumar131 commented Jun 16, 2025

What changes were proposed in this pull request?

Disable Ratis metadata write to Raft Log on OM & SCM

For OM & SCM we don't have to write the commit index to the Raft Log, this is meant for scenarios where we want to recover commit index on majority failure. This is not required for OM & SCM.

Disabling commit index write to Raft Log will improver the performance of Ratis as we will make one less disk IO for each transaction.

For SCM Statemachine, call updateLastAppliedTermIndex(appliedTermIndex); is done to update same transaction index (a bug)

What is the link to the Apache JIRA

HDDS-13281

How was this patch tested?

CI Run: https://github.com/nandakumar131/ozone/actions/runs/15940431628

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR disables commit metadata writes to the Raft log for both OzoneManager and SCM, reducing disk IO and improving performance.

  • Disable Raft log metadata writes in OzoneManager for performance.
  • Disable Raft log metadata writes in SCM with a similar configuration change.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java Disabled commit metadata writes to improve performance.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/RatisUtil.java Disabled commit metadata writes; comment needs updating for SCM instead of OzoneManager.
Comments suppressed due to low confidence (1)

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/RatisUtil.java:201

  • The comment in this SCM file still references OzoneManager, which may confuse readers. Consider updating it to accurately reflect that the configuration is for SCM.
// commit index even if a majority of servers are dead. We don't need this for OzoneManager,

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandakumar131 Thanks for the patch. LGTM +1.

Before RATIS-2109 and this patch, each updateCommit will create a single metadata log entry, so there can be almost 1:1 ratio between normal Raft log entries and the metadata entries.

@nandakumar131
Copy link
Contributor Author

Thanks @szetszwo and @ivandika3 for the review.

We actually make use of commit index in OzoneManager here. So, we cannot go ahead with this change for now.

@szetszwo
Copy link
Contributor

We actually make use of commit index in OzoneManager here. ...

@nandakumar131 , Good catch! We should change it to use getLastAppliedTermIndex()

+++ b/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/upgrade/OMPrepareRequest.java
@@ -185,7 +185,6 @@ private static long waitForLogIndex(long minOMDBFlushIndex,
     // If we purge logs without waiting for this index, it may not make it to
     // the RocksDB snapshot, and then the log entry is lost on this OM.
     long minRatisStateMachineIndex = minOMDBFlushIndex + 1; // for the ratis-metadata transaction
-    long lastRatisCommitIndex = RaftLog.INVALID_LOG_INDEX;
 
     // Wait OM state machine to apply the given index.
     long lastOMDBFlushIndex = RaftLog.INVALID_LOG_INDEX;
@@ -202,11 +201,10 @@ private static long waitForLogIndex(long minOMDBFlushIndex,
           lastOMDBFlushIndex);
 
       // Check ratis state machine.
-      lastRatisCommitIndex = stateMachine.getLastNotifiedTermIndex().getIndex();
-      ratisStateMachineApplied = (lastRatisCommitIndex >=
-          minRatisStateMachineIndex);
+      final long lastRatisAppliedIndex = stateMachine.getLastAppliedTermIndex().getIndex();
+      ratisStateMachineApplied = lastRatisAppliedIndex >= minRatisStateMachineIndex;
       LOG.debug("{} Current Ratis state machine transaction index {}.",
-          om.getOMNodeId(), lastRatisCommitIndex);
+          om.getOMNodeId(), lastRatisAppliedIndex);
 
       if (!(omDBFlushed && ratisStateMachineApplied)) {
         Thread.sleep(flushCheckInterval.toMillis());

@nandakumar131
Copy link
Contributor Author

Thanks @szetszwo for the suggestion, updated the PR accordingly.

ratisStateMachineApplied = lastRatisAppliedIndex >= minRatisStateMachineIndex;
LOG.debug("{} Current Ratis state machine transaction index {}.",
om.getOMNodeId(), lastRatisCommitIndex);
om.getOMNodeId(), ratisStateMachineApplied);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandakumar131 , It should print lastRatisAppliedIndex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Also the lastRatisCommitIndex was referred in the same method towards the end as well, replaced that with lastRatisAppliedIndex.

@@ -185,7 +185,6 @@ private static long waitForLogIndex(long minOMDBFlushIndex,
// If we purge logs without waiting for this index, it may not make it to
// the RocksDB snapshot, and then the log entry is lost on this OM.
long minRatisStateMachineIndex = minOMDBFlushIndex + 1; // for the ratis-metadata transaction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandakumar131 , just found that We should also remove minRatisStateMachineIndex and just use minOMDBFlushIndex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because we are not writing the commit index to Raft Log anymore, so there is no need to add 1 to minOMDBFlushIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the lastRatisAppliedIndex check?
Can't we just rely on om.getRatisSnapshotIndex() and make sure that the given minOMDBFlushIndex is present in the snapshot? If minOMDBFlushIndex is present in the snapshot, then it will definitely be applied in the Ratis' state machine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... so there is no need to add 1 to minOMDBFlushIndex?

Yes.

Can't we just rely on om.getRatisSnapshotIndex() and make sure that the given minOMDBFlushIndex is present in the snapshot?

In OMPrepareRequest.validateAndUpdateCache(..), it waits and then takes snapshot. So it should wait for lastRatisAppliedIndex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Updated the patch accordingly.

@peterxcli peterxcli self-requested a review June 19, 2025 02:06
@nandakumar131 nandakumar131 marked this pull request as ready for review June 19, 2025 02:28
@adoroszlai
Copy link
Contributor

Some integration tests seem to be timing out:

  • TestBlockDeletion
  • TestSCMInstallSnapshotWithHA

@nandakumar131
Copy link
Contributor Author

Thanks for the ping @adoroszlai, I'm looking at the test timeouts.

@nandakumar131
Copy link
Contributor Author

@szetszwo, the SCM State machine was not updating LastAppliedTermIndex due to which the tests were timing out.
Fixed that in the latest commit.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the new change looks good.

@adoroszlai adoroszlai marked this pull request as draft June 20, 2025 10:15
@nandakumar131 nandakumar131 marked this pull request as ready for review June 23, 2025 06:48
@nandakumar131 nandakumar131 marked this pull request as draft June 25, 2025 07:17
@szetszwo
Copy link
Contributor

@nandakumar131 , This PR currently is in "draft" state. Is it ready?

@nandakumar131
Copy link
Contributor Author

@szetszwo, there are test failures in the CI run. Trying to check if they are related to this change, will make the PR ready once the test failures are addressed.

@nandakumar131
Copy link
Contributor Author

@szetszwo @ivandika3
I have fixed all the test failures, can you please take another look at the changes? Thanks in advance!

@nandakumar131 nandakumar131 marked this pull request as ready for review June 28, 2025 05:55
@jojochuang jojochuang added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jun 30, 2025
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandakumar131 , Thanks for the update!

For the inFlightSnapshotCount, the current workaround is fine. We should file a separate Ozone JIRA for fixing it. The idea should be similar to acquiring a lock or opening a fille -- we should have something similar to try-finally for incrementing/decrementing the count. Cc @peterxcli

The code change looks good. I suggest updating the java comment for the inFlightSnapshotCount workaround .

Comment on lines +935 to +938
int result = inFlightSnapshotCount.decrementAndGet();
if (result < 0) {
resetInFlightSnapshotCount();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workaround is fine. How about updating the comment to below?

// TODO this is a work around for the accounting logic of `inFlightSnapshotCount`.
//    - It incorrectly assumes that LeaderReady means that there are no inflight snapshot requests.
//      We may consider fixing it by waiting all the pending requests in notifyLeaderReady().
//    - Also, it seems to have another bug that the PrepareState could disallow snapshot requests.
//      In such case, `inFlightSnapshotCount` won't be decremented. 

We should file a separate Ozone JIRA for fixing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created HDDS-13357 to fix resetInFlightSnapshotCount logic.

@szetszwo
Copy link
Contributor

@nandakumar131 , this change turns out to be not straightforward. How about we separate the OM and the SCM changes into two JIRAs? I am fine if we want to keep everything in one JIRA.

@nandakumar131
Copy link
Contributor Author

Thanks @szetszwo for the review.
Since all the required code changes are already done here, lets keep everything in this PR itself.
If you have some more comments other than updating the TODO comment I will split the change into two PRs.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@nandakumar131 nandakumar131 merged commit d79ea9c into apache:master Jul 2, 2025
41 checks passed
@nandakumar131
Copy link
Contributor Author

Thanks @szetszwo & @ivandika3 for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants