HDDS-13281. Disable Ratis metadata write to Raft Log on OM & SCM. #8637

nandakumar131 · 2025-06-16T17:49:13Z

What changes were proposed in this pull request?

Disable Ratis metadata write to Raft Log on OM & SCM

For OM & SCM we don't have to write the commit index to the Raft Log, this is meant for scenarios where we want to recover commit index on majority failure. This is not required for OM & SCM.

Disabling commit index write to Raft Log will improver the performance of Ratis as we will make one less disk IO for each transaction.

For SCM Statemachine, call updateLastAppliedTermIndex(appliedTermIndex); is done to update same transaction index (a bug)

What is the link to the Apache JIRA

HDDS-13281

How was this patch tested?

CI Run: https://github.com/nandakumar131/ozone/actions/runs/15940431628

Copilot

Pull Request Overview

This PR disables commit metadata writes to the Raft log for both OzoneManager and SCM, reducing disk IO and improving performance.

Disable Raft log metadata writes in OzoneManager for performance.
Disable Raft log metadata writes in SCM with a similar configuration change.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerRatisServer.java	Disabled commit metadata writes to improve performance.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/RatisUtil.java	Disabled commit metadata writes; comment needs updating for SCM instead of OzoneManager.

Comments suppressed due to low confidence (1)

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/RatisUtil.java:201

The comment in this SCM file still references OzoneManager, which may confuse readers. Consider updating it to accurately reflect that the configuration is for SCM.

// commit index even if a majority of servers are dead. We don't need this for OzoneManager,

szetszwo

+1 the change looks good.

ivandika3

@nandakumar131 Thanks for the patch. LGTM +1.

Before RATIS-2109 and this patch, each updateCommit will create a single metadata log entry, so there can be almost 1:1 ratio between normal Raft log entries and the metadata entries.

nandakumar131 · 2025-06-17T07:56:09Z

Thanks @szetszwo and @ivandika3 for the review.

We actually make use of commit index in OzoneManager here. So, we cannot go ahead with this change for now.

szetszwo · 2025-06-17T17:25:35Z

We actually make use of commit index in OzoneManager here. ...

@nandakumar131 , Good catch! We should change it to use getLastAppliedTermIndex()

+++ b/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/upgrade/OMPrepareRequest.java
@@ -185,7 +185,6 @@ private static long waitForLogIndex(long minOMDBFlushIndex,
     // If we purge logs without waiting for this index, it may not make it to
     // the RocksDB snapshot, and then the log entry is lost on this OM.
     long minRatisStateMachineIndex = minOMDBFlushIndex + 1; // for the ratis-metadata transaction
-    long lastRatisCommitIndex = RaftLog.INVALID_LOG_INDEX;
 
     // Wait OM state machine to apply the given index.
     long lastOMDBFlushIndex = RaftLog.INVALID_LOG_INDEX;
@@ -202,11 +201,10 @@ private static long waitForLogIndex(long minOMDBFlushIndex,
           lastOMDBFlushIndex);
 
       // Check ratis state machine.
-      lastRatisCommitIndex = stateMachine.getLastNotifiedTermIndex().getIndex();
-      ratisStateMachineApplied = (lastRatisCommitIndex >=
-          minRatisStateMachineIndex);
+      final long lastRatisAppliedIndex = stateMachine.getLastAppliedTermIndex().getIndex();
+      ratisStateMachineApplied = lastRatisAppliedIndex >= minRatisStateMachineIndex;
       LOG.debug("{} Current Ratis state machine transaction index {}.",
-          om.getOMNodeId(), lastRatisCommitIndex);
+          om.getOMNodeId(), lastRatisAppliedIndex);
 
       if (!(omDBFlushed && ratisStateMachineApplied)) {
         Thread.sleep(flushCheckInterval.toMillis());

nandakumar131 · 2025-06-18T17:06:45Z

Thanks @szetszwo for the suggestion, updated the PR accordingly.

szetszwo · 2025-06-18T17:38:39Z

...ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/upgrade/OMPrepareRequest.java

+      ratisStateMachineApplied = lastRatisAppliedIndex >= minRatisStateMachineIndex;
      LOG.debug("{} Current Ratis state machine transaction index {}.",
-          om.getOMNodeId(), lastRatisCommitIndex);
+          om.getOMNodeId(), ratisStateMachineApplied);


@nandakumar131 , It should print lastRatisAppliedIndex.

Updated. Also the lastRatisCommitIndex was referred in the same method towards the end as well, replaced that with lastRatisAppliedIndex.

szetszwo · 2025-06-18T17:40:36Z

...ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/upgrade/OMPrepareRequest.java

@@ -185,7 +185,6 @@ private static long waitForLogIndex(long minOMDBFlushIndex,
    // If we purge logs without waiting for this index, it may not make it to
    // the RocksDB snapshot, and then the log entry is lost on this OM.
    long minRatisStateMachineIndex = minOMDBFlushIndex + 1; // for the ratis-metadata transaction


@nandakumar131 , just found that We should also remove minRatisStateMachineIndex and just use minOMDBFlushIndex.

Is it because we are not writing the commit index to Raft Log anymore, so there is no need to add 1 to minOMDBFlushIndex?

Do we really need the lastRatisAppliedIndex check?
Can't we just rely on om.getRatisSnapshotIndex() and make sure that the given minOMDBFlushIndex is present in the snapshot? If minOMDBFlushIndex is present in the snapshot, then it will definitely be applied in the Ratis' state machine.

... so there is no need to add 1 to minOMDBFlushIndex?

Yes.

Can't we just rely on om.getRatisSnapshotIndex() and make sure that the given minOMDBFlushIndex is present in the snapshot?

In OMPrepareRequest.validateAndUpdateCache(..), it waits and then takes snapshot. So it should wait for lastRatisAppliedIndex.

Thanks for the explanation. Updated the patch accordingly.

adoroszlai · 2025-06-19T07:02:22Z

Some integration tests seem to be timing out:

TestBlockDeletion
TestSCMInstallSnapshotWithHA

nandakumar131 · 2025-06-19T08:29:11Z

Thanks for the ping @adoroszlai, I'm looking at the test timeouts.

nandakumar131 · 2025-06-19T19:16:12Z

@szetszwo, the SCM State machine was not updating LastAppliedTermIndex due to which the tests were timing out.
Fixed that in the latest commit.

szetszwo

+1 the new change looks good.

szetszwo · 2025-06-25T17:26:06Z

@nandakumar131 , This PR currently is in "draft" state. Is it ready?

nandakumar131 · 2025-06-26T03:10:04Z

@szetszwo, there are test failures in the CI run. Trying to check if they are related to this change, will make the PR ready once the test failures are addressed.

nandakumar131 · 2025-06-28T04:28:37Z

@szetszwo @ivandika3
I have fixed all the test failures, can you please take another look at the changes? Thanks in advance!

szetszwo

@nandakumar131 , Thanks for the update!

For the inFlightSnapshotCount, the current workaround is fine. We should file a separate Ozone JIRA for fixing it. The idea should be similar to acquiring a lock or opening a fille -- we should have something similar to try-finally for incrementing/decrementing the count. Cc @peterxcli

The code change looks good. I suggest updating the java comment for the inFlightSnapshotCount workaround .

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java

szetszwo · 2025-06-30T22:41:46Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java

+    int result = inFlightSnapshotCount.decrementAndGet();
+    if (result < 0) {
+      resetInFlightSnapshotCount();
+    }


The workaround is fine. How about updating the comment to below?

// TODO this is a work around for the accounting logic of `inFlightSnapshotCount`. // - It incorrectly assumes that LeaderReady means that there are no inflight snapshot requests. // We may consider fixing it by waiting all the pending requests in notifyLeaderReady(). // - Also, it seems to have another bug that the PrepareState could disallow snapshot requests. // In such case, `inFlightSnapshotCount` won't be decremented.

We should file a separate Ozone JIRA for fixing it.

Created HDDS-13357 to fix resetInFlightSnapshotCount logic.

szetszwo · 2025-06-30T22:46:38Z

@nandakumar131 , this change turns out to be not straightforward. How about we separate the OM and the SCM changes into two JIRAs? I am fine if we want to keep everything in one JIRA.

nandakumar131 · 2025-07-01T06:20:38Z

Thanks @szetszwo for the review.
Since all the required code changes are already done here, lets keep everything in this PR itself.
If you have some more comments other than updating the TODO comment I will split the change into two PRs.

szetszwo

+1 the change looks good.

nandakumar131 · 2025-07-02T10:59:45Z

Thanks @szetszwo & @ivandika3 for the review!

HDDS-13281. Disable Ratis metadata write to Raft Log on OM & SCM.

f7da01e

nandakumar131 requested a review from szetszwo June 16, 2025 17:49

kerneltime requested review from Copilot, errose28 and swamirishi June 16, 2025 18:20

Copilot AI reviewed Jun 16, 2025

View reviewed changes

Typo fixed in comment

5fccba6

szetszwo approved these changes Jun 16, 2025

View reviewed changes

ivandika3 added the performance label Jun 17, 2025

ivandika3 approved these changes Jun 17, 2025

View reviewed changes

nandakumar131 added 2 commits June 18, 2025 20:46

Merge remote-tracking branch 'upstream/master' into HDDS-13281

1e7ba56

Remove the usage of commit index from OMPrepareRequest

f1470a4

szetszwo reviewed Jun 18, 2025

View reviewed changes

nandakumar131 added 2 commits June 18, 2025 23:11

fixed build issue

c65f897

Replace minRatisStateMachineIndex with minOMDBFlushIndex

45b1d24

peterxcli self-requested a review June 19, 2025 02:06

nandakumar131 marked this pull request as ready for review June 19, 2025 02:28

SCM Statemachine should update last applied Term Index

ce2b07b

szetszwo approved these changes Jun 19, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into HDDS-13281

82171ba

adoroszlai marked this pull request as draft June 20, 2025 10:15

nandakumar131 marked this pull request as ready for review June 23, 2025 06:48

nandakumar131 marked this pull request as draft June 25, 2025 07:17

Merge remote-tracking branch 'upstream/master' into HDDS-13281

dd2c473

nandakumar131 added 2 commits June 27, 2025 18:56

fixed test failures

ead6823

fixed some more test failures

da3f22c

nandakumar131 mentioned this pull request Jun 28, 2025

HDDS-12596. OM fs snapshot max limit is not enforced (fix #8157) #8377

Merged

nandakumar131 marked this pull request as ready for review June 28, 2025 05:55

jojochuang added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jun 30, 2025

szetszwo reviewed Jun 30, 2025

View reviewed changes

updated comment in OmSnapshotManager#decrementInFlightSnapshotCount

ac0bac1

szetszwo approved these changes Jul 1, 2025

View reviewed changes

nandakumar131 merged commit d79ea9c into apache:master Jul 2, 2025
41 checks passed

sumitagrawl mentioned this pull request Oct 27, 2025

HDDS-13842. Exit safemode rule quickly at follower for idle SCM #9202

Merged

HDDS-13281. Disable Ratis metadata write to Raft Log on OM & SCM. #8637

HDDS-13281. Disable Ratis metadata write to Raft Log on OM & SCM. #8637

Uh oh!

Conversation

nandakumar131 commented Jun 16, 2025 • edited by sumitagrawl Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

ivandika3 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nandakumar131 commented Jun 17, 2025

Uh oh!

szetszwo commented Jun 17, 2025

Uh oh!

nandakumar131 commented Jun 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jun 19, 2025

Uh oh!

nandakumar131 commented Jun 19, 2025

Uh oh!

nandakumar131 commented Jun 19, 2025

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

szetszwo commented Jun 25, 2025

Uh oh!

nandakumar131 commented Jun 26, 2025

Uh oh!

nandakumar131 commented Jun 28, 2025

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szetszwo commented Jun 30, 2025

Uh oh!

nandakumar131 commented Jul 1, 2025

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nandakumar131 commented Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

nandakumar131 commented Jun 16, 2025 •

edited by sumitagrawl

Loading

ivandika3 left a comment •

edited

Loading