HDDS-14121. Parallelize NSSummary Tree rebuild. #9473

ArafatKhan2198 · 2025-12-10T08:28:36Z

What changes were proposed in this pull request?

Earlier, Recon rebuilt the NSSummary tree using a single thread and wrote directly to the DB, which was very slow for large namespaces. This change makes the rebuild parallel, faster, and safer.

During a rebuild, Recon splits the OM DB tables into ranges and processes them in parallel using multiple iterator and worker threads. Workers scan records and build in-memory summary updates but never read from Recon DB, keeping them fast and avoiding contention.

When workers accumulate enough updates, they send batches to a single background async flusher through a bounded queue. The flusher is the only component that writes to Recon DB. It merges updates, propagates file sizes and counts up the directory tree, and commits everything using batched DB writes.

For FSO, the rebuild runs in two phases: first the directory phase to build the directory structure, then the file phase to apply file size and count updates. Each phase has its own flusher so file updates never depend on missing directories.

If a DB write fails, the flusher immediately marks itself as failed. Workers detect this quickly and stop processing, new batches are rejected, and the original error is propagated so the task fails cleanly.

Overall, this approach significantly reduces rebuild time for large namespaces while keeping DB writes controlled, consistent, and correct.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14121

How was this patch tested?

Locally the results are the following comparing sequential iteration (old approach) vs parallel iteration (new approach)

Phase	Sequential	Parallel (Optimized)	Time Saved	Improvement
FSO Directory	7m 12s	5m 52s	-1m 20s	19% faster
FSO File	17m 44s	9m 56s	-7m 48s	44% faster
FSO TOTAL	24m 56s	15m 48s	-9m 8s	37% faster

…ocess

sumitagrawl · 2026-01-05T07:24:15Z

hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/tasks/NSSummaryAsyncFlusher.java

+        break;
+      } catch (Exception e) {
+        LOG.error("{}: Error in flush loop", taskName, e);
+        // Continue processing other batches


error or task failure is not reported ... need have mechanism to report failure if db have some issue

Hi @sumitagrawl,
The async flusher now tracks a FAILED state on any DB write error or any other error it records the exception and stops processing.

Worker threads check flusher health before processing each record and stop within milliseconds if a failure is detected.

The queue also rejects new batches immediately after failure, and close() propagates the original DB exception so the main task fails cleanly.

Result: No wasted work, fast failure detection, protected queue, and clear errors with the original DB issue kept.

sumitagrawl

LGTM

adoroszlai · 2026-01-12T12:52:31Z

hadoop-hdds/common/src/main/resources/ozone-default.xml

    </description>
  </property>
-
+    


Please avoid whitespace-only changes.

HDDS-14121. Parallelize NSSummaryTask tree rebuild.

9b4b595

ArafatKhan2198 force-pushed the HDDS-14121 branch from 241d7e8 to 9b4b595 Compare December 10, 2025 19:12

ArafatKhan2198 added 4 commits December 11, 2025 12:16

Made some changes to the test code

8c791bb

Reverted back to the old methods and created new methods for the repr…

22ce3e9

…ocess

Fixed reprocess

9fbe17e

Fixed reprocess

8e505d4

jojochuang requested review from devmadhuu and sumitagrawl and removed request for devmadhuu December 15, 2025 18:35

jojochuang added the recon label Dec 15, 2025

jojochuang requested a review from ChenSammi December 15, 2025 18:37

ArafatKhan2198 added 11 commits December 16, 2025 14:37

Made the async flush thread code simple

1806451

Removed commented code

b3d00d9

Fixed the size and count bug

15c36c6

Removed comments

6402dfc

Added parallelization for nssummarytaskWithOBS

f72b924

Fixed the failing tests

301f8d2

Minor changes

d09bec1

Fixed checkstyle issues

3c7ff7e

Fixed findbugs issue

dad1fd2

Refactored variable declaration code

853546d

Removed unused code

9b56da2

sumitagrawl reviewed Jan 5, 2026

View reviewed changes

ArafatKhan2198 added 3 commits January 7, 2026 12:58

Fixed the directory iteration performance

ee181a9

Improved error handling code

d357492

Merge branch 'master' into HDDS-14121

1f7de77

ArafatKhan2198 marked this pull request as ready for review January 7, 2026 08:12

ArafatKhan2198 marked this pull request as draft January 7, 2026 08:12

sumitagrawl approved these changes Jan 7, 2026

View reviewed changes

ArafatKhan2198 marked this pull request as ready for review January 9, 2026 08:34

ArafatKhan2198 marked this pull request as draft January 9, 2026 08:39

ArafatKhan2198 added 2 commits January 12, 2026 12:44

Fixed the failing tests and build

b96515f

Fixed queue size

88eff5e

ArafatKhan2198 marked this pull request as ready for review January 12, 2026 07:22

ArafatKhan2198 merged commit b9ba495 into apache:master Jan 12, 2026
56 checks passed

adoroszlai reviewed Jan 12, 2026

View reviewed changes

hadoop-hdds/common/src/main/resources/ozone-default.xml

</description>

</property>

Copy link

Contributor

adoroszlai Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid whitespace-only changes.

ArafatKhan2198 changed the title ~~HDDS-14121. Parallelize NSSummaryTask tree rebuild.~~ HDDS-14121. Parallelize NSSummary Tree rebuild. Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14121. Parallelize NSSummary Tree rebuild. #9473

HDDS-14121. Parallelize NSSummary Tree rebuild. #9473

Uh oh!

ArafatKhan2198 commented Dec 10, 2025 •

edited

Loading

Uh oh!

sumitagrawl Jan 5, 2026

Uh oh!

ArafatKhan2198 Jan 7, 2026 •

edited

Loading

Uh oh!

sumitagrawl left a comment

Uh oh!

Uh oh!

adoroszlai Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-14121. Parallelize NSSummary Tree rebuild. #9473

HDDS-14121. Parallelize NSSummary Tree rebuild. #9473

Uh oh!

Conversation

ArafatKhan2198 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sumitagrawl Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ArafatKhan2198 Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adoroszlai Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArafatKhan2198 commented Dec 10, 2025 •

edited

Loading

ArafatKhan2198 Jan 7, 2026 •

edited

Loading