Skip to content

Conversation

@ArafatKhan2198
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 commented Dec 10, 2025

What changes were proposed in this pull request?

Earlier, Recon rebuilt the NSSummary tree using a single thread and wrote directly to the DB, which was very slow for large namespaces. This change makes the rebuild parallel, faster, and safer.

During a rebuild, Recon splits the OM DB tables into ranges and processes them in parallel using multiple iterator and worker threads. Workers scan records and build in-memory summary updates but never read from Recon DB, keeping them fast and avoiding contention.

When workers accumulate enough updates, they send batches to a single background async flusher through a bounded queue. The flusher is the only component that writes to Recon DB. It merges updates, propagates file sizes and counts up the directory tree, and commits everything using batched DB writes.

For FSO, the rebuild runs in two phases: first the directory phase to build the directory structure, then the file phase to apply file size and count updates. Each phase has its own flusher so file updates never depend on missing directories.

If a DB write fails, the flusher immediately marks itself as failed. Workers detect this quickly and stop processing, new batches are rejected, and the original error is propagated so the task fails cleanly.

Overall, this approach significantly reduces rebuild time for large namespaces while keeping DB writes controlled, consistent, and correct.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14121

How was this patch tested?

Locally the results are the following comparing sequential iteration (old approach) vs parallel iteration (new approach)

Phase Sequential Parallel (Optimized) Time Saved Improvement
FSO Directory 7m 12s 5m 52s -1m 20s 19% faster
FSO File 17m 44s 9m 56s -7m 48s 44% faster
FSO TOTAL 24m 56s 15m 48s -9m 8s 37% faster

@jojochuang jojochuang requested review from devmadhuu and sumitagrawl and removed request for devmadhuu December 15, 2025 18:35
@jojochuang jojochuang requested a review from ChenSammi December 15, 2025 18:37
break;
} catch (Exception e) {
LOG.error("{}: Error in flush loop", taskName, e);
// Continue processing other batches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error or task failure is not reported ... need have mechanism to report failure if db have some issue

Copy link
Contributor Author

@ArafatKhan2198 ArafatKhan2198 Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sumitagrawl,
The async flusher now tracks a FAILED state on any DB write error or any other error it records the exception and stops processing.

Worker threads check flusher health before processing each record and stop within milliseconds if a failure is detected.

The queue also rejects new batches immediately after failure, and close() propagates the original DB exception so the main task fails cleanly.

Result: No wasted work, fast failure detection, protected queue, and clear errors with the original DB issue kept.

@ArafatKhan2198 ArafatKhan2198 marked this pull request as ready for review January 7, 2026 08:12
@ArafatKhan2198 ArafatKhan2198 marked this pull request as draft January 7, 2026 08:12
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ArafatKhan2198 ArafatKhan2198 marked this pull request as ready for review January 9, 2026 08:34
@ArafatKhan2198 ArafatKhan2198 marked this pull request as draft January 9, 2026 08:39
@ArafatKhan2198 ArafatKhan2198 marked this pull request as ready for review January 12, 2026 07:22
@ArafatKhan2198 ArafatKhan2198 merged commit b9ba495 into apache:master Jan 12, 2026
56 checks passed
</description>
</property>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid whitespace-only changes.

@ArafatKhan2198 ArafatKhan2198 changed the title HDDS-14121. Parallelize NSSummaryTask tree rebuild. HDDS-14121. Parallelize NSSummary Tree rebuild. Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants