Skip to content

Conversation

@ashishkumar50
Copy link
Contributor

What changes were proposed in this pull request?

Control number of delete blocks request per DN in a cycle, we can safely increase total delete block request in SCM in each cycle, as it can distribute requests to more DN.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13410

How was this patch tested?

New unit test

@ivandika3 ivandika3 requested review from sumitagrawl and xichen01 July 9, 2025 10:37
@xichen01
Copy link
Contributor

xichen01 commented Jul 9, 2025

@ashishkumar50 Thanks for your patch, could I know the background of this PR, what problem is this limit trying to solve ?

@ashishkumar50
Copy link
Contributor Author

@xichen01 Thanks for looking into it.
Currently block limit from SCM to DN is just 100K by default in every interval which is quite conservative value. We want to increase this value to much higher like 2M or even more, at the same time we don't want to overload particular DN to have too many block delete request due to overall SCM delete block limit change. Let the delete block request distribute across many DNs.

Copy link
Contributor

@xichen01 xichen01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishkumar50 Thanks for you patch, I have left a few comments.

Map<DatanodeID, Map<Long, CmdStatus>> commandStatus) {

// Check if all replicas satisfy the maxBlocksPerDatanode condition
if (!replicas.stream().allMatch(replica -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the deleted replicas are not distributed evenly, this return may cause the blockDeletionLimit hard to be reached (such as: in an expanded cluster, the deleted data may be on the old DNs.).
Maybe we can set a maximum number of loop, e.g. end the while loop in DeletedBlockLogImpl#getTransactions if it exceeds 3 * (DN count) * maxDeleteBlocksPerDatanode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, looks fine for me to break the while loop. But the problem is 3 * (DN count) * maxDeleteBlocksPerDatanode may not always happen as there can be under-replicated containers, replica count will be less than 3. Even this will not hold good for EC which may have 5-8 replica or so.

Assume we set total limit as 2M and per data node limit as 100K.
It will iterate whole table only when cluster size is too less may be 15-20 nodes and then we expand the cluster.
The problem will arise only when gap between total limit and per data node limit is way too high compare to the number of DNs in the cluster.

Map<DatanodeID, Map<Long, CmdStatus>> commandStatus) {

// Check if all replicas satisfy the maxBlocksPerDatanode condition
if (!replicas.stream().allMatch(replica -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if the ((DN count) * maxDeleteBlocksPerDatanode) < blockDeletionLimit, the DeletedBlockLogImpl#getTransactions will iterator all the data in the table.

Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ashishkumar50, for the patch. It looks good, I have a few suggestions.

One suggestion regarding the default value of ozone.scm.block.deletion.max.blocks.per.dn: it's currently hardcoded to 100k, which is safe and conservative but may not scale well or distribute load efficiently in larger clusters.

We already have another config, hdds.scm.block.deletion.per-interval.max (default: 500k), which controls the total number of blocks SCM can process per cycle. If both values are set to 100k, a single DN could potentially receive the full deletion load, defeating the goal of balanced distribution.

Consider a case where hdds.scm.block.deletion.per-interval.max = 2M. With ozone.scm.block.deletion.max.blocks.per.dn = 100k, we would ideally need 20 DNs to fully utilize the interval. If the cluster has fewer than 20 DNs, the load won't be evenly distributed — some DNs may hit the cap, while others remain underutilized.

To improve flexibility and balance, we could consider deriving the default value dynamically, for example:

ozone.scm.block.deletion.max.blocks.per.dn = hdds.scm.block.deletion.per-interval.max / (number of DNs / 2)

This ensures better load spreading in large clusters, while still avoiding overload in smaller ones. The /2 factor acts as a safety buffer but can be tuned based on experimentation.

@ashishkumar50
Copy link
Contributor Author

Thanks @aryangupta1998 for the review, updated patch with the dynamic derived value for each DN.

Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ashishkumar50, for updating the patch, LGTM!

@aryangupta1998 aryangupta1998 merged commit 64a4958 into apache:master Jul 21, 2025
81 of 82 checks passed
@aryangupta1998
Copy link
Contributor

Thanks @ashishkumar50 for the contribution and @xichen01 for the review!

errose28 added a commit to errose28/ozone that referenced this pull request Jul 30, 2025
* master: (730 commits)
  HDDS-13083. Handle cases where block deletion generates tree file before scanner (apache#8565)
  HDDS-12982. Reduce log level for snapshot validation failure (apache#8851)
  HDDS-13396. Documentation: Improve the top-level overview page for new users. (apache#8753)
  HDDS-13176. containerIds table value format change to proto from string (apache#8589)
  HDDS-13449. Incorrect Interrupt Handling for DirectoryDeletingService and KeyDeletingService (apache#8817)
  HDDS-2453. Add Freon tests for S3 MPU Keys (apache#8803)
  HDDS-13237. Container data checksum should contain block IDs. (apache#8773)
  HDDS-13489. Fix SCMBlockdeleting unnecessary iteration in corner case. (apache#8847)
  HDDS-13464. Make ozone.snapshot.filtering.service.interval reconfigurable (apache#8825)
  HDDS-13473. Amend validation for OZONE_OM_SNAPSHOT_DB_MAX_OPEN_FILES (apache#8829)
  HDDS-13435. Add an OzoneManagerAuthorizer interface (apache#8840)
  HDDS-8565. Recon memory leak in NSSummary (apache#8823).
  HDDS-12852. Implement a sliding window counter utility (apache#8498)
  HDDS-12000. Add unit test for RatisContainerSafeModeRule and ECContainerSafeModeRule (apache#8801)
  HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy (apache#8603)
  HDDS-13070. OM Follower changes to create and place sst files from hardlink file. (apache#8761)
  HDDS-13482. Mark testWriteStateMachineDataIdempotencyWithClosedContainer as flaky
  HDDS-13481. Fix success latency metric in SCM panels of deletion grafana dashboard (apache#8835)
  HDDS-13468. Update default value of ozone.scm.ha.dbtransactionbuffer.flush.interval. (apache#8834)
  HDDS-13410. Control block deletion for each DN from SCM. (apache#8767)
  ...

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaInfo.java
hadoop-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
hadoop-ozone/cli-admin/src/test/java/org/apache/hadoop/hdds/scm/cli/container/TestReconcileSubcommand.java
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants