-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-13410. Control block deletion for each DN from SCM. #8767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ashishkumar50 Thanks for your patch, could I know the background of this PR, what problem is this limit trying to solve ? |
|
@xichen01 Thanks for looking into it. |
xichen01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashishkumar50 Thanks for you patch, I have left a few comments.
| Map<DatanodeID, Map<Long, CmdStatus>> commandStatus) { | ||
|
|
||
| // Check if all replicas satisfy the maxBlocksPerDatanode condition | ||
| if (!replicas.stream().allMatch(replica -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the deleted replicas are not distributed evenly, this return may cause the blockDeletionLimit hard to be reached (such as: in an expanded cluster, the deleted data may be on the old DNs.).
Maybe we can set a maximum number of loop, e.g. end the while loop in DeletedBlockLogImpl#getTransactions if it exceeds 3 * (DN count) * maxDeleteBlocksPerDatanode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, looks fine for me to break the while loop. But the problem is 3 * (DN count) * maxDeleteBlocksPerDatanode may not always happen as there can be under-replicated containers, replica count will be less than 3. Even this will not hold good for EC which may have 5-8 replica or so.
Assume we set total limit as 2M and per data node limit as 100K.
It will iterate whole table only when cluster size is too less may be 15-20 nodes and then we expand the cluster.
The problem will arise only when gap between total limit and per data node limit is way too high compare to the number of DNs in the cluster.
| Map<DatanodeID, Map<Long, CmdStatus>> commandStatus) { | ||
|
|
||
| // Check if all replicas satisfy the maxBlocksPerDatanode condition | ||
| if (!replicas.stream().allMatch(replica -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if the ((DN count) * maxDeleteBlocksPerDatanode) < blockDeletionLimit, the DeletedBlockLogImpl#getTransactions will iterator all the data in the table.
aryangupta1998
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @ashishkumar50, for the patch. It looks good, I have a few suggestions.
One suggestion regarding the default value of ozone.scm.block.deletion.max.blocks.per.dn: it's currently hardcoded to 100k, which is safe and conservative but may not scale well or distribute load efficiently in larger clusters.
We already have another config, hdds.scm.block.deletion.per-interval.max (default: 500k), which controls the total number of blocks SCM can process per cycle. If both values are set to 100k, a single DN could potentially receive the full deletion load, defeating the goal of balanced distribution.
Consider a case where hdds.scm.block.deletion.per-interval.max = 2M. With ozone.scm.block.deletion.max.blocks.per.dn = 100k, we would ideally need 20 DNs to fully utilize the interval. If the cluster has fewer than 20 DNs, the load won't be evenly distributed — some DNs may hit the cap, while others remain underutilized.
To improve flexibility and balance, we could consider deriving the default value dynamically, for example:
ozone.scm.block.deletion.max.blocks.per.dn = hdds.scm.block.deletion.per-interval.max / (number of DNs / 2)
This ensures better load spreading in large clusters, while still avoiding overload in smaller ones. The /2 factor acts as a safety buffer but can be tuned based on experimentation.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java
Outdated
Show resolved
Hide resolved
|
Thanks @aryangupta1998 for the review, updated patch with the dynamic derived value for each DN. |
aryangupta1998
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ashishkumar50, for updating the patch, LGTM!
|
Thanks @ashishkumar50 for the contribution and @xichen01 for the review! |
* master: (730 commits) HDDS-13083. Handle cases where block deletion generates tree file before scanner (apache#8565) HDDS-12982. Reduce log level for snapshot validation failure (apache#8851) HDDS-13396. Documentation: Improve the top-level overview page for new users. (apache#8753) HDDS-13176. containerIds table value format change to proto from string (apache#8589) HDDS-13449. Incorrect Interrupt Handling for DirectoryDeletingService and KeyDeletingService (apache#8817) HDDS-2453. Add Freon tests for S3 MPU Keys (apache#8803) HDDS-13237. Container data checksum should contain block IDs. (apache#8773) HDDS-13489. Fix SCMBlockdeleting unnecessary iteration in corner case. (apache#8847) HDDS-13464. Make ozone.snapshot.filtering.service.interval reconfigurable (apache#8825) HDDS-13473. Amend validation for OZONE_OM_SNAPSHOT_DB_MAX_OPEN_FILES (apache#8829) HDDS-13435. Add an OzoneManagerAuthorizer interface (apache#8840) HDDS-8565. Recon memory leak in NSSummary (apache#8823). HDDS-12852. Implement a sliding window counter utility (apache#8498) HDDS-12000. Add unit test for RatisContainerSafeModeRule and ECContainerSafeModeRule (apache#8801) HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy (apache#8603) HDDS-13070. OM Follower changes to create and place sst files from hardlink file. (apache#8761) HDDS-13482. Mark testWriteStateMachineDataIdempotencyWithClosedContainer as flaky HDDS-13481. Fix success latency metric in SCM panels of deletion grafana dashboard (apache#8835) HDDS-13468. Update default value of ozone.scm.ha.dbtransactionbuffer.flush.interval. (apache#8834) HDDS-13410. Control block deletion for each DN from SCM. (apache#8767) ... hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaInfo.java hadoop-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java hadoop-ozone/cli-admin/src/test/java/org/apache/hadoop/hdds/scm/cli/container/TestReconcileSubcommand.java
What changes were proposed in this pull request?
Control number of delete blocks request per DN in a cycle, we can safely increase total delete block request in SCM in each cycle, as it can distribute requests to more DN.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13410
How was this patch tested?
New unit test