Skip to content

Conversation

@sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

When RM selects nodes to delete replicas from, it sorts the replicas by datanode UUID and then iterates the list. If a node is overloaded when it is selected for delete, then rather than holding that delete for later, it skips it and tries the next replica in the list. This can result in non-deterministic delete selection, which we want to avoid.

This PR changes that, so that the original replica is no longer skipped, but will be tried again on the next iteration.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12115

How was this patch tested?

A couple of new mis-replication tests added and a test which started failing after the change was modified to reflect the new intended behavior.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@siddhantsangwan siddhantsangwan merged commit efd8adc into apache:master Jan 25, 2025
42 checks passed
@siddhantsangwan
Copy link
Contributor

Thanks for the patch.

sodonnel added a commit to sodonnel/hadoop-ozone that referenced this pull request Jan 30, 2025
nandakumar131 pushed a commit to nandakumar131/ozone that referenced this pull request Feb 10, 2025
ptlrs pushed a commit to ptlrs/ozone that referenced this pull request Mar 8, 2025
* CDPD-78092. HDDS-12114. Prevent delete commands running after a long lock wait and send ICR earlier (apache#7726)

(cherry picked from commit b6cc4af)

 Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
	hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueHandler.java

Change-Id: I62ffb7203f2af5be2901ef923f333de53bbc3656

* CDPD-78149. HDDS-12115. RM selects replicas to delete non-deterministically if nodes are overloaded (apache#7728)

(cherry picked from commit efd8adc)

 Conflicts:
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestRatisOverReplicationHandler.java

Change-Id: Ia3d54917c7c488a9b706f6ce941e7f466746d3bd

* CDPD-78286. HDDS-12135. Set RM default deadline to 12 minutes and datanode offset to 6 minutes (apache#7747)

(cherry picked from commit d7616ec)
Change-Id: I36f237705f5a94d453bcec72c32056c2be8f38ba

* CDPD-78213. HDDS-12127. RM should not expire pending deletes, but retry until delete is confirmed or node is dead (apache#7746)

(cherry picked from commit 04f6255)

 Conflicts:
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerReplicaPendingOps.java
        hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestMoveManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestContainerReplicaPendingOps.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestECContainerReplicaCount.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestReplicationManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestReplicationManagerScenarios.java
	hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/TestBlockDeletion.java

Change-Id: Ic01591f72706f2473c63dd2e44c3f2a94fb70d43

---------

Co-authored-by: Stephen O'Donnell <stephen.odonnell@gmail.com>
Cyrill pushed a commit to Cyrill/ozone that referenced this pull request Nov 10, 2025
Cyrill pushed a commit to Cyrill/ozone that referenced this pull request Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants