Skip to content

Conversation

@sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

We recently found that delete commands can run for a long time once picked off the queue, and the default of a 10 minute deadline on SCM and 30 seconds less deadline on the datanodes can result in currently running commands being seen as expired in SCM.

This PR is to make the defaults less aggressive - giving a SCM / RM timeout of 12 minutes and a datanode timeout of 6 minutes. That way, there is longer for commands to be processed before RM will resend them.

With the throttling that RM employs, there should not be a large number of commands on the queue anyway, as the goal of RM is to schedule only the number of commands which can be processed in a heartbeat or two.

Other related Jiras to this one are: HDDS-12127, HDDS-12115, HDDS-12114

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12135

How was this patch tested?

Simple config change. No new tests added or modified.

@adoroszlai adoroszlai merged commit d7616ec into apache:master Jan 26, 2025
42 checks passed
@adoroszlai
Copy link
Contributor

Thanks @sodonnel for the patch.

sodonnel added a commit to sodonnel/hadoop-ozone that referenced this pull request Jan 30, 2025
nandakumar131 pushed a commit to nandakumar131/ozone that referenced this pull request Feb 10, 2025
ptlrs pushed a commit to ptlrs/ozone that referenced this pull request Mar 8, 2025
* CDPD-78092. HDDS-12114. Prevent delete commands running after a long lock wait and send ICR earlier (apache#7726)

(cherry picked from commit b6cc4af)

 Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
	hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueHandler.java

Change-Id: I62ffb7203f2af5be2901ef923f333de53bbc3656

* CDPD-78149. HDDS-12115. RM selects replicas to delete non-deterministically if nodes are overloaded (apache#7728)

(cherry picked from commit efd8adc)

 Conflicts:
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestRatisOverReplicationHandler.java

Change-Id: Ia3d54917c7c488a9b706f6ce941e7f466746d3bd

* CDPD-78286. HDDS-12135. Set RM default deadline to 12 minutes and datanode offset to 6 minutes (apache#7747)

(cherry picked from commit d7616ec)
Change-Id: I36f237705f5a94d453bcec72c32056c2be8f38ba

* CDPD-78213. HDDS-12127. RM should not expire pending deletes, but retry until delete is confirmed or node is dead (apache#7746)

(cherry picked from commit 04f6255)

 Conflicts:
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerReplicaPendingOps.java
        hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestMoveManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestContainerReplicaPendingOps.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestECContainerReplicaCount.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestReplicationManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestReplicationManagerScenarios.java
	hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/TestBlockDeletion.java

Change-Id: Ic01591f72706f2473c63dd2e44c3f2a94fb70d43

---------

Co-authored-by: Stephen O'Donnell <stephen.odonnell@gmail.com>
Cyrill pushed a commit to Cyrill/ozone that referenced this pull request Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants