Skip to content

Conversation

@sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

We have seen some instances where delete container commands are picked from the command queue within the SCM defined deadline. However they run for a very long time in the handler. This cases SCM to think the delete has been dropped or failed, when it is actually still running.

The causes of the slow running command could be:

  1. Something else has a lock on the container for a long time, blocking the delete operation
  2. Slow disk causing the removal of the container files to take a very long time.

To compound this problem, an ICR confirming the delete is not sent until the very last stage of the delete process.

To combat this, two changes are included in this PR:

  1. Introduce a lock timeout of 60 seconds. If it takes longer than this for the lock and pre-checks to complete, the container delete is skipped.
  2. Move the ICR to immediately after the point where the container is removed from the container set. At this stage, there is no way to recover the container without a DN restart and it makes sense to inform SCM that the container is logically removed ASAP.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12114

How was this patch tested?

New unit test added.

@sodonnel sodonnel changed the title Hdds 12114 HDDS-12114. Prevent delete commands running after a long lock wait and send ICR earlier Jan 20, 2025
@ivandika3 ivandika3 requested a review from xichen01 January 21, 2025 06:30
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sodonnel overall looks good just some minor comments.

@sodonnel
Copy link
Contributor Author

@errose28 I believe I have addressed the comments. Please have another check.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick improvement @sodonnel. Optional if you want to fix the minor whitespace diff before merging

public static final String
OZONE_RECOVERING_CONTAINER_TIMEOUT_DEFAULT = "20m";


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. whitespace diff

@sodonnel sodonnel merged commit b6cc4af into apache:master Jan 24, 2025
42 checks passed
sodonnel added a commit to sodonnel/hadoop-ozone that referenced this pull request Jan 30, 2025
…d send ICR earlier (apache#7726)

(cherry picked from commit b6cc4af)

 Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
	hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueHandler.java
nandakumar131 pushed a commit to nandakumar131/ozone that referenced this pull request Feb 10, 2025
ptlrs pushed a commit to ptlrs/ozone that referenced this pull request Mar 8, 2025
* CDPD-78092. HDDS-12114. Prevent delete commands running after a long lock wait and send ICR earlier (apache#7726)

(cherry picked from commit b6cc4af)

 Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
	hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueHandler.java

Change-Id: I62ffb7203f2af5be2901ef923f333de53bbc3656

* CDPD-78149. HDDS-12115. RM selects replicas to delete non-deterministically if nodes are overloaded (apache#7728)

(cherry picked from commit efd8adc)

 Conflicts:
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestRatisOverReplicationHandler.java

Change-Id: Ia3d54917c7c488a9b706f6ce941e7f466746d3bd

* CDPD-78286. HDDS-12135. Set RM default deadline to 12 minutes and datanode offset to 6 minutes (apache#7747)

(cherry picked from commit d7616ec)
Change-Id: I36f237705f5a94d453bcec72c32056c2be8f38ba

* CDPD-78213. HDDS-12127. RM should not expire pending deletes, but retry until delete is confirmed or node is dead (apache#7746)

(cherry picked from commit 04f6255)

 Conflicts:
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ContainerReplicaPendingOps.java
        hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestMoveManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestContainerReplicaPendingOps.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestECContainerReplicaCount.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestReplicationManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestReplicationManagerScenarios.java
	hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/TestBlockDeletion.java

Change-Id: Ic01591f72706f2473c63dd2e44c3f2a94fb70d43

---------

Co-authored-by: Stephen O'Donnell <stephen.odonnell@gmail.com>
Cyrill pushed a commit to Cyrill/ozone that referenced this pull request Nov 10, 2025
Cyrill pushed a commit to Cyrill/ozone that referenced this pull request Nov 25, 2025
…d send ICR earlier (apache#7726)

(cherry picked from commit b6cc4af)

 Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
	hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueHandler.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants