-
Notifications
You must be signed in to change notification settings - Fork 595
HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster #7665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java
Outdated
Show resolved
Hide resolved
|
|
||
| } while (!batch.isEmpty()); | ||
| } else { | ||
| // Process txIDs provided by the user in batches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user provided list of txIDs reaches SCM via RPC call, so it's ok to process this in single go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…ause long db lock in large cluster
e2df43e to
88722d6
Compare
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java
Show resolved
Hide resolved
...e/integration-test/src/test/java/org/apache/hadoop/hdds/scm/TestStorageContainerManager.java
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java
Show resolved
Hide resolved
|
@aryangupta1998 the test failure seems related to this change, can you take a look at it? |
|
Thanks @nandakumar131, fixed the test case! |
sadanand48
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks @aryangupta1998 for the contribution. Thanks @sadanand48 for the review. |
* master: (168 commits) HDDS-12112. Fix interval used for Chunk Read/Write Dashboard (apache#7724) HDDS-12212. Fix grammar in decommissioning and observability documentation (apache#7815) HDDS-12195. Implement skip() in OzoneFSInputStream (apache#7801) HDDS-12200. Fix grammar in OM HA, EC and Snapshot doc (apache#7806) HDDS-12202. OpsCreate and OpsAppend metrics not incremented (apache#7811) HDDS-12203. Initialize block length before skip (apache#7809) HDDS-12183. Reuse cluster across safe test classes (apache#7793) HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster. (apache#7665) HDDS-12186. (addendum) Avoid array allocation for table iterator (apache#7799) HDDS-12186. Avoid array allocation for table iterator. (apache#7797) HDDS-11508. Decouple delete batch limits from Ratis request size for DirectoryDeletingService. (apache#7365) HDDS-12073. Don't show Source Bucket and Volume if null in DU metadata (apache#7760) HDDS-12142. Save logs from build check (apache#7782) HDDS-12163. Reduce number of individual getCapacity/getAvailable/getUsedSpace calls (apache#7790) HDDS-12176. Trivial dependency cleanup.(apache#7787) HDDS-12181. Bump jline to 3.29.0 (apache#7789) HDDS-12165. Refactor VolumeInfoMetrics to use getCurrentUsage (apache#7784) HDDS-12085. Add manual refresh button for DU page (apache#7780) HDDS-12132. Parameterize testUpdateTransactionInfoTable for SCM (apache#7768) HDDS-11277. Remove dependency on hadoop-hdfs in Ozone client (apache#7781) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueHandler.java hadoop-ozone/dist/src/main/smoketest/admincli/container.robot hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/freon/ClosedContainerReplicator.java
* master: (168 commits) HDDS-12112. Fix interval used for Chunk Read/Write Dashboard (apache#7724) HDDS-12212. Fix grammar in decommissioning and observability documentation (apache#7815) HDDS-12195. Implement skip() in OzoneFSInputStream (apache#7801) HDDS-12200. Fix grammar in OM HA, EC and Snapshot doc (apache#7806) HDDS-12202. OpsCreate and OpsAppend metrics not incremented (apache#7811) HDDS-12203. Initialize block length before skip (apache#7809) HDDS-12183. Reuse cluster across safe test classes (apache#7793) HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster. (apache#7665) HDDS-12186. (addendum) Avoid array allocation for table iterator (apache#7799) HDDS-12186. Avoid array allocation for table iterator. (apache#7797) HDDS-11508. Decouple delete batch limits from Ratis request size for DirectoryDeletingService. (apache#7365) HDDS-12073. Don't show Source Bucket and Volume if null in DU metadata (apache#7760) HDDS-12142. Save logs from build check (apache#7782) HDDS-12163. Reduce number of individual getCapacity/getAvailable/getUsedSpace calls (apache#7790) HDDS-12176. Trivial dependency cleanup.(apache#7787) HDDS-12181. Bump jline to 3.29.0 (apache#7789) HDDS-12165. Refactor VolumeInfoMetrics to use getCurrentUsage (apache#7784) HDDS-12085. Add manual refresh button for DU page (apache#7780) HDDS-12132. Parameterize testUpdateTransactionInfoTable for SCM (apache#7768) HDDS-11277. Remove dependency on hadoop-hdfs in Ozone client (apache#7781) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueHandler.java hadoop-ozone/dist/src/main/smoketest/admincli/container.robot hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/freon/ClosedContainerReplicator.java
…ee-improvements * HDDS-10239-container-reconciliation: (168 commits) HDDS-12112. Fix interval used for Chunk Read/Write Dashboard (apache#7724) HDDS-12212. Fix grammar in decommissioning and observability documentation (apache#7815) HDDS-12195. Implement skip() in OzoneFSInputStream (apache#7801) HDDS-12200. Fix grammar in OM HA, EC and Snapshot doc (apache#7806) HDDS-12202. OpsCreate and OpsAppend metrics not incremented (apache#7811) HDDS-12203. Initialize block length before skip (apache#7809) HDDS-12183. Reuse cluster across safe test classes (apache#7793) HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster. (apache#7665) HDDS-12186. (addendum) Avoid array allocation for table iterator (apache#7799) HDDS-12186. Avoid array allocation for table iterator. (apache#7797) HDDS-11508. Decouple delete batch limits from Ratis request size for DirectoryDeletingService. (apache#7365) HDDS-12073. Don't show Source Bucket and Volume if null in DU metadata (apache#7760) HDDS-12142. Save logs from build check (apache#7782) HDDS-12163. Reduce number of individual getCapacity/getAvailable/getUsedSpace calls (apache#7790) HDDS-12176. Trivial dependency cleanup.(apache#7787) HDDS-12181. Bump jline to 3.29.0 (apache#7789) HDDS-12165. Refactor VolumeInfoMetrics to use getCurrentUsage (apache#7784) HDDS-12085. Add manual refresh button for DU page (apache#7780) HDDS-12132. Parameterize testUpdateTransactionInfoTable for SCM (apache#7768) HDDS-11277. Remove dependency on hadoop-hdfs in Ozone client (apache#7781) ...
…ause long db lock in large cluster. (apache#7665)
What changes were proposed in this pull request?
In case of resetDeletedBlockRetryCount with --all option, scm takes lock and tries to get all the transaction with max retry and then updates DB with 0 count. In some large scale env this count can be huge which can lead to multiple problem.
i) Lock can lead to block all other normal operation.
ii) Since message is passed through ratis, which will fail because of size.
Instead of doing like above we should do this operation in batches to avoid long lock and ratis message size failure.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11714
How was this patch tested?
Tested Manually.