-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-12078. Improve container reconciliation CLIs #7944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…prove-cli * HDDS-10239-container-reconciliation: (113 commits) HDDS-12361. Mark testGetSnapshotDiffReportJob as flaky HDDS-12375. Random object created and used only once (apache#7933) HDDS-12335. Fix ozone admin namespace summary to give complete output (apache#7908) HDDS-12364. Require Override annotation for overridden methods (apache#7923) HDDS-11530. Support listMultipartUploads max uploads and markers (apache#7817) HDDS-11867. Remove code paths for non-Ratis SCM. (apache#7911) HDDS-12363. Add import options to IntelliJ IDEA style settings (apache#7921) HDDS-12215. Mark testContainerStateMachineRestartWithDNChangePipeline as flaky HDDS-12331. BlockOutputStream.failedServers is not thread-safe (apache#7885) HDDS-12188. Move server-only upgrade classes from hdds-common to hdds-server-framework (apache#7903) HDDS-12362. Remove temporary checkstyle suppression file (apache#7920) HDDS-12343. Fix spotbugs warnings in Recon (apache#7902) HDDS-12286. Fix license headers and imports for ozone-tools (apache#7919) HDDS-12284. Fix license headers and imports for ozone-s3-secret-store (apache#7917) HDDS-12275. Fix license headers and imports for ozone-integration-test (apache#7904) HDDS-12164. Rename and deprecate DFSConfigKeysLegacy config keys (apache#7803) HDDS-12283. Fix license headers and imports for ozone-recon-codegen (apache#7916) HDDS-12330. Convert Volume, Bucket and Key count to use comma separated numbers (apache#7881) HDDS-12281. Fix license headers and imports for ozone-filesystem-hadoop3 (apache#7913) HDDS-12285. Fix license headers and imports for ozone-s3gateway (apache#7918) ... Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaInfo.java hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/server/JsonUtils.java hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
kerneltime
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer robot tests over integration tests. Overall seems fine, will go over a few more times.
...p-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
Outdated
Show resolved
Hide resolved
...p-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
Outdated
Show resolved
Hide resolved
| System.out.println("Use \"ozone admin container reconcile --status " + containerID + "\" to see the checksums of " + | ||
| "each container replica"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a good way to indicate when reconciliation is completed... Maybe we add a reconciliation count as a metric to the container info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, some sort of feedback that it finished would be good. I was planning to revisit this with the SCM integration. For example, SCM may end up tracking the known in-flight reconciliations and could report the results.
Are you suggesting adding a jmx metric or just a counter in the code? I'm not sure exactly how this work, but I imagine the counter would get persisted on the DN side and SCM would see the updates. Users would then look for the count to increase when they send in a command.
...p-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
Outdated
Show resolved
Hide resolved
Yes I will add robot tests as well. The current tests are unit tests without a mini ozone cluster. The scm client is mocked to return various combinations of replicas that would otherwise be hard to create in a mini ozone or docker cluster. |
|
HDDS-10239-container-reconciliation is merged into |
* master: (730 commits) HDDS-13083. Handle cases where block deletion generates tree file before scanner (apache#8565) HDDS-12982. Reduce log level for snapshot validation failure (apache#8851) HDDS-13396. Documentation: Improve the top-level overview page for new users. (apache#8753) HDDS-13176. containerIds table value format change to proto from string (apache#8589) HDDS-13449. Incorrect Interrupt Handling for DirectoryDeletingService and KeyDeletingService (apache#8817) HDDS-2453. Add Freon tests for S3 MPU Keys (apache#8803) HDDS-13237. Container data checksum should contain block IDs. (apache#8773) HDDS-13489. Fix SCMBlockdeleting unnecessary iteration in corner case. (apache#8847) HDDS-13464. Make ozone.snapshot.filtering.service.interval reconfigurable (apache#8825) HDDS-13473. Amend validation for OZONE_OM_SNAPSHOT_DB_MAX_OPEN_FILES (apache#8829) HDDS-13435. Add an OzoneManagerAuthorizer interface (apache#8840) HDDS-8565. Recon memory leak in NSSummary (apache#8823). HDDS-12852. Implement a sliding window counter utility (apache#8498) HDDS-12000. Add unit test for RatisContainerSafeModeRule and ECContainerSafeModeRule (apache#8801) HDDS-13092. Container scanner should trigger volume scan when marking a container unhealthy (apache#8603) HDDS-13070. OM Follower changes to create and place sst files from hardlink file. (apache#8761) HDDS-13482. Mark testWriteStateMachineDataIdempotencyWithClosedContainer as flaky HDDS-13481. Fix success latency metric in SCM panels of deletion grafana dashboard (apache#8835) HDDS-13468. Update default value of ozone.scm.ha.dbtransactionbuffer.flush.interval. (apache#8834) HDDS-13410. Control block deletion for each DN from SCM. (apache#8767) ... hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReplicaInfo.java hadoop-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java hadoop-ozone/cli-admin/src/test/java/org/apache/hadoop/hdds/scm/cli/container/TestReconcileSubcommand.java
Generated-By: Cursor
* HDDS-12078-improve-cli-ai: Use common mixin for container ID list
Generated-By: Cursor
This reverts commit 25c1b55.
Generated-By: Cursor
Generated-By: Cursor
|
The latest updates should have all functionality and unit tests. Just acceptance tests need to be updated. I've added container ID validation in the common |
Generated-By: Cursor
|
@aswinshakil I've updated the acceptance tests for the new CLI, although most of the testing is still happening in the unit test. PTAL |
...ne/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
Outdated
Show resolved
Hide resolved
...ne/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
Outdated
Show resolved
Hide resolved
...ne/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
Outdated
Show resolved
Hide resolved
.../cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ContainerIDParameters.java
Outdated
Show resolved
Hide resolved
...ne/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/container/ReconcileSubcommand.java
Show resolved
Hide resolved
* master: (55 commits) HDDS-13525. Rename configuration property to ozone.om.compaction.service.enabled (apache#8928) HDDS-13519. Reconciliation should continue if a peer datanode is unreachable (apache#8908) HDDS-13566. Fix incorrect authorizer class in ACL documentation (apache#8931) HDDS-13084. Trigger on-demand container scan when a container moves from open to unhealthy. (apache#8904) HDDS-13432. Accelerating Namespace Usage Calculation in Recon using - Materialised Approach (apache#8797) HDDS-13557. Bump jline to 3.30.5 (apache#8920) HDDS-13556. Bump assertj-core to 3.27.4 (apache#8919) HDDS-13543. [Docs] Design doc for OM bootstrapping process with snapshots. (apache#8900) HDDS-13541. Bump sonar-maven-plugin to 5.1.0.4751 (apache#8911) HDDS-13101. Remove duplicate information in datanode list output (apache#8523) HDDS-13528. Handle null paths when the NSSummary is initializing (apache#8901) HDDS-12990. (addendum) Generate tree from metadata when it does not exist during getContainerChecksumInfo call (apache#8881) HDDS-13086. Block duplicate reconciliation requests for the same container and datanode within the datanode. (apache#8905) HDDS-12990. Generate tree from metadata when it doesn't exist during getContainerChecksumInfo call (apache#8881) HDDS-12824. Optimize container checksum read during datanode startup (apache#8604) HDDS-13522. Rename axisLabel for No. of delete request received (apache#8879) HDDS-12196. Document ozone repair cli (apache#8849) HDDS-13514. Intermittent failure in TestNSSummaryMemoryLeak (apache#8889) HDDS-13423. Log reason for triggering on-demand container scan (apache#8854) HDDS-13466. Disable flaky TestOmSnapshotFsoWithNativeLibWithLinkedBuckets ...
aswinshakil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @errose28 . The changes look good to me. Let's wait for the CI to pass.
* master: HDDS-13553. Recon Staging DB for OM full db reprocess (apache#8917) HDDS-13138. [Docs] Update Topology Awareness user doc. (apache#8528) HDDS-11944. Usability improvements for Ozone tools. (apache#7597) HDDS-12197. Update documentation for all ozone debug tools (apache#8868)
|
I found more issues with the output after trying things manually, which should be fixed now. I also updated the jq queries used in acceptance tests that were previously failing. |
|
Overall, the changes look good to me. I was wondering if we should have a limit on the number list statuses so that we don't overwhelm the system. |
|
Each status is a separate call to SCM and the json is streamed as it is generated so it won't create a giant proto to send over the network or fill up client memory. The CLI to query status of multiple containers is a shorthand for running something like |
|
Thanks @aswinshakil for the review. |
What changes were proposed in this pull request?
Summary
Currently
ozone admin container reconcileis used to trigger reconciliation for a specific container, but the resulting checksums are only shown inozone admin container info --jsonwhich can be verbose and usually requiresjqfiltering to get the desired information.This PR adds new functionality to the
ozone admin container reconcile command:ozone admin container reconcile 1 2 3 4ozone admin container info:cat <id-file> | ozone admin container reconcile ---statusflag that prints replica and checksum information instead of sending reconcile commands:ozone admin container reconcile --status 1 2 3 4replicasMatchfield is set totrueorfalseon the client side, which can be used to quickly filter out containers with mismatched replicas.No proto changes are made. The same
ScmClient#getContainerandScmClient#getContainerReplicasmethods used byozone admin container infoare re-used forozone admin container reconcile --status. This command just provides a more filtered view of the results specific to reconciliation and adds thereplicasMatchfield.Examples
Querying status of one container:
Querying the status of multiple containers (json list output):
Given a list of container IDs in a file, output the IDs of all the ones with mismatched replicas:
Same as above, but now send the IDs of mismatched containers back in for reconciliation:
What is the link to the Apache JIRA
HDDS-12078
How was this patch tested?
--statusflag instead ofozone admin container info -json.