diff --git a/hadoop-hdds/docs/content/tools/Repair.md b/hadoop-hdds/docs/content/tools/Repair.md new file mode 100644 index 000000000000..002b163773c7 --- /dev/null +++ b/hadoop-hdds/docs/content/tools/Repair.md @@ -0,0 +1,252 @@ +--- +title: "Ozone Repair" +date: 2025-07-22 +summary: Advanced tool to repair Ozone. +--- + + +Ozone Repair (`ozone repair`) is an advanced tool to repair Ozone. The nodes being repaired must be stopped before the tool is run. +Note: All repair commands support a `--dry-run` option which allows a user to see what repair the command will be performing without actually making any changes to the cluster. +Use the `--force` flag to override the running service check in false-positive cases. + +```bash +Usage: ozone repair [-hV] [--verbose] [-conf=] + [-D=]... [COMMAND] +Advanced tool to repair Ozone. The nodes being repaired must be stopped before +the tool is run. + -conf= + + -D, --set= + + -h, --help Show this help message and exit. + -V, --version Print version information and exit. + --verbose More verbose output. Show the stack trace of the errors. +Commands: + datanode Tools to repair Datanode + ldb Operational tool to repair ldb. + om Operational tool to repair OM. + scm Operational tool to repair SCM. +``` +For more detailed usage see the output of `--help` for each of the subcommands. + +## ozone repair datanode +Operational tool to repair datanode. + +### upgrade-container-schema +Upgrade all schema V2 containers to schema V3 for a datanode in offline mode. +Optionally takes `--volume` option to specify which volume needs the upgrade. + +## ozone repair ldb +Operational tool to repair ldb. + +### compact +Compact a column family in the DB to clean up tombstones while the service is offline. +```bash +Usage: ozone repair ldb compact [-hV] [--dry-run] [--force] [--verbose] + --cf= --db= +CLI to compact a column-family in the DB while the service is offline. +Note: If om.db is compacted with this tool then it will negatively impact the +Ozone Manager\'s efficient snapshot diff. + --cf, --column-family, --column_family= + Column family name + --db= Database File Path +``` + +## ozone repair om +Operational tool to repair OM. + +#### Subcommands under OM +- fso-tree +- snapshot +- update-transaction +- quota +- compact +- skip-ratis-transaction + +### fso-tree +Identify and repair a disconnected FSO tree by marking unreferenced entries for deletion. +Reports the reachable, unreachable (pending delete) and unreferenced (orphaned) directories and files. +OM should be stopped while this tool is run. +```bash +Usage: ozone repair om fso-tree [-hV] [--dry-run] [--force] [--verbose] + [-b=] --db= + [-v=] +Identify and repair a disconnected FSO tree by marking unreferenced entries for +deletion. OM should be stopped while this tool is run. + -b, --bucket= + Filter by bucket name + --db= Path to OM RocksDB + -v, --volume= + Filter by volume name. Add '/' before the volume name. +``` + +### snapshot +Subcommand for all snapshot related repairs. + +#### chain +Update global and path previous snapshot for a snapshot in case snapshot chain is corrupted. +```bash +Usage: ozone repair om snapshot chain [-hV] [--dry-run] [--force] [--verbose] + --db= + --gp= + --pp= + +CLI to update global and path previous snapshot for a snapshot in case snapshot +chain is corrupted. + URI of the bucket (format: volume/bucket). + Snapshot name to update + --db= Database File Path + --gp, --global-previous= + Global previous snapshotId to set for the given snapshot + --pp, --path-previous= + Path previous snapshotId to set for the given snapshot +``` + +### update-transaction +To avoid modifying Ratis logs and only update the latest applied transaction, use `update-transaction` command. +This updates the highest transaction index in the OM transaction info table. +```bash +Usage: ozone repair om update-transaction [-hV] [--dry-run] [--force] + [--verbose] --db= --index= + --term= +CLI to update the highest index in transaction info table. + --db= Database File Path + --index= + Highest index to set. The input should be non-zero long + integer. + --term= + Highest term to set. The input should be non-zero long + integer. +``` + +### quota +Operational tool to repair quota in OM DB. + +#### start +To trigger quota repair use the `start` command. +```bash +Usage: ozone repair om quota start [-hV] [--dry-run] [--force] [--verbose] + [--buckets=] + [--service-host=] + [--service-id=] +CLI to trigger quota repair. + --buckets= start quota repair for specific buckets. Input will + be list of uri separated by comma as + //[,...] + --service-host= + Ozone Manager Host. If OM HA is enabled, use + --service-id instead. If you must use + --service-host with OM HA, this must point + directly to the leader OM. This option is + required when --service-id is not provided or + when HA is not enabled. + --service-id, --om-service-id= + Ozone Manager Service ID +``` + +#### status +Get the status of last triggered quota repair. +```bash +Usage: ozone repair om quota status [-hV] [--verbose] [--service-host=] + [--service-id=] +CLI to get the status of last trigger quota repair if available. + --service-host= + Ozone Manager Host. If OM HA is enabled, use --service-id + instead. If you must use --service-host with OM HA, this + must point directly to the leader OM. This option is + required when --service-id is not provided or when HA is + not enabled. + --service-id, --om-service-id= + Ozone Manager Service ID +``` + +### compact +Compact a column family in the OM DB to clean up tombstones. The compaction happens asynchronously. Requires admin privileges. +```bash +Usage: ozone repair om compact [-hV] [--dry-run] [--force] [--verbose] + --cf= [--node-id=] + [--service-id=] +CLI to compact a column family in the om.db. The compaction happens +asynchronously. Requires admin privileges. + --cf, --column-family, --column_family= + Column family name + --node-id= NodeID of the OM for which db needs to be compacted. + --service-id, --om-service-id= + Ozone Manager Service ID +``` + +### skip-ratis-transaction, srt +Omit a raft log in a ratis segment file by replacing the specified index with a dummy EchoOM command. +This is an offline tool meant to be used only when all 3 OMs crash on the same transaction. +If the issue is isolated to one OM, manually copy the DB from a healthy OM instead. +```bash +Usage: ozone repair om skip-ratis-transaction [-hV] [--dry-run] [--force] + [--verbose] -b= --index= (-s= | + -d=) +CLI to omit a raft log in a ratis segment file. The raft log at the index +specified is replaced with an EchoOM command (which is a dummy command). It is +an offline command i.e., doesn\'t require OM to be running. The command should +be run for the same transaction on all 3 OMs only when all the OMs are crashing +while applying the same transaction. If only one OM is crashing and the other +OMs have executed the log successfully, then the DB should be manually copied +from one of the good OMs to the crashing OM instead. + -b, --backup= Directory to put the backup of the original + repaired segment file before the repair. + -d, --ratis-log-dir= + Path of the ratis log directory + --index= Index of the failing transaction that should be + removed + -s, --segment-path= + Path of the input segment file +``` + +## ozone repair scm +Operational tool to repair SCM. + +#### Subcommands under SCM +- cert +- update-transaction + +### cert +Subcommand for all certificate related repairs on SCM + +#### recover +Recover Deleted SCM Certificate from RocksDB +```bash +Usage: ozone repair scm cert recover [-hV] [--dry-run] [--force] [--verbose] + --db= +Recover Deleted SCM Certificate from RocksDB + --db= SCM DB Path +``` + +### update-transaction +To avoid modifying Ratis logs and only update the latest applied transaction, use `update-transaction` command. +This updates the highest transaction index in the SCM transaction info table. +```bash +Usage: ozone repair scm update-transaction [-hV] [--dry-run] [--force] + [--verbose] --db= --index= + --term= +CLI to update the highest index in transaction info table. + --db= Database File Path + --index= + Highest index to set. The input should be non-zero long + integer. + --term= + Highest term to set. The input should be non-zero long + integer. +```