From 763f5261a8c4a0ccb6c76a84b0e3ba63b7191cc3 Mon Sep 17 00:00:00 2001 From: Xintao Date: Thu, 5 Aug 2021 16:11:01 +0800 Subject: [PATCH 1/4] Print the bad sst files and related information Signed-off-by: Xintao --- tikv-control.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/tikv-control.md b/tikv-control.md index fd70acaf1b163..69958f6c0e669 100644 --- a/tikv-control.md +++ b/tikv-control.md @@ -518,3 +518,29 @@ Type "I consent" to continue, anything else to exit: I consent > **Note** > > The command will expose data encryption keys as plaintext. In production, DO NOT redirect the output to a file. Even deleting the output file afterward may not cleanly wipe out the content from disk. + +### Print the bad sst files and related information + +Sometimes the TiKV process will panic because some sst files are damaged. You can use the `bad-ssts` command to print information about bad sst files. Before running this command, stop the running TiKV instance. + +```bash +$ tikv-ctl bad-ssts --db --pd +-------------------------------------------------------- +corruption info: +data/tikv-21107/db/000014.sst: Corruption: Bad table magic number: expected 9863518390377041911, found 759105309091689679 in data/tikv-21107/db/000014.sst + +sst meta: +14:552997[1 .. 5520]['0101' seq:1, type:1 .. '7A7480000000000000FF0F5F728000000000FF0002160000000000FAFA13AB33020BFFFA' seq:2032, type:1] at level 0 for Column family "default" (ID 0) +it isn't easy to handle local data, start key:0101 + +overlap region: +RegionInfo { region: id: 4 end_key: 7480000000000000FF0500000000000000F8 region_epoch { conf_ver: 1 version: 2 } peers { id: 5 store_id: 1 }, leader: Some(id: 5 store_id: 1) } + +suggested operations: +tikv-ctl ldb --db=data/tikv-21107/db unsafe_remove_sst_file "data/tikv-21107/db/000014.sst" +tikv-ctl --db=data/tikv-21107/db tombstone -r 4 --pd +-------------------------------------------------------- +corruption analysis has completed +``` + +The above output is an example. The command print corruption sst information first, and then print related meta information. Take the above output as an example: 14 means sst number, 552997 means file size, followed by the smallest and largest seqno and other meta information. This command will also try to get the region involved through PD server. Finally, you can clean up the bad ssts according to the suggested operations and restart the TiKV instance. From cf72fe88d97f543bbea4ce479df90fcfa178a366 Mon Sep 17 00:00:00 2001 From: Xintao Date: Mon, 9 Aug 2021 09:57:58 +0800 Subject: [PATCH 2/4] address comments Signed-off-by: Xintao --- tikv-control.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/tikv-control.md b/tikv-control.md index 69958f6c0e669..16744692d8122 100644 --- a/tikv-control.md +++ b/tikv-control.md @@ -519,9 +519,9 @@ Type "I consent" to continue, anything else to exit: I consent > > The command will expose data encryption keys as plaintext. In production, DO NOT redirect the output to a file. Even deleting the output file afterward may not cleanly wipe out the content from disk. -### Print the bad sst files and related information +### Print information related to damaged SST files -Sometimes the TiKV process will panic because some sst files are damaged. You can use the `bad-ssts` command to print information about bad sst files. Before running this command, stop the running TiKV instance. +Damaged SST files in TiKV might cause the TiKV process to panic. To clean up the damaged SST files, you will need the information of these files. To get the information, you can execute the `bad-ssts` command in TiKV Control. The needed information is shown in the output. The following is an example command and output. ```bash $ tikv-ctl bad-ssts --db --pd @@ -543,4 +543,7 @@ tikv-ctl --db=data/tikv-21107/db tombstone -r 4 --pd corruption analysis has completed ``` -The above output is an example. The command print corruption sst information first, and then print related meta information. Take the above output as an example: 14 means sst number, 552997 means file size, followed by the smallest and largest seqno and other meta information. This command will also try to get the region involved through PD server. Finally, you can clean up the bad ssts according to the suggested operations and restart the TiKV instance. +From the output above, you can see that the information of the damaged SST file is printed first and then the meta-information is printed. ++ In the `sst meta` part, `14` means the SST file number; `552997` means the file size, followed by the smallest and largest sequence numbers and other meta-information. ++ The `overlap region` part shows the information of the Region involved. This information is obtained through the PD server. ++ The `suggested operations` part provides you suggestion to clean up the damaged SST file. You can take the suggestion to clean up files and restart the TiKV instance. From f16f9c4e9d3a34d52134c6c0c263e98c5f2bbbd8 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Mon, 9 Aug 2021 10:17:19 +0800 Subject: [PATCH 3/4] separate command from output --- tikv-control.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tikv-control.md b/tikv-control.md index 16744692d8122..03ec1c3bb1e2d 100644 --- a/tikv-control.md +++ b/tikv-control.md @@ -525,6 +525,9 @@ Damaged SST files in TiKV might cause the TiKV process to panic. To clean up the ```bash $ tikv-ctl bad-ssts --db --pd +``` + +```bash -------------------------------------------------------- corruption info: data/tikv-21107/db/000014.sst: Corruption: Bad table magic number: expected 9863518390377041911, found 759105309091689679 in data/tikv-21107/db/000014.sst From 88e9b916e7c3345197ee98348d98bdc4bf4de6de Mon Sep 17 00:00:00 2001 From: Xintao Date: Mon, 9 Aug 2021 10:17:38 +0800 Subject: [PATCH 4/4] address comments Signed-off-by: Xintao --- tikv-control.md | 1 + 1 file changed, 1 insertion(+) diff --git a/tikv-control.md b/tikv-control.md index 16744692d8122..1aa3b03356007 100644 --- a/tikv-control.md +++ b/tikv-control.md @@ -544,6 +544,7 @@ corruption analysis has completed ``` From the output above, you can see that the information of the damaged SST file is printed first and then the meta-information is printed. + + In the `sst meta` part, `14` means the SST file number; `552997` means the file size, followed by the smallest and largest sequence numbers and other meta-information. + The `overlap region` part shows the information of the Region involved. This information is obtained through the PD server. + The `suggested operations` part provides you suggestion to clean up the damaged SST file. You can take the suggestion to clean up files and restart the TiKV instance.