From f1b196b59831e6a7880a1febf063b5c3f0ef2b1d Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Tue, 29 Nov 2022 14:47:24 +0800 Subject: [PATCH 1/9] br: backup checkpoint --- TOC.md | 1 + br/checkpoint-backup.md | 40 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 41 insertions(+) create mode 100644 br/checkpoint-backup.md diff --git a/TOC.md b/TOC.md index e6f0701fbf313..ad7dd6feeb29c 100644 --- a/TOC.md +++ b/TOC.md @@ -479,6 +479,7 @@ - BR Features - [Auto Tune](/br/br-auto-tune.md) - [Batch Create Table](/br/br-batch-create-table.md) + - [Checkpoint Backup](/br/checkpoint-backup.md) - References - [BR Design Principles](/br/backup-and-restore-design.md) - [BR Command-line](/br/use-br-command-line-tool.md) diff --git a/br/checkpoint-backup.md b/br/checkpoint-backup.md new file mode 100644 index 0000000000000..493500d41a56c --- /dev/null +++ b/br/checkpoint-backup.md @@ -0,0 +1,40 @@ +--- +title: Checkpoint Backup +summary: Learn about the checkpoint backup feature, including its application scenarios, usage, and implementation details. +--- + +# Checkpoint Backup New in v6.5.0 + +Snapshot backup may end in advance due to recoverable errors, such as disk exhaustion and node down. Before TiDB v6.5.0, data that is backed up before the interruption would be invalidated after addressing the error, and you need to start the backup again. For large clusters, this results in noticeable extra cost. + +Since TiDB v6.5.0, Backup & Restore (BR) introduces checkpoint backup feature to allow continuing an interrupted backup. This feature is enabled by default. After this feature is enabled, most data of the interrupted backup is retained after an unexpected exit. + +## Application scenarios + +If your TiDB cluster is large and cannot tolerate backup again after a failure, you can enable the checkpoint backup feature. After this feature is enabled, br command-line tool (hereinafter referred to as `br`) periodically records the shards that have been backed up. In this way, the next backup retry can use the backup progress close to the abnormal exit. + +## Usage limitations + +During the backup, `br` periodically updates the `gc-safe-point` of the backup snapshot in PD to avoid data being garbage collected. When `br` exits, the `gc-safe-point` cannot be updated in time. As a result, before the next retry backup, the data might have been garbage collected. + +To avoid this situation, `br` keeps the `gc-safe-point` for about one hour by default when `gcttl` is not specified. If you need to extend this time, you can set the `gcttl` parameter. + +The following example sets `gcttl` to 15 hours to extend the retention period of `gc-safe-point`: + +```shell +br backup full \ +--storage local:///br_data/ --pd "${PD_IP}:2379" \ +--gcttl 54000 +``` + +> **Note:** +> +> `gc-safe-point` created before backup is deleted after the snapshot backup is completed and you do not need to delete it manually. + +## Implementation details + +During snapshot backup, `br` encodes the tables into the corresponding key space, and generates backup RPC requests before sending them to TiKV nodes. After receiving the backup request, TiKV nodes back up the data within the requested range. Every time a TiKV node finishes backing up data of a Region, it returns the backup information of this range to `br`. + +By recording the information returned by TiKV nodes, `br` gets informed of the key ranges that have been backed up. The checkpoint backup feature periodically uploads the new backup information to external storage so that the key ranges that have been backed up can be persisted. + +When `br` retries the backup, it reads the key ranges that have been backed up from external storage, and compares them with the key ranges of the backup task. The differential data helps `br` to determine the data that still needs to be backed up in checkpoint backup. From 21a5e79fcd1ba39d6ed95a8717b2e956118a5ec2 Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Tue, 29 Nov 2022 14:53:28 +0800 Subject: [PATCH 2/9] refine --- br/checkpoint-backup.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/br/checkpoint-backup.md b/br/checkpoint-backup.md index 493500d41a56c..241529c85c074 100644 --- a/br/checkpoint-backup.md +++ b/br/checkpoint-backup.md @@ -5,9 +5,9 @@ summary: Learn about the checkpoint backup feature, including its application sc # Checkpoint Backup New in v6.5.0 -Snapshot backup may end in advance due to recoverable errors, such as disk exhaustion and node down. Before TiDB v6.5.0, data that is backed up before the interruption would be invalidated after addressing the error, and you need to start the backup again. For large clusters, this results in noticeable extra cost. +Snapshot backup may end in advance due to recoverable errors, such as disk exhaustion and node down. Before TiDB v6.5.0, data that is backed up before the interruption would be invalidated after the error is addressed, and you need to start the backup again. For large clusters, this results in noticeable extra cost. -Since TiDB v6.5.0, Backup & Restore (BR) introduces checkpoint backup feature to allow continuing an interrupted backup. This feature is enabled by default. After this feature is enabled, most data of the interrupted backup is retained after an unexpected exit. +In TiDB v6.5.0, Backup & Restore (BR) introduces checkpoint backup feature to allow continuing an interrupted backup. This feature is enabled by default. After this feature is enabled, most data of the interrupted backup is retained after an unexpected exit. ## Application scenarios @@ -15,7 +15,7 @@ If your TiDB cluster is large and cannot tolerate backup again after a failure, ## Usage limitations -During the backup, `br` periodically updates the `gc-safe-point` of the backup snapshot in PD to avoid data being garbage collected. When `br` exits, the `gc-safe-point` cannot be updated in time. As a result, before the next retry backup, the data might have been garbage collected. +During the backup, `br` periodically updates the `gc-safe-point` of the backup snapshot in PD to avoid data being garbage collected. When `br` exits, the `gc-safe-point` cannot be updated in time. As a result, before the next backup retry, the data might have been garbage collected. To avoid this situation, `br` keeps the `gc-safe-point` for about one hour by default when `gcttl` is not specified. If you need to extend this time, you can set the `gcttl` parameter. @@ -33,8 +33,8 @@ br backup full \ ## Implementation details -During snapshot backup, `br` encodes the tables into the corresponding key space, and generates backup RPC requests before sending them to TiKV nodes. After receiving the backup request, TiKV nodes back up the data within the requested range. Every time a TiKV node finishes backing up data of a Region, it returns the backup information of this range to `br`. +During a snapshot backup, `br` encodes the tables into the corresponding key space, and generates backup RPC requests before sending them to TiKV nodes. After receiving the backup request, TiKV nodes back up the data within the requested range. Every time a TiKV node finishes backing up data of a Region, it returns the backup information of this range to `br`. -By recording the information returned by TiKV nodes, `br` gets informed of the key ranges that have been backed up. The checkpoint backup feature periodically uploads the new backup information to external storage so that the key ranges that have been backed up can be persisted. +`br` records the information returned by TiKV nodes, which helps `br` get informed of the key ranges that have been backed up. The checkpoint backup feature periodically uploads the new backup information to external storage so that the key ranges that have been backed up can be persisted. When `br` retries the backup, it reads the key ranges that have been backed up from external storage, and compares them with the key ranges of the backup task. The differential data helps `br` to determine the data that still needs to be backed up in checkpoint backup. From f86ff1949bec21acce3d7b8917a8e1b6fab02854 Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Tue, 29 Nov 2022 17:37:11 +0800 Subject: [PATCH 3/9] fix gc-safepoint --- TOC.md | 2 +- br/{checkpoint-backup.md => br-checkpointmd} | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) rename br/{checkpoint-backup.md => br-checkpointmd} (78%) diff --git a/TOC.md b/TOC.md index ad7dd6feeb29c..1d17ad5c70a35 100644 --- a/TOC.md +++ b/TOC.md @@ -479,7 +479,7 @@ - BR Features - [Auto Tune](/br/br-auto-tune.md) - [Batch Create Table](/br/br-batch-create-table.md) - - [Checkpoint Backup](/br/checkpoint-backup.md) + - [Checkpoint Backup](/br/br-checkpoint.md) - References - [BR Design Principles](/br/backup-and-restore-design.md) - [BR Command-line](/br/use-br-command-line-tool.md) diff --git a/br/checkpoint-backup.md b/br/br-checkpointmd similarity index 78% rename from br/checkpoint-backup.md rename to br/br-checkpointmd index 241529c85c074..7fd0303c3ef25 100644 --- a/br/checkpoint-backup.md +++ b/br/br-checkpointmd @@ -15,11 +15,11 @@ If your TiDB cluster is large and cannot tolerate backup again after a failure, ## Usage limitations -During the backup, `br` periodically updates the `gc-safe-point` of the backup snapshot in PD to avoid data being garbage collected. When `br` exits, the `gc-safe-point` cannot be updated in time. As a result, before the next backup retry, the data might have been garbage collected. +During the backup, `br` periodically updates the `gc-safepoint` of the backup snapshot in PD to avoid data being garbage collected. When `br` exits, the `gc-safepoint` cannot be updated in time. As a result, before the next backup retry, the data might have been garbage collected. -To avoid this situation, `br` keeps the `gc-safe-point` for about one hour by default when `gcttl` is not specified. If you need to extend this time, you can set the `gcttl` parameter. +To avoid this situation, `br` keeps the `gc-safepoint` for about one hour by default when `gcttl` is not specified. If you need to extend this time, you can set the `gcttl` parameter. -The following example sets `gcttl` to 15 hours to extend the retention period of `gc-safe-point`: +The following example sets `gcttl` to 15 hours to extend the retention period of `gc-safepoint`: ```shell br backup full \ @@ -29,7 +29,7 @@ br backup full \ > **Note:** > -> `gc-safe-point` created before backup is deleted after the snapshot backup is completed and you do not need to delete it manually. +> `gc-safepoint` created before backup is deleted after the snapshot backup is completed and you do not need to delete it manually. ## Implementation details From fc0dbfbf93871da665996fda32cd04834909eb17 Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Tue, 29 Nov 2022 19:01:06 +0800 Subject: [PATCH 4/9] fix file name --- br/{br-checkpointmd => br-checkpoint.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename br/{br-checkpointmd => br-checkpoint.md} (100%) diff --git a/br/br-checkpointmd b/br/br-checkpoint.md similarity index 100% rename from br/br-checkpointmd rename to br/br-checkpoint.md From 1935c1f412e8560d17cdd4e38b4f7056363dee35 Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Tue, 29 Nov 2022 19:06:08 +0800 Subject: [PATCH 5/9] address comment --- br/br-checkpoint.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/br/br-checkpoint.md b/br/br-checkpoint.md index 7fd0303c3ef25..2657d92f5376a 100644 --- a/br/br-checkpoint.md +++ b/br/br-checkpoint.md @@ -5,13 +5,13 @@ summary: Learn about the checkpoint backup feature, including its application sc # Checkpoint Backup New in v6.5.0 -Snapshot backup may end in advance due to recoverable errors, such as disk exhaustion and node down. Before TiDB v6.5.0, data that is backed up before the interruption would be invalidated after the error is addressed, and you need to start the backup again. For large clusters, this results in noticeable extra cost. +Snapshot backup might end in advance due to recoverable errors, such as disk exhaustion and node crash. Before TiDB v6.5.0, data that is backed up before the interruption would be invalidated after the error is addressed, and you need to start the backup from scratch. For large clusters, this results in considerable extra cost. In TiDB v6.5.0, Backup & Restore (BR) introduces checkpoint backup feature to allow continuing an interrupted backup. This feature is enabled by default. After this feature is enabled, most data of the interrupted backup is retained after an unexpected exit. ## Application scenarios -If your TiDB cluster is large and cannot tolerate backup again after a failure, you can enable the checkpoint backup feature. After this feature is enabled, br command-line tool (hereinafter referred to as `br`) periodically records the shards that have been backed up. In this way, the next backup retry can use the backup progress close to the abnormal exit. +If your TiDB cluster is large and cannot afford to back up again after a failure, you can enable the checkpoint backup feature. After this feature is enabled, br command-line tool (hereinafter referred to as `br`) periodically records the shards that have been backed up. In this way, the next backup retry can use the backup progress close to the abnormal exit. ## Usage limitations @@ -19,7 +19,7 @@ During the backup, `br` periodically updates the `gc-safepoint` of the backup sn To avoid this situation, `br` keeps the `gc-safepoint` for about one hour by default when `gcttl` is not specified. If you need to extend this time, you can set the `gcttl` parameter. -The following example sets `gcttl` to 15 hours to extend the retention period of `gc-safepoint`: +The following example sets `gcttl` to 15 hours (54000 seconds) to extend the retention period of `gc-safepoint`: ```shell br backup full \ @@ -29,7 +29,7 @@ br backup full \ > **Note:** > -> `gc-safepoint` created before backup is deleted after the snapshot backup is completed and you do not need to delete it manually. +> `gc-safepoint` created before backup is deleted after the snapshot backup is completed. You do not need to delete it manually. ## Implementation details From b8fac10487aac9af1133bc6a6dd0e6dd317af1c6 Mon Sep 17 00:00:00 2001 From: shichun-0415 <89768198+shichun-0415@users.noreply.github.com> Date: Wed, 30 Nov 2022 11:27:04 +0800 Subject: [PATCH 6/9] Apply suggestions from code review Co-authored-by: xixirangrang --- br/br-checkpoint.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/br/br-checkpoint.md b/br/br-checkpoint.md index 2657d92f5376a..8fc72aace86cf 100644 --- a/br/br-checkpoint.md +++ b/br/br-checkpoint.md @@ -5,19 +5,19 @@ summary: Learn about the checkpoint backup feature, including its application sc # Checkpoint Backup New in v6.5.0 -Snapshot backup might end in advance due to recoverable errors, such as disk exhaustion and node crash. Before TiDB v6.5.0, data that is backed up before the interruption would be invalidated after the error is addressed, and you need to start the backup from scratch. For large clusters, this results in considerable extra cost. +Snapshot backup might be interrupted due to recoverable errors, such as disk exhaustion and node crash. Before TiDB v6.5.0, data that is backed up before the interruption would be invalidated even after the error is addressed, and you need to start the backup from scratch. For large clusters, this incurs considerable extra cost. -In TiDB v6.5.0, Backup & Restore (BR) introduces checkpoint backup feature to allow continuing an interrupted backup. This feature is enabled by default. After this feature is enabled, most data of the interrupted backup is retained after an unexpected exit. +In TiDB v6.5.0, Backup & Restore (BR) introduces the checkpoint backup feature to allow continuing an interrupted backup. This feature is enabled by default. After this feature is enabled, most data of the interrupted backup can be retained. ## Application scenarios -If your TiDB cluster is large and cannot afford to back up again after a failure, you can enable the checkpoint backup feature. After this feature is enabled, br command-line tool (hereinafter referred to as `br`) periodically records the shards that have been backed up. In this way, the next backup retry can use the backup progress close to the abnormal exit. +If your TiDB cluster is large and cannot afford to back up again after a failure, you can use the checkpoint backup feature. The br command-line tool (hereinafter referred to as `br`) periodically records the shards that have been backed up. In this way, the next backup retry can use the backup progress close to the abnormal exit. ## Usage limitations During the backup, `br` periodically updates the `gc-safepoint` of the backup snapshot in PD to avoid data being garbage collected. When `br` exits, the `gc-safepoint` cannot be updated in time. As a result, before the next backup retry, the data might have been garbage collected. -To avoid this situation, `br` keeps the `gc-safepoint` for about one hour by default when `gcttl` is not specified. If you need to extend this time, you can set the `gcttl` parameter. +To avoid this situation, `br` keeps the `gc-safepoint` for about one hour by default when `gcttl` is not specified. You can set the `gcttl` parameter to extend the retention period if needed . The following example sets `gcttl` to 15 hours (54000 seconds) to extend the retention period of `gc-safepoint`: @@ -35,6 +35,6 @@ br backup full \ During a snapshot backup, `br` encodes the tables into the corresponding key space, and generates backup RPC requests before sending them to TiKV nodes. After receiving the backup request, TiKV nodes back up the data within the requested range. Every time a TiKV node finishes backing up data of a Region, it returns the backup information of this range to `br`. -`br` records the information returned by TiKV nodes, which helps `br` get informed of the key ranges that have been backed up. The checkpoint backup feature periodically uploads the new backup information to external storage so that the key ranges that have been backed up can be persisted. +`br` records the information returned by TiKV nodes, which helps `br` get the key ranges that have been backed up. The checkpoint backup feature periodically uploads the new backup information to external storage so that the key ranges that have been backed up can be persisted. -When `br` retries the backup, it reads the key ranges that have been backed up from external storage, and compares them with the key ranges of the backup task. The differential data helps `br` to determine the data that still needs to be backed up in checkpoint backup. +When `br` retries the backup, it reads the key ranges that have been backed up from external storage, and compares them with the key ranges of the backup task. The differential data helps `br` to determine the key range that still needs to be backed up in checkpoint backup. From aca640b5e14632e312fd641b41160c2057875761 Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Thu, 1 Dec 2022 13:11:50 +0800 Subject: [PATCH 7/9] Update br-checkpoint.md --- br/br-checkpoint.md | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/br/br-checkpoint.md b/br/br-checkpoint.md index 8fc72aace86cf..653fb22d8c7f7 100644 --- a/br/br-checkpoint.md +++ b/br/br-checkpoint.md @@ -3,7 +3,7 @@ title: Checkpoint Backup summary: Learn about the checkpoint backup feature, including its application scenarios, usage, and implementation details. --- -# Checkpoint Backup New in v6.5.0 +# Checkpoint Backup Snapshot backup might be interrupted due to recoverable errors, such as disk exhaustion and node crash. Before TiDB v6.5.0, data that is backed up before the interruption would be invalidated even after the error is addressed, and you need to start the backup from scratch. For large clusters, this incurs considerable extra cost. @@ -15,6 +15,10 @@ If your TiDB cluster is large and cannot afford to back up again after a failure ## Usage limitations +Checkpoint backup relies on the GC mechanism and cannot recover all data that has been backed up. The following sections provide the details. + +### Backup retry must be prior to GC + During the backup, `br` periodically updates the `gc-safepoint` of the backup snapshot in PD to avoid data being garbage collected. When `br` exits, the `gc-safepoint` cannot be updated in time. As a result, before the next backup retry, the data might have been garbage collected. To avoid this situation, `br` keeps the `gc-safepoint` for about one hour by default when `gcttl` is not specified. You can set the `gcttl` parameter to extend the retention period if needed . @@ -31,6 +35,14 @@ br backup full \ > > `gc-safepoint` created before backup is deleted after the snapshot backup is completed. You do not need to delete it manually. +### Some data needs to be backed up again + +When `br` retries backup, some data that has been backed up might need to be backed up again, including the data being backed up and the data not recorded by the checkpoint. + +- If the interruption is caused by an error, `br` will persist the meta information of the data backed up before exit. In this case, only the data being backed up needs to be backed up again in the next retry. + +- If the `br` process is interrupted by the system, `br` cannot persist the meta information of the data backed up to the external storage. Since `br` persists the meta information every 30 seconds, data backed up in the last 30 seconds before interruption cannot be persisted and need to be backed up again in the next retry. + ## Implementation details During a snapshot backup, `br` encodes the tables into the corresponding key space, and generates backup RPC requests before sending them to TiKV nodes. After receiving the backup request, TiKV nodes back up the data within the requested range. Every time a TiKV node finishes backing up data of a Region, it returns the backup information of this range to `br`. From 78e2f8360a9672dab4b8c309df7e33df57af4532 Mon Sep 17 00:00:00 2001 From: shichun-0415 Date: Thu, 1 Dec 2022 16:41:06 +0800 Subject: [PATCH 8/9] wording --- br/br-checkpoint.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/br/br-checkpoint.md b/br/br-checkpoint.md index 653fb22d8c7f7..b2ba12f80dfeb 100644 --- a/br/br-checkpoint.md +++ b/br/br-checkpoint.md @@ -33,7 +33,7 @@ br backup full \ > **Note:** > -> `gc-safepoint` created before backup is deleted after the snapshot backup is completed. You do not need to delete it manually. +> The `gc-safepoint` created before backup is deleted after the snapshot backup is completed. You do not need to delete it manually. ### Some data needs to be backed up again From f5fdd54b9bfca2e62919e56b94d263cf918e1da1 Mon Sep 17 00:00:00 2001 From: shichun-0415 <89768198+shichun-0415@users.noreply.github.com> Date: Thu, 1 Dec 2022 21:46:47 +0800 Subject: [PATCH 9/9] Apply suggestions from code review Co-authored-by: xixirangrang --- br/br-checkpoint.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/br/br-checkpoint.md b/br/br-checkpoint.md index b2ba12f80dfeb..a5ce426455703 100644 --- a/br/br-checkpoint.md +++ b/br/br-checkpoint.md @@ -41,7 +41,7 @@ When `br` retries backup, some data that has been backed up might need to be bac - If the interruption is caused by an error, `br` will persist the meta information of the data backed up before exit. In this case, only the data being backed up needs to be backed up again in the next retry. -- If the `br` process is interrupted by the system, `br` cannot persist the meta information of the data backed up to the external storage. Since `br` persists the meta information every 30 seconds, data backed up in the last 30 seconds before interruption cannot be persisted and need to be backed up again in the next retry. +- If the `br` process is interrupted by the system, `br` cannot persist the meta information of the data backed up to the external storage. Since `br` persists the meta information every 30 seconds, data backed up in the last 30 seconds before interruption cannot be persisted and needs to be backed up again in the next retry. ## Implementation details