From 49b38d948464067e0288f413a57ace668b2a4fbf Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 12:52:56 +0800 Subject: [PATCH 01/17] update interruption-EN --- ticdc/manage-ticdc.md | 2 +- ticdc/troubleshoot-ticdc.md | 147 ++++++++++++++++++++---------------- 2 files changed, 84 insertions(+), 65 deletions(-) diff --git a/ticdc/manage-ticdc.md b/ticdc/manage-ticdc.md index 18c6682d30d7f..b596a8c4d8bc4 100644 --- a/ticdc/manage-ticdc.md +++ b/ticdc/manage-ticdc.md @@ -787,4 +787,4 @@ In the output of the above command, if the value of `sort-engine` is "unified", > + If your servers use mechanical hard drives or other storage devices that have high latency or limited bandwidth, use the unified sorter with caution. > + The total free capacity of hard drives must be greater than or equal to 500G. If you need to replicate a large amount of historical data, make sure that the free capacity on each node is greater than or equal to the size of the incremental data that needs to be replicated. > + Unified sorter is enabled by default. If your servers do not match the above requirements and you want to disable the unified sorter, you need to manually set `sort-engine` to `memory` for the changefeed. -> + To enable Unified Sorter on an existing changefeed, see the methods provided in [How do I handle the OOM that occurs after TiCDC is restarted after a task interruption?](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-the-oom-that-occurs-after-ticdc-is-restarted-after-a-task-interruption). +> + To enable Unified Sorter on an existing changefeed, see the methods provided in [How do I handle the OOM that occurs after TiCDC is restarted after a task interruption?](/ticdc/troubleshoot-ticdc.md#what-should-i-do-to-handle-the-oom-that-occurs-after-ticdc-is-restarted-after-a-task-interruption) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 8452381f18b90..35df462b6bd9d 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -24,13 +24,74 @@ If you do not specify `start-ts`, or specify `start-ts` as `0`, when a replicati When you execute `cdc cli changefeed create` to create a replication task, TiCDC checks whether the upstream tables meet the [replication restrictions](/ticdc/ticdc-overview.md#restrictions). If some tables do not meet the restrictions, `some tables are not eligible to replicate` is returned with a list of ineligible tables. You can choose `Y` or `y` to continue creating the task, and all updates on these tables are automatically ignored during the replication. If you choose an input other than `Y` or `y`, the replication task is not created. -## How do I handle replication interruption? +## How do I view the state of TiCDC replication tasks? + +You can use `cdc cli` to view the state of TiCDC replication tasks. For example: + +{{< copyable "shell-regular" >}} + +```shell +cdc cli changefeed list --pd=http://10.0.10.25:2379 +``` + +The expected output is as follows: + +```json +[{ + "id": "4e24dde6-53c1-40b6-badf-63620e4940dc", + "summary": { + "state": "normal", + "tso": 417886179132964865, + "checkpoint": "2020-07-07 16:07:44.881", + "error": null + } +}] +``` + +* `checkpoint`: TiCDC has replicated all data before this timestamp to downstream. +* `state`: The state of this replication task: + * `normal`: The task runs normally. + * `stopped`: The task is stopped manually or encounters an error. + * `removed`: The task is removed. + +> **Note:** +> +> This feature is introduced in TiCDC version 4.0.3. + +## TiCDC replication interruptions + +### How do I know whether a TiCDC replication task is interrupted? + +- Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. +- Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. +- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. +- In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. + +### How do I know whether the replication task is stopped manually? + +You can know whether the replication task is stopped manually by using `cdc cli`. For example: + +{{< copyable "shell-regular" >}} + +```shell +cdc cli changefeed query --pd=http://10.0.10.25:2379 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f +``` + +In the output of the above command, `admin-job-type` shows the state of this replication task: + +* `0`: In progress, which means that the task is not stopped manually. +* `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. +* `2`: Resumed. The replication task resumes from `checkpoint-ts`. +* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. + +### How do I handle replication interruptions? A replication task might be interrupted in the following known scenarios: - The downstream continues to be abnormal, and TiCDC still fails after many retries. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. + - Handling method: You can resume the replication task via the HTTP interface after the downstream is back to normal. - Replication cannot continue because of incompatible SQL statement(s) in the downstream. @@ -41,35 +102,45 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -## How do I know whether a TiCDC replication task is interrupted? +- In TiCDC v4.0.13 and earlier versions, the replication partition table may cause a replication interruption. -- Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. -- Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. -- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. -- In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. + - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. + - Handling procedures: + 1. Pause the replication task through `cdc cli changefeed pause -c `. + 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. -## What is `gc-ttl` in TiCDC? +### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. +- Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. -When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. +- In above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: -## How do I handle the OOM that occurs after TiCDC is restarted after a task interruption? +{{< copyable "shell-regular" >}} -If the replication task is interrupted for a long time and a large volume of new data has been written to TiDB, Out of Memory (OOM) might occur when TiCDC is restarted. In this situation, you can enable unified sorter, TiCDC's experimental sorting engine. This engine sorts data in the disk when the memory is insufficient. To enable this feature, pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: +```shell +cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 +``` +If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} ```shell -cdc cli changefeed update -c [changefeed-id] --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 +cdc cli changefeed update -c --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 ``` > **Note:** > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. +> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatiblity-notes-for-sort-dir-and-data-dir), and configures it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. +## What is `gc-ttl` in TiCDC? + +Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. + +When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. + ## What is the complete behavior of TiCDC garbage collection (GC) safepoint? If a replication task starts after the TiCDC service starts, the TiCDC owner updates the PD service GC safepoint with the smallest value of `checkpoint-ts` among all replication tasks. The service GC safepoint ensures that TiCDC does not delete data generated at that time and after that time. If the replication task is interrupted, the `checkpoint-ts` of this task does not change and PD's corresponding service GC safepoint is not updated either. The Time-To-Live (TTL) that TiCDC sets for a service GC safepoint is 24 hours, which means that the GC mechanism does not delete any data if the TiCDC service can be recovered within 24 hours after it is interrupted. @@ -176,58 +247,6 @@ cdc cli changefeed create --pd=http://10.0.10.25:2379 --sink-uri="kafka://127.0. For more information, refer to [Create a replication task](/ticdc/manage-ticdc.md#create-a-replication-task). -## How do I view the status of TiCDC replication tasks? - -To view the status of TiCDC replication tasks, use `cdc cli`. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed list --pd=http://10.0.10.25:2379 -``` - -The expected output is as follows: - -```json -[{ - "id": "4e24dde6-53c1-40b6-badf-63620e4940dc", - "summary": { - "state": "normal", - "tso": 417886179132964865, - "checkpoint": "2020-07-07 16:07:44.881", - "error": null - } -}] -``` - -* `checkpoint`: TiCDC has replicated all data before this timestamp to downstream. -* `state`: The state of the replication task: - - * `normal`: The task runs normally. - * `stopped`: The task is stopped manually or encounters an error. - * `removed`: The task is removed. - -> **Note:** -> -> This feature is introduced in TiCDC 4.0.3. - -## How do I know whether the replication task is stopped manually? - -You can know whether the replication task is stopped manually by using `cdc cli`. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed query --pd=http://10.0.10.25:2379 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f -``` - -In the output of this command, `admin-job-type` shows the state of the replication task: - -* `0`: In progress, which means that the task is not stopped manually. -* `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. -* `2`: Resumed. The replication task resumes from `checkpoint-ts`. -* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. - ## Why does the latency from TiCDC to Kafka become higher and higher? * Check [how do I view the status of TiCDC replication tasks](#how-do-i-view-the-status-of-ticdc-replication-tasks). From c2a316be2c8310477713e875b5bf97653335dce3 Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 13:38:25 +0800 Subject: [PATCH 02/17] Update ticdc troubleshooting EN --- ticdc/troubleshoot-ticdc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 35df462b6bd9d..a79dff18badc2 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -132,7 +132,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatiblity-notes-for-sort-dir-and-data-dir), and configures it with caution. +> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configures it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. ## What is `gc-ttl` in TiCDC? @@ -249,7 +249,7 @@ For more information, refer to [Create a replication task](/ticdc/manage-ticdc.m ## Why does the latency from TiCDC to Kafka become higher and higher? -* Check [how do I view the status of TiCDC replication tasks](#how-do-i-view-the-status-of-ticdc-replication-tasks). +* Check [how do I view the state of TiCDC replication tasks](#how-do-i-view-the-state-of-ticdc-replication-tasks). * Adjust the following parameters of Kafka: * Increase the `message.max.bytes` value in `server.properties` to `1073741824` (1 GB). From f516209ae879d5b94ef92b5441fb0b78a1fa1106 Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 13:46:07 +0800 Subject: [PATCH 03/17] Update ticdc troubleshooting - EN --- ticdc/troubleshoot-ticdc.md | 1 + 1 file changed, 1 insertion(+) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index a79dff18badc2..210a429ab422f 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -122,6 +122,7 @@ cdc cli changefeed update -c --sort-engine="unified" --pd=http:/ ``` If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: + {{< copyable "shell-regular" >}} ```shell From cdb5081f0ed6bac666eee52d52c6af08d18531df Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:25:34 +0800 Subject: [PATCH 04/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 210a429ab422f..843bba2b5d5cc 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -26,7 +26,7 @@ When you execute `cdc cli changefeed create` to create a replication task, TiCDC ## How do I view the state of TiCDC replication tasks? -You can use `cdc cli` to view the state of TiCDC replication tasks. For example: +To view the status of TiCDC replication tasks, use `cdc cli`. For example: {{< copyable "shell-regular" >}} From 4fa08c0afe2789270d2fd7fb99218a18cbfa00c8 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:25:54 +0800 Subject: [PATCH 05/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 843bba2b5d5cc..0494df4d23042 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -56,7 +56,7 @@ The expected output is as follows: > **Note:** > -> This feature is introduced in TiCDC version 4.0.3. +> This feature is introduced in TiCDC 4.0.3. ## TiCDC replication interruptions From e6c4f4a50ee619988f494fa72d268bdf7e17160e Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:26:40 +0800 Subject: [PATCH 06/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 0494df4d23042..78feaeef3ca18 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -64,7 +64,7 @@ The expected output is as follows: - Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. - Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. -- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. +- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped, and the `error` item provides the detailed error message. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. - In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. ### How do I know whether the replication task is stopped manually? From afc7b78ec0ba87b160cccfcf451b8d762e614926 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:26:54 +0800 Subject: [PATCH 07/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 78feaeef3ca18..e8fca6b1f00b2 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -69,7 +69,7 @@ The expected output is as follows: ### How do I know whether the replication task is stopped manually? -You can know whether the replication task is stopped manually by using `cdc cli`. For example: +You can know whether the replication task is stopped manually by executing `cdc cli`. For example: {{< copyable "shell-regular" >}} From f852b6fd95b924dbf99d0c9d2c17b20ed6fe6022 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:27:37 +0800 Subject: [PATCH 08/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index e8fca6b1f00b2..fd8f3224d2c8d 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -82,7 +82,7 @@ In the output of the above command, `admin-job-type` shows the state of this rep * `0`: In progress, which means that the task is not stopped manually. * `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. * `2`: Resumed. The replication task resumes from `checkpoint-ts`. -* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. +* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. The replication status is retained only for later queries. ### How do I handle replication interruptions? From 56c1a9a7984408470c6bbdfa7faea334082a0d7c Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:28:11 +0800 Subject: [PATCH 09/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index fd8f3224d2c8d..d63a081bba102 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -113,7 +113,7 @@ A replication task might be interrupted in the following known scenarios: - Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. -- In above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: +- In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} From 312fd0dbfc6fdf0d0e91776371ac1736a9f7b5cb Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:29:24 +0800 Subject: [PATCH 10/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index d63a081bba102..ed780e8230e58 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -121,7 +121,7 @@ A replication task might be interrupted in the following known scenarios: cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 ``` -If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: +If you fail to update your cluster to the above new versions, you can still enable Unified Sorter in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} From f595a1ac213901c2c4342b0c4acb4ac653a709bb Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:30:50 +0800 Subject: [PATCH 11/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index ed780e8230e58..e37f78c617ba6 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -133,7 +133,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configures it with caution. +> + `sort-dir` has different behaviors in different versions. Refer to [compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configure it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. ## What is `gc-ttl` in TiCDC? From f171d2d6a80b94fd8d91dd6de4f2dac17d24e8d1 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:25:02 +0800 Subject: [PATCH 12/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index e37f78c617ba6..d8b73eacec1d0 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -106,7 +106,7 @@ A replication task might be interrupted in the following known scenarios: - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: - 1. Pause the replication task through `cdc cli changefeed pause -c `. + 1. Pause the replication task by executing `cdc cli changefeed pause -c `. 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? From 4d2bf077fd026b4f023ac5bb7b41f90eb4879a63 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:25:26 +0800 Subject: [PATCH 13/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index d8b73eacec1d0..b66ed7359de3d 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -107,7 +107,7 @@ A replication task might be interrupted in the following known scenarios: - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: 1. Pause the replication task by executing `cdc cli changefeed pause -c `. - 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. + 2. Wait for about one munite, and then resume the replication task by executing `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? From ea757fbda9e492b4b75096ce28d16a4795872dc6 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:26:21 +0800 Subject: [PATCH 14/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index b66ed7359de3d..57b08cc70a46d 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -111,7 +111,7 @@ A replication task might be interrupted in the following known scenarios: ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -- Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. +- Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions**. - In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: From 3bfcb40021bb3744e7894c8a295000424d4c4089 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:01:05 +0800 Subject: [PATCH 15/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 57b08cc70a46d..ded239fed4e13 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -138,7 +138,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= ## What is `gc-ttl` in TiCDC? -Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. +Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC. When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC. When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. From d46d200f94371c8ab8b22e426dd0899b1892bf14 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:01:40 +0800 Subject: [PATCH 16/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index ded239fed4e13..7b8ca93647a95 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -140,7 +140,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC. When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC. -When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. +When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint by configuring `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. ## What is the complete behavior of TiCDC garbage collection (GC) safepoint? From fd3d51124d1fa8eee307ed0a66adc0cc3405d5d5 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:02:20 +0800 Subject: [PATCH 17/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 7b8ca93647a95..cf5cb42545f52 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -102,7 +102,7 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -- In TiCDC v4.0.13 and earlier versions, the replication partition table may cause a replication interruption. +- In TiCDC v4.0.13 and earlier versions, when TiCDC replicates the partitioned table, it might encounter an error that leads to replication interruption. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: