From 04dcec7828ed31d618a7c266e59109930d9fa941 Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 12:52:56 +0800 Subject: [PATCH 01/17] update interruption-EN --- ticdc/manage-ticdc.md | 2 +- ticdc/troubleshoot-ticdc.md | 147 ++++++++++++++++++++---------------- 2 files changed, 84 insertions(+), 65 deletions(-) diff --git a/ticdc/manage-ticdc.md b/ticdc/manage-ticdc.md index d3fac7256af0c..5f67788a7a972 100644 --- a/ticdc/manage-ticdc.md +++ b/ticdc/manage-ticdc.md @@ -874,4 +874,4 @@ In the output of the above command, if the value of `sort-engine` is "unified", > + If your servers use mechanical hard drives or other storage devices that have high latency or limited bandwidth, use the unified sorter with caution. > + The total free capacity of hard drives must be greater than or equal to 500G. If you need to replicate a large amount of historical data, make sure that the free capacity on each node is greater than or equal to the size of the incremental data that needs to be replicated. > + Unified sorter is enabled by default. If your servers do not match the above requirements and you want to disable the unified sorter, you need to manually set `sort-engine` to `memory` for the changefeed. -> + To enable Unified Sorter on an existing changefeed, see the methods provided in [How do I handle the OOM that occurs after TiCDC is restarted after a task interruption?](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-the-oom-that-occurs-after-ticdc-is-restarted-after-a-task-interruption). +> + To enable Unified Sorter on an existing changefeed, see the methods provided in [How do I handle the OOM that occurs after TiCDC is restarted after a task interruption?](/ticdc/troubleshoot-ticdc.md#what-should-i-do-to-handle-the-oom-that-occurs-after-ticdc-is-restarted-after-a-task-interruption) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 4ff37794110f4..6d40892a528bd 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -24,13 +24,74 @@ If you do not specify `start-ts`, or specify `start-ts` as `0`, when a replicati When you execute `cdc cli changefeed create` to create a replication task, TiCDC checks whether the upstream tables meet the [replication restrictions](/ticdc/ticdc-overview.md#restrictions). If some tables do not meet the restrictions, `some tables are not eligible to replicate` is returned with a list of ineligible tables. You can choose `Y` or `y` to continue creating the task, and all updates on these tables are automatically ignored during the replication. If you choose an input other than `Y` or `y`, the replication task is not created. -## How do I handle replication interruption? +## How do I view the state of TiCDC replication tasks? + +You can use `cdc cli` to view the state of TiCDC replication tasks. For example: + +{{< copyable "shell-regular" >}} + +```shell +cdc cli changefeed list --pd=http://10.0.10.25:2379 +``` + +The expected output is as follows: + +```json +[{ + "id": "4e24dde6-53c1-40b6-badf-63620e4940dc", + "summary": { + "state": "normal", + "tso": 417886179132964865, + "checkpoint": "2020-07-07 16:07:44.881", + "error": null + } +}] +``` + +* `checkpoint`: TiCDC has replicated all data before this timestamp to downstream. +* `state`: The state of this replication task: + * `normal`: The task runs normally. + * `stopped`: The task is stopped manually or encounters an error. + * `removed`: The task is removed. + +> **Note:** +> +> This feature is introduced in TiCDC version 4.0.3. + +## TiCDC replication interruptions + +### How do I know whether a TiCDC replication task is interrupted? + +- Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. +- Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. +- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. +- In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. + +### How do I know whether the replication task is stopped manually? + +You can know whether the replication task is stopped manually by using `cdc cli`. For example: + +{{< copyable "shell-regular" >}} + +```shell +cdc cli changefeed query --pd=http://10.0.10.25:2379 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f +``` + +In the output of the above command, `admin-job-type` shows the state of this replication task: + +* `0`: In progress, which means that the task is not stopped manually. +* `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. +* `2`: Resumed. The replication task resumes from `checkpoint-ts`. +* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. + +### How do I handle replication interruptions? A replication task might be interrupted in the following known scenarios: - The downstream continues to be abnormal, and TiCDC still fails after many retries. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. + - Handling method: You can resume the replication task via the HTTP interface after the downstream is back to normal. - Replication cannot continue because of incompatible SQL statement(s) in the downstream. @@ -41,35 +102,45 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -## How do I know whether a TiCDC replication task is interrupted? +- In TiCDC v4.0.13 and earlier versions, the replication partition table may cause a replication interruption. -- Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. -- Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. -- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. -- In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. + - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. + - Handling procedures: + 1. Pause the replication task through `cdc cli changefeed pause -c `. + 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. -## What is `gc-ttl` in TiCDC? +### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. +- Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. -When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. +- In above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: -## How do I handle the OOM that occurs after TiCDC is restarted after a task interruption? +{{< copyable "shell-regular" >}} -If the replication task is interrupted for a long time and a large volume of new data has been written to TiDB, Out of Memory (OOM) might occur when TiCDC is restarted. In this situation, you can enable unified sorter, TiCDC's experimental sorting engine. This engine sorts data in the disk when the memory is insufficient. To enable this feature, pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: +```shell +cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 +``` +If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} ```shell -cdc cli changefeed update -c [changefeed-id] --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 +cdc cli changefeed update -c --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 ``` > **Note:** > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. +> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatiblity-notes-for-sort-dir-and-data-dir), and configures it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. +## What is `gc-ttl` in TiCDC? + +Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. + +When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. + ## What is the complete behavior of TiCDC garbage collection (GC) safepoint? If a replication task starts after the TiCDC service starts, the TiCDC owner updates the PD service GC safepoint with the smallest value of `checkpoint-ts` among all replication tasks. The service GC safepoint ensures that TiCDC does not delete data generated at that time and after that time. If the replication task is interrupted, the `checkpoint-ts` of this task does not change and PD's corresponding service GC safepoint is not updated either. The Time-To-Live (TTL) that TiCDC sets for a service GC safepoint is 24 hours, which means that the GC mechanism does not delete any data if the TiCDC service can be recovered within 24 hours after it is interrupted. @@ -176,58 +247,6 @@ cdc cli changefeed create --pd=http://10.0.10.25:2379 --sink-uri="kafka://127.0. For more information, refer to [Create a replication task](/ticdc/manage-ticdc.md#create-a-replication-task). -## How do I view the status of TiCDC replication tasks? - -To view the status of TiCDC replication tasks, use `cdc cli`. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed list --pd=http://10.0.10.25:2379 -``` - -The expected output is as follows: - -```json -[{ - "id": "4e24dde6-53c1-40b6-badf-63620e4940dc", - "summary": { - "state": "normal", - "tso": 417886179132964865, - "checkpoint": "2020-07-07 16:07:44.881", - "error": null - } -}] -``` - -* `checkpoint`: TiCDC has replicated all data before this timestamp to downstream. -* `state`: The state of the replication task: - - * `normal`: The task runs normally. - * `stopped`: The task is stopped manually or encounters an error. - * `removed`: The task is removed. - -> **Note:** -> -> This feature is introduced in TiCDC 4.0.3. - -## How do I know whether the replication task is stopped manually? - -You can know whether the replication task is stopped manually by using `cdc cli`. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed query --pd=http://10.0.10.25:2379 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f -``` - -In the output of this command, `admin-job-type` shows the state of the replication task: - -* `0`: In progress, which means that the task is not stopped manually. -* `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. -* `2`: Resumed. The replication task resumes from `checkpoint-ts`. -* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. - ## Why does the latency from TiCDC to Kafka become higher and higher? * Check [how do I view the status of TiCDC replication tasks](#how-do-i-view-the-status-of-ticdc-replication-tasks). From 42644f9a5be1b585042d6f7d09d1580171f903b6 Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 13:38:25 +0800 Subject: [PATCH 02/17] Update ticdc troubleshooting EN --- ticdc/troubleshoot-ticdc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 6d40892a528bd..e19bd4d8c1566 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -132,7 +132,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatiblity-notes-for-sort-dir-and-data-dir), and configures it with caution. +> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configures it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. ## What is `gc-ttl` in TiCDC? @@ -249,7 +249,7 @@ For more information, refer to [Create a replication task](/ticdc/manage-ticdc.m ## Why does the latency from TiCDC to Kafka become higher and higher? -* Check [how do I view the status of TiCDC replication tasks](#how-do-i-view-the-status-of-ticdc-replication-tasks). +* Check [how do I view the state of TiCDC replication tasks](#how-do-i-view-the-state-of-ticdc-replication-tasks). * Adjust the following parameters of Kafka: * Increase the `message.max.bytes` value in `server.properties` to `1073741824` (1 GB). From a329940f4b09272fad4b9594172b2465745e065f Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 13:46:07 +0800 Subject: [PATCH 03/17] Update ticdc troubleshooting - EN --- ticdc/troubleshoot-ticdc.md | 1 + 1 file changed, 1 insertion(+) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index e19bd4d8c1566..3bb01597d6c00 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -122,6 +122,7 @@ cdc cli changefeed update -c --sort-engine="unified" --pd=http:/ ``` If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: + {{< copyable "shell-regular" >}} ```shell From 99a4ecd89161a7bde8a123cff6b92e475297454d Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:25:34 +0800 Subject: [PATCH 04/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 3bb01597d6c00..7213a61baf994 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -26,7 +26,7 @@ When you execute `cdc cli changefeed create` to create a replication task, TiCDC ## How do I view the state of TiCDC replication tasks? -You can use `cdc cli` to view the state of TiCDC replication tasks. For example: +To view the status of TiCDC replication tasks, use `cdc cli`. For example: {{< copyable "shell-regular" >}} From 65e396b190d407a7637fc16d7a1ffc6831f9852a Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:25:54 +0800 Subject: [PATCH 05/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 7213a61baf994..51d0f7b5d9a56 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -56,7 +56,7 @@ The expected output is as follows: > **Note:** > -> This feature is introduced in TiCDC version 4.0.3. +> This feature is introduced in TiCDC 4.0.3. ## TiCDC replication interruptions From e5248cacbf315d4a143b7b4c47311b3c99321e8d Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:26:40 +0800 Subject: [PATCH 06/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 51d0f7b5d9a56..36d59d280ee95 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -64,7 +64,7 @@ The expected output is as follows: - Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. - Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. -- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. +- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped, and the `error` item provides the detailed error message. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. - In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. ### How do I know whether the replication task is stopped manually? From 212aa97031919db741c25c9f1008e279297ee7cb Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:26:54 +0800 Subject: [PATCH 07/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 36d59d280ee95..6652ba3044733 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -69,7 +69,7 @@ The expected output is as follows: ### How do I know whether the replication task is stopped manually? -You can know whether the replication task is stopped manually by using `cdc cli`. For example: +You can know whether the replication task is stopped manually by executing `cdc cli`. For example: {{< copyable "shell-regular" >}} From 535330906bae99480aa6612a414e339f7affac17 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:27:37 +0800 Subject: [PATCH 08/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 6652ba3044733..d3351d6fc3049 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -82,7 +82,7 @@ In the output of the above command, `admin-job-type` shows the state of this rep * `0`: In progress, which means that the task is not stopped manually. * `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. * `2`: Resumed. The replication task resumes from `checkpoint-ts`. -* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. +* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. The replication status is retained only for later queries. ### How do I handle replication interruptions? From a3b4145d892778948c9de37bfe76a9403f3fd6d4 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:28:11 +0800 Subject: [PATCH 09/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index d3351d6fc3049..946c33cf833a2 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -113,7 +113,7 @@ A replication task might be interrupted in the following known scenarios: - Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. -- In above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: +- In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} From 3ea14940c142abf508402a7bcb08f74cd1d5be65 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:29:24 +0800 Subject: [PATCH 10/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 946c33cf833a2..7d806b8bf148a 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -121,7 +121,7 @@ A replication task might be interrupted in the following known scenarios: cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 ``` -If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: +If you fail to update your cluster to the above new versions, you can still enable Unified Sorter in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} From 07f9d9496fa467d65cd04c86944c235e54be7e49 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:30:50 +0800 Subject: [PATCH 11/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 7d806b8bf148a..9cd06f93e3b23 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -133,7 +133,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configures it with caution. +> + `sort-dir` has different behaviors in different versions. Refer to [compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configure it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. ## What is `gc-ttl` in TiCDC? From 7e45b6fe9dae8f1790d2bc104f8254b320deac8c Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:25:02 +0800 Subject: [PATCH 12/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 9cd06f93e3b23..484c3af714326 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -106,7 +106,7 @@ A replication task might be interrupted in the following known scenarios: - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: - 1. Pause the replication task through `cdc cli changefeed pause -c `. + 1. Pause the replication task by executing `cdc cli changefeed pause -c `. 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? From 91543688ab3db9b20b5448bdfd5c81d220a997bf Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:25:26 +0800 Subject: [PATCH 13/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 484c3af714326..f4b20c5036c74 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -107,7 +107,7 @@ A replication task might be interrupted in the following known scenarios: - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: 1. Pause the replication task by executing `cdc cli changefeed pause -c `. - 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. + 2. Wait for about one munite, and then resume the replication task by executing `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? From acdc25314e9d92ffeb62a0cf0b54bdc37c902f3d Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:26:21 +0800 Subject: [PATCH 14/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index f4b20c5036c74..27f1dd1e38c1c 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -111,7 +111,7 @@ A replication task might be interrupted in the following known scenarios: ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -- Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. +- Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions**. - In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: From 61d4e58b3a798a4e465a5423e039d3e5b83999dc Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:01:05 +0800 Subject: [PATCH 15/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 27f1dd1e38c1c..11fee6e4bfeb0 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -138,7 +138,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= ## What is `gc-ttl` in TiCDC? -Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. +Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC. When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC. When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. From a5eaacecd9dbd7eceaa163c098a90604595f5c0d Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:01:40 +0800 Subject: [PATCH 16/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 11fee6e4bfeb0..31404da0ca165 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -140,7 +140,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC. When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC. -When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. +When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint by configuring `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. ## What is the complete behavior of TiCDC garbage collection (GC) safepoint? From 959cff531326708c3e91e483042e9ea512454e80 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:02:20 +0800 Subject: [PATCH 17/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 31404da0ca165..25ec85bfb9da2 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -102,7 +102,7 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -- In TiCDC v4.0.13 and earlier versions, the replication partition table may cause a replication interruption. +- In TiCDC v4.0.13 and earlier versions, when TiCDC replicates the partitioned table, it might encounter an error that leads to replication interruption. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: