From 9fb27955ea48f74a9a17e1462ad378083ef59847 Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 12:52:56 +0800 Subject: [PATCH 01/17] update interruption-EN --- ticdc/manage-ticdc.md | 2 +- ticdc/troubleshoot-ticdc.md | 147 ++++++++++++++++++++---------------- 2 files changed, 84 insertions(+), 65 deletions(-) diff --git a/ticdc/manage-ticdc.md b/ticdc/manage-ticdc.md index 573f211a079cb..621da6a6df0d3 100644 --- a/ticdc/manage-ticdc.md +++ b/ticdc/manage-ticdc.md @@ -788,4 +788,4 @@ In the output of the above command, if the value of `sort-engine` is "unified", > + If your servers use mechanical hard drives or other storage devices that have high latency or limited bandwidth, use the unified sorter with caution. > + The total free capacity of hard drives must be greater than or equal to 500G. If you need to replicate a large amount of historical data, make sure that the free capacity on each node is greater than or equal to the size of the incremental data that needs to be replicated. > + Unified sorter is enabled by default. If your servers do not match the above requirements and you want to disable the unified sorter, you need to manually set `sort-engine` to `memory` for the changefeed. -> + To enable Unified Sorter on an existing changefeed, see the methods provided in [How do I handle the OOM that occurs after TiCDC is restarted after a task interruption?](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-the-oom-that-occurs-after-ticdc-is-restarted-after-a-task-interruption). +> + To enable Unified Sorter on an existing changefeed, see the methods provided in [How do I handle the OOM that occurs after TiCDC is restarted after a task interruption?](/ticdc/troubleshoot-ticdc.md#what-should-i-do-to-handle-the-oom-that-occurs-after-ticdc-is-restarted-after-a-task-interruption) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index be47a53d9caf9..9b0e629468b0e 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -25,13 +25,74 @@ If you do not specify `start-ts`, or specify `start-ts` as `0`, when a replicati When you execute `cdc cli changefeed create` to create a replication task, TiCDC checks whether the upstream tables meet the [replication restrictions](/ticdc/ticdc-overview.md#restrictions). If some tables do not meet the restrictions, `some tables are not eligible to replicate` is returned with a list of ineligible tables. You can choose `Y` or `y` to continue creating the task, and all updates on these tables are automatically ignored during the replication. If you choose an input other than `Y` or `y`, the replication task is not created. -## How do I handle replication interruption? +## How do I view the state of TiCDC replication tasks? + +You can use `cdc cli` to view the state of TiCDC replication tasks. For example: + +{{< copyable "shell-regular" >}} + +```shell +cdc cli changefeed list --pd=http://10.0.10.25:2379 +``` + +The expected output is as follows: + +```json +[{ + "id": "4e24dde6-53c1-40b6-badf-63620e4940dc", + "summary": { + "state": "normal", + "tso": 417886179132964865, + "checkpoint": "2020-07-07 16:07:44.881", + "error": null + } +}] +``` + +* `checkpoint`: TiCDC has replicated all data before this timestamp to downstream. +* `state`: The state of this replication task: + * `normal`: The task runs normally. + * `stopped`: The task is stopped manually or encounters an error. + * `removed`: The task is removed. + +> **Note:** +> +> This feature is introduced in TiCDC version 4.0.3. + +## TiCDC replication interruptions + +### How do I know whether a TiCDC replication task is interrupted? + +- Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. +- Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. +- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. +- In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. + +### How do I know whether the replication task is stopped manually? + +You can know whether the replication task is stopped manually by using `cdc cli`. For example: + +{{< copyable "shell-regular" >}} + +```shell +cdc cli changefeed query --pd=http://10.0.10.25:2379 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f +``` + +In the output of the above command, `admin-job-type` shows the state of this replication task: + +* `0`: In progress, which means that the task is not stopped manually. +* `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. +* `2`: Resumed. The replication task resumes from `checkpoint-ts`. +* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. + +### How do I handle replication interruptions? A replication task might be interrupted in the following known scenarios: - The downstream continues to be abnormal, and TiCDC still fails after many retries. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. + - Handling method: You can resume the replication task via the HTTP interface after the downstream is back to normal. - Replication cannot continue because of incompatible SQL statement(s) in the downstream. @@ -42,35 +103,45 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -## How do I know whether a TiCDC replication task is interrupted? +- In TiCDC v4.0.13 and earlier versions, the replication partition table may cause a replication interruption. -- Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. -- Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. -- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. -- In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. + - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. + - Handling procedures: + 1. Pause the replication task through `cdc cli changefeed pause -c `. + 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. -## What is `gc-ttl` in TiCDC? +### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. +- Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. -When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. +- In above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: -## How do I handle the OOM that occurs after TiCDC is restarted after a task interruption? +{{< copyable "shell-regular" >}} -If the replication task is interrupted for a long time and a large volume of new data has been written to TiDB, Out of Memory (OOM) might occur when TiCDC is restarted. In this situation, you can enable unified sorter, TiCDC's experimental sorting engine. This engine sorts data in the disk when the memory is insufficient. To enable this feature, pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: +```shell +cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 +``` +If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} ```shell -cdc cli changefeed update -c [changefeed-id] --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 +cdc cli changefeed update -c --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 ``` > **Note:** > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. +> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatiblity-notes-for-sort-dir-and-data-dir), and configures it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. +## What is `gc-ttl` in TiCDC? + +Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. + +When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. + ## What is the complete behavior of TiCDC garbage collection (GC) safepoint? If a replication task starts after the TiCDC service starts, the TiCDC owner updates the PD service GC safepoint with the smallest value of `checkpoint-ts` among all replication tasks. The service GC safepoint ensures that TiCDC does not delete data generated at that time and after that time. If the replication task is interrupted, the `checkpoint-ts` of this task does not change and PD's corresponding service GC safepoint is not updated either. The Time-To-Live (TTL) that TiCDC sets for a service GC safepoint is 24 hours, which means that the GC mechanism does not delete any data if the TiCDC service can be recovered within 24 hours after it is interrupted. @@ -177,58 +248,6 @@ cdc cli changefeed create --pd=http://10.0.10.25:2379 --sink-uri="kafka://127.0. For more information, refer to [Create a replication task](/ticdc/manage-ticdc.md#create-a-replication-task). -## How do I view the status of TiCDC replication tasks? - -To view the status of TiCDC replication tasks, use `cdc cli`. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed list --pd=http://10.0.10.25:2379 -``` - -The expected output is as follows: - -```json -[{ - "id": "4e24dde6-53c1-40b6-badf-63620e4940dc", - "summary": { - "state": "normal", - "tso": 417886179132964865, - "checkpoint": "2020-07-07 16:07:44.881", - "error": null - } -}] -``` - -* `checkpoint`: TiCDC has replicated all data before this timestamp to downstream. -* `state`: The state of the replication task: - - * `normal`: The task runs normally. - * `stopped`: The task is stopped manually or encounters an error. - * `removed`: The task is removed. - -> **Note:** -> -> This feature is introduced in TiCDC 4.0.3. - -## How do I know whether the replication task is stopped manually? - -You can know whether the replication task is stopped manually by using `cdc cli`. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed query --pd=http://10.0.10.25:2379 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f -``` - -In the output of this command, `admin-job-type` shows the state of the replication task: - -* `0`: In progress, which means that the task is not stopped manually. -* `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. -* `2`: Resumed. The replication task resumes from `checkpoint-ts`. -* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. - ## Why does the latency from TiCDC to Kafka become higher and higher? * Check [how do I view the status of TiCDC replication tasks](#how-do-i-view-the-status-of-ticdc-replication-tasks). From 023457cb3c7db9a508fb50f145522b206afc4eb4 Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 13:38:25 +0800 Subject: [PATCH 02/17] Update ticdc troubleshooting EN --- ticdc/troubleshoot-ticdc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 9b0e629468b0e..de6336a2c389b 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -133,7 +133,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatiblity-notes-for-sort-dir-and-data-dir), and configures it with caution. +> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configures it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. ## What is `gc-ttl` in TiCDC? @@ -250,7 +250,7 @@ For more information, refer to [Create a replication task](/ticdc/manage-ticdc.m ## Why does the latency from TiCDC to Kafka become higher and higher? -* Check [how do I view the status of TiCDC replication tasks](#how-do-i-view-the-status-of-ticdc-replication-tasks). +* Check [how do I view the state of TiCDC replication tasks](#how-do-i-view-the-state-of-ticdc-replication-tasks). * Adjust the following parameters of Kafka: * Increase the `message.max.bytes` value in `server.properties` to `1073741824` (1 GB). From e32b483cd41bd97d4f749272ef8d21b329e37d18 Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 13:46:07 +0800 Subject: [PATCH 03/17] Update ticdc troubleshooting - EN --- ticdc/troubleshoot-ticdc.md | 1 + 1 file changed, 1 insertion(+) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index de6336a2c389b..6424bad9ec21f 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -123,6 +123,7 @@ cdc cli changefeed update -c --sort-engine="unified" --pd=http:/ ``` If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: + {{< copyable "shell-regular" >}} ```shell From f57ebea72b7abc21f5859405d1887f3753edc0d1 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:25:34 +0800 Subject: [PATCH 04/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 6424bad9ec21f..136f3ca6950cb 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -27,7 +27,7 @@ When you execute `cdc cli changefeed create` to create a replication task, TiCDC ## How do I view the state of TiCDC replication tasks? -You can use `cdc cli` to view the state of TiCDC replication tasks. For example: +To view the status of TiCDC replication tasks, use `cdc cli`. For example: {{< copyable "shell-regular" >}} From 188e5e087d368f4277cefb94f4c2b1436f77151b Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:25:54 +0800 Subject: [PATCH 05/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 136f3ca6950cb..d28295ca17502 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -57,7 +57,7 @@ The expected output is as follows: > **Note:** > -> This feature is introduced in TiCDC version 4.0.3. +> This feature is introduced in TiCDC 4.0.3. ## TiCDC replication interruptions From 8cfbee0425715f948a39de3fbe949ed71097a2ba Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:26:40 +0800 Subject: [PATCH 06/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index d28295ca17502..5ed4c6f89381c 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -65,7 +65,7 @@ The expected output is as follows: - Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. - Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. -- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. +- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped, and the `error` item provides the detailed error message. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. - In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. ### How do I know whether the replication task is stopped manually? From 97cf50355a7c91f8dc1806bffdb3845757be5b20 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:26:54 +0800 Subject: [PATCH 07/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 5ed4c6f89381c..ad4aa9ef5d9f3 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -70,7 +70,7 @@ The expected output is as follows: ### How do I know whether the replication task is stopped manually? -You can know whether the replication task is stopped manually by using `cdc cli`. For example: +You can know whether the replication task is stopped manually by executing `cdc cli`. For example: {{< copyable "shell-regular" >}} From b851e1333bdb694b53e913a9fb343910d42a4941 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:27:37 +0800 Subject: [PATCH 08/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index ad4aa9ef5d9f3..8dad5af7b01b0 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -83,7 +83,7 @@ In the output of the above command, `admin-job-type` shows the state of this rep * `0`: In progress, which means that the task is not stopped manually. * `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. * `2`: Resumed. The replication task resumes from `checkpoint-ts`. -* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. +* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. The replication status is retained only for later queries. ### How do I handle replication interruptions? From 9fe9eeca4364812096c2cf0ec4e83df02f8bd2b2 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:28:11 +0800 Subject: [PATCH 09/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 8dad5af7b01b0..92651f5f5337a 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -114,7 +114,7 @@ A replication task might be interrupted in the following known scenarios: - Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. -- In above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: +- In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} From 5e3ebb733553e2062d26f6fe257cac793c5d1d46 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:29:24 +0800 Subject: [PATCH 10/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 92651f5f5337a..b107607924c4c 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -122,7 +122,7 @@ A replication task might be interrupted in the following known scenarios: cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 ``` -If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: +If you fail to update your cluster to the above new versions, you can still enable Unified Sorter in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} From 35b8b7eae0515cb4e7396c3d49be84d9f10d2f47 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:30:50 +0800 Subject: [PATCH 11/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index b107607924c4c..814749635bbb6 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -134,7 +134,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configures it with caution. +> + `sort-dir` has different behaviors in different versions. Refer to [compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configure it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. ## What is `gc-ttl` in TiCDC? From 079827acc4b932ad6e8c1ec568b3bf00b8164998 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:25:02 +0800 Subject: [PATCH 12/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 814749635bbb6..a503cae2e7007 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -107,7 +107,7 @@ A replication task might be interrupted in the following known scenarios: - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: - 1. Pause the replication task through `cdc cli changefeed pause -c `. + 1. Pause the replication task by executing `cdc cli changefeed pause -c `. 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? From 3bbfdaa6710bca26cd2fd5125a0c219b53b3f09d Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:25:26 +0800 Subject: [PATCH 13/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index a503cae2e7007..7a497620de548 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -108,7 +108,7 @@ A replication task might be interrupted in the following known scenarios: - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: 1. Pause the replication task by executing `cdc cli changefeed pause -c `. - 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. + 2. Wait for about one munite, and then resume the replication task by executing `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? From 9592460b37244205a970997c703f289eac3e2179 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:26:21 +0800 Subject: [PATCH 14/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 7a497620de548..65543696a2a6e 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -112,7 +112,7 @@ A replication task might be interrupted in the following known scenarios: ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -- Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. +- Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions**. - In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: From 34d50de32fe191878affe37a084364c68a06f5a4 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:01:05 +0800 Subject: [PATCH 15/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 65543696a2a6e..5e4a373e1eb74 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -139,7 +139,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= ## What is `gc-ttl` in TiCDC? -Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. +Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC. When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC. When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. From 131b0a8140bcf2fbdc0120185354025c15689aee Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:01:40 +0800 Subject: [PATCH 16/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 5e4a373e1eb74..e10ea30c6f530 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -141,7 +141,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC. When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC. -When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. +When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint by configuring `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. ## What is the complete behavior of TiCDC garbage collection (GC) safepoint? From 7b808ac42d75f6669e85e4bf824531d0ebdb0ec6 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:02:20 +0800 Subject: [PATCH 17/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index e10ea30c6f530..ad91a303aafbd 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -103,7 +103,7 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -- In TiCDC v4.0.13 and earlier versions, the replication partition table may cause a replication interruption. +- In TiCDC v4.0.13 and earlier versions, when TiCDC replicates the partitioned table, it might encounter an error that leads to replication interruption. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: