From 44c0dd6ac735b6f7cabf4243fc02e98c51e333a0 Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 12:52:56 +0800 Subject: [PATCH 01/17] update interruption-EN --- ticdc/manage-ticdc.md | 2 +- ticdc/troubleshoot-ticdc.md | 147 ++++++++++++++++++++---------------- 2 files changed, 84 insertions(+), 65 deletions(-) diff --git a/ticdc/manage-ticdc.md b/ticdc/manage-ticdc.md index 5fa72ce018bd7..ebc5aa8b5aada 100644 --- a/ticdc/manage-ticdc.md +++ b/ticdc/manage-ticdc.md @@ -860,4 +860,4 @@ In the output of the above command, if the value of `sort-engine` is "unified", > + If your servers use mechanical hard drives or other storage devices that have high latency or limited bandwidth, use the unified sorter with caution. > + The total free capacity of hard drives must be greater than or equal to 500G. If you need to replicate a large amount of historical data, make sure that the free capacity on each node is greater than or equal to the size of the incremental data that needs to be replicated. > + Unified sorter is enabled by default. If your servers do not match the above requirements and you want to disable the unified sorter, you need to manually set `sort-engine` to `memory` for the changefeed. -> + To enable Unified Sorter on an existing changefeed, see the methods provided in [How do I handle the OOM that occurs after TiCDC is restarted after a task interruption?](/ticdc/troubleshoot-ticdc.md#how-do-i-handle-the-oom-that-occurs-after-ticdc-is-restarted-after-a-task-interruption). +> + To enable Unified Sorter on an existing changefeed, see the methods provided in [How do I handle the OOM that occurs after TiCDC is restarted after a task interruption?](/ticdc/troubleshoot-ticdc.md#what-should-i-do-to-handle-the-oom-that-occurs-after-ticdc-is-restarted-after-a-task-interruption) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index efdcc4c330295..5e3c78e86837d 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -24,13 +24,74 @@ If you do not specify `start-ts`, or specify `start-ts` as `0`, when a replicati When you execute `cdc cli changefeed create` to create a replication task, TiCDC checks whether the upstream tables meet the [replication restrictions](/ticdc/ticdc-overview.md#restrictions). If some tables do not meet the restrictions, `some tables are not eligible to replicate` is returned with a list of ineligible tables. You can choose `Y` or `y` to continue creating the task, and all updates on these tables are automatically ignored during the replication. If you choose an input other than `Y` or `y`, the replication task is not created. -## How do I handle replication interruption? +## How do I view the state of TiCDC replication tasks? + +You can use `cdc cli` to view the state of TiCDC replication tasks. For example: + +{{< copyable "shell-regular" >}} + +```shell +cdc cli changefeed list --pd=http://10.0.10.25:2379 +``` + +The expected output is as follows: + +```json +[{ + "id": "4e24dde6-53c1-40b6-badf-63620e4940dc", + "summary": { + "state": "normal", + "tso": 417886179132964865, + "checkpoint": "2020-07-07 16:07:44.881", + "error": null + } +}] +``` + +* `checkpoint`: TiCDC has replicated all data before this timestamp to downstream. +* `state`: The state of this replication task: + * `normal`: The task runs normally. + * `stopped`: The task is stopped manually or encounters an error. + * `removed`: The task is removed. + +> **Note:** +> +> This feature is introduced in TiCDC version 4.0.3. + +## TiCDC replication interruptions + +### How do I know whether a TiCDC replication task is interrupted? + +- Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. +- Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. +- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. +- In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. + +### How do I know whether the replication task is stopped manually? + +You can know whether the replication task is stopped manually by using `cdc cli`. For example: + +{{< copyable "shell-regular" >}} + +```shell +cdc cli changefeed query --pd=http://10.0.10.25:2379 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f +``` + +In the output of the above command, `admin-job-type` shows the state of this replication task: + +* `0`: In progress, which means that the task is not stopped manually. +* `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. +* `2`: Resumed. The replication task resumes from `checkpoint-ts`. +* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. + +### How do I handle replication interruptions? A replication task might be interrupted in the following known scenarios: - The downstream continues to be abnormal, and TiCDC still fails after many retries. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. + - Handling method: You can resume the replication task via the HTTP interface after the downstream is back to normal. - Replication cannot continue because of incompatible SQL statement(s) in the downstream. @@ -41,35 +102,45 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -## How do I know whether a TiCDC replication task is interrupted? +- In TiCDC v4.0.13 and earlier versions, the replication partition table may cause a replication interruption. -- Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. -- Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. -- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. -- In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. + - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. + - Handling procedures: + 1. Pause the replication task through `cdc cli changefeed pause -c `. + 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. -## What is `gc-ttl` in TiCDC? +### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. +- Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. -When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. +- In above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: -## How do I handle the OOM that occurs after TiCDC is restarted after a task interruption? +{{< copyable "shell-regular" >}} -If the replication task is interrupted for a long time and a large volume of new data has been written to TiDB, Out of Memory (OOM) might occur when TiCDC is restarted. In this situation, you can enable unified sorter, TiCDC's experimental sorting engine. This engine sorts data in the disk when the memory is insufficient. To enable this feature, pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: +```shell +cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 +``` +If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} ```shell -cdc cli changefeed update -c [changefeed-id] --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 +cdc cli changefeed update -c --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 ``` > **Note:** > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. +> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatiblity-notes-for-sort-dir-and-data-dir), and configures it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. +## What is `gc-ttl` in TiCDC? + +Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. + +When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. + ## What is the complete behavior of TiCDC garbage collection (GC) safepoint? If a replication task starts after the TiCDC service starts, the TiCDC owner updates the PD service GC safepoint with the smallest value of `checkpoint-ts` among all replication tasks. The service GC safepoint ensures that TiCDC does not delete data generated at that time and after that time. If the replication task is interrupted, the `checkpoint-ts` of this task does not change and PD's corresponding service GC safepoint is not updated either. The Time-To-Live (TTL) that TiCDC sets for a service GC safepoint is 24 hours, which means that the GC mechanism does not delete any data if the TiCDC service can be recovered within 24 hours after it is interrupted. @@ -176,58 +247,6 @@ cdc cli changefeed create --pd=http://10.0.10.25:2379 --sink-uri="kafka://127.0. For more information, refer to [Create a replication task](/ticdc/manage-ticdc.md#create-a-replication-task). -## How do I view the status of TiCDC replication tasks? - -To view the status of TiCDC replication tasks, use `cdc cli`. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed list --pd=http://10.0.10.25:2379 -``` - -The expected output is as follows: - -```json -[{ - "id": "4e24dde6-53c1-40b6-badf-63620e4940dc", - "summary": { - "state": "normal", - "tso": 417886179132964865, - "checkpoint": "2020-07-07 16:07:44.881", - "error": null - } -}] -``` - -* `checkpoint`: TiCDC has replicated all data before this timestamp to downstream. -* `state`: The state of the replication task: - - * `normal`: The task runs normally. - * `stopped`: The task is stopped manually or encounters an error. - * `removed`: The task is removed. - -> **Note:** -> -> This feature is introduced in TiCDC 4.0.3. - -## How do I know whether the replication task is stopped manually? - -You can know whether the replication task is stopped manually by using `cdc cli`. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed query --pd=http://10.0.10.25:2379 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f -``` - -In the output of this command, `admin-job-type` shows the state of the replication task: - -* `0`: In progress, which means that the task is not stopped manually. -* `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. -* `2`: Resumed. The replication task resumes from `checkpoint-ts`. -* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. - ## Why does the latency from TiCDC to Kafka become higher and higher? * Check [how do I view the status of TiCDC replication tasks](#how-do-i-view-the-status-of-ticdc-replication-tasks). From 046ca976a0fb846c05bc9de81165a84970104eb9 Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 13:38:25 +0800 Subject: [PATCH 02/17] Update ticdc troubleshooting EN --- ticdc/troubleshoot-ticdc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 5e3c78e86837d..bcabba96fbd3a 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -132,7 +132,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatiblity-notes-for-sort-dir-and-data-dir), and configures it with caution. +> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configures it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. ## What is `gc-ttl` in TiCDC? @@ -249,7 +249,7 @@ For more information, refer to [Create a replication task](/ticdc/manage-ticdc.m ## Why does the latency from TiCDC to Kafka become higher and higher? -* Check [how do I view the status of TiCDC replication tasks](#how-do-i-view-the-status-of-ticdc-replication-tasks). +* Check [how do I view the state of TiCDC replication tasks](#how-do-i-view-the-state-of-ticdc-replication-tasks). * Adjust the following parameters of Kafka: * Increase the `message.max.bytes` value in `server.properties` to `1073741824` (1 GB). From cdefde1bde6f0c03ca1d3d52fd6ef7761d440a5a Mon Sep 17 00:00:00 2001 From: Fendy Date: Tue, 14 Sep 2021 13:46:07 +0800 Subject: [PATCH 03/17] Update ticdc troubleshooting - EN --- ticdc/troubleshoot-ticdc.md | 1 + 1 file changed, 1 insertion(+) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index bcabba96fbd3a..92612508894c9 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -122,6 +122,7 @@ cdc cli changefeed update -c --sort-engine="unified" --pd=http:/ ``` If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: + {{< copyable "shell-regular" >}} ```shell From f7d07e55e1b3dfee14372b9574772dd01e0f3009 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:25:34 +0800 Subject: [PATCH 04/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 92612508894c9..88d845d0d438d 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -26,7 +26,7 @@ When you execute `cdc cli changefeed create` to create a replication task, TiCDC ## How do I view the state of TiCDC replication tasks? -You can use `cdc cli` to view the state of TiCDC replication tasks. For example: +To view the status of TiCDC replication tasks, use `cdc cli`. For example: {{< copyable "shell-regular" >}} From 739a7aa8e3168513e8f51c5434cebc1f77321dfb Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:25:54 +0800 Subject: [PATCH 05/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 88d845d0d438d..48deb11fe144b 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -56,7 +56,7 @@ The expected output is as follows: > **Note:** > -> This feature is introduced in TiCDC version 4.0.3. +> This feature is introduced in TiCDC 4.0.3. ## TiCDC replication interruptions From 6c50ff77e22be9a1ced10374b17ec1b1281f80af Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:26:40 +0800 Subject: [PATCH 06/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 48deb11fe144b..aff720aa5b62d 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -64,7 +64,7 @@ The expected output is as follows: - Check the `changefeed checkpoint` monitoring metric of the replication task (choose the right `changefeed id`) in the Grafana dashboard. If the metric value stays unchanged, or the `checkpoint lag` metric keeps increasing, the replication task might be interrupted. - Check the `exit error count` monitoring metric. If the metric value is greater than `0`, an error has occurred in the replication task. -- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped and the `error` item provides the detailed error information. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. +- Execute `cdc cli changefeed list` and `cdc cli changefeed query` to check the status of the replication task. `stopped` means the task has stopped, and the `error` item provides the detailed error message. After the error occurs, you can search `error on running processor` in the TiCDC server log to see the error stack for troubleshooting. - In some extreme cases, the TiCDC service is restarted. You can search the `FATAL` level log in the TiCDC server log for troubleshooting. ### How do I know whether the replication task is stopped manually? From f3af3d357e5c197688cda1498298fe233fccd228 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:26:54 +0800 Subject: [PATCH 07/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index aff720aa5b62d..bbed8b6eda082 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -69,7 +69,7 @@ The expected output is as follows: ### How do I know whether the replication task is stopped manually? -You can know whether the replication task is stopped manually by using `cdc cli`. For example: +You can know whether the replication task is stopped manually by executing `cdc cli`. For example: {{< copyable "shell-regular" >}} From dec2afef250f5192ce0092fba3dc1a83b94a99cc Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:27:37 +0800 Subject: [PATCH 08/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index bbed8b6eda082..875fec3322ea6 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -82,7 +82,7 @@ In the output of the above command, `admin-job-type` shows the state of this rep * `0`: In progress, which means that the task is not stopped manually. * `1`: Paused. When the task is paused, all replicated `processor`s exit. The configuration and the replication status of the task are retained, so you can resume the task from `checkpiont-ts`. * `2`: Resumed. The replication task resumes from `checkpoint-ts`. -* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. Only the replication status is retained for later queries. +* `3`: Removed. When the task is removed, all replicated `processor`s are ended, and the configuration information of the replication task is cleared up. The replication status is retained only for later queries. ### How do I handle replication interruptions? From 3c8c53322b690e4a2257827b2cd0204cdc204033 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:28:11 +0800 Subject: [PATCH 09/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 875fec3322ea6..23d601d6c8a07 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -113,7 +113,7 @@ A replication task might be interrupted in the following known scenarios: - Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. -- In above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: +- In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} From 07e5a0a117b3775b3897ca033fb4dfc9987ac7fa Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:29:24 +0800 Subject: [PATCH 10/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 23d601d6c8a07..aebe517164e3b 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -121,7 +121,7 @@ A replication task might be interrupted in the following known scenarios: cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 ``` -If you fail to update your cluster to above new versions, the Unified Sorter can still be enabled in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: +If you fail to update your cluster to the above new versions, you can still enable Unified Sorter in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: {{< copyable "shell-regular" >}} From 744766881bad522bdb717349ed788a8ee5a94eda Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 14:30:50 +0800 Subject: [PATCH 11/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index aebe517164e3b..151c1d0eca7fa 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -133,7 +133,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= > > + Since v4.0.9, TiCDC supports the unified sorter engine. > + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions, please refer to [Compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configures it with caution. +> + `sort-dir` has different behaviors in different versions. Refer to [compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configure it with caution. > + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. ## What is `gc-ttl` in TiCDC? From c01de9089f57b5f819af7db288ac3bacf08534c3 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:25:02 +0800 Subject: [PATCH 12/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 151c1d0eca7fa..f1c7c0dfc5171 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -106,7 +106,7 @@ A replication task might be interrupted in the following known scenarios: - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: - 1. Pause the replication task through `cdc cli changefeed pause -c `. + 1. Pause the replication task by executing `cdc cli changefeed pause -c `. 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? From 39d115e359266c8c4b9e677f00b9ce6342381cf9 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:25:26 +0800 Subject: [PATCH 13/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index f1c7c0dfc5171..c936a0d0a7a72 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -107,7 +107,7 @@ A replication task might be interrupted in the following known scenarios: - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: 1. Pause the replication task by executing `cdc cli changefeed pause -c `. - 2. Wait for about one munite and then resume the replication task through `cdc cli changefeed resume -c `. + 2. Wait for about one munite, and then resume the replication task by executing `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? From df8c73f96c181da7da2a9d943516a252c2f7b634 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Tue, 14 Sep 2021 16:26:21 +0800 Subject: [PATCH 14/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index c936a0d0a7a72..d34e9b58d17d2 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -111,7 +111,7 @@ A replication task might be interrupted in the following known scenarios: ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -- Update TiDB cluster and TiCDC cluster to their latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the newest versions**. +- Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions**. - In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: From 5c9f4962bf3213de5f48dd11746bac424afa9fb2 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:01:05 +0800 Subject: [PATCH 15/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index d34e9b58d17d2..d1b5125b0650f 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -138,7 +138,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= ## What is `gc-ttl` in TiCDC? -Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data smaller than this GC safepoint is not cleaned by GC. Enabling this feature in TiCDC ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC when the replication task is unavailable or interrupted. +Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC. When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC. When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. From 95cb6878db8c887441e38410abc99d201e386788 Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:01:40 +0800 Subject: [PATCH 16/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index d1b5125b0650f..676b6b3a6825b 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -140,7 +140,7 @@ cdc cli changefeed update -c --sort-engine="unified" --sort-dir= Since v4.0.0-rc.1, PD supports external services in setting the service-level GC safepoint. Any service can register and update its GC safepoint. PD ensures that the key-value data later than this GC safepoint is not cleaned by GC. When the replication task is unavailable or interrupted, this feature ensures that the data to be consumed by TiCDC is retained in TiKV without being cleaned by GC. -When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint through `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. +When starting the TiCDC server, you can specify the Time To Live (TTL) duration of GC safepoint by configuring `gc-ttl`, which means the longest time that data is retained within the GC safepoint. This value is set by TiCDC in PD, which is 86,400 seconds by default. ## What is the complete behavior of TiCDC garbage collection (GC) safepoint? From 44c127122f6bc31223a15a2db171f78e0792fc7b Mon Sep 17 00:00:00 2001 From: Fendy <40378371+septemberfd@users.noreply.github.com> Date: Mon, 8 Nov 2021 11:02:20 +0800 Subject: [PATCH 17/17] Update ticdc/troubleshoot-ticdc.md Co-authored-by: Ran --- ticdc/troubleshoot-ticdc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 676b6b3a6825b..aaf7eb4206395 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -102,7 +102,7 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -- In TiCDC v4.0.13 and earlier versions, the replication partition table may cause a replication interruption. +- In TiCDC v4.0.13 and earlier versions, when TiCDC replicates the partitioned table, it might encounter an error that leads to replication interruption. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: