From 04dd7ff9522f3cd23a570309da7ce411ef8bce10 Mon Sep 17 00:00:00 2001 From: toutdesuite Date: Fri, 24 Jul 2020 11:55:53 +0800 Subject: [PATCH 1/5] update both zh and en --- en/error-handling.md | 65 ++++++++++++++++++++++++++++++++++++++++++++ zh/error-handling.md | 18 ++++++++++-- 2 files changed, 80 insertions(+), 3 deletions(-) diff --git a/en/error-handling.md b/en/error-handling.md index 2885f5abf..935389cf1 100644 --- a/en/error-handling.md +++ b/en/error-handling.md @@ -165,3 +165,68 @@ For binlog replication processing units, manually recover replication using the For database related passwords in all the DM configuration files, use the passwords encrypted by `dmctl`. If a database password is empty, it is unnecessary to encrypt it. For how to encrypt the plaintext password, see [Encrypt the upstream MySQL user password using dmctl](deploy-a-dm-cluster-using-ansible.md#encrypt-the-upstream-mysql-user-password-using-dmctl). In addition, the user of the upstream and downstream databases must have the corresponding read and write privileges. Data Migration also [prechecks the corresponding privileges automatically](precheck.md) while starting the data replication task. + +### The replication task is interrupted and contains a `driver: bad connection` error + +When a `driver: bad connection` error occurs, it usually means that the database connection between the DM and the downstream TiDB is abnormal (such as network failure, TiDB restart, etc.) and the currently requested data cannot be sent to TiDB temporarily. + +The current version of DM will automatically retry. If it is not automatically retried due to version problems, etc., you can use `stop-task` to stop the task and then use `start-task` to restart the task. + +### The relay processing unit reports an error `event from * in * diff from passed-in event *` or the replication task is interrupted and contains binlogs such as `get binlog error ERROR 1236 (HY000)`, `binlog checksum mismatch, data may be corrupted`, etc. Get or parse failed error + +In the process of DM's relay log pull and incremental replication, if you encounter an upstream binlog file exceeding 4GB, these two errors may occur. + +The reason is that the DM needs to verify the event based on the binlog position and file size when writing the relay log, and needs to save the synchronized binlog position information as a checkpoint. However, the official definition of MySQL binlog position uses uint32 storage, so the offset value of binlog position exceeding 4G will overflow, and the above error will occur. + +For the relay processing unit, you can manually restore it through the following steps: + +1. The size of the corresponding binlog file when the error is confirmed upstream exceeds 4GB. +2. Stop DM-worker. +3. Copy the binlog file corresponding to the upstream to the relay log directory as the relay log file. +4. Update the corresponding `relay.meta` file in the relay log directory to start pulling from the next binlog. If DM worker has enabled `enable_gtid`, then when modifying the `relay.meta` file, you also need to modify the GTID corresponding to the next binlog. If `enable_gtid` is not enabled, there is no need to modify the GTID. + + For example: when the error is reported, there are `binlog-name = "mysql-bin.004451"` and `binlog-pos = 2453`, then update them to `binlog-name = "mysql-bin.004452"` and `binlog- respectively pos = 4`, and update `binlog-gtid = "f0e914ef-54cf-11e7-813d-6c92bf2fa791:1-138218058"` at the same time. +5. Restart DM-worker. + +For the binlog replication processing unit, you can manually restore it through the following steps: + +1. The size of the corresponding binlog file when the error is confirmed upstream exceeds 4GB. +2. Stop the replication task by `stop-task`. +3. Update the global checkpoint in the downstream `dm_meta` database and the `binlog_name` in the checkpoint of each table to the error binlog file, and update the `binlog_pos` to a synchronized legal position value, such as 4. + + For example: the error task name is `dm_test`, the corresponding `source-id` is `replica-1`, and the corresponding binlog file is `mysql-bin|000001.004451` when the error occurs, then execute `UPDATE dm_test_syncer_checkpoint SET binlog_name='mysql- bin|000001.004451', binlog_pos = 4 WHERE id='replica-1';`. +4. Set `safe-mode: true` for the `syncers` part in the replication task configuration to ensure reentrant execution. +5. Start the replication task with `start-task`. +6. Observe the status of the replication task through `query-status`. When the relay log file that caused the error is synchronized, you can restore the `safe-mode` to the original value and restart the replication task. + +### When executing `query-status` or viewing the log, `Access denied for user'root'@'172.31.43.27' (using password: YES)` + +In all DM configuration files, database-related passwords must use ciphertext encrypted by dmctl (if the database password is empty, no encryption is required). For details on how to use dmctl to encrypt plaintext passwords, see [Use dmctl to encrypt upstream MySQL user password](deploy-a-dm-cluster-using-ansible.md#Use -dmctl-encrypt upstream-mysql-user password). + +In addition, during DM operation, users of upstream and downstream databases must have corresponding read and write permissions. In the process of starting the replication task, the DM will automatically perform the pre-check of the corresponding permissions. For details, see [Upstream MySQL instance configuration pre-check](precheck.md). + +### The load processing unit reports the error `packet for query is too large. Try adjusting the 'max_allowed_packet' variable` + +#### Reasons: + +* Both MySQL client and MySQL/TiDB Server have the quota limit for `max_allowed_packet`. If any `max_allowed_packet` is outside the normal range, the client receives the error message. Currently, for the latest version of DM and TiDB Server, the default value of `max_allowed_packet` is `64M`. + +* The full data import processing unit in DM does not support splitting the SQL file exported by the Dump processing unit in DM. + +#### Solutions: + +* It is recommended to set the `statement-size` option of `extra-args` for the Dump processing unit: + + According to the default `--statement-size` setting, the default size of `Insert Statement` generated by the Dump processing unit is about `1M`. With this default setting, the Load processing unit does not report the error `packet for query is too large. Try adjusting the 'max_allowed_packet' variable` in most cases. + + Sometimes you might receive the following `WARN` log during the data dump. This `WARN` log does not affect the dump process. This only means that wide tables are dumped. + + ``` + Row bigger than statement_size for xxx + ``` + +* If the single row of the wide table exceeds `64M`, you need to modify the following configurations and make sure the configurations take effect. + + * Execute `set @@global.max_allowed_packet=134217728` (`134217728 = 128M`) in the TiDB Server. + + * First add the `max-allowed-packet: 134217728` (128M) configuration to `target-database` in the DM task configuration file. Next, execute the `stop-task` command and then execute the `start-task` command. diff --git a/zh/error-handling.md b/zh/error-handling.md index 5eb9c6ad5..a810b6f31 100644 --- a/zh/error-handling.md +++ b/zh/error-handling.md @@ -102,8 +102,12 @@ aliases: ['/docs-cn/tidb-data-migration/dev/troubleshoot-dm/','/docs-cn/tidb-dat ### 同步任务中断并包含 `invalid connection` 错误 +#### 原因 + 发生 `invalid connection` 错误时,通常表示 DM 到下游 TiDB 的数据库连接出现了异常(如网络故障、TiDB 重启、TiKV busy 等)且当前请求已有部分数据发送到了 TiDB。 +#### 解决方案 + 由于 DM 中存在同步任务并发向下游复制数据的特性,因此在任务中断时可能同时包含多个错误(可通过 `query-status` 或 `query-error` 查询当前错误)。 - 如果错误中仅包含 `invalid connection` 类型的错误且当前处于增量复制阶段,则 DM 会自动进行重试。 @@ -111,16 +115,24 @@ aliases: ['/docs-cn/tidb-data-migration/dev/troubleshoot-dm/','/docs-cn/tidb-dat ### 同步任务中断并包含 `driver: bad connection` 错误 +#### 原因 + 发生 `driver: bad connection` 错误时,通常表示 DM 到下游 TiDB 的数据库连接出现了异常(如网络故障、TiDB 重启等)且当前请求的数据暂时未能发送到 TiDB。 +#### 解决方案 + 当前版本 DM 会自动进行重试,如果由于版本问题等未自动重试,可先使用 `stop-task` 停止任务后再使用 `start-task` 重启任务。 ### relay 处理单元报错 `event from * in * diff from passed-in event *` 或同步任务中断并包含 `get binlog error ERROR 1236 (HY000)`、`binlog checksum mismatch, data may be corrupted` 等 binlog 获取或解析失败错误 +#### 原因 + 在 DM 进行 relay log 拉取与增量同步过程中,如果遇到了上游超过 4GB 的 binlog 文件,就可能出现这两个错误。 原因是 DM 在写 relay log 时需要依据 binlog position 及文件大小对 event 进行验证,且需要保存同步的 binlog position 信息作为 checkpoint。但是 MySQL binlog position 官方定义使用 uint32 存储,所以超过 4G 部分的 binlog position 的 offset 值会溢出,进而出现上面的错误。 +#### 解决方案 + 对于 relay 处理单元,可通过以下步骤手动恢复: 1. 在上游确认出错时对应的 binlog 文件的大小超出了 4GB。 @@ -150,7 +162,7 @@ aliases: ['/docs-cn/tidb-data-migration/dev/troubleshoot-dm/','/docs-cn/tidb-dat ### load 处理单元报错 `packet for query is too large. Try adjusting the 'max_allowed_packet' variable` -出现该报错的主要原因包括以下两点: +#### 原因 * MySQL client 和 MySQL/TiDB Server 都有 `max_allowed_packet` 配额的限制,如果在使用过程中违反其中任何一个 `max_allowed_packet` 配额,客户端程序就会收到对应的报错。目前最新版本的 DM 和 TiDB Server 的默认 `max_allowed_packet` 配额都为 `64M`。 @@ -162,7 +174,7 @@ aliases: ['/docs-cn/tidb-data-migration/dev/troubleshoot-dm/','/docs-cn/tidb-dat * 性能的极大降低 -解决方案为: +#### 解决方案 * 推荐在 DM 的 dump 处理单元提供的配置 `extra-args` 中设置 `statement-size`: @@ -178,4 +190,4 @@ aliases: ['/docs-cn/tidb-data-migration/dev/troubleshoot-dm/','/docs-cn/tidb-dat * 在 TiDB Server 执行 `set @@global.max_allowed_packet=134217728` (`134217728 = 128M`) - * 根据实际情况为 DM 的任务配置文件中的 `target-database` 增加配置 `max-allowed-packet: 134217728`(128M),执行 `stop-task` 后再重新 `start-task`。 + * 根据实际情况为 DM 的任务配置文件中的 `target-database` 增加配置 `max-allowed-packet: 134217728` (128M),执行 `stop-task` 后再重新 `start-task`。 From 1bac3244d22dc0f9028752b17810222ce601eaaf Mon Sep 17 00:00:00 2001 From: toutdesuite Date: Fri, 24 Jul 2020 12:05:28 +0800 Subject: [PATCH 2/5] Update error-handling.md --- en/error-handling.md | 55 +++++++++++--------------------------------- 1 file changed, 14 insertions(+), 41 deletions(-) diff --git a/en/error-handling.md b/en/error-handling.md index 935389cf1..c03705706 100644 --- a/en/error-handling.md +++ b/en/error-handling.md @@ -105,8 +105,12 @@ However, you need to reset the data replication task in some cases. For details, ### What can I do when a replication task is interrupted with the `invalid connection` error returned? +#### Reason + The `invalid connection` error indicates that anomalies have occurred in the connection between DM and the downstream TiDB database (such as network failure, TiDB restart, TiKV busy and so on) and that a part of the data for the current request has been sent to TiDB. +#### Solutions + Because DM has the feature of concurrently replicating data to the downstream in replication tasks, several errors might occur when a task is interrupted. You can check these errors by using `query-status` or `query-error`. - If only the `invalid connection` error occurs during the incremental replication process, DM retries the task automatically. @@ -114,16 +118,24 @@ Because DM has the feature of concurrently replicating data to the downstream in ### A replication task is interrupted with the `driver: bad connection` error returned +#### Reason + The `driver: bad connection` error indicates that anomalies have occurred in the connection between DM and the upstream TiDB database (such as network failure, TiDB restart and so on) and that the data of the current request has not yet been sent to TiDB at that moment. +#### Solution + The current version of DM automatically retries on error. If you use the previous version which does not support automatically retry, you can execute the `stop-task` command to stop the task. Then execute `start-task` to restart the task. ### The relay unit throws error `event from * in * diff from passed-in event *` or a replication task is interrupted with failing to get or parse binlog errors like `get binlog error ERROR 1236 (HY000)` and `binlog checksum mismatch, data may be corrupted` returned +#### Reason + During the DM process of relay log pulling or incremental replication, this two errors might occur if the size of the upstream binlog file exceeds **4 GB**. **Cause:** When writing relay logs, DM needs to perform event verification based on binlog positions and the size of the binlog file, and store the replicated binlog positions as checkpoints. However, the official MySQL uses `uint32` to store binlog positions. This means the binlog position for a binlog file over 4 GB overflows, and then the errors above occur. +#### Solutions + For relay units, manually recover replication using the following solution: 1. Identify in the upstream that the size of the corresponding binlog file has exceeded 4GB when the error occurs. @@ -166,54 +178,15 @@ For database related passwords in all the DM configuration files, use the passwo In addition, the user of the upstream and downstream databases must have the corresponding read and write privileges. Data Migration also [prechecks the corresponding privileges automatically](precheck.md) while starting the data replication task. -### The replication task is interrupted and contains a `driver: bad connection` error - -When a `driver: bad connection` error occurs, it usually means that the database connection between the DM and the downstream TiDB is abnormal (such as network failure, TiDB restart, etc.) and the currently requested data cannot be sent to TiDB temporarily. - -The current version of DM will automatically retry. If it is not automatically retried due to version problems, etc., you can use `stop-task` to stop the task and then use `start-task` to restart the task. - -### The relay processing unit reports an error `event from * in * diff from passed-in event *` or the replication task is interrupted and contains binlogs such as `get binlog error ERROR 1236 (HY000)`, `binlog checksum mismatch, data may be corrupted`, etc. Get or parse failed error - -In the process of DM's relay log pull and incremental replication, if you encounter an upstream binlog file exceeding 4GB, these two errors may occur. - -The reason is that the DM needs to verify the event based on the binlog position and file size when writing the relay log, and needs to save the synchronized binlog position information as a checkpoint. However, the official definition of MySQL binlog position uses uint32 storage, so the offset value of binlog position exceeding 4G will overflow, and the above error will occur. - -For the relay processing unit, you can manually restore it through the following steps: - -1. The size of the corresponding binlog file when the error is confirmed upstream exceeds 4GB. -2. Stop DM-worker. -3. Copy the binlog file corresponding to the upstream to the relay log directory as the relay log file. -4. Update the corresponding `relay.meta` file in the relay log directory to start pulling from the next binlog. If DM worker has enabled `enable_gtid`, then when modifying the `relay.meta` file, you also need to modify the GTID corresponding to the next binlog. If `enable_gtid` is not enabled, there is no need to modify the GTID. - - For example: when the error is reported, there are `binlog-name = "mysql-bin.004451"` and `binlog-pos = 2453`, then update them to `binlog-name = "mysql-bin.004452"` and `binlog- respectively pos = 4`, and update `binlog-gtid = "f0e914ef-54cf-11e7-813d-6c92bf2fa791:1-138218058"` at the same time. -5. Restart DM-worker. - -For the binlog replication processing unit, you can manually restore it through the following steps: - -1. The size of the corresponding binlog file when the error is confirmed upstream exceeds 4GB. -2. Stop the replication task by `stop-task`. -3. Update the global checkpoint in the downstream `dm_meta` database and the `binlog_name` in the checkpoint of each table to the error binlog file, and update the `binlog_pos` to a synchronized legal position value, such as 4. - - For example: the error task name is `dm_test`, the corresponding `source-id` is `replica-1`, and the corresponding binlog file is `mysql-bin|000001.004451` when the error occurs, then execute `UPDATE dm_test_syncer_checkpoint SET binlog_name='mysql- bin|000001.004451', binlog_pos = 4 WHERE id='replica-1';`. -4. Set `safe-mode: true` for the `syncers` part in the replication task configuration to ensure reentrant execution. -5. Start the replication task with `start-task`. -6. Observe the status of the replication task through `query-status`. When the relay log file that caused the error is synchronized, you can restore the `safe-mode` to the original value and restart the replication task. - -### When executing `query-status` or viewing the log, `Access denied for user'root'@'172.31.43.27' (using password: YES)` - -In all DM configuration files, database-related passwords must use ciphertext encrypted by dmctl (if the database password is empty, no encryption is required). For details on how to use dmctl to encrypt plaintext passwords, see [Use dmctl to encrypt upstream MySQL user password](deploy-a-dm-cluster-using-ansible.md#Use -dmctl-encrypt upstream-mysql-user password). - -In addition, during DM operation, users of upstream and downstream databases must have corresponding read and write permissions. In the process of starting the replication task, the DM will automatically perform the pre-check of the corresponding permissions. For details, see [Upstream MySQL instance configuration pre-check](precheck.md). - ### The load processing unit reports the error `packet for query is too large. Try adjusting the 'max_allowed_packet' variable` -#### Reasons: +#### Reasons * Both MySQL client and MySQL/TiDB Server have the quota limit for `max_allowed_packet`. If any `max_allowed_packet` is outside the normal range, the client receives the error message. Currently, for the latest version of DM and TiDB Server, the default value of `max_allowed_packet` is `64M`. * The full data import processing unit in DM does not support splitting the SQL file exported by the Dump processing unit in DM. -#### Solutions: +#### Solutions * It is recommended to set the `statement-size` option of `extra-args` for the Dump processing unit: From 088582b91daf877701978ac964e20c75d260729c Mon Sep 17 00:00:00 2001 From: toutdesuite Date: Tue, 28 Jul 2020 17:19:24 +0800 Subject: [PATCH 3/5] Apply suggestions from code review Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- en/error-handling.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/en/error-handling.md b/en/error-handling.md index c03705706..c8bd2c44c 100644 --- a/en/error-handling.md +++ b/en/error-handling.md @@ -178,7 +178,7 @@ For database related passwords in all the DM configuration files, use the passwo In addition, the user of the upstream and downstream databases must have the corresponding read and write privileges. Data Migration also [prechecks the corresponding privileges automatically](precheck.md) while starting the data replication task. -### The load processing unit reports the error `packet for query is too large. Try adjusting the 'max_allowed_packet' variable` +### The `load` processing unit reports the error `packet for query is too large. Try adjusting the 'max_allowed_packet' variable` #### Reasons @@ -190,7 +190,7 @@ In addition, the user of the upstream and downstream databases must have the cor * It is recommended to set the `statement-size` option of `extra-args` for the Dump processing unit: - According to the default `--statement-size` setting, the default size of `Insert Statement` generated by the Dump processing unit is about `1M`. With this default setting, the Load processing unit does not report the error `packet for query is too large. Try adjusting the 'max_allowed_packet' variable` in most cases. + According to the default `--statement-size` setting, the default size of `Insert Statement` generated by the Dump processing unit is about `1M`. With this default setting, the load processing unit does not report the error `packet for query is too large. Try adjusting the 'max_allowed_packet' variable` in most cases. Sometimes you might receive the following `WARN` log during the data dump. This `WARN` log does not affect the dump process. This only means that wide tables are dumped. @@ -200,6 +200,6 @@ In addition, the user of the upstream and downstream databases must have the cor * If the single row of the wide table exceeds `64M`, you need to modify the following configurations and make sure the configurations take effect. - * Execute `set @@global.max_allowed_packet=134217728` (`134217728 = 128M`) in the TiDB Server. + * Execute `set @@global.max_allowed_packet=134217728` (`134217728` = 128 MB) in the TiDB server. - * First add the `max-allowed-packet: 134217728` (128M) configuration to `target-database` in the DM task configuration file. Next, execute the `stop-task` command and then execute the `start-task` command. + * First add the `max-allowed-packet: 134217728` (128 MB) to the `target-database` section in the DM task configuration file. Then, execute the `stop-task` command and execute the `start-task` command. From 165ca15b7ade3df98108862796dc1930fccf8be7 Mon Sep 17 00:00:00 2001 From: toutdesuite Date: Tue, 28 Jul 2020 19:00:39 +0800 Subject: [PATCH 4/5] Update en/error-handling.md Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com> --- en/error-handling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/en/error-handling.md b/en/error-handling.md index c8bd2c44c..fe5a7f356 100644 --- a/en/error-handling.md +++ b/en/error-handling.md @@ -182,7 +182,7 @@ In addition, the user of the upstream and downstream databases must have the cor #### Reasons -* Both MySQL client and MySQL/TiDB Server have the quota limit for `max_allowed_packet`. If any `max_allowed_packet` is outside the normal range, the client receives the error message. Currently, for the latest version of DM and TiDB Server, the default value of `max_allowed_packet` is `64M`. +* Both MySQL client and MySQL/TiDB server have the quota limits for `max_allowed_packet`. If any `max_allowed_packet` exceeds a limit, the client receives the error message. Currently, for the latest version of DM and TiDB server, the default value of `max_allowed_packet` is `64M`. * The full data import processing unit in DM does not support splitting the SQL file exported by the Dump processing unit in DM. From 71cca72764a4f1c50e8f161b5f2a0408a58b7b82 Mon Sep 17 00:00:00 2001 From: toutdesuite Date: Tue, 28 Jul 2020 19:02:11 +0800 Subject: [PATCH 5/5] Update error-handling.md --- zh/error-handling.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/zh/error-handling.md b/zh/error-handling.md index a810b6f31..0fa3ee4ee 100644 --- a/zh/error-handling.md +++ b/zh/error-handling.md @@ -166,13 +166,7 @@ aliases: ['/docs-cn/tidb-data-migration/dev/troubleshoot-dm/','/docs-cn/tidb-dat * MySQL client 和 MySQL/TiDB Server 都有 `max_allowed_packet` 配额的限制,如果在使用过程中违反其中任何一个 `max_allowed_packet` 配额,客户端程序就会收到对应的报错。目前最新版本的 DM 和 TiDB Server 的默认 `max_allowed_packet` 配额都为 `64M`。 -* DM 的全量数据导入处理模块不支持对 dump 处理模块导出的 SQL 文件进行切分。因为 DM 的 dump 处理单元采用了最简单的编码实现,如果在 DM 实现文件切分,需要在 `TiDB Parser` 基础上实现一个完备的解析器才能正确的处理数据切分。但是随之会带来以下的问题: - - * 工作量大 - - * 复杂度高,不容易保证正确性 - - * 性能的极大降低 +* DM 的全量数据导入处理模块不支持对 dump 处理模块导出的 SQL 文件进行切分。 #### 解决方案