Skip to content
This repository was archived by the owner on Jan 3, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions en/error-handling.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,25 +105,37 @@ However, you need to reset the data replication task in some cases. For details,

### What can I do when a replication task is interrupted with the `invalid connection` error returned?

#### Reason

The `invalid connection` error indicates that anomalies have occurred in the connection between DM and the downstream TiDB database (such as network failure, TiDB restart, TiKV busy and so on) and that a part of the data for the current request has been sent to TiDB.

#### Solutions

Because DM has the feature of concurrently replicating data to the downstream in replication tasks, several errors might occur when a task is interrupted. You can check these errors by using `query-status` or `query-error`.

- If only the `invalid connection` error occurs during the incremental replication process, DM retries the task automatically.
- If DM does not or fails to retry automatically because of version problems, use `stop-task` to stop the task and then use `start-task` to restart the task.

### A replication task is interrupted with the `driver: bad connection` error returned

#### Reason

The `driver: bad connection` error indicates that anomalies have occurred in the connection between DM and the upstream TiDB database (such as network failure, TiDB restart and so on) and that the data of the current request has not yet been sent to TiDB at that moment.

#### Solution

The current version of DM automatically retries on error. If you use the previous version which does not support automatically retry, you can execute the `stop-task` command to stop the task. Then execute `start-task` to restart the task.

### The relay unit throws error `event from * in * diff from passed-in event *` or a replication task is interrupted with failing to get or parse binlog errors like `get binlog error ERROR 1236 (HY000)` and `binlog checksum mismatch, data may be corrupted` returned

#### Reason

During the DM process of relay log pulling or incremental replication, this two errors might occur if the size of the upstream binlog file exceeds **4 GB**.

**Cause:** When writing relay logs, DM needs to perform event verification based on binlog positions and the size of the binlog file, and store the replicated binlog positions as checkpoints. However, the official MySQL uses `uint32` to store binlog positions. This means the binlog position for a binlog file over 4 GB overflows, and then the errors above occur.

#### Solutions

For relay units, manually recover replication using the following solution:

1. Identify in the upstream that the size of the corresponding binlog file has exceeded 4GB when the error occurs.
Expand Down Expand Up @@ -165,3 +177,29 @@ For binlog replication processing units, manually recover replication using the
For database related passwords in all the DM configuration files, use the passwords encrypted by `dmctl`. If a database password is empty, it is unnecessary to encrypt it. For how to encrypt the plaintext password, see [Encrypt the upstream MySQL user password using dmctl](deploy-a-dm-cluster-using-ansible.md#encrypt-the-upstream-mysql-user-password-using-dmctl).

In addition, the user of the upstream and downstream databases must have the corresponding read and write privileges. Data Migration also [prechecks the corresponding privileges automatically](precheck.md) while starting the data replication task.

### The `load` processing unit reports the error `packet for query is too large. Try adjusting the 'max_allowed_packet' variable`

#### Reasons

* Both MySQL client and MySQL/TiDB server have the quota limits for `max_allowed_packet`. If any `max_allowed_packet` exceeds a limit, the client receives the error message. Currently, for the latest version of DM and TiDB server, the default value of `max_allowed_packet` is `64M`.

* The full data import processing unit in DM does not support splitting the SQL file exported by the Dump processing unit in DM.
Comment thread
toutdesuite marked this conversation as resolved.

#### Solutions

* It is recommended to set the `statement-size` option of `extra-args` for the Dump processing unit:

According to the default `--statement-size` setting, the default size of `Insert Statement` generated by the Dump processing unit is about `1M`. With this default setting, the load processing unit does not report the error `packet for query is too large. Try adjusting the 'max_allowed_packet' variable` in most cases.

Sometimes you might receive the following `WARN` log during the data dump. This `WARN` log does not affect the dump process. This only means that wide tables are dumped.

```
Row bigger than statement_size for xxx
```

* If the single row of the wide table exceeds `64M`, you need to modify the following configurations and make sure the configurations take effect.

* Execute `set @@global.max_allowed_packet=134217728` (`134217728` = 128 MB) in the TiDB server.

* First add the `max-allowed-packet: 134217728` (128 MB) to the `target-database` section in the DM task configuration file. Then, execute the `stop-task` command and execute the `start-task` command.
26 changes: 16 additions & 10 deletions zh/error-handling.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,25 +102,37 @@ aliases: ['/docs-cn/tidb-data-migration/dev/troubleshoot-dm/','/docs-cn/tidb-dat

### 同步任务中断并包含 `invalid connection` 错误

#### 原因

发生 `invalid connection` 错误时,通常表示 DM 到下游 TiDB 的数据库连接出现了异常(如网络故障、TiDB 重启、TiKV busy 等)且当前请求已有部分数据发送到了 TiDB。

#### 解决方案

由于 DM 中存在同步任务并发向下游复制数据的特性,因此在任务中断时可能同时包含多个错误(可通过 `query-status` 或 `query-error` 查询当前错误)。

- 如果错误中仅包含 `invalid connection` 类型的错误且当前处于增量复制阶段,则 DM 会自动进行重试。
- 如果 DM 由于版本问题等未自动进行重试或自动重试未能成功,则可尝试先使用 `stop-task` 停止任务,然后再使用 `start-task` 重启任务。

### 同步任务中断并包含 `driver: bad connection` 错误

#### 原因

发生 `driver: bad connection` 错误时,通常表示 DM 到下游 TiDB 的数据库连接出现了异常(如网络故障、TiDB 重启等)且当前请求的数据暂时未能发送到 TiDB。

#### 解决方案

当前版本 DM 会自动进行重试,如果由于版本问题等未自动重试,可先使用 `stop-task` 停止任务后再使用 `start-task` 重启任务。

### relay 处理单元报错 `event from * in * diff from passed-in event *` 或同步任务中断并包含 `get binlog error ERROR 1236 (HY000)`、`binlog checksum mismatch, data may be corrupted` 等 binlog 获取或解析失败错误

#### 原因

在 DM 进行 relay log 拉取与增量同步过程中,如果遇到了上游超过 4GB 的 binlog 文件,就可能出现这两个错误。

原因是 DM 在写 relay log 时需要依据 binlog position 及文件大小对 event 进行验证,且需要保存同步的 binlog position 信息作为 checkpoint。但是 MySQL binlog position 官方定义使用 uint32 存储,所以超过 4G 部分的 binlog position 的 offset 值会溢出,进而出现上面的错误。

#### 解决方案

对于 relay 处理单元,可通过以下步骤手动恢复:

1. 在上游确认出错时对应的 binlog 文件的大小超出了 4GB。
Expand Down Expand Up @@ -150,19 +162,13 @@ aliases: ['/docs-cn/tidb-data-migration/dev/troubleshoot-dm/','/docs-cn/tidb-dat

### load 处理单元报错 `packet for query is too large. Try adjusting the 'max_allowed_packet' variable`

出现该报错的主要原因包括以下两点:
#### 原因

* MySQL client 和 MySQL/TiDB Server 都有 `max_allowed_packet` 配额的限制,如果在使用过程中违反其中任何一个 `max_allowed_packet` 配额,客户端程序就会收到对应的报错。目前最新版本的 DM 和 TiDB Server 的默认 `max_allowed_packet` 配额都为 `64M`。

* DM 的全量数据导入处理模块不支持对 dump 处理模块导出的 SQL 文件进行切分。因为 DM 的 dump 处理单元采用了最简单的编码实现,如果在 DM 实现文件切分,需要在 `TiDB Parser` 基础上实现一个完备的解析器才能正确的处理数据切分。但是随之会带来以下的问题:

* 工作量大

* 复杂度高,不容易保证正确性

* 性能的极大降低
* DM 的全量数据导入处理模块不支持对 dump 处理模块导出的 SQL 文件进行切分。

解决方案为:
#### 解决方案

* 推荐在 DM 的 dump 处理单元提供的配置 `extra-args` 中设置 `statement-size`:

Expand All @@ -178,4 +184,4 @@ aliases: ['/docs-cn/tidb-data-migration/dev/troubleshoot-dm/','/docs-cn/tidb-dat

* 在 TiDB Server 执行 `set @@global.max_allowed_packet=134217728` (`134217728 = 128M`)

* 根据实际情况为 DM 的任务配置文件中的 `target-database` 增加配置 `max-allowed-packet: 134217728`(128M),执行 `stop-task` 后再重新 `start-task`。
* 根据实际情况为 DM 的任务配置文件中的 `target-database` 增加配置 `max-allowed-packet: 134217728` (128M),执行 `stop-task` 后再重新 `start-task`。