Skip to content

Lightning cannot Import CSV files with UTF-8 BOM #40744

@dsdashun

Description

@dsdashun

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Generate a CSV file and the schema file. In the CSV file, the head of the file contains a UTF-8 BOM (0xEF, 0xBB, 0xBF).
    Here's an example of the CSV file:
"id","val1"
1,"aaa01"
2,"aaa01"
3,"aaa02"
4,"aaa02"
5,"aaa05"

Here's the binary representation of the file:

$ xxd src-data/bom-data/mytest.testtbl.csv
# result: 
# 00000000: efbb bf22 6964 222c 2276 616c 3122 0a31  ..."id","val1".1
# 00000010: 2c22 6161 6130 3122 0a32 2c22 6161 6130  ,"aaa01".2,"aaa0
# 00000020: 3122 0a33 2c22 6161 6130 3222 0a34 2c22  1".3,"aaa02".4,"
# 00000030: 6161 6130 3222 0a35 2c22 6161 6130 3522  aaa02".5,"aaa05"
# 00000040: 0a
  1. Import data using Lightning

2. What did you expect to see? (Required)

The import should succeed

3. What did you see instead (Required)

The import failed. Failure message:
tidb lightning encountered error: [Lightning:Restore:ErrEncodeKV]encode kv error in file mytest.testtbl.csv:0 at offset 4: syntax error: cannot have consecutive fields without separator

4. What is your TiDB version? (Required)

Release Version: v6.5.0
Edition: Community
Git Commit Hash: 706c3fa3c526cdba5b3e9f066b1a568fb96c56e3
Git Branch: heads/refs/tags/v6.5.0
UTC Build Time: 2022-12-27 03:42:38
GoVersion: go1.19.3
Race Enabled: false
TiKV Min Version: 6.2.0-alpha
Check Table Before Drop: false
Store: tikv

Metadata

Metadata

Assignees

Labels

affects-6.5This bug affects the 6.5.x(LTS) versions.component/lightningThis issue is related to Lightning of TiDB.severity/minortype/bugThe issue is confirmed as a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions