Skip to content

*: change default charset and collation from 'utf8 utf8_bin' to 'utf8mb4 utf8mb4_bin'#7965

Merged
zz-jason merged 17 commits into
pingcap:masterfrom
winkyao:charset_mb4
Nov 9, 2018
Merged

*: change default charset and collation from 'utf8 utf8_bin' to 'utf8mb4 utf8mb4_bin'#7965
zz-jason merged 17 commits into
pingcap:masterfrom
winkyao:charset_mb4

Conversation

@winkyao
Copy link
Copy Markdown
Contributor

@winkyao winkyao commented Oct 19, 2018

What problem does this PR solve?

fix #7920.

Change TiDB default charset and collation to "utf8mb4 utf8mb4_bin", TiDB treat all the data as utf8mb4 actually, but the previous default charset is "utf8", insert the 4 bytes unicode string into TiDB will be ok, but if we use mysqldump to restore the data back into mysql, the charset will be utf8, and it will report an error ERROR 1366 (HY000): Incorrect string value: '\xF0\xA4\x8B\xAE' for column 'v' at row 1.

how it works?

  1. change mysql.DefaultCharset from UTF8Charset to UTF8MB4Charset.
  2. change mysql.DefaultCollationName from UTF8DefaultCollation to UTF8MB4DefaultCollation.
  3. Find all places that used charset.CharsetUTF8 and CollationUTF8, modify them to charset.CharsetUTF8MB4 or mysql.DefaultCharset.

Then fix corresponding test cases.

Check List

Tests

  • exits test cases

Code changes

  • Has exported variable/fields change

Related changes

  • Need to update the documentation
  • Need to be included in the release note

@winkyao winkyao added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 19, 2018
@winkyao winkyao mentioned this pull request Oct 19, 2018
3 tasks
Comment thread types/field_type.go Outdated
tp.Decimal = UnspecifiedLength
tp.Charset = mysql.DefaultCharset
tp.Collate = mysql.DefaultCollationName
tp.Charset = charset.CharsetUTF8MB4
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use GetDefaultCharsetAndCollate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me for this PR. My PR will change it back though :)

Copy link
Copy Markdown
Contributor

@gregwebs gregwebs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. However, I think we are treading on very dangerous ground with change until we just make utf8mb4 the default: #7757

@morgo morgo mentioned this pull request Oct 19, 2018
71 tasks
@winkyao
Copy link
Copy Markdown
Contributor Author

winkyao commented Oct 22, 2018

@shenli PTAL

@winkyao
Copy link
Copy Markdown
Contributor Author

winkyao commented Nov 5, 2018

The ci will be fixed after pingcap/parser#13 merged.

@winkyao winkyao removed the status/DNM label Nov 5, 2018
@winkyao
Copy link
Copy Markdown
Contributor Author

winkyao commented Nov 8, 2018

/run-all-tests

@winkyao
Copy link
Copy Markdown
Contributor Author

winkyao commented Nov 8, 2018

/run-all-tests -tidb-test=pr/646

@winkyao
Copy link
Copy Markdown
Contributor Author

winkyao commented Nov 9, 2018

/run-integration-ddl-test -tidb-test=pr/646

@winkyao
Copy link
Copy Markdown
Contributor Author

winkyao commented Nov 9, 2018

/run-sqllogic-test -tidb-test=pr/646

@winkyao
Copy link
Copy Markdown
Contributor Author

winkyao commented Nov 9, 2018

/run-integration-ddl-test -tidb-test=pr/646

Copy link
Copy Markdown
Contributor

@XuHuaiyu XuHuaiyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread expression/aggregation/descriptor.go Outdated
a.RetTp.Charset = charset.CharsetUTF8
a.RetTp.Collate = charset.CollationUTF8
a.RetTp.Charset = charset.CharsetUTF8MB4
a.RetTp.Collate = charset.CollationUTF8MB4
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/charset.CharsetUTF8MB4/mysql.DefaultCharset/
s/charset.CollationUTF8MB4/mysql.DefaultCollationName/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or: charset.GetDefaultCharsetAndCollate()

Comment thread expression/builtin.go Outdated
fieldType.Charset, fieldType.Collate = charset.CharsetBin, charset.CollationBin
} else {
fieldType.Charset, fieldType.Collate = charset.CharsetUTF8, charset.CharsetUTF8
fieldType.Charset, fieldType.Collate = mysql.DefaultCharset, mysql.DefaultCollationName
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about: charset.GetDefaultCharsetAndCollate()

Comment thread expression/builtin.go Outdated
}
if len(b.tp.Charset) <= 0 {
b.tp.Charset, b.tp.Collate = charset.CharsetUTF8, charset.CollationUTF8
b.tp.Charset, b.tp.Collate = mysql.DefaultCharset, mysql.DefaultCollationName
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread expression/builtin_cast.go Outdated
Charset: charset.CharsetUTF8,
Collate: charset.CollationUTF8,
Charset: charset.CharsetUTF8MB4,
Collate: charset.CollationUTF8MB4,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread expression/builtin_cast.go Outdated
}
tp := types.NewFieldType(mysql.TypeVarString)
tp.Charset, tp.Collate = charset.CharsetUTF8, charset.CollationUTF8
tp.Charset, tp.Collate = charset.CharsetUTF8MB4, charset.CollationUTF8MB4
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread expression/builtin.go
Decimal: 0,
Charset: charset.CharsetUTF8,
Collate: charset.CollationUTF8,
Charset: mysql.DefaultCharset,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not charset.CharsetUTF8MB4?

Comment thread expression/builtin.go
Charset: charset.CharsetUTF8,
Collate: charset.CollationUTF8,
Charset: mysql.DefaultCharset,
Collate: mysql.DefaultCollationName,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not charset.CollationUTF8MB4

Comment thread expression/builtin.go Outdated
fieldType.Charset, fieldType.Collate = charset.CharsetBin, charset.CollationBin
} else {
fieldType.Charset, fieldType.Collate = charset.CharsetUTF8, charset.CharsetUTF8
fieldType.Charset, fieldType.Collate = mysql.DefaultCharset, mysql.DefaultCollationName
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread expression/builtin.go Outdated
}
if len(b.tp.Charset) <= 0 {
b.tp.Charset, b.tp.Collate = charset.CharsetUTF8, charset.CollationUTF8
b.tp.Charset, b.tp.Collate = mysql.DefaultCharset, mysql.DefaultCollationName
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

types.SetBinChsClnFlag(resultFieldType)
} else {
resultFieldType.Charset, resultFieldType.Collate, resultFieldType.Flag = charset.CharsetUTF8, charset.CollationUTF8, 0
resultFieldType.Charset, resultFieldType.Collate, resultFieldType.Flag = mysql.DefaultCharset, mysql.DefaultCollationName, 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

fieldTp.Decimal, fieldTp.Flen = decimal, flen
if fieldTp.EvalType().IsStringKind() && !isBinaryStr {
fieldTp.Charset, fieldTp.Collate = mysql.DefaultCharset, mysql.DefaultCollationName
fieldTp.Charset, fieldTp.Collate = charset.CharsetUTF8MB4, charset.CollationUTF8MB4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @XuHuaiyu , use mysql.DefaultCharset or charset.CharsetUTF8MB4?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will mysql.DefaultCharset change? If so, we should we utf8mb4.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XuHuaiyu, DefaultCharset is better. It hides the implementation details, once we change the default charset again, the code modification can be minimized.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this place should be utf8mb4, if we did not set the charset, it can be default charset, but in here, it should definitely be utf8mb4.

@winkyao
Copy link
Copy Markdown
Contributor Author

winkyao commented Nov 9, 2018

@zz-jason @crazycs520 PTAL

Comment thread executor/show.go Outdated
charsetName := tb.Meta().Charset
if len(charsetName) == 0 {
charsetName = charset.CharsetUTF8
charsetName = charset.CharsetUTF8MB4
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/charset.CharsetUTF8MB4/mysql.DefaultCharset/

}
if types.IsNonBinaryStr(lhs) && !types.IsBinaryStr(rhs) {
resultFieldType.Charset, resultFieldType.Collate, resultFieldType.Flag = charset.CharsetUTF8, charset.CollationUTF8, 0
resultFieldType.Charset, resultFieldType.Collate, resultFieldType.Flag = charset.CharsetUTF8MB4, charset.CollationUTF8MB4, 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use mysql.DefaultCharset, mysql.DefaultCollationName instead.

cs = mysql.DefaultCharset
cl = mysql.DefaultCollationName
cs = charset.CharsetUTF8MB4
cl = charset.CollationUTF8MB4
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cs, cl = charset.GetDefaultCharsetAndCollate()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread infoschema/tables.go
mCharset = mysql.DefaultCharset
mCollation = mysql.DefaultCollationName
mCharset = charset.CharsetUTF8MB4
mCollation = charset.CollationUTF8MB4
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

charset.GetDefaultCharsetAndCollate()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here should be definitely utf8mb4

Comment thread server/server.go Outdated
var err error
if cfg.Socket != "" {
if s.listener, err = net.Listen("unix", cfg.Socket); err == nil {
// job.SnapshotVer == 0 means
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Copy Markdown
Contributor

@crazycs520 crazycs520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset LGTM

Copy link
Copy Markdown
Member

@zz-jason zz-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zz-jason zz-jason added the status/LGT2 Indicates that a PR has LGTM 2. label Nov 9, 2018
@zz-jason zz-jason merged commit 29f14d4 into pingcap:master Nov 9, 2018
@winkyao winkyao deleted the charset_mb4 branch November 9, 2018 11:03
winkyao added a commit that referenced this pull request Dec 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. status/LGT2 Indicates that a PR has LGTM 2.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Change default charset and collation from utf8 to utf8mb4

6 participants