Update character-set-and-collation.md#3402
Conversation
|
/label size/smal,special-week,translation/from-docs-cn,needs-cherry-pick-4.0 |
|
These labels are not found |
|
/cc TomShawn, yikeke |
| This document introduces the character set and collation supported by TiDB. | ||
|
|
||
| ## Concepts | ||
|
|
||
| A character set is a set of symbols and encodings. | ||
|
|
||
| A collation is a set of rules for comparing characters in a character set. |
There was a problem hiding this comment.
May I suggest the following as an intro:
A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above. UTF-8 encoding accounts for between 83% - 100% of webpages, depending on the language and country.
A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collation A and a do not compare as equal:
{{< copyable "sql" >}}
SET NAMES utf8mb4 COLLATE utf8mb4_bin;
SELECT 'A' = 'a';
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
SELECT 'A' = 'a';mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
| 0 |
+-----------+
1 row in set (0.00 sec)
mysql> SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'A' = 'a';
+-----------+
| 'A' = 'a' |
+-----------+
| 1 |
+-----------+
1 row in set (0.00 sec)TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default.
There was a problem hiding this comment.
In TiDB utf8 and utf8mb4 behave identically, and utf8 is not restricted to a maximum of 3 bytes as in MySQL.
In fact, TiDB checks if the bytes character is greater than 3 unless check-mb4-value-in-utf8 is disabled.
There was a problem hiding this comment.
@nullnotnil Do you need to modify your suggested intro according to #3402 (comment)?
There was a problem hiding this comment.
I've updated it.
There was a problem hiding this comment.
@nullnotnil Do you need to modify your suggested intro according to #3402 (comment)?
Yes, I verified and my earlier comment was incorrect. I have updated it to describe the check-mb4 option. Edit: I will add a section on utf8mb4 vs utf8 which will make this clearer. It's useful to describe, but doesn't have to be in the intro.
Co-authored-by: Null not nil <67764674+nullnotnil@users.noreply.github.com>
Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com>
|
PTAL again @nullnotnil @wjhuang2016 |
| @@ -51,11 +107,11 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; | |||
| 2 rows in set (0.00 sec) | |||
| ``` | |||
|
|
|||
There was a problem hiding this comment.
utf8 and utf8mb4 in MySQL
In MySQL, the character set utf8 is limited to a maximum of three bytes. This is sufficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character set utf8mb4 instead.
By default, TiDB provides the same 3-byte limit on utf8 to ensure that data created in TiDB can still safely be restored in MySQL. This can be disabled by changing the value of check-mb4-value-in-utf8 to FALSE in your TiDB configuration file.
The following demonstrates the default behavior when inserting a 4-byte emoji character into a table. The INSERT statement fails for the utf8 character set, but succeeds for ut8mb4:
mysql> CREATE TABLE utf8_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8;
Query OK, 0 rows affected (0.09 sec)
mysql> CREATE TABLE utf8m4_test (
-> c char(1) NOT NULL
-> ) CHARACTER SET utf8mb4;
Query OK, 0 rows affected (0.09 sec)
mysql> INSERT INTO utf8_test VALUES ('😉');
ERROR 1366 (HY000): incorrect utf8 value f09f9889(😉) for column c
mysql> INSERT INTO utf8m4_test VALUES ('😉');
Query OK, 1 row affected (0.02 sec)
mysql> SELECT char_length(c), length(c), c FROM utf8_test;
Empty set (0.01 sec)
mysql> SELECT char_length(c), length(c), c FROM utf8m4_test;
+----------------+-----------+------+
| char_length(c) | length(c) | c |
+----------------+-----------+------+
| 1 | 4 | 😉 |
+----------------+-----------+------+
1 row in set (0.00 sec)There was a problem hiding this comment.
May I suggest changing the header to `utf8` and `ut8mb4` in TiDB to focus our topic on TiDB? @nullnotnil
Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com>
|
PTAL again @nullnotnil @wjhuang2016 |
Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com>
|
LGTM |
|
@nullnotnil,Thanks for your review. However, LGTM is restricted to Reviewers or higher roles.See the corresponding SIG page for more information. Related SIGs: docs(slack). |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
|
cherry pick to release-4.0 in PR #3540 |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com> Co-authored-by: ireneontheway <48651140+ireneontheway@users.noreply.github.com>
What is changed, added or deleted? (Required)
Which TiDB version(s) do your changes apply to? (Required)
What is the related PR or file link(s)?
Do your changes match any of the following descriptions?