From 9634febef5fa5607b3619af28ae51ff462ab06b1 Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Mon, 20 Apr 2020 22:02:54 +0800 Subject: [PATCH 1/9] reference: update documents for new collation --- TOC.md | 2 +- .../tidb-server/configuration-file.md | 6 + reference/mysql-compatibility.md | 9 - ...r-set.md => characterset-and-collation.md} | 254 ++++++++++++++++-- reference/sql/statements/alter-database.md | 2 +- reference/sql/statements/create-database.md | 2 +- reference/sql/statements/set-names.md | 2 +- .../system-databases/information-schema.md | 2 +- reference/tools/syncer.md | 2 +- 9 files changed, 239 insertions(+), 42 deletions(-) rename reference/sql/{character-set.md => characterset-and-collation.md} (53%) diff --git a/TOC.md b/TOC.md index a4f4eaca54ddf..587ea693933e5 100644 --- a/TOC.md +++ b/TOC.md @@ -245,7 +245,7 @@ - [Constraints](/reference/sql/constraints.md) - [Generated Columns](/reference/sql/generated-columns.md) - [Partitioning](/reference/sql/partitioning.md) - - [Character Set](/reference/sql/character-set.md) + - [Character Set and Collation](/reference/sql/characterset-and-collation.md) - [SQL Mode](/reference/sql/sql-mode.md) - [SQL Diagnosis](/reference/system-databases/sql-diagnosis.md) - [Views](/reference/sql/views.md) diff --git a/reference/configuration/tidb-server/configuration-file.md b/reference/configuration/tidb-server/configuration-file.md index efaaf11779b4d..6a1e3d2880d25 100644 --- a/reference/configuration/tidb-server/configuration-file.md +++ b/reference/configuration/tidb-server/configuration-file.md @@ -157,6 +157,12 @@ Configuration items related to log. - Default value: `4096` - When the length of the statement is longer than `query-log-max-len`, the statement is truncated to output. +### `new_collations_enabled_on_first_bootstrap` + ++ Enables or disables the new Collation support. ++ Default value: `false` ++ Note: This configuration takes effect only for the TiDB cluster that has just been initialized. After the initialization, you cannot use this configuration item to enable and disable the new Collation support. When a TiDB cluster is upgraded to v4.0, because the cluster is initialized before, both `true` and `false` values of this configuration item are taken as `false`. + ### `max-server-connections` - The maximum number of concurrent client connections allowed in TiDB. It is used to control resources. diff --git a/reference/mysql-compatibility.md b/reference/mysql-compatibility.md index 480c85bf99cdf..4bad2332665d6 100644 --- a/reference/mysql-compatibility.md +++ b/reference/mysql-compatibility.md @@ -25,7 +25,6 @@ However, TiDB does not support some of MySQL features or behaves differently fro + `FOREIGN KEY` constraints + `FULLTEXT`/`SPATIAL` functions and indexes + Character sets other than `utf8`, `utf8mb4`, `ascii`, `latin1` and `binary` -+ Collations other than `BINARY` + Add/drop primary key + SYS schema + Optimizer trace @@ -247,14 +246,6 @@ Because they are built-in, named time zones in TiDB might behave slightly differ It is not recommended to unset the `NO_ZERO_DATE` and `NO_ZERO_IN_DATE` SQL modes, which are enabled by default in TiDB as in MySQL. While TiDB supports operating with these modes disabled, the TiKV coprocessor does not. Executing certain statements that push down date and time processing functions to TiKV might result in a statement error. -#### Handling of space at the end of string line - -Currently, when inserting data, TiDB keeps the space at the end of the line for the `VARCHAR` type, and truncate the space for the `CHAR` type. In case there is no index, TiDB behaves exactly the same as MySQL. - -If there is a `UNIQUE` index on the `VARCHAR` data, MySQL truncates the space at the end of the `VARCHAR` line before determining whether the data is duplicated, which is similar to the processing of the `CHAR` type, while TiDB keeps the space. - -When making a comparison, MySQL first truncates the constant and the space at the end of the column, while TiDB keeps them to enable exact comparison. - ### Type system differences The following column types are supported by MySQL, but not by TiDB: diff --git a/reference/sql/character-set.md b/reference/sql/characterset-and-collation.md similarity index 53% rename from reference/sql/character-set.md rename to reference/sql/characterset-and-collation.md index dcaebba2a2599..2962b7765ee54 100644 --- a/reference/sql/character-set.md +++ b/reference/sql/characterset-and-collation.md @@ -1,18 +1,23 @@ --- -title: Character Set Support +title: Character Set and Collation summary: Learn about the supported character sets in TiDB. category: reference - +aliases: ['/docs/dev/reference/sql/character-set/'] --- -# Character Set Support +# Character Set and Collation A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Currently, TiDB supports the following character sets: +{{< copyable "sql" >}} + ```sql -mysql> SHOW CHARACTER SET; +SHOW CHARACTER SET; +``` + +``` +---------|---------------|-------------------|--------+ | Charset | Description | Default collation | Maxlen | +---------|---------------|-------------------|--------+ @@ -27,15 +32,19 @@ mysql> SHOW CHARACTER SET; > **Note:** > -> + In `TiDB`, `utf8` is treated as `utf8mb4`. -> + Each character set corresponds to only one default collation. +> Each character set corresponds to only one default collation. ## Collation support TiDB only supports binary collations. This means that unlike MySQL, in TiDB string comparisons are both case sensitive and accent sensitive: +{{< copyable "sql" >}} + ```sql -mysql> SELECT * FROM information_schema.collations; +SELECT * FROM information_schema.collations; +``` + +``` +----------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +----------------+--------------------+------+------------+-------------+---------+ @@ -46,8 +55,15 @@ mysql> SELECT * FROM information_schema.collations; | utf8_bin | utf8 | 83 | Yes | Yes | 1 | +----------------+--------------------+------+------------+-------------+---------+ 5 rows in set (0.00 sec) +``` -mysql> SHOW COLLATION WHERE Charset = 'utf8mb4'; +{{< copyable "sql" >}} + +```sql +SHOW COLLATION WHERE Charset = 'utf8mb4'; +``` + +``` +-------------+---------+------+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +-------------+---------+------+---------+----------+---------+ @@ -58,25 +74,58 @@ mysql> SHOW COLLATION WHERE Charset = 'utf8mb4'; For compatibility with MySQL, TiDB will allow other collation names to be used: +{{< copyable "sql" >}} + ```sql -mysql> CREATE TABLE t1 (a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(10)) COLLATE utf8mb4_unicode_520_ci; +CREATE TABLE t1 (a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(10)) COLLATE utf8mb4_unicode_520_ci; +``` + +``` Query OK, 0 rows affected (0.00 sec) +``` + +{{< copyable "sql" >}} + +```sql +INSERT INTO t1 VALUES (1, 'a'); +``` -mysql> INSERT INTO t1 VALUES (1, 'a'); +``` Query OK, 1 row affected (0.00 sec) +``` -mysql> SELECT * FROM t1 WHERE b = 'a'; +{{< copyable "sql" >}} + +```sql +SELECT * FROM t1 WHERE b = 'a'; +``` + +``` +---+------+ | a | b | +---+------+ | 1 | a | +---+------+ 1 row in set (0.00 sec) +``` -mysql> SELECT * FROM t1 WHERE b = 'A'; +{{< copyable "sql" >}} + +```sql +SELECT * FROM t1 WHERE b = 'A'; +``` + +``` Empty set (0.00 sec) +``` -mysql> SHOW CREATE TABLE t1\G +{{< copyable "sql" >}} + +```sql +SHOW CREATE TABLE t1\G +``` + +``` *************************** 1. row *************************** Table: t1 Create Table: CREATE TABLE `t1` ( @@ -107,32 +156,74 @@ ALTER DATABASE db_name `DATABASE` can be replaced with `SCHEMA` here. -Different databases can use different character sets and collations. Use the `character_set_database` and `collation_database` to see the character set and collation of the current database: +Different databases can use different character sets and collations. Use the `character_set_database` and `collation_database` to see the character set and collation of the current database: + +{{< copyable "sql" >}} ```sql -mysql> create schema test1 character set utf8 COLLATE uft8_general_ci; +create schema test1 character set utf8 COLLATE uft8_general_ci; +``` + +``` Query OK, 0 rows affected (0.09 sec) +``` -mysql> use test1; +{{< copyable "sql" >}} + +```sql +use test1; +``` + +``` Database changed -mysql> SELECT @@character_set_database, @@collation_database; +``` + +{{< copyable "sql" >}} + +``` +SELECT @@character_set_database, @@collation_database; +``` + +``` +--------------------------|----------------------+ | @@character_set_database | @@collation_database | +--------------------------|----------------------+ -| utf8 | uft8_general_ci | +| utf8mb4 | uft8mb4_general_ci | +--------------------------|----------------------+ 1 row in set (0.00 sec) +``` -mysql> create schema test2 character set latin1 COLLATE latin1_general_ci; +{{< copyable "sql" >}} + +``` +create schema test2 character set latin1 COLLATE latin1_bin; +``` + +``` Query OK, 0 rows affected (0.09 sec) +``` + +{{< copyable "sql" >}} -mysql> use test2; +``` +use test2; +``` + +``` Database changed -mysql> SELECT @@character_set_database, @@collation_database; +``` + +{{< copyable "sql" >}} + +``` +SELECT @@character_set_database, @@collation_database; +``` + +``` +--------------------------|----------------------+ | @@character_set_database | @@collation_database | +--------------------------|----------------------+ -| latin1 | latin1_general_ci | +| latin1 | latin1_bin | +--------------------------|----------------------+ 1 row in set (0.00 sec) ``` @@ -160,8 +251,13 @@ ALTER TABLE tbl_name For example: +{{< copyable "sql" >}} + ```sql -mysql> CREATE TABLE t1(a int) CHARACTER SET utf8 COLLATE utf8_general_ci; +CREATE TABLE t1(a int) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; +``` + +``` Query OK, 0 rows affected (0.08 sec) ``` @@ -197,8 +293,8 @@ Example: ```sql SELECT 'string'; -SELECT _latin1'string'; -SELECT _latin1'string' COLLATE latin1_danish_ci; +SELECT _utf8mb4'string'; +SELECT _utf8mb4'string' COLLATE utf8mb4_general_ci; ``` Rules: @@ -217,7 +313,7 @@ You can use the following statement to specify a particular collation that is re + `SET NAMES 'charset_name' [COLLATE 'collation_name']` - `SET NAMES` indicates what character set the client will use to send SQL statements to the server. `SET NAMES utf8` indicates that all the requests from the client use utf8, as well as the results from the server. + `SET NAMES` indicates what character set the client will use to send SQL statements to the server. `SET NAMES utf8mb4` indicates that all the requests from the client use utf8mb4, as well as the results from the server. The `SET NAMES 'charset_name'` statement is equivalent to the following statement combination: @@ -241,7 +337,7 @@ You can use the following statement to specify a particular collation that is re ## Optimization levels of character sets and collations -String => Column => Table => Database => Server => Cluster +String > Column > Table > Database > Server > Cluster ## General rules on selecting character sets and collation @@ -256,3 +352,107 @@ For the specified `utf8` or `utf8mb4` character set, TiDB only supports the vali To disable this error reporting, use `set @@tidb_skip_utf8_check=1;` to skip the character check. For more information, see [Connection Character Sets and Collations in MySQL](https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html). + +## Collation support + +The syntax support and semantic support for the collation is affected by the [`new_collation_enable`](/reference/configuration/tidb-server/configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter inicates that TiDB can correctly use collations when comparing strings. + +Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old collation support framework](#old-collation-support-framework). + +Since v4.0, TiDB supports semantically parsing different collations and strictly following the collations when comparing strings, which is the [new collation support framework](#new-collation-support-framework). + +### Old collation support framework + +Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collation, which means that the byte order determines the character order. Different from MySQL, TiDB fills in spaces according to Collation's `PADDING` attribute before comparing characters, which causes the following behavior differences: + +{{< copyable "sql" >}} + +```sql +create table t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci primary key); +Query OK, 0 rows affected +insert into t values ('A'); +Query OK, 1 row affected +insert into t values ('a'); +Query OK, 1 row affected # In MySQL, because utf8mb4_general_ci is case-insensitive, the `Duplicate entry 'a'` error is reported. +insert into t1 values ('a '); +Query OK, 1 row affected # In MySQL, because comparison is performed after the spaces are filled in, the `Duplicate entry 'a '` error is returned. +``` + +### New collation support framework + +In TiDB 4.0, a complete collation support framework is introduced that semantically supports collations and adds the `new_collation_enabled_on_first_boostrap` configuration item to decide whether to enable the new collation support framework when a cluster is initialized. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the `new_collation_enabled` variable in the `mysql`.`tidb` table: + +{{< copyable "sql" >}} + +```sql +select VARIABLE_VALUE from mysql.tidb where VARIABLE_NAME='new_collation_enabled'; +``` + +``` ++----------------+ +| VARIABLE_VALUE | ++----------------+ +| True | ++----------------+ +1 row in set (0.00 sec) +``` + +Under the new collation support framework, TiDB support the `utf8_general_ci` and `utf8mb4_general_ci` collations, which is compatible with MySQL. + +When `utf8_general_ci` or `utf8mb4_general_ci` is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's `PADDING` behavior: + +{{< copyable "sql" >}} + +```sql +create table t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci primary key); +Query OK, 0 rows affected (0.00 sec) +insert into t values ('A'); +Query OK, 1 row affected (0.00 sec) +insert into t values ('a'); +ERROR 1062 (23000): Duplicate entry 'a' for key 'PRIMARY' +insert into t values ('a '); +ERROR 1062 (23000): Duplicate entry 'a ' for key 'PRIMARY' +``` + +> **Note:** +> +> The implementation of padding in TiDB is different from that in MySQL. In MySQL, padding is implemented by filling in spaces. In TiDB, padding is implemented by cutting out the spaces at the end. The two approaches are the same in most cases. The only exception is when the end of the string contains characters that are less than spaces (0x20). For example, the result of `'a' < 'a\t'` in TiDB is `1`, but in MySQL, `'a' < 'a\t'` is equivalent to `'a ' < 'a\t'`, and the result is `0`. + +## Coercibility values of collations in expressions + +If an expression involves multiple clauses of different collations, you need to deduce the collation used in the calculation. The rules are as follows: + ++ The coercibility value of the explicit `COLLATE` clause is `0`. ++ If the collations of two strings are incompatible, the coercibility values of the concatenation of two strings with different collations is `1`. ++ The column's collation has a coercibility value of `2`. ++ The system constant (the string returned by `USER ()` or `VERSION ()`) has a coercibility value of `3`. ++ The coercibility value of the constant is `4`. ++ The coercibility value of numbers or intermediate variables is `5`. ++ `NULL` or expressions derived from `NULL` have a coercibility value of `6`. + +When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values (the same as MySQL). If the coercibility values of two clauses are the same, the collation is determined according to the following priority: + +binary > utf8mb4_bin > utf8mb4_general_ci > utf8_bin > utf8_general_ci > latin1_bin > ascii_bin + +If the collations of two clauses are different and the coercibility value of both clauses is `0`, TiDB cannot infer the collation and reports an error. + +## `COLLATE` clause + +TiDB supports using the `COLLATE` clause to specify the collation of an expression. The coercibility value of this expression is `0`, which has the highest priority. See the following example: + +{{< copyable "sql" >}} + +```sql +select 'a' = 'A' collate utf8mb4_general_ci; +``` + +``` ++--------------------------------------+ +| 'a' = 'A' collate utf8mb4_general_ci | ++--------------------------------------+ +| 1 | ++--------------------------------------+ +1 row in set (0.00 sec) +``` + +For more details, see [Connection Character Sets and Collations](https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html). diff --git a/reference/sql/statements/alter-database.md b/reference/sql/statements/alter-database.md index dd776cb84c265..58ec328275301 100644 --- a/reference/sql/statements/alter-database.md +++ b/reference/sql/statements/alter-database.md @@ -18,7 +18,7 @@ alter_specification: | [DEFAULT] COLLATE [=] collation_name ``` -The `alter_specification` option specifies the `CHARACTER SET` and `COLLATE` of a specified database. Currently, TiDB only supports some character sets and collations. See [Character Set Support](/reference/sql/character-set.md) for details. +The `alter_specification` option specifies the `CHARACTER SET` and `COLLATE` of a specified database. Currently, TiDB only supports some character sets and collations. See [Character Set Support](/reference/sql/characterset-and-collation.md) for details. ## See also diff --git a/reference/sql/statements/create-database.md b/reference/sql/statements/create-database.md index 974f478394e0b..ec7f822998fda 100644 --- a/reference/sql/statements/create-database.md +++ b/reference/sql/statements/create-database.md @@ -45,7 +45,7 @@ create_specification: If you create an existing database and does not specify `IF NOT EXISTS`, an error is displayed. -The `create_specification` option is used to specify the specific `CHARACTER SET` and `COLLATE` in the database. Currently, TiDB only supports some of the character sets and collations. For details, see [Character Set Support](/reference/sql/character-set.md). +The `create_specification` option is used to specify the specific `CHARACTER SET` and `COLLATE` in the database. Currently, TiDB only supports some of the character sets and collations. For details, see [Character Set and Collation Supports](/reference/sql/characterset-and-collation.md). ## Examples diff --git a/reference/sql/statements/set-names.md b/reference/sql/statements/set-names.md index 3c7b769ffdd14..8eedf4e19f858 100644 --- a/reference/sql/statements/set-names.md +++ b/reference/sql/statements/set-names.md @@ -77,4 +77,4 @@ This statement is understood to be fully compatible with MySQL. Any compatibilit * [SHOW \[GLOBAL|SESSION\] VARIABLES](/reference/sql/statements/show-variables.md) * [SET ](/reference/sql/statements/set-variable.md) -* [Character Set Support](/reference/sql/character-set.md) +* [Character Set and Collation Supports](/reference/sql/characterset-and-collation.md) diff --git a/reference/system-databases/information-schema.md b/reference/system-databases/information-schema.md index 9205582e17f8d..214317114083b 100644 --- a/reference/system-databases/information-schema.md +++ b/reference/system-databases/information-schema.md @@ -36,7 +36,7 @@ select * from `ANALYZE_STATUS`; ### CHARACTER_SETS table -The `CHARACTER_SETS` table provides information about [character sets](/reference/sql/character-set.md). Currently, TiDB only supports some of the character sets. +The `CHARACTER_SETS` table provides information about [character sets](/reference/sql/characterset-and-collation.md). Currently, TiDB only supports some of the character sets. {{< copyable "sql" >}} diff --git a/reference/tools/syncer.md b/reference/tools/syncer.md index 6842a6cb3763c..5ef431d74bce1 100644 --- a/reference/tools/syncer.md +++ b/reference/tools/syncer.md @@ -462,7 +462,7 @@ Before replicating data using Syncer, check the following items: 6. Check the Character Set. - TiDB differs from MySQL in [Character Set](/reference/sql/character-set.md). + TiDB differs from MySQL in [Character Set](/reference/sql/characterset-and-collation.md). 7. Check whether the table to be replicated has a primary key or a unique index. From 08de2c222c3786a1885324f07a4425b7ffbcf670 Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Mon, 20 Apr 2020 22:07:53 +0800 Subject: [PATCH 2/9] fix title --- reference/sql/characterset-and-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/reference/sql/characterset-and-collation.md b/reference/sql/characterset-and-collation.md index 2962b7765ee54..7fd1c63274569 100644 --- a/reference/sql/characterset-and-collation.md +++ b/reference/sql/characterset-and-collation.md @@ -353,7 +353,7 @@ To disable this error reporting, use `set @@tidb_skip_utf8_check=1;` to skip the For more information, see [Connection Character Sets and Collations in MySQL](https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html). -## Collation support +## Collation support framework The syntax support and semantic support for the collation is affected by the [`new_collation_enable`](/reference/configuration/tidb-server/configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter inicates that TiDB can correctly use collations when comparing strings. From da213ff4a38f26e0b4b25cca4affd7ef7763f5a0 Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Tue, 21 Apr 2020 20:20:53 +0800 Subject: [PATCH 3/9] refine language --- .../tidb-server/configuration-file.md | 4 +- reference/mysql-compatibility.md | 1 - reference/sql/characterset-and-collation.md | 72 ++++++++++--------- 3 files changed, 40 insertions(+), 37 deletions(-) diff --git a/reference/configuration/tidb-server/configuration-file.md b/reference/configuration/tidb-server/configuration-file.md index 8ab2da08ea253..e75fa780b4d15 100644 --- a/reference/configuration/tidb-server/configuration-file.md +++ b/reference/configuration/tidb-server/configuration-file.md @@ -106,9 +106,9 @@ The TiDB configuration file supports more options than command-line parameters. ### `new_collations_enabled_on_first_bootstrap` -- Enables or disables the new Collation support. +- Enables or disables the new collation support. - Default value: `false` -- Note: This configuration takes effect only for the TiDB cluster that has just been initialized. After the initialization, you cannot use this configuration item to enable and disable the new Collation support. When a TiDB cluster is upgraded to v4.0, because the cluster is initialized before, both `true` and `false` values of this configuration item are taken as `false`. +- Note: This configuration takes effect only for the TiDB cluster that is first initialized. After the initialization, you cannot use this configuration item to enable or disable the new collation support. When a TiDB cluster is upgraded to v4.0, because the cluster has been initialized before, both `true` and `false` values of this configuration item are taken as `false`. ### `max-server-connections` diff --git a/reference/mysql-compatibility.md b/reference/mysql-compatibility.md index 4bad2332665d6..cc435396fd5cf 100644 --- a/reference/mysql-compatibility.md +++ b/reference/mysql-compatibility.md @@ -98,7 +98,6 @@ In TiDB DDL does not block reads or writes to tables while in operation. However - Does not support lossy changes, such as from `BIGINT` to `INTEGER` or `VARCHAR(255)` to `VARCHAR(10)`. - Does not support modifying the precision of `DECIMAL` data types. - Does not support changing the `UNSIGNED` attribute. - - Only supports changing the `CHARACTER SET` attribute from `utf8` to `utf8mb4`. + `LOCK [=] {DEFAULT|NONE|SHARED|EXCLUSIVE}`: the syntax is supported, but is not applicable to TiDB. All DDL changes that are supported do not lock the table. + `ALGORITHM [=] {DEFAULT|INSTANT|INPLACE|COPY}`: the syntax for `ALGORITHM=INSTANT` and `ALGORITHM=INPLACE` is fully supported, but it works differently from MySQL because some operations that are `INPLACE` in MySQL are `INSTANT` in TiDB. The syntax `ALGORITHM=COPY` is not applicable to TIDB and returns a warning. + Multiple operations cannot be completed in a single `ALTER TABLE` statement. For example, it's not possible to add multiple columns or indexes in a single statement. diff --git a/reference/sql/characterset-and-collation.md b/reference/sql/characterset-and-collation.md index 7fd1c63274569..f8ce549f22ba3 100644 --- a/reference/sql/characterset-and-collation.md +++ b/reference/sql/characterset-and-collation.md @@ -17,7 +17,7 @@ Currently, TiDB supports the following character sets: SHOW CHARACTER SET; ``` -``` +```sql +---------|---------------|-------------------|--------+ | Charset | Description | Default collation | Maxlen | +---------|---------------|-------------------|--------+ @@ -44,7 +44,7 @@ TiDB only supports binary collations. This means that unlike MySQL, in TiDB stri SELECT * FROM information_schema.collations; ``` -``` +```sql +----------------+--------------------+------+------------+-------------+---------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | +----------------+--------------------+------+------------+-------------+---------+ @@ -57,13 +57,15 @@ SELECT * FROM information_schema.collations; 5 rows in set (0.00 sec) ``` +At least one collation corresponds to a character set. You can use the following statement to view the collation (under the [new supporting framework for collations](#new-supporting-framework-for-collations)) that corresponds to the character set. + {{< copyable "sql" >}} ```sql SHOW COLLATION WHERE Charset = 'utf8mb4'; ``` -``` +```sql +-------------+---------+------+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +-------------+---------+------+---------+----------+---------+ @@ -72,6 +74,8 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; 1 row in set (0.00 sec) ``` +Each character set has a default collation. For example, the default collation for `utf8mb4` is `utf8mb4_bin`. + For compatibility with MySQL, TiDB will allow other collation names to be used: {{< copyable "sql" >}} @@ -80,7 +84,7 @@ For compatibility with MySQL, TiDB will allow other collation names to be used: CREATE TABLE t1 (a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(10)) COLLATE utf8mb4_unicode_520_ci; ``` -``` +```sql Query OK, 0 rows affected (0.00 sec) ``` @@ -90,7 +94,7 @@ Query OK, 0 rows affected (0.00 sec) INSERT INTO t1 VALUES (1, 'a'); ``` -``` +```sql Query OK, 1 row affected (0.00 sec) ``` @@ -100,7 +104,7 @@ Query OK, 1 row affected (0.00 sec) SELECT * FROM t1 WHERE b = 'a'; ``` -``` +```sql +---+------+ | a | b | +---+------+ @@ -115,7 +119,7 @@ SELECT * FROM t1 WHERE b = 'a'; SELECT * FROM t1 WHERE b = 'A'; ``` -``` +```sql Empty set (0.00 sec) ``` @@ -125,7 +129,7 @@ Empty set (0.00 sec) SHOW CREATE TABLE t1\G ``` -``` +```sql *************************** 1. row *************************** Table: t1 Create Table: CREATE TABLE `t1` ( @@ -161,10 +165,10 @@ Different databases can use different character sets and collations. Use the `ch {{< copyable "sql" >}} ```sql -create schema test1 character set utf8 COLLATE uft8_general_ci; +create schema test1 character set utf8mb4 COLLATE uft8mb4_general_ci; ``` -``` +```sql Query OK, 0 rows affected (0.09 sec) ``` @@ -174,17 +178,17 @@ Query OK, 0 rows affected (0.09 sec) use test1; ``` -``` +```sql Database changed ``` {{< copyable "sql" >}} -``` +```sql SELECT @@character_set_database, @@collation_database; ``` -``` +```sql +--------------------------|----------------------+ | @@character_set_database | @@collation_database | +--------------------------|----------------------+ @@ -195,31 +199,31 @@ SELECT @@character_set_database, @@collation_database; {{< copyable "sql" >}} -``` +```sql create schema test2 character set latin1 COLLATE latin1_bin; ``` -``` +```sql Query OK, 0 rows affected (0.09 sec) ``` {{< copyable "sql" >}} -``` +```sql use test2; ``` -``` +```sql Database changed ``` {{< copyable "sql" >}} -``` +```sql SELECT @@character_set_database, @@collation_database; ``` -``` +```sql +--------------------------|----------------------+ | @@character_set_database | @@collation_database | +--------------------------|----------------------+ @@ -257,7 +261,7 @@ For example: CREATE TABLE t1(a int) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; ``` -``` +```sql Query OK, 0 rows affected (0.08 sec) ``` @@ -355,15 +359,15 @@ For more information, see [Connection Character Sets and Collations in MySQL](ht ## Collation support framework -The syntax support and semantic support for the collation is affected by the [`new_collation_enable`](/reference/configuration/tidb-server/configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter inicates that TiDB can correctly use collations when comparing strings. +The syntax support and semantic support for the collation are influenced by the [`new_collation_enable`](/reference/configuration/tidb-server/configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings. -Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old collation support framework](#old-collation-support-framework). +Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old supporting framework for collations](#old-supporting-framework-for-collations). -Since v4.0, TiDB supports semantically parsing different collations and strictly following the collations when comparing strings, which is the [new collation support framework](#new-collation-support-framework). +Since v4.0, TiDB supports semantically parsing different collations and strictly following the collations when comparing strings, which is the [new supporting framework for collations](#new-supporting-framework-for-collations). -### Old collation support framework +### Old supporting framework for collations -Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collation, which means that the byte order determines the character order. Different from MySQL, TiDB fills in spaces according to Collation's `PADDING` attribute before comparing characters, which causes the following behavior differences: +Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collations, which means that the byte order determines the character order. Different from MySQL, TiDB fills in spaces according to collation's `PADDING` attribute before comparing characters, which causes the following behavior differences: {{< copyable "sql" >}} @@ -378,9 +382,9 @@ insert into t1 values ('a '); Query OK, 1 row affected # In MySQL, because comparison is performed after the spaces are filled in, the `Duplicate entry 'a '` error is returned. ``` -### New collation support framework +### New supporting framework for collations -In TiDB 4.0, a complete collation support framework is introduced that semantically supports collations and adds the `new_collation_enabled_on_first_boostrap` configuration item to decide whether to enable the new collation support framework when a cluster is initialized. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the `new_collation_enabled` variable in the `mysql`.`tidb` table: +In TiDB 4.0, a complete supporting framework for collations is introduced. This new framework supports semantically parsing collations and introduces the `new_collation_enabled_on_first_boostrap` configuration item to decide whether to enable the new framework when a cluster is first initialized. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the `new_collation_enabled` variable in the `mysql`.`tidb` table: {{< copyable "sql" >}} @@ -388,7 +392,7 @@ In TiDB 4.0, a complete collation support framework is introduced that semantica select VARIABLE_VALUE from mysql.tidb where VARIABLE_NAME='new_collation_enabled'; ``` -``` +```sql +----------------+ | VARIABLE_VALUE | +----------------+ @@ -397,7 +401,7 @@ select VARIABLE_VALUE from mysql.tidb where VARIABLE_NAME='new_collation_enabled 1 row in set (0.00 sec) ``` -Under the new collation support framework, TiDB support the `utf8_general_ci` and `utf8mb4_general_ci` collations, which is compatible with MySQL. +Under the new supporting framework, TiDB support the `utf8_general_ci` and `utf8mb4_general_ci` collations which are compatible with MySQL. When `utf8_general_ci` or `utf8mb4_general_ci` is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's `PADDING` behavior: @@ -420,15 +424,15 @@ ERROR 1062 (23000): Duplicate entry 'a ' for key 'PRIMARY' ## Coercibility values of collations in expressions -If an expression involves multiple clauses of different collations, you need to deduce the collation used in the calculation. The rules are as follows: +If an expression involves multiple clauses of different collations, you need to infer the collation used in the calculation. The rules are as follows: + The coercibility value of the explicit `COLLATE` clause is `0`. -+ If the collations of two strings are incompatible, the coercibility values of the concatenation of two strings with different collations is `1`. ++ If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is `1`. Currently, all implemented collations are compatible with each other. + The column's collation has a coercibility value of `2`. + The system constant (the string returned by `USER ()` or `VERSION ()`) has a coercibility value of `3`. -+ The coercibility value of the constant is `4`. ++ The coercibility value of constants is `4`. + The coercibility value of numbers or intermediate variables is `5`. -+ `NULL` or expressions derived from `NULL` have a coercibility value of `6`. ++ `NULL` or expressions derived from `NULL` has a coercibility value of `6`. When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values (the same as MySQL). If the coercibility values of two clauses are the same, the collation is determined according to the following priority: @@ -446,7 +450,7 @@ TiDB supports using the `COLLATE` clause to specify the collation of an expressi select 'a' = 'A' collate utf8mb4_general_ci; ``` -``` +```sql +--------------------------------------+ | 'a' = 'A' collate utf8mb4_general_ci | +--------------------------------------+ From bed2fe1d95d5f15eac12f116b2216c26fe92e167 Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Sun, 26 Apr 2020 15:02:16 +0800 Subject: [PATCH 4/9] removing outdated content based on the Chinese version --- reference/sql/characterset-and-collation.md | 120 +++----------------- 1 file changed, 16 insertions(+), 104 deletions(-) diff --git a/reference/sql/characterset-and-collation.md b/reference/sql/characterset-and-collation.md index f8ce549f22ba3..05471c8ebb942 100644 --- a/reference/sql/characterset-and-collation.md +++ b/reference/sql/characterset-and-collation.md @@ -32,32 +32,9 @@ SHOW CHARACTER SET; > **Note:** > -> Each character set corresponds to only one default collation. +> Each character set might correspond to multiple collations, but by default each character set corresponds to only one collation. -## Collation support - -TiDB only supports binary collations. This means that unlike MySQL, in TiDB string comparisons are both case sensitive and accent sensitive: - -{{< copyable "sql" >}} - -```sql -SELECT * FROM information_schema.collations; -``` - -```sql -+----------------+--------------------+------+------------+-------------+---------+ -| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | -+----------------+--------------------+------+------------+-------------+---------+ -| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | -| latin1_bin | latin1 | 47 | Yes | Yes | 1 | -| binary | binary | 63 | Yes | Yes | 1 | -| ascii_bin | ascii | 65 | Yes | Yes | 1 | -| utf8_bin | utf8 | 83 | Yes | Yes | 1 | -+----------------+--------------------+------+------------+-------------+---------+ -5 rows in set (0.00 sec) -``` - -At least one collation corresponds to a character set. You can use the following statement to view the collation (under the [new supporting framework for collations](#new-supporting-framework-for-collations)) that corresponds to the character set. +You can use the following statement to view the collation (under the [new framework for collations](#new-framework-for-collations)) that corresponds to the character set. {{< copyable "sql" >}} @@ -66,78 +43,13 @@ SHOW COLLATION WHERE Charset = 'utf8mb4'; ``` ```sql -+-------------+---------+------+---------+----------+---------+ -| Collation | Charset | Id | Default | Compiled | Sortlen | -+-------------+---------+------+---------+----------+---------+ -| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | -+-------------+---------+------+---------+----------+---------+ -1 row in set (0.00 sec) -``` - -Each character set has a default collation. For example, the default collation for `utf8mb4` is `utf8mb4_bin`. - -For compatibility with MySQL, TiDB will allow other collation names to be used: - -{{< copyable "sql" >}} - -```sql -CREATE TABLE t1 (a INT NOT NULL PRIMARY KEY AUTO_INCREMENT, b VARCHAR(10)) COLLATE utf8mb4_unicode_520_ci; -``` - -```sql -Query OK, 0 rows affected (0.00 sec) -``` - -{{< copyable "sql" >}} - -```sql -INSERT INTO t1 VALUES (1, 'a'); -``` - -```sql -Query OK, 1 row affected (0.00 sec) -``` - -{{< copyable "sql" >}} - -```sql -SELECT * FROM t1 WHERE b = 'a'; -``` - -```sql -+---+------+ -| a | b | -+---+------+ -| 1 | a | -+---+------+ -1 row in set (0.00 sec) -``` - -{{< copyable "sql" >}} - -```sql -SELECT * FROM t1 WHERE b = 'A'; -``` - -```sql -Empty set (0.00 sec) -``` - -{{< copyable "sql" >}} - -```sql -SHOW CREATE TABLE t1\G -``` - -```sql -*************************** 1. row *************************** - Table: t1 -Create Table: CREATE TABLE `t1` ( - `a` int(11) NOT NULL AUTO_INCREMENT, - `b` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL, - PRIMARY KEY (`a`) -) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci AUTO_INCREMENT=30002 -1 row in set (0.00 sec) ++--------------------+---------+------+---------+----------+---------+ +| Collation | Charset | Id | Default | Compiled | Sortlen | ++--------------------+---------+------+---------+----------+---------+ +| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 | +| utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | ++--------------------+---------+------+---------+----------+---------+ +2 rows in set (0.00 sec) ``` ## Cluster character set and collation @@ -232,7 +144,7 @@ SELECT @@character_set_database, @@collation_database; 1 row in set (0.00 sec) ``` -You can also see the two values in INFORMATION_SCHEMA: +You can also see the two values in `INFORMATION_SCHEMA`: ```sql SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME @@ -361,11 +273,11 @@ For more information, see [Connection Character Sets and Collations in MySQL](ht The syntax support and semantic support for the collation are influenced by the [`new_collation_enable`](/reference/configuration/tidb-server/configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings. -Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old supporting framework for collations](#old-supporting-framework-for-collations). +Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old framework for collations](#old-framework-for-collations). -Since v4.0, TiDB supports semantically parsing different collations and strictly following the collations when comparing strings, which is the [new supporting framework for collations](#new-supporting-framework-for-collations). +Since v4.0, TiDB supports semantically parsing different collations and strictly following the collations when comparing strings, which is the [new framework for collations](#new-framework-for-collations). -### Old supporting framework for collations +### Old framework for collations Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collations, which means that the byte order determines the character order. Different from MySQL, TiDB fills in spaces according to collation's `PADDING` attribute before comparing characters, which causes the following behavior differences: @@ -382,9 +294,9 @@ insert into t1 values ('a '); Query OK, 1 row affected # In MySQL, because comparison is performed after the spaces are filled in, the `Duplicate entry 'a '` error is returned. ``` -### New supporting framework for collations +### New framework for collations -In TiDB 4.0, a complete supporting framework for collations is introduced. This new framework supports semantically parsing collations and introduces the `new_collation_enabled_on_first_boostrap` configuration item to decide whether to enable the new framework when a cluster is first initialized. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the `new_collation_enabled` variable in the `mysql`.`tidb` table: +In TiDB 4.0, a complete framework for collations is introduced. This new framework supports semantically parsing collations and introduces the `new_collation_enabled_on_first_boostrap` configuration item to decide whether to enable the new framework when a cluster is first initialized. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the `new_collation_enabled` variable in the `mysql`.`tidb` table: {{< copyable "sql" >}} @@ -401,7 +313,7 @@ select VARIABLE_VALUE from mysql.tidb where VARIABLE_NAME='new_collation_enabled 1 row in set (0.00 sec) ``` -Under the new supporting framework, TiDB support the `utf8_general_ci` and `utf8mb4_general_ci` collations which are compatible with MySQL. +Under the new framework, TiDB support the `utf8_general_ci` and `utf8mb4_general_ci` collations which are compatible with MySQL. When `utf8_general_ci` or `utf8mb4_general_ci` is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's `PADDING` behavior: From b9aa23a006698dbdd7b8046f746e82d136175fa8 Mon Sep 17 00:00:00 2001 From: TomShawn <1135243111@qq.com> Date: Tue, 28 Apr 2020 23:02:21 +0800 Subject: [PATCH 5/9] update a sentence --- reference/sql/characterset-and-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/reference/sql/characterset-and-collation.md b/reference/sql/characterset-and-collation.md index 05471c8ebb942..ddf5f48603441 100644 --- a/reference/sql/characterset-and-collation.md +++ b/reference/sql/characterset-and-collation.md @@ -279,7 +279,7 @@ Since v4.0, TiDB supports semantically parsing different collations and strictly ### Old framework for collations -Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collations, which means that the byte order determines the character order. Different from MySQL, TiDB fills in spaces according to collation's `PADDING` attribute before comparing characters, which causes the following behavior differences: +Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collations, which means that the byte order determines the character order. Different from MySQL, TiDB deletes the space at the end of the character according to the `PADDING` attribute of the collation before comparing characters, which causes the following behavior differences: {{< copyable "sql" >}} From 5218c72ff17fa31925716d8a8bc1c2da47f3ed5c Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 12 May 2020 11:59:08 +0800 Subject: [PATCH 6/9] correct parameter name from #2802 --- reference/sql/characterset-and-collation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/reference/sql/characterset-and-collation.md b/reference/sql/characterset-and-collation.md index ddf5f48603441..d4f109dc00535 100644 --- a/reference/sql/characterset-and-collation.md +++ b/reference/sql/characterset-and-collation.md @@ -271,7 +271,7 @@ For more information, see [Connection Character Sets and Collations in MySQL](ht ## Collation support framework -The syntax support and semantic support for the collation are influenced by the [`new_collation_enable`](/reference/configuration/tidb-server/configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings. +The syntax support and semantic support for the collation are influenced by the [`new_collations_enabled_on_first_bootstrap`](/reference/configuration/tidb-server/configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings. Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old framework for collations](#old-framework-for-collations). @@ -296,7 +296,7 @@ Query OK, 1 row affected # In MySQL, because comparison is performed after the s ### New framework for collations -In TiDB 4.0, a complete framework for collations is introduced. This new framework supports semantically parsing collations and introduces the `new_collation_enabled_on_first_boostrap` configuration item to decide whether to enable the new framework when a cluster is first initialized. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the `new_collation_enabled` variable in the `mysql`.`tidb` table: +In TiDB 4.0, a complete framework for collations is introduced. This new framework supports semantically parsing collations and introduces the `new_collations_enabled_on_first_bootstrap` configuration item to decide whether to enable the new framework when a cluster is first initialized. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the `new_collation_enabled` variable in the `mysql`.`tidb` table: {{< copyable "sql" >}} From 48680f293633f59304111f4ba32ec5cdcfbe6e68 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Tue, 12 May 2020 12:02:47 +0800 Subject: [PATCH 7/9] Update date-and-time.md --- reference/sql/data-types/date-and-time.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/reference/sql/data-types/date-and-time.md b/reference/sql/data-types/date-and-time.md index e147c9526cf7b..d30f59ba1b28c 100644 --- a/reference/sql/data-types/date-and-time.md +++ b/reference/sql/data-types/date-and-time.md @@ -76,6 +76,8 @@ Different types of zero value are shown in the following table: Invalid `DATE`, `DATETIME`, `TIMESTAMP` values are automatically converted to the corresponding type of zero value ( '0000-00-00' or '0000-00-00 00:00:00' ) if the SQL mode permits such usage. + + ### Automatic initialization and update of `TIMESTAMP` and `DATETIME` Columns with `TIMESTAMP` or `DATETIME` value type can be automatically initialized or updated to the current time. From d73ef292328c71164784c8d25a90053a5d610f87 Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Wed, 13 May 2020 15:09:47 +0800 Subject: [PATCH 8/9] address comments from coco --- reference/sql/characterset-and-collation.md | 42 ++++++++++++--------- reference/sql/statements/alter-database.md | 2 +- reference/sql/statements/create-database.md | 2 +- reference/sql/statements/set-names.md | 2 +- reference/tools/syncer.md | 4 +- 5 files changed, 29 insertions(+), 23 deletions(-) diff --git a/reference/sql/characterset-and-collation.md b/reference/sql/characterset-and-collation.md index d4f109dc00535..a4b32f5d5cd9f 100644 --- a/reference/sql/characterset-and-collation.md +++ b/reference/sql/characterset-and-collation.md @@ -1,6 +1,6 @@ --- title: Character Set and Collation -summary: Learn about the supported character sets in TiDB. +summary: Learn about the supported character sets and collations in TiDB. category: reference aliases: ['/docs/dev/reference/sql/character-set/'] --- @@ -34,7 +34,7 @@ SHOW CHARACTER SET; > > Each character set might correspond to multiple collations, but by default each character set corresponds to only one collation. -You can use the following statement to view the collation (under the [new framework for collations](#new-framework-for-collations)) that corresponds to the character set. +You can use the following statement to view the collations (under the [new framework for collations](#new-framework-for-collations)) that corresponds to the character set. {{< copyable "sql" >}} @@ -146,6 +146,8 @@ SELECT @@character_set_database, @@collation_database; You can also see the two values in `INFORMATION_SCHEMA`: +{{< copyable "sql" >}} + ```sql SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db_name'; @@ -177,11 +179,11 @@ CREATE TABLE t1(a int) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci; Query OK, 0 rows affected (0.08 sec) ``` -The database character set and collation are used as the default values for table definitions if the table character set and collation are not specified in individual column definitions. +If the table character set and collation are not specified, the database character set and collation are used as their default values. ## Column character set and collation -See the following table for the character set and collation syntax for columns: +You can use the following statement to specify the character set and collation for columns: ```sql col_name {CHAR | VARCHAR | TEXT} (col_length) @@ -193,11 +195,11 @@ col_name {ENUM | SET} (val_list) [COLLATE collation_name] ``` -The table character set and collation are used as the default values for column definitions if the column character set and collation are not specified in individual column definitions. +If the column character set and collation are not specified, the table character set and collation are used as their default values. ## String character sets and collation -Each character literal in a string has a character set and a collation. When you use a string, this option is available: +Each string corresponds to a character set and a collation. When you use a string, this option is available: {{< copyable "sql" >}} @@ -207,6 +209,8 @@ Each character literal in a string has a character set and a collation. When you Example: +{{< copyable "sql" >}} + ```sql SELECT 'string'; SELECT _utf8mb4'string'; @@ -215,17 +219,21 @@ SELECT _utf8mb4'string' COLLATE utf8mb4_general_ci; Rules: -+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `CHARACTER SET charset_name` and `COLLATE collation_name` are used directly. -+ Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, `CHARACTER SET charset_name` and the default collation of `CHARACTER SET charset_name` are used. ++ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `charset_name` and `collation_name` are used directly. ++ Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, `charset_name` and the default collation of `charset_name` are used. + Rule 3: If you specify neither `CHARACTER SET charset_name` nor `COLLATE collation_name`, the character set and collation given by the system variables `character_set_connection` and `collation_connection` are used. -## Connection character sets and collations +## Client connection character set and collation + The server character set and collation are the values of the `character_set_server` and `collation_server` system variables. -+ The character set and collation of the default database are the values of the `character_set_database` and `collation_database` system variables. You can use `character_set_connection` and `collation_connection` to specify the character set and collation for each connection. The `character_set_client` variable is to set the client character set. Before returning the result, the `character_set_results` system variable indicates the character set in which the server returns query results to the client, including the metadata of the result. ++ The character set and collation of the default database are the values of the `character_set_database` and `collation_database` system variables. -You can use the following statement to specify a particular collation that is related to the client: +You can use `character_set_connection` and `collation_connection` to specify the character set and collation for each connection. The `character_set_client` variable is to set the client character set. + +Before returning the result, the `character_set_results` system variable indicates the character set in which the server returns query results to the client, including the metadata of the result. + +You can use the following statement to set the character set and collation that is related to the client: + `SET NAMES 'charset_name' [COLLATE 'collation_name']` @@ -257,25 +265,23 @@ String > Column > Table > Database > Server > Cluster ## General rules on selecting character sets and collation -+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `CHARACTER SET charset_name` and `COLLATE collation_name` are used directly. -+ Rule 2: If you specify `CHARACTER SET charset_name` and do not specify `COLLATE collation_name`, then `CHARACTER SET charset_name` and the default comparison collation of `CHARACTER SET charset_name` are used. ++ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `charset_name` and `collation_name` are used directly. ++ Rule 2: If you specify `CHARACTER SET charset_name` and do not specify `COLLATE collation_name`, then `charset_name` and the default collation of `charset_name` are used. + Rule 3: If you specify neither `CHARACTER SET charset_name` nor `COLLATE collation_name`, the character set and collation with higher optimization levels are used. ## Validity check of characters -For the specified `utf8` or `utf8mb4` character set, TiDB only supports the valid `utf8` character, and reports the `incorrect utf8 value` error when the character is invalid. This validity check of characters in TiDB is compatible with MySQL 8.0 but incompatible with MySQL 5.7 or earlier versions. +If the specified character set is `utf8` or `utf8mb4`, TiDB only supports the valid `utf8` characters. For invalid characters, TiDB reports the `incorrect utf8 value` error. This validity check of characters in TiDB is compatible with MySQL 8.0 but incompatible with MySQL 5.7 or earlier versions. To disable this error reporting, use `set @@tidb_skip_utf8_check=1;` to skip the character check. -For more information, see [Connection Character Sets and Collations in MySQL](https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html). - ## Collation support framework The syntax support and semantic support for the collation are influenced by the [`new_collations_enabled_on_first_bootstrap`](/reference/configuration/tidb-server/configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings. -Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old framework for collations](#old-framework-for-collations). +Before v4.0, TiDB provides only the [old framework for collations](#old-framework-for-collations). In this framework, TiDB supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations. -Since v4.0, TiDB supports semantically parsing different collations and strictly following the collations when comparing strings, which is the [new framework for collations](#new-framework-for-collations). +Since v4.0, TiDB supports a [new framework for collations](#new-framework-for-collations). In this framework, TiDB semantically parses different collations and strictly follows the collations when comparing strings. ### Old framework for collations diff --git a/reference/sql/statements/alter-database.md b/reference/sql/statements/alter-database.md index 58ec328275301..4d0a47e8f4f94 100644 --- a/reference/sql/statements/alter-database.md +++ b/reference/sql/statements/alter-database.md @@ -18,7 +18,7 @@ alter_specification: | [DEFAULT] COLLATE [=] collation_name ``` -The `alter_specification` option specifies the `CHARACTER SET` and `COLLATE` of a specified database. Currently, TiDB only supports some character sets and collations. See [Character Set Support](/reference/sql/characterset-and-collation.md) for details. +The `alter_specification` option specifies the `CHARACTER SET` and `COLLATE` of a specified database. Currently, TiDB only supports some character sets and collations. See [Character Set and Collation Support](/reference/sql/characterset-and-collation.md) for details. ## See also diff --git a/reference/sql/statements/create-database.md b/reference/sql/statements/create-database.md index c6840530e14ff..9d08da6fd0666 100644 --- a/reference/sql/statements/create-database.md +++ b/reference/sql/statements/create-database.md @@ -45,7 +45,7 @@ create_specification: If you create an existing database and does not specify `IF NOT EXISTS`, an error is displayed. -The `create_specification` option is used to specify the specific `CHARACTER SET` and `COLLATE` in the database. Currently, TiDB only supports some of the character sets and collations. For details, see [Character Set and Collation Supports](/reference/sql/characterset-and-collation.md). +The `create_specification` option is used to specify the specific `CHARACTER SET` and `COLLATE` in the database. Currently, TiDB only supports some of the character sets and collations. For details, see [Character Set and Collation Support](/reference/sql/characterset-and-collation.md). ## Examples diff --git a/reference/sql/statements/set-names.md b/reference/sql/statements/set-names.md index d18acef99ef8d..fee1bd5e390dd 100644 --- a/reference/sql/statements/set-names.md +++ b/reference/sql/statements/set-names.md @@ -77,4 +77,4 @@ This statement is understood to be fully compatible with MySQL. Any compatibilit * [SHOW \[GLOBAL|SESSION\] VARIABLES](/reference/sql/statements/show-variables.md) * [SET ](/reference/sql/statements/set-variable.md) -* [Character Set and Collation Supports](/reference/sql/characterset-and-collation.md) +* [Character Set and Collation Support](/reference/sql/characterset-and-collation.md) diff --git a/reference/tools/syncer.md b/reference/tools/syncer.md index 5ef431d74bce1..b065754912672 100644 --- a/reference/tools/syncer.md +++ b/reference/tools/syncer.md @@ -38,7 +38,7 @@ binlog-gtid = "2bfabd22-fff7-11e6-97f7-f02fa73bcb01:1-23,61ccbb5d-c82d-11e6-ac2e > **Note:** > > - The `syncer.meta` file only needs to be configured when it is first used. The position is automatically updated when the new subsequent binlog is replicated. -> - If you use the binlog position to replicate, you only need to configure `binlog-name` and `binlog-pos`; if you use `binlog-gtid` to replacate, you need to configure `binlog-gtid` and set `--enable-gtid` when starting Syncer. +> - If you use the binlog position to replicate, you only need to configure `binlog-name` and `binlog-pos`; if you use `binlog-gtid` to replicate, you need to configure `binlog-gtid` and set `--enable-gtid` when starting Syncer. ### 2. Start Syncer @@ -462,7 +462,7 @@ Before replicating data using Syncer, check the following items: 6. Check the Character Set. - TiDB differs from MySQL in [Character Set](/reference/sql/characterset-and-collation.md). + TiDB differs from MySQL in [character sets](/reference/sql/characterset-and-collation.md). 7. Check whether the table to be replicated has a primary key or a unique index. From ed8027a7724ba719fb5b02700ad2d1d8af1a53ac Mon Sep 17 00:00:00 2001 From: TomShawn <41534398+TomShawn@users.noreply.github.com> Date: Wed, 13 May 2020 16:14:02 +0800 Subject: [PATCH 9/9] Apply suggestions from code review Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com> --- reference/sql/characterset-and-collation.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/reference/sql/characterset-and-collation.md b/reference/sql/characterset-and-collation.md index a4b32f5d5cd9f..858685e1d719b 100644 --- a/reference/sql/characterset-and-collation.md +++ b/reference/sql/characterset-and-collation.md @@ -219,8 +219,8 @@ SELECT _utf8mb4'string' COLLATE utf8mb4_general_ci; Rules: -+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `charset_name` and `collation_name` are used directly. -+ Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, `charset_name` and the default collation of `charset_name` are used. ++ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then the `charset_name` character set and the `collation_name` collation are used directly. ++ Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, the `charset_name` character set and the default collation of `charset_name` are used. + Rule 3: If you specify neither `CHARACTER SET charset_name` nor `COLLATE collation_name`, the character set and collation given by the system variables `character_set_connection` and `collation_connection` are used. ## Client connection character set and collation @@ -265,8 +265,8 @@ String > Column > Table > Database > Server > Cluster ## General rules on selecting character sets and collation -+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `charset_name` and `collation_name` are used directly. -+ Rule 2: If you specify `CHARACTER SET charset_name` and do not specify `COLLATE collation_name`, then `charset_name` and the default collation of `charset_name` are used. ++ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then the `charset_name` character set and the `collation_name` collation are used directly. ++ Rule 2: If you specify `CHARACTER SET charset_name` and do not specify `COLLATE collation_name`, then the `charset_name` character set and the default collation of `charset_name` are used. + Rule 3: If you specify neither `CHARACTER SET charset_name` nor `COLLATE collation_name`, the character set and collation with higher optimization levels are used. ## Validity check of characters