Skip to content

reference: update documents for new collation#2350

Merged
sre-bot merged 12 commits into
pingcap:masterfrom
TomShawn:collations
May 13, 2020
Merged

reference: update documents for new collation#2350
sre-bot merged 12 commits into
pingcap:masterfrom
TomShawn:collations

Conversation

@TomShawn
Copy link
Copy Markdown
Contributor

@TomShawn TomShawn commented Apr 20, 2020

What is changed, added or deleted? (Required)

Update documents for new collation.

Which TiDB version(s) do your changes apply to? (Required)

  • master (the latest development version)
  • v4.0 (TiDB 4.0 versions)
  • v3.1 (TiDB 3.1 versions)
  • v3.0 (TiDB 3.0 versions)
  • v2.1 (TiDB 2.1 versions)

If you select two or more versions from above, to trigger the bot to cherry-pick this PR to your desired release version branch(es), you must add corresponding labels such as needs-cherry-pick-4.0, needs-cherry-pick-3.1, needs-cherry-pick-3.0, and needs-cherry-pick-2.1.

What is the related PR or file link(s)?

@TomShawn TomShawn added translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. v4.0 This PR/issue applies to TiDB v4.0. size/large Changes of a large size. status/WIP This PR is still working in progress. needs-cherry-pick-4.0 labels Apr 20, 2020
@TomShawn TomShawn requested a review from wjhuang2016 April 20, 2020 14:05
@TomShawn TomShawn added status/PTAL This PR is ready for reviewing. and removed status/WIP This PR is still working in progress. labels Apr 21, 2020
@TomShawn TomShawn requested a review from yikeke April 21, 2020 12:23
Copy link
Copy Markdown
Member

@wjhuang2016 wjhuang2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TomShawn TomShawn added the status/WIP This PR is still working in progress. label Apr 24, 2020
@yikeke yikeke removed the status/PTAL This PR is ready for reviewing. label Apr 30, 2020
@pingcap pingcap deleted a comment from sre-bot Apr 30, 2020
@pingcap pingcap deleted a comment from sre-bot Apr 30, 2020
@pingcap pingcap deleted a comment from sre-bot Apr 30, 2020
@yikeke
Copy link
Copy Markdown
Contributor

yikeke commented May 12, 2020

Sorry, I'll review this PR later today.

@TomShawn TomShawn removed the status/WIP This PR is still working in progress. label May 12, 2020
```

The `alter_specification` option specifies the `CHARACTER SET` and `COLLATE` of a specified database. Currently, TiDB only supports some character sets and collations. See [Character Set Support](/reference/sql/character-set.md) for details.
The `alter_specification` option specifies the `CHARACTER SET` and `COLLATE` of a specified database. Currently, TiDB only supports some character sets and collations. See [Character Set Support](/reference/sql/characterset-and-collation.md) for details.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `alter_specification` option specifies the `CHARACTER SET` and `COLLATE` of a specified database. Currently, TiDB only supports some character sets and collations. See [Character Set Support](/reference/sql/characterset-and-collation.md) for details.
The `alter_specification` option specifies the `CHARACTER SET` and `COLLATE` of a specified database. Currently, TiDB only supports some character sets and collations. See [Character Set and Collation Support](/reference/sql/characterset-and-collation.md) for details.

If you create an existing database and does not specify `IF NOT EXISTS`, an error is displayed.

The `create_specification` option is used to specify the specific `CHARACTER SET` and `COLLATE` in the database. Currently, TiDB only supports some of the character sets and collations. For details, see [Character Set Support](/reference/sql/character-set.md).
The `create_specification` option is used to specify the specific `CHARACTER SET` and `COLLATE` in the database. Currently, TiDB only supports some of the character sets and collations. For details, see [Character Set and Collation Supports](/reference/sql/characterset-and-collation.md).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `create_specification` option is used to specify the specific `CHARACTER SET` and `COLLATE` in the database. Currently, TiDB only supports some of the character sets and collations. For details, see [Character Set and Collation Supports](/reference/sql/characterset-and-collation.md).
The `create_specification` option is used to specify the specific `CHARACTER SET` and `COLLATE` in the database. Currently, TiDB only supports some of the character sets and collations. For details, see [Character Set and Collation Support](/reference/sql/characterset-and-collation.md).

Comment thread reference/sql/statements/set-names.md Outdated
* [SHOW \[GLOBAL|SESSION\] VARIABLES](/reference/sql/statements/show-variables.md)
* [SET <variable>](/reference/sql/statements/set-variable.md)
* [Character Set Support](/reference/sql/character-set.md)
* [Character Set and Collation Supports](/reference/sql/characterset-and-collation.md)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [Character Set and Collation Supports](/reference/sql/characterset-and-collation.md)
* [Character Set and Collation Support](/reference/sql/characterset-and-collation.md)

Comment thread reference/tools/syncer.md Outdated
6. Check the Character Set.

TiDB differs from MySQL in [Character Set](/reference/sql/character-set.md).
TiDB differs from MySQL in [Character Set](/reference/sql/characterset-and-collation.md).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TiDB differs from MySQL in [Character Set](/reference/sql/characterset-and-collation.md).
TiDB differs from MySQL in [character sets](/reference/sql/characterset-and-collation.md).

@@ -0,0 +1,374 @@
---
title: Character Set and Collation
summary: Learn about the supported character sets in TiDB.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
summary: Learn about the supported character sets in TiDB.
summary: Learn about the supported character sets and collations in TiDB.

>
> Each character set might correspond to multiple collations, but by default each character set corresponds to only one collation.

You can use the following statement to view the collation (under the [new framework for collations](#new-framework-for-collations)) that corresponds to the character set.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can use the following statement to view the collation (under the [new framework for collations](#new-framework-for-collations)) that corresponds to the character set.
You can use the following statement to view the collations (under the [new framework for collations](#new-framework-for-collations)) that corresponds to the character set.

Comment thread reference/sql/characterset-and-collation.md
Query OK, 0 rows affected (0.08 sec)
```

The database character set and collation are used as the default values for table definitions if the table character set and collation are not specified in individual column definitions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The database character set and collation are used as the default values for table definitions if the table character set and collation are not specified in individual column definitions.
If the table character set and collation are not specified, the database character set and collation are used as their default values.


## Column character set and collation

See the following table for the character set and collation syntax for columns:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See the following table for the character set and collation syntax for columns:
You can use the following statement to specify the character set and collation for columns:

[COLLATE collation_name]
```

The table character set and collation are used as the default values for column definitions if the column character set and collation are not specified in individual column definitions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The table character set and collation are used as the default values for column definitions if the column character set and collation are not specified in individual column definitions.
If the column character set and collation are not specified, the table character set and collation are used as their default values.


## String character sets and collation

Each character literal in a string has a character set and a collation. When you use a string, this option is available:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each character literal in a string has a character set and a collation. When you use a string, this option is available:
Each string corresponds to a character set and a collation. When you use a string, this option is available:


## String character sets and collation

Each character literal in a string has a character set and a collation. When you use a string, this option is available:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each character literal in a string has a character set and a collation. When you use a string, this option is available:
Each string corresponds to a character set and a collation. When you use a string, this option is available:

Comment thread reference/sql/characterset-and-collation.md

Rules:

+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `CHARACTER SET charset_name` and `COLLATE collation_name` are used directly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `CHARACTER SET charset_name` and `COLLATE collation_name` are used directly.
+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `charset_name` and `collation_name` are used directly.

I changed the zh doc in pingcap/docs-cn#3056

Rules:

+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `CHARACTER SET charset_name` and `COLLATE collation_name` are used directly.
+ Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, `CHARACTER SET charset_name` and the default collation of `CHARACTER SET charset_name` are used.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, `CHARACTER SET charset_name` and the default collation of `CHARACTER SET charset_name` are used.
+ Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, `charset_name` and the default collation of `charset_name` are used.


## General rules on selecting character sets and collation

+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `CHARACTER SET charset_name` and `COLLATE collation_name` are used directly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `CHARACTER SET charset_name` and `COLLATE collation_name` are used directly.
+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `charset_name` and `collation_name` are used directly.

## General rules on selecting character sets and collation

+ Rule 1: If you specify `CHARACTER SET charset_name` and `COLLATE collation_name`, then `CHARACTER SET charset_name` and `COLLATE collation_name` are used directly.
+ Rule 2: If you specify `CHARACTER SET charset_name` and do not specify `COLLATE collation_name`, then `CHARACTER SET charset_name` and the default comparison collation of `CHARACTER SET charset_name` are used.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ Rule 2: If you specify `CHARACTER SET charset_name` and do not specify `COLLATE collation_name`, then `CHARACTER SET charset_name` and the default comparison collation of `CHARACTER SET charset_name` are used.
+ Rule 2: If you specify `CHARACTER SET charset_name` and do not specify `COLLATE collation_name`, then `charset_name` and the default collation of `charset_name` are used.

+ Rule 2: If you specify `CHARACTER SET charset_name` but do not specify `COLLATE collation_name`, `CHARACTER SET charset_name` and the default collation of `CHARACTER SET charset_name` are used.
+ Rule 3: If you specify neither `CHARACTER SET charset_name` nor `COLLATE collation_name`, the character set and collation given by the system variables `character_set_connection` and `collation_connection` are used.

## Connection character sets and collations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Connection character sets and collations
## Client connection character set and collation


+ The server character set and collation are the values of the `character_set_server` and `collation_server` system variables.

+ The character set and collation of the default database are the values of the `character_set_database` and `collation_database` system variables. You can use `character_set_connection` and `collation_connection` to specify the character set and collation for each connection. The `character_set_client` variable is to set the client character set. Before returning the result, the `character_set_results` system variable indicates the character set in which the server returns query results to the client, including the metadata of the result.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ The character set and collation of the default database are the values of the `character_set_database` and `collation_database` system variables. You can use `character_set_connection` and `collation_connection` to specify the character set and collation for each connection. The `character_set_client` variable is to set the client character set. Before returning the result, the `character_set_results` system variable indicates the character set in which the server returns query results to the client, including the metadata of the result.
+ The character set and collation of the default database are the values of the `character_set_database` and `collation_database` environment variables.
You can use `character_set_connection` and `collation_connection` to specify the character set and collation for each client connection.
The `character_set_client` variable is to set the client character set. Before returning the result, the server converts the query result to the character set corresponding to the `character_set_results` variable, including the metadata of the result.


+ The character set and collation of the default database are the values of the `character_set_database` and `collation_database` system variables. You can use `character_set_connection` and `collation_connection` to specify the character set and collation for each connection. The `character_set_client` variable is to set the client character set. Before returning the result, the `character_set_results` system variable indicates the character set in which the server returns query results to the client, including the metadata of the result.

You can use the following statement to specify a particular collation that is related to the client:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can use the following statement to specify a particular collation that is related to the client:
You can use the following statement to set the character set and collation that is related to the client:


## Validity check of characters

For the specified `utf8` or `utf8mb4` character set, TiDB only supports the valid `utf8` character, and reports the `incorrect utf8 value` error when the character is invalid. This validity check of characters in TiDB is compatible with MySQL 8.0 but incompatible with MySQL 5.7 or earlier versions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For the specified `utf8` or `utf8mb4` character set, TiDB only supports the valid `utf8` character, and reports the `incorrect utf8 value` error when the character is invalid. This validity check of characters in TiDB is compatible with MySQL 8.0 but incompatible with MySQL 5.7 or earlier versions.
If the specified character set is `utf8` or `utf8mb4`, TiDB only supports the valid `utf8` characters. For invalid characters, TiDB reports the `incorrect utf8 value` error. This validity check of characters in TiDB is compatible with MySQL 8.0 but incompatible with MySQL 5.7 or earlier versions.

Comment on lines +270 to +271
For more information, see [Connection Character Sets and Collations in MySQL](https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more information, see [Connection Character Sets and Collations in MySQL](https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html).


The syntax support and semantic support for the collation are influenced by the [`new_collations_enabled_on_first_bootstrap`](/reference/configuration/tidb-server/configuration-file.md#new_collations_enabled_on_first_bootstrap) configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings.

Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old framework for collations](#old-framework-for-collations).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old framework for collations](#old-framework-for-collations).
Before v4.0, TiDB provides only the [old framework for collations](#old-framework-for-collations). In this framework, TiDB supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations.


Before v4.0, TiDB only supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations, which is the [old framework for collations](#old-framework-for-collations).

Since v4.0, TiDB supports semantically parsing different collations and strictly following the collations when comparing strings, which is the [new framework for collations](#new-framework-for-collations).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Since v4.0, TiDB supports semantically parsing different collations and strictly following the collations when comparing strings, which is the [new framework for collations](#new-framework-for-collations).
Since v4.0, TiDB supports a [new framework for collations](#new-framework-for-collations). In this framework, TiDB semantically parses different collations and strictly follows the collations when comparing strings.

@TomShawn
Copy link
Copy Markdown
Contributor Author

@yikeke All comments are addressed, PTAL again, thanks!

Copy link
Copy Markdown
Contributor

@yikeke yikeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align new commits in pingcap/docs-cn#3056

Comment thread reference/sql/characterset-and-collation.md Outdated
Comment thread reference/sql/characterset-and-collation.md Outdated
Comment thread reference/sql/characterset-and-collation.md Outdated
Comment thread reference/sql/characterset-and-collation.md Outdated
Co-authored-by: Keke Yi <40977455+yikeke@users.noreply.github.com>
@yikeke
Copy link
Copy Markdown
Contributor

yikeke commented May 13, 2020

/merge

@sre-bot sre-bot added the status/can-merge Indicates a PR has been approved by a committer. label May 13, 2020
@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented May 13, 2020

/run-all-tests

@sre-bot sre-bot merged commit 096bdc3 into pingcap:master May 13, 2020
sre-bot pushed a commit to sre-bot/docs that referenced this pull request May 13, 2020
@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented May 13, 2020

cherry pick to release-4.0 in PR #2541

@TomShawn TomShawn deleted the collations branch May 13, 2020 08:50
yikeke pushed a commit that referenced this pull request May 13, 2020
* reference: update documents for new collation (#2350)

* Update reference/sql/characterset-and-collation.md

Co-authored-by: TomShawn <41534398+TomShawn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/large Changes of a large size. status/can-merge Indicates a PR has been approved by a committer. translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. v4.0 This PR/issue applies to TiDB v4.0.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants