Skip to content

Unicode escapes: support u{N...}#2823

Merged
andrewrk merged 3 commits intoziglang:masterfrom
hryx:unicode-escape
Jul 6, 2019
Merged

Unicode escapes: support u{N...}#2823
andrewrk merged 3 commits intoziglang:masterfrom
hryx:unicode-escape

Conversation

@hryx
Copy link
Contributor

@hryx hryx commented Jul 5, 2019

Closes #2129

TODO

Notes

  • Any number of digits (one or more) is allowed in the braces. The stage1 tokenizer retains upper limit on character value of 0x10ffff.
  • The old \uNNNN and \UNNNNNN syntaxes were removed.

<tr>
<td><code>\UNNNNNN</code></td>
<td>hexadecimal 24-bit Unicode character code UTF-8 encoded (6 digits)</td>
<td><code>\u{NNNNNN}</code></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the clearest way to write this is. Could also be something like:

\u{N...}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "1 or more digits" you have below is sufficient

},

State.CharLiteralHexEscape => switch (c) {
'0'...'9', 'a'...'z', 'A'...'F' => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was a bug (found when new tests were added)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. thanks!

},
},

State.CharLiteralUnicodeInvalid => switch (c) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a little creative here because I thought this behavior might prevent some confusing error output. If it doesn't actually help, I'd be totally fine removing this special state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's run with this and see what happens.

break;
}
if (t.char_code > 0x10ffff) {
tokenize_error(&t, "unicode value out of range: %x", t.char_code);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this down to the else below?

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, easy merge

<tr>
<td><code>\UNNNNNN</code></td>
<td>hexadecimal 24-bit Unicode character code UTF-8 encoded (6 digits)</td>
<td><code>\u{NNNNNN}</code></td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "1 or more digits" you have below is sufficient

},

State.CharLiteralHexEscape => switch (c) {
'0'...'9', 'a'...'z', 'A'...'F' => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. thanks!

},
},

State.CharLiteralUnicodeInvalid => switch (c) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's run with this and see what happens.

@andrewrk andrewrk merged commit 21c6092 into ziglang:master Jul 6, 2019
@shawnl
Copy link
Contributor

shawnl commented Jul 20, 2019

On neither stage1 or stage2 did you reject UTF-16 surrogate pairs, 0xd800 - 0xdfff.

@hryx hryx deleted the unicode-escape branch July 20, 2019 19:59
@hryx
Copy link
Contributor Author

hryx commented Jul 20, 2019

@shawnl The purpose of this PR was to change the grammar, not introduce new validation logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

change \uXXXX \UXXXXXX string literal escape syntax to \u{}

4 participants