Skip to content

Cannot match Unicode characters outside the BMP#1

Open
nylen wants to merge 2 commits intomasterfrom
fix/chars-outside-bmp
Open

Cannot match Unicode characters outside the BMP#1
nylen wants to merge 2 commits intomasterfrom
fix/chars-outside-bmp

Conversation

@nylen
Copy link
Copy Markdown
Owner

@nylen nylen commented Jul 10, 2017

This is a bit of a mess...

PEG.js does not allow specifying characters outside the BMP (more than two bytes) in its grammars. JavaScript handles these as multi-byte sequences, for example \ud83d\udca9. Presumably this works fine with vanilla PEG.js.

In phpegjs, we use PCRE to split all characters in the input. This handles emoji characters as, for example, a single \u{1f4a9}.

It would be very difficult to bridge this gap. We'd have to write logic (in JavaScript) to accept string literals containing multibyte sequences like \ud83d\udca9, calculate what their length would be in PHP, then adjust the generated code to account for the difference between the string length in PHP and in JavaScript.

That still wouldn't handle matching character classes, because JavaScript and PEG.js would have to use two separate character classes like [\ud83d][\udca9], but PHP would have to use [\x{1f4a9}] instead.

Fortunately this is only going to be a problem if your grammar itself needs to match these characters. Hopefully you don't need to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant