Unicode codepoint flags for custom regexs#7245
Conversation
|
Looks like the tokenizer tests are failing on Windows for some reason: |
I can not debug this in local, it is possible to skip all but the failing test? I have reviewed the previous logs but that test was not executed, so I think i'm going to start from a clean point and redo all commits until I see the fail. Also I found that compiling tests with |
|
The problem is the stack size limit in Windows. According to MSVC \STACK documentation:
|
afcbcb5 to
6ca6c46
Compare
|
I think I'm done here. Now I have the base to fix tokenizers. |
* Replace CODEPOINT_TYPE_* with codepoint_flags * Update and bugfix brute force random test * Deterministic brute force random test * Unicode normalization NFD * Get rid of BOM
* Replace CODEPOINT_TYPE_* with codepoint_flags * Update and bugfix brute force random test * Deterministic brute force random test * Unicode normalization NFD * Get rid of BOM
* Replace CODEPOINT_TYPE_* with codepoint_flags * Update and bugfix brute force random test * Deterministic brute force random test * Unicode normalization NFD * Get rid of BOM




Use flags for each unicode category (
\p{N},\p{L},\p{Z}, ...) instead of definitionsCODEPOINT_TYPE_*.Including helper flags for common regex params like
\s(only this for now),\d,\w...This simplifies writing custom regexs.
All flags are precomputed in
unicode-data.cppgenerated bygen-unicode-data.py.