Introduce UTF matchers into std.uni.#1685
Conversation
|
Interesting. This will probably be of great use in I'd be interested in seeing it's speed with / without this. |
|
The problem is poor speed of the stuff with DMD. It's not that bad but doesn't improve beyond decode + Trie lookup. I'm not sure how we are going to solve this problem but DMD really needs to grow better inliner. |
Will try to get to it sometime at the weekend. |
|
Very preliminary ball-park tests on Bad news: new version it's not @safe, nor is it pure. Anyhow benchmark is here, feel free to test it: |
Nice.
:/
What about CTFE? Almost everything in std.string is |
|
Great stuff. |
I did't test it directly yet but my WIP on std.regex uses the same kinds of Tries for ctRegex and it compiles just fine. |
There was a problem hiding this comment.
We should really find a public space for this in phobos.
There was a problem hiding this comment.
We should really find a public space for this in phobos.
I had submitted it in:
#1440
However, we realized there was an even more generic thing that could make a compile time tuple out of any range, eg: toTypeTuple!(iota(0, 3)).
#1472
It's now kind of stuck on "do we want to introduce something new with the name tuple? TypeTuple or ExpressionTuple?".
In the mean time, Phobos devs can use staticIota.
|
Closing until updated to support const/pure/@safe |
|
Got back to this. In short: |
std/uni.d
Outdated
There was a problem hiding this comment.
For what it's worth, since Prefix is a vararg, you shoudn't need staticIota. You can just:
foreach (i, _; Prefix)
//Use i hereFurthermore (I think), since Prefix is a type tuple, then each _ will simply alias to the arg (not create a copy like in a range), so there should be strictly no overhead.
There was a problem hiding this comment.
Nice idea, will do.
|
Fixed the shameful auto mWhite8 = utfMatcher!char(unicode.WhiteSpace);at global scope. |
Nice. I wanted to ask: " |
|
Turns out this is simple failure of compiler inference. It doesn't infer free functions inside of a stand-alone template it seems. Turns out this is impure: If I put This is what I need to distill as a bug report. |
|
For purity I actually just need the following to work at global scope: immutable mWhite8 = utfMatcher!char(unicode.WhiteSpace);no need to mess with construction being not pure. UPDATE: Anyhow I've found a workaround that actually looks better then original code:) void main() pure
{
import std.uni;
auto s = CodepointSet(0x0, 0x7f);
} |
std/uni.d
Outdated
There was a problem hiding this comment.
This could also be TypeTuple!(bool, char[1], clampIdx!(0, 7)), right?
There was a problem hiding this comment.
It could be, however I handle ASCII separately down the road and plain char is simpler there.
|
It's pretty difficult to review this in depth (or rather finding enough time to do so). I'm inclined to merge this so you can move on. |
|
@MartinNowak Actually I want to try your suggestion with tuple and see how it goes performance wise on LDC. That and I'll have a go at simplifying sub-matchers. I'll ping back once I'm done with it. |
Thanks. |
It's a step zero to get decode-less std.regex. UTF matchers are efficient functors around a set of specific tries. Enables processing Unicode characters without decoding at speeds on par with decoding itself. Along the way make staticIota at 'package' protected and reuse it. Fix a shameful typo in setSearcher.
Granularity is horribly high. Auto-inference for templates has the downside that it, leaves no explanations or reasons for failure.
std/uni.d
Outdated
There was a problem hiding this comment.
s/combining decode and classify steps/combining the decoding and classification steps/
Spelling, style etc.
|
@andralex Hopefully all things covered. |
std/uni.d
Outdated
There was a problem hiding this comment.
"Avoiding" cannot be a "building block". How about "Another useful approach to efficient Unicode-aware parsers is to avoid unnecessary..."
Also don't forget the typo: unnecesSary
There was a problem hiding this comment.
"Avoiding" cannot be a "building block". How about "Another useful approach to efficient Unicode-aware parsers is to avoid unnecessary..."
Technique?
|
Thanks, fixed. |
Introduce UTF matchers into std.uni.
|
@blackwhale something went wrong so #2012 |
Revert "Merge pull request #1685 from blackwhale/utf8-matcher"
|
@blackwhale pliz reopen |
There was a problem hiding this comment.
std.utf is not visible from here.
There was a problem hiding this comment.
Thanks for pointing this out.
|
thx! |
I really like the idea of having a single matcher function and 3 UFCS functions for |
|
Revert-revert pull is #2020. |
It's a step zero to get decode-less std.regex.
UTF matchers are efficient functors around a set of
specific tries. Processing Unicode characters
without decoding at speeds on par with decoding itself.
Along the way make staticIota 'package' protected and reuse it.
Fixes a couple of shameful typos.
Some ballpark results with LDC.
DMD results suck badly, but it never optimized new Tries to begin with so nothing new there.
Tested:
Legend: name [file], count, time (s), throughput (Mb/s)