Skip to content

Introduce UTF matchers into std.uni.#1685

Merged
andralex merged 11 commits intodlang:masterfrom
DmitryOlshansky:utf8-matcher
Mar 15, 2014
Merged

Introduce UTF matchers into std.uni.#1685
andralex merged 11 commits intodlang:masterfrom
DmitryOlshansky:utf8-matcher

Conversation

@DmitryOlshansky
Copy link
Member

It's a step zero to get decode-less std.regex.

UTF matchers are efficient functors around a set of
specific tries. Processing Unicode characters
without decoding at speeds on par with decoding itself.

Along the way make staticIota 'package' protected and reuse it.
Fixes a couple of shameful typos.

Some ballpark results with LDC.
DMD results suck badly, but it never optimized new Tries to begin with so nothing new there.

Tested:

  1. new UTF matchers
  2. decode + tried and true internal trie std.regex
  3. decode + new std.uni trie
  4. decode-only
  5. count bytes > 0x20

Legend: name [file], count, time (s), throughput (Mb/s)

        m8-Alpha [arwiki],    273256 hits,   1.35852, 459.92 Mb/s
      trie-Alpha [arwiki],    273256 hits,   2.35152, 265.70 Mb/s
  new-trie-alpha [arwiki],    273256 hits,   2.27336, 274.84 Mb/s
     decode-only [arwiki],    342832 hits,   1.63872, 381.28 Mb/s
            noop [arwiki],    599328 hits,   0.30504, 2048.28 Mb/s

        m8-Alpha [dewiki],   1435456 hits,   3.84244, 442.65 Mb/s
        trie-Alpha [dewiki],   1435456 hits,   7.92868, 214.52 Mb/s
  new-trie-alpha [dewiki],   1435456 hits,    6.7128, 253.38 Mb/s
     decode-only [dewiki],   1680050 hits,    3.4514, 492.81 Mb/s
            noop [dewiki],   1607454 hits,   0.88392, 1924.23 Mb/s

        m8-Alpha [enwiki],   1152212 hits,   3.90008, 459.83 Mb/s
      trie-Alpha [enwiki],   1152212 hits,   7.90484, 226.87 Mb/s
  new-trie-alpha [enwiki],   1152212 hits,   6.95048, 258.02 Mb/s
     decode-only [enwiki],   1786517 hits,   3.53968, 506.65 Mb/s
            noop [enwiki],   1699856 hits,   0.97012, 1848.61 Mb/s

        m8-Alpha [ruwiki],    414460 hits,    2.0152, 452.02 Mb/s
      trie-Alpha [ruwiki],    414460 hits,   3.34324, 272.46 Mb/s
  new-trie-alpha [ruwiki],    414460 hits,   3.29996, 276.03 Mb/s
     decode-only [ruwiki],    496887 hits,   2.33088, 390.80 Mb/s
            noop [ruwiki],    884374 hits,   0.45652, 1995.31 Mb/s

@monarchdodra
Copy link
Collaborator

Interesting. This will probably be of great use in std.string.split (found in std.array), to split a string into non-white tokens. It used to be unicode in-correct, but the decode + isWhite loop gave it a real performance penalty.

I'd be interested in seeing it's speed with / without this.

@DmitryOlshansky
Copy link
Member Author

The problem is poor speed of the stuff with DMD. It's not that bad but doesn't improve beyond decode + Trie lookup. I'm not sure how we are going to solve this problem but DMD really needs to grow better inliner.

@DmitryOlshansky
Copy link
Member Author

... I'd be interested in seeing it's speed with / without this.

Will try to get to it sometime at the weekend.

@DmitryOlshansky
Copy link
Member Author

Very preliminary ball-park tests on split are quite optimistic with about x2 speed.
(And that's overall, i.e. appending included!)
Again DMD sucks and gets a speed of about 5-10% slower with new/new2 versions.

Bad news: new version it's not @safe, nor is it pure.
Purity can be obtained if I manage to make the UTF matcher usable as immutable/const.
@safety if I mark things as @trusted... Hm...

Anyhow benchmark is here, feel free to test it:
https://github.com/blackwhale/gsoc-bench-2012/blob/master/split.d
(I compile LDC from source with fresh uni.d replacing whatever version of it in LDC's Phobos)

dmitry@dmitry-VirtualBox ~ $ ldc2 -O3 -release split.d -ofsplit

dmitry@dmitry-VirtualBox ~ $ for i in 1 2 3 4 5 ; do ./split std ruwiki-latest-all-titles-in-ns0 ; done
Done 265270 pieces in 69448 us
Done 265270 pieces in 60479 us
Done 265270 pieces in 60794 us
Done 265270 pieces in 60064 us
Done 265270 pieces in 58714 us
dmitry@dmitry-VirtualBox ~ $ for i in 1 2 3 4 5 ; do ./split new ruwiki-latest-all-titles-in-ns0 ; done
Done 265270 pieces in 33896 us
Done 265270 pieces in 27098 us
Done 265270 pieces in 27965 us
Done 265270 pieces in 26815 us
Done 265270 pieces in 26059 us
dmitry@dmitry-VirtualBox ~ $ for i in 1 2 3 4 5 ; do ./split new2 ruwiki-latest-all-titles-in-ns0 ; done
Done 265270 pieces in 32793 us
Done 265270 pieces in 27294 us
Done 265270 pieces in 29296 us
Done 265270 pieces in 28838 us
Done 265270 pieces in 28315 us

@monarchdodra
Copy link
Collaborator

Very preliminary ball-park tests on split are quite optimistic with about x2 speed. (And that's overall, i.e. appending included!)

Nice.

Again DMD sucks and gets a speed of about 5-10% slower with new/new2 versions.

:/

Bad news: new version it's not @safe, nor is it pure.
Purity can be obtained if I manage to make the UTF matcher usable as immutable/const.
@safety if I mark things as @trusted... Hm...

What about CTFE? Almost everything in std.string is @safe, pure and CTFE.

@MartinNowak
Copy link
Member

Great stuff.

@DmitryOlshansky
Copy link
Member Author

What about CTFE? Almost everything in std.string is @safe, pure and CTFE

I did't test it directly yet but my WIP on std.regex uses the same kinds of Tries for ctRegex and it compiles just fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really find a public space for this in phobos.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really find a public space for this in phobos.

I had submitted it in:
#1440

However, we realized there was an even more generic thing that could make a compile time tuple out of any range, eg: toTypeTuple!(iota(0, 3)).
#1472

It's now kind of stuck on "do we want to introduce something new with the name tuple? TypeTuple or ExpressionTuple?".

In the mean time, Phobos devs can use staticIota.

@DmitryOlshansky
Copy link
Member Author

Closing until updated to support const/pure/@safe

@MartinNowak
Copy link
Member

Be careful with @safe, it enables bounds checking by default unless you explicitly use -noboundscheck.
So in very performance sensitive areas you might want to use @trusted and manual bounds checking.

@DmitryOlshansky
Copy link
Member Author

Got back to this. In short:
pure is hard to reach so postponed for another pull.
Got @safety by wrapping entry points with @trusted blocks.

std/uni.d Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, since Prefix is a vararg, you shoudn't need staticIota. You can just:

foreach (i, _; Prefix)
    //Use i here

Furthermore (I think), since Prefix is a type tuple, then each _ will simply alias to the arg (not create a copy like in a range), so there should be strictly no overhead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea, will do.

@DmitryOlshansky
Copy link
Member Author

Fixed the shameful trancate and dropped staticIota.
BTW turns out building utf matcher is CTFE-able, so this now works:

auto mWhite8 = utfMatcher!char(unicode.WhiteSpace);

at global scope.

@monarchdodra
Copy link
Collaborator

so this now works

Nice. I wanted to ask: "@@@BUG@@@ sort is not pure" is this filed? Seems like a pretty serious restriction. What exactly is sort doing that is making it impure?

@DmitryOlshansky
Copy link
Member Author

Turns out this is simple failure of compiler inference. It doesn't infer free functions inside of a stand-alone template it seems.

Turns out this is impure:
https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L9254
Because it doesn't infer anything for functions inside of this template:
https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L9727

If I put pure: here it indeed magically makes sort pure, at least for integers. It however would break for any non-pure type.

This is what I need to distill as a bug report.

@DmitryOlshansky
Copy link
Member Author

For purity I actually just need the following to work at global scope:

immutable mWhite8 = utfMatcher!char(unicode.WhiteSpace);

no need to mess with construction being not pure.

UPDATE:
https://d.puremagic.com/issues/show_bug.cgi?id=12265

Anyhow I've found a workaround that actually looks better then original code:)
This now works:

void main() pure
{
    import std.uni;
    auto s = CodepointSet(0x0, 0x7f);
}

std/uni.d Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be TypeTuple!(bool, char[1], clampIdx!(0, 7)), right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, however I handle ASCII separately down the road and plain char is simpler there.

@MartinNowak
Copy link
Member

It's pretty difficult to review this in depth (or rather finding enough time to do so). I'm inclined to merge this so you can move on.

@DmitryOlshansky
Copy link
Member Author

@MartinNowak Actually I want to try your suggestion with tuple and see how it goes performance wise on LDC. That and I'll have a go at simplifying sub-matchers. I'll ping back once I'm done with it.

@MartinNowak
Copy link
Member

@MartinNowak Actually I want to try your suggestion with tuple and see how it goes performance wise on LDC. That and I'll have a go at simplifying sub-matchers. I'll ping back once I'm done with it.

Thanks.

It's a step zero to get decode-less std.regex.
UTF matchers are efficient functors around a set of
specific tries. Enables processing Unicode characters
without decoding at speeds on par with decoding itself.

Along the way make staticIota at 'package' protected and reuse it.

Fix a shameful typo in setSearcher.
Granularity is horribly high. Auto-inference for templates has the
downside that it, leaves no explanations or reasons for failure.
std/uni.d Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/combining decode and classify steps/combining the decoding and classification steps/

Spelling, style etc.
@DmitryOlshansky
Copy link
Member Author

@andralex Hopefully all things covered.

std/uni.d Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Avoiding" cannot be a "building block". How about "Another useful approach to efficient Unicode-aware parsers is to avoid unnecessary..."

Also don't forget the typo: unnecesSary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Avoiding" cannot be a "building block". How about "Another useful approach to efficient Unicode-aware parsers is to avoid unnecessary..."

Technique?

@DmitryOlshansky
Copy link
Member Author

Thanks, fixed.

andralex added a commit that referenced this pull request Mar 15, 2014
Introduce UTF matchers into std.uni.
@andralex andralex merged commit 216ca01 into dlang:master Mar 15, 2014
andralex added a commit that referenced this pull request Mar 16, 2014
This reverts commit 216ca01, reversing
changes made to d56c1db.
@andralex
Copy link
Member

@blackwhale something went wrong so #2012

WalterBright added a commit that referenced this pull request Mar 16, 2014
Revert "Merge pull request #1685 from blackwhale/utf8-matcher"
@9rnsr
Copy link
Contributor

9rnsr commented Mar 16, 2014

The ICE was caused by the regression 12376, and it was exposed by the merge of #2256.

@andralex
Copy link
Member

@blackwhale pliz reopen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std.utf is not visible from here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out.

@DmitryOlshansky
Copy link
Member Author

@andralex I can't reopen merged pull, see #2020

@andralex
Copy link
Member

thx!

@MartinNowak
Copy link
Member

@MartinNowak I was able to test the new implementation with tuples (or rather POD structs since tuples are unusable in std.uni). Sadly I'm seeing awful performance with both LDC and DMD.

The fork is here:
https://github.com/blackwhale/phobos/tree/utf-matcher-2

I really like the idea of having a single matcher function and 3 UFCS functions for match, skip and test. It definitely avoids any unnecessary code duplication.
I've commented on the utf-matcher-2 branch with two more ideas to improve the performance, hopefully it will suffice.

@MartinNowak
Copy link
Member

Revert-revert pull is #2020.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants