Introduce UTF matchers into std.uni. by DmitryOlshansky · Pull Request #1685 · dlang/phobos

DmitryOlshansky · 2013-11-10T20:11:37Z

It's a step zero to get decode-less std.regex.

UTF matchers are efficient functors around a set of
specific tries. Processing Unicode characters
without decoding at speeds on par with decoding itself.

Along the way make staticIota 'package' protected and reuse it.
Fixes a couple of shameful typos.

Some ballpark results with LDC.
DMD results suck badly, but it never optimized new Tries to begin with so nothing new there.

Tested:

new UTF matchers
decode + tried and true internal trie std.regex
decode + new std.uni trie
decode-only
count bytes > 0x20

Legend: name [file], count, time (s), throughput (Mb/s)

        m8-Alpha [arwiki],    273256 hits,   1.35852, 459.92 Mb/s
      trie-Alpha [arwiki],    273256 hits,   2.35152, 265.70 Mb/s
  new-trie-alpha [arwiki],    273256 hits,   2.27336, 274.84 Mb/s
     decode-only [arwiki],    342832 hits,   1.63872, 381.28 Mb/s
            noop [arwiki],    599328 hits,   0.30504, 2048.28 Mb/s

        m8-Alpha [dewiki],   1435456 hits,   3.84244, 442.65 Mb/s
        trie-Alpha [dewiki],   1435456 hits,   7.92868, 214.52 Mb/s
  new-trie-alpha [dewiki],   1435456 hits,    6.7128, 253.38 Mb/s
     decode-only [dewiki],   1680050 hits,    3.4514, 492.81 Mb/s
            noop [dewiki],   1607454 hits,   0.88392, 1924.23 Mb/s

        m8-Alpha [enwiki],   1152212 hits,   3.90008, 459.83 Mb/s
      trie-Alpha [enwiki],   1152212 hits,   7.90484, 226.87 Mb/s
  new-trie-alpha [enwiki],   1152212 hits,   6.95048, 258.02 Mb/s
     decode-only [enwiki],   1786517 hits,   3.53968, 506.65 Mb/s
            noop [enwiki],   1699856 hits,   0.97012, 1848.61 Mb/s

        m8-Alpha [ruwiki],    414460 hits,    2.0152, 452.02 Mb/s
      trie-Alpha [ruwiki],    414460 hits,   3.34324, 272.46 Mb/s
  new-trie-alpha [ruwiki],    414460 hits,   3.29996, 276.03 Mb/s
     decode-only [ruwiki],    496887 hits,   2.33088, 390.80 Mb/s
            noop [ruwiki],    884374 hits,   0.45652, 1995.31 Mb/s

monarchdodra · 2013-11-10T20:39:11Z

Interesting. This will probably be of great use in std.string.split (found in std.array), to split a string into non-white tokens. It used to be unicode in-correct, but the decode + isWhite loop gave it a real performance penalty.

I'd be interested in seeing it's speed with / without this.

DmitryOlshansky · 2013-11-10T20:44:53Z

The problem is poor speed of the stuff with DMD. It's not that bad but doesn't improve beyond decode + Trie lookup. I'm not sure how we are going to solve this problem but DMD really needs to grow better inliner.

DmitryOlshansky · 2013-11-28T20:53:42Z

... I'd be interested in seeing it's speed with / without this.

Will try to get to it sometime at the weekend.

DmitryOlshansky · 2013-12-03T18:49:45Z

Very preliminary ball-park tests on split are quite optimistic with about x2 speed.
(And that's overall, i.e. appending included!)
Again DMD sucks and gets a speed of about 5-10% slower with new/new2 versions.

Bad news: new version it's not @safe, nor is it pure.
Purity can be obtained if I manage to make the UTF matcher usable as immutable/const.
@safety if I mark things as @trusted... Hm...

Anyhow benchmark is here, feel free to test it:
https://github.com/blackwhale/gsoc-bench-2012/blob/master/split.d
(I compile LDC from source with fresh uni.d replacing whatever version of it in LDC's Phobos)

dmitry@dmitry-VirtualBox ~ $ ldc2 -O3 -release split.d -ofsplit

dmitry@dmitry-VirtualBox ~ $ for i in 1 2 3 4 5 ; do ./split std ruwiki-latest-all-titles-in-ns0 ; done
Done 265270 pieces in 69448 us
Done 265270 pieces in 60479 us
Done 265270 pieces in 60794 us
Done 265270 pieces in 60064 us
Done 265270 pieces in 58714 us
dmitry@dmitry-VirtualBox ~ $ for i in 1 2 3 4 5 ; do ./split new ruwiki-latest-all-titles-in-ns0 ; done
Done 265270 pieces in 33896 us
Done 265270 pieces in 27098 us
Done 265270 pieces in 27965 us
Done 265270 pieces in 26815 us
Done 265270 pieces in 26059 us
dmitry@dmitry-VirtualBox ~ $ for i in 1 2 3 4 5 ; do ./split new2 ruwiki-latest-all-titles-in-ns0 ; done
Done 265270 pieces in 32793 us
Done 265270 pieces in 27294 us
Done 265270 pieces in 29296 us
Done 265270 pieces in 28838 us
Done 265270 pieces in 28315 us

monarchdodra · 2013-12-19T13:27:16Z

Very preliminary ball-park tests on split are quite optimistic with about x2 speed. (And that's overall, i.e. appending included!)

Nice.

Again DMD sucks and gets a speed of about 5-10% slower with new/new2 versions.

:/

Bad news: new version it's not @safe, nor is it pure.
Purity can be obtained if I manage to make the UTF matcher usable as immutable/const.
@safety if I mark things as @trusted... Hm...

What about CTFE? Almost everything in std.string is @safe, pure and CTFE.

MartinNowak · 2013-12-19T20:32:02Z

Great stuff.

DmitryOlshansky · 2013-12-19T20:53:28Z

What about CTFE? Almost everything in std.string is @safe, pure and CTFE

I did't test it directly yet but my WIP on std.regex uses the same kinds of Tries for ctRegex and it compiles just fine.

MartinNowak · 2013-12-19T23:20:03Z

std/typecons.d

We should really find a public space for this in phobos.

We should really find a public space for this in phobos.

I had submitted it in:
#1440

However, we realized there was an even more generic thing that could make a compile time tuple out of any range, eg: toTypeTuple!(iota(0, 3)).
#1472

It's now kind of stuck on "do we want to introduce something new with the name tuple? TypeTuple or ExpressionTuple?".

In the mean time, Phobos devs can use staticIota.

DmitryOlshansky · 2014-01-19T21:53:03Z

Closing until updated to support const/pure/@safe

MartinNowak · 2014-01-19T21:59:16Z

Be careful with @safe, it enables bounds checking by default unless you explicitly use -noboundscheck.
So in very performance sensitive areas you might want to use @trusted and manual bounds checking.

DmitryOlshansky · 2014-02-23T18:19:41Z

Got back to this. In short:
pure is hard to reach so postponed for another pull.
Got @safety by wrapping entry points with @trusted blocks.

monarchdodra · 2014-02-24T20:42:18Z

std/uni.d

For what it's worth, since Prefix is a vararg, you shoudn't need staticIota. You can just:

foreach (i, _; Prefix) //Use i here

Furthermore (I think), since Prefix is a type tuple, then each _ will simply alias to the arg (not create a copy like in a range), so there should be strictly no overhead.

Nice idea, will do.

DmitryOlshansky · 2014-02-25T21:18:18Z

Fixed the shameful trancate and dropped staticIota.
BTW turns out building utf matcher is CTFE-able, so this now works:

auto mWhite8 = utfMatcher!char(unicode.WhiteSpace);

at global scope.

monarchdodra · 2014-02-25T22:05:27Z

so this now works

Nice. I wanted to ask: "@@@BUG@@@ sort is not pure" is this filed? Seems like a pretty serious restriction. What exactly is sort doing that is making it impure?

DmitryOlshansky · 2014-02-26T08:50:25Z

Turns out this is simple failure of compiler inference. It doesn't infer free functions inside of a stand-alone template it seems.

Turns out this is impure:
https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L9254
Because it doesn't infer anything for functions inside of this template:
https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L9727

If I put pure: here it indeed magically makes sort pure, at least for integers. It however would break for any non-pure type.

This is what I need to distill as a bug report.

DmitryOlshansky · 2014-02-26T09:19:20Z

For purity I actually just need the following to work at global scope:

immutable mWhite8 = utfMatcher!char(unicode.WhiteSpace);

no need to mess with construction being not pure.

UPDATE:
https://d.puremagic.com/issues/show_bug.cgi?id=12265

Anyhow I've found a workaround that actually looks better then original code:)
This now works:

void main() pure
{
    import std.uni;
    auto s = CodepointSet(0x0, 0x7f);
}

MartinNowak · 2014-02-27T19:40:08Z

std/uni.d

This could also be TypeTuple!(bool, char[1], clampIdx!(0, 7)), right?

It could be, however I handle ASCII separately down the road and plain char is simpler there.

MartinNowak · 2014-02-28T01:58:12Z

It's pretty difficult to review this in depth (or rather finding enough time to do so). I'm inclined to merge this so you can move on.

DmitryOlshansky · 2014-02-28T09:08:43Z

@MartinNowak Actually I want to try your suggestion with tuple and see how it goes performance wise on LDC. That and I'll have a go at simplifying sub-matchers. I'll ping back once I'm done with it.

MartinNowak · 2014-02-28T19:19:44Z

@MartinNowak Actually I want to try your suggestion with tuple and see how it goes performance wise on LDC. That and I'll have a go at simplifying sub-matchers. I'll ping back once I'm done with it.

Thanks.

It's a step zero to get decode-less std.regex. UTF matchers are efficient functors around a set of specific tries. Enables processing Unicode characters without decoding at speeds on par with decoding itself. Along the way make staticIota at 'package' protected and reuse it. Fix a shameful typo in setSearcher.

Granularity is horribly high. Auto-inference for templates has the downside that it, leaves no explanations or reasons for failure.

andralex · 2014-03-15T17:26:35Z

std/uni.d

s/combining decode and classify steps/combining the decoding and classification steps/

Spelling, style etc.

DmitryOlshansky · 2014-03-15T18:43:54Z

@andralex Hopefully all things covered.

andralex · 2014-03-15T19:34:26Z

std/uni.d

"Avoiding" cannot be a "building block". How about "Another useful approach to efficient Unicode-aware parsers is to avoid unnecessary..."

Also don't forget the typo: unnecesSary

"Avoiding" cannot be a "building block". How about "Another useful approach to efficient Unicode-aware parsers is to avoid unnecessary..."

Technique?

DmitryOlshansky · 2014-03-15T21:35:36Z

Thanks, fixed.

Introduce UTF matchers into std.uni.

This reverts commit 216ca01, reversing changes made to d56c1db.

andralex · 2014-03-16T00:24:01Z

@blackwhale something went wrong so #2012

Revert "Merge pull request #1685 from blackwhale/utf8-matcher"

9rnsr · 2014-03-16T04:28:23Z

The ICE was caused by the regression 12376, and it was exposed by the merge of #2256.

andralex · 2014-03-16T04:29:11Z

@blackwhale pliz reopen

9rnsr · 2014-03-16T04:29:20Z

std/uni.d

std.utf is not visible from here.

Thanks for pointing this out.

DmitryOlshansky · 2014-03-18T19:51:23Z

@andralex I can't reopen merged pull, see #2020

andralex · 2014-03-18T20:53:17Z

thx!

MartinNowak · 2014-04-06T14:47:23Z

@MartinNowak I was able to test the new implementation with tuples (or rather POD structs since tuples are unusable in std.uni). Sadly I'm seeing awful performance with both LDC and DMD.

The fork is here:
https://github.com/blackwhale/phobos/tree/utf-matcher-2

I really like the idea of having a single matcher function and 3 UFCS functions for match, skip and test. It definitely avoids any unnecessary code duplication.
I've commented on the utf-matcher-2 branch with two more ideas to improve the performance, hopefully it will suffice.

MartinNowak · 2014-04-06T14:48:01Z

Revert-revert pull is #2020.

MartinNowak reviewed Dec 19, 2013
View reviewed changes

DmitryOlshansky closed this Jan 19, 2014

DmitryOlshansky reopened this Feb 23, 2014

monarchdodra reviewed Feb 24, 2014
View reviewed changes

MartinNowak reviewed Feb 27, 2014
View reviewed changes

DmitryOlshansky added 6 commits March 8, 2014 13:55

workaround stable sort (std.move) not CTFE-able

d0e408d

const/pure annotations

cbd9bd3

purify std.uni constructs, blocked by std.algortihm.sort

c264bd6

@safety bags for utfMatchers

1e771c0

Granularity is horribly high. Auto-inference for templates has the downside that it, leaves no explanations or reasons for failure.

fold in review comments

528099a

andralex reviewed Mar 15, 2014
View reviewed changes

std/uni.d Outdated

Copy link

Member

andralex Mar 15, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/combining decode and classify steps/combining the decoding and classification steps/

address review issues

0d60e46

Spelling, style etc.

andralex reviewed Mar 15, 2014
View reviewed changes

another DDoc tweak

1b03e40

andralex added a commit that referenced this pull request Mar 15, 2014

Merge pull request #1685 from blackwhale/utf8-matcher

216ca01

Introduce UTF matchers into std.uni.

andralex merged commit 216ca01 into dlang:master Mar 15, 2014

andralex added a commit that referenced this pull request Mar 16, 2014

Revert "Merge pull request #1685 from blackwhale/utf8-matcher"

e289a7c

This reverts commit 216ca01, reversing changes made to d56c1db.

WalterBright mentioned this pull request Mar 16, 2014

Revert "Merge pull request #1685 from blackwhale/utf8-matcher" #2012

Merged

WalterBright added a commit that referenced this pull request Mar 16, 2014

Merge pull request #2012 from D-Programming-Language/fixmaster

7f550f9

Revert "Merge pull request #1685 from blackwhale/utf8-matcher"

9rnsr reviewed Mar 16, 2014
View reviewed changes

DmitryOlshansky mentioned this pull request Mar 18, 2014

UTF matchers in std.uni #2020

Merged

Uh oh!

Conversation

DmitryOlshansky commented Nov 10, 2013

Uh oh!

monarchdodra commented Nov 10, 2013

Uh oh!

DmitryOlshansky commented Nov 10, 2013

Uh oh!

DmitryOlshansky commented Nov 28, 2013

Uh oh!

DmitryOlshansky commented Dec 3, 2013

Uh oh!

monarchdodra commented Dec 19, 2013

Uh oh!

MartinNowak commented Dec 19, 2013

Uh oh!

DmitryOlshansky commented Dec 19, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmitryOlshansky commented Jan 19, 2014

Uh oh!

MartinNowak commented Jan 19, 2014

Uh oh!

DmitryOlshansky commented Feb 23, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmitryOlshansky commented Feb 25, 2014

Uh oh!

monarchdodra commented Feb 25, 2014

Uh oh!

DmitryOlshansky commented Feb 26, 2014

Uh oh!

DmitryOlshansky commented Feb 26, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartinNowak commented Feb 28, 2014

Uh oh!

DmitryOlshansky commented Feb 28, 2014

Uh oh!

MartinNowak commented Feb 28, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmitryOlshansky commented Mar 15, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmitryOlshansky commented Mar 15, 2014

Uh oh!

andralex commented Mar 16, 2014

Uh oh!

9rnsr commented Mar 16, 2014

Uh oh!

andralex commented Mar 16, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmitryOlshansky commented Mar 18, 2014

Uh oh!

andralex commented Mar 18, 2014

Uh oh!

MartinNowak commented Apr 6, 2014

Uh oh!

MartinNowak commented Apr 6, 2014

Uh oh!

Reviewers

Assignees