Skip to content

UTF matchers in std.uni#2020

Merged
MartinNowak merged 14 commits intodlang:masterfrom
DmitryOlshansky:utf-matcher
Apr 23, 2014
Merged

UTF matchers in std.uni#2020
MartinNowak merged 14 commits intodlang:masterfrom
DmitryOlshansky:utf-matcher

Conversation

@DmitryOlshansky
Copy link
Member

Second try. See also pull #1685

std/uni.d Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public?

@andralex
Copy link
Member

Will leave API review to UTF experts. For my money I wonder where the generated tries come from? Should we include them in the code, or use ctfe? Also since you have style corrections maybe time comes near to bring this in line with phobos e.g. a+1 => a + 1 etc.

@DmitryOlshansky
Copy link
Member Author

For my money I wonder where the generated tries come from?

As with the rest of std.uni the user is in the front seat and picks any CodepointSet he/she happens to like. Then one way is to build a Trie with say toTrie!2 (3/4 levels) that would handle dchars or (new) UTF matcher for ranges of char/wchar.

Should we include them in the code, or use ctfe?

Anything is possible, including generating a matcher based on run-time input or doing the whole thing via CTFE. As noted before e.g. auto m = utfMatcher!char(unicode.WhiteSpace); works at module-scope.

@andralex
Copy link
Member

Great. Anyhow more line breaks in the generated tries would be nice.

@DmitryOlshansky
Copy link
Member Author

Anyhow more line breaks in the generated tries would be nice.

Sure, I have it on my list.

@CyberShadow
Copy link
Member

This pull request conflicts with dlang/dmd#3399:

std\uni.d(4604): Warning: The comma operator will be deprecated
std\uni.d(4667): Warning: The comma operator will be deprecated
std\uni.d(4682): Warning: The comma operator will be deprecated
std\uni.d(4743): Warning: The comma operator will be deprecated
std\uni.d(4784): Warning: The comma operator will be deprecated
std\uni.d(4865): Warning: The comma operator will be deprecated
std\uni.d(4880): Warning: The comma operator will be deprecated
std\uni.d(4961): Warning: The comma operator will be deprecated
std\uni.d(4982): Warning: The comma operator will be deprecated

@MartinNowak
Copy link
Member

Let's give the simpified interface of utf-matcher-2 another try.

@DmitryOlshansky
Copy link
Member Author

@MartinNowak

Still sucks ...
My hacked copy of LDC 0.13 alpha2 source:
http://1drv.ms/1ewOWoL
Numbers do not inspire confidence, and with DMD we don't even reach 95Mb/s.

====================
        m8-Alpha [enwiki],   1152212 hits,   12.1373, 147.76 Mb/s
         m8-Mark [enwiki],        67 hits,   12.0029, 149.41 Mb/s
       m8-Symbol [enwiki],      1196 hits,   12.1377, 147.75 Mb/s
       m8-Number [enwiki],    289430 hits,    11.829, 151.61 Mb/s
      trie-Alpha [enwiki],   1152212 hits,   6.76636, 265.04 Mb/s
  new-trie-alpha [enwiki],   1152212 hits,   7.03228, 255.02 Mb/s
     decode-only [enwiki],   1786517 hits,   2.95148, 607.62 Mb/s
            noop [enwiki],   1699856 hits,   0.90752, 1976.13 Mb/s
====================
        m8-Alpha [ruwiki],    414460 hits,   4.78436, 190.39 Mb/s
         m8-Mark [ruwiki],         1 hits,   4.77492, 190.77 Mb/s
       m8-Symbol [ruwiki],        18 hits,   4.79576, 189.94 Mb/s
       m8-Number [ruwiki],      2403 hits,   4.76172, 191.30 Mb/s
      trie-Alpha [ruwiki],    414460 hits,     3.218, 283.06 Mb/s
  new-trie-alpha [ruwiki],    414460 hits,   3.36564, 270.65 Mb/s
     decode-only [ruwiki],    496887 hits,   2.26156, 402.78 Mb/s
            noop [ruwiki],    884374 hits,    0.5076, 1794.53 Mb/s
====================
        m8-Alpha [dewiki],   1435456 hits,   11.4893, 148.04 Mb/s
         m8-Mark [dewiki],         1 hits,   11.4691, 148.30 Mb/s
       m8-Symbol [dewiki],        36 hits,   11.4435, 148.63 Mb/s
       m8-Number [dewiki],      9461 hits,   11.2448, 151.26 Mb/s
      trie-Alpha [dewiki],   1435456 hits,   6.48324, 262.35 Mb/s
  new-trie-alpha [dewiki],   1435456 hits,   6.70656, 253.61 Mb/s
     decode-only [dewiki],   1680050 hits,   2.97852, 571.04 Mb/s
            noop [dewiki],   1607454 hits,   0.90392, 1881.66 Mb/s
====================
        m8-Alpha [arwiki],    273256 hits,   3.25012, 192.24 Mb/s
         m8-Mark [arwiki],       496 hits,   3.34056, 187.04 Mb/s
       m8-Symbol [arwiki],       152 hits,   3.32096, 188.14 Mb/s
       m8-Number [arwiki],      7216 hits,   3.27048, 191.04 Mb/s
      trie-Alpha [arwiki],    273256 hits,   2.21164, 282.51 Mb/s
  new-trie-alpha [arwiki],    273256 hits,   2.29456, 272.30 Mb/s
     decode-only [arwiki],    342832 hits,   1.60236, 389.93 Mb/s
            noop [arwiki],    599328 hits,   0.34816, 1794.60 Mb/s

@DmitryOlshansky
Copy link
Member Author

With this std.uni hacked into LDC 0.13-alpha:

====================
        m8-Alpha [enwiki],   1152212 hits,   4.37764, 409.67 Mb/s
         m8-Mark [enwiki],        67 hits,   4.00568, 447.71 Mb/s
       m8-Symbol [enwiki],      1196 hits,   4.07752, 439.82 Mb/s
       m8-Number [enwiki],    289430 hits,    3.9842, 450.12 Mb/s
      trie-Alpha [enwiki],   1152212 hits,   6.74988, 265.69 Mb/s
  new-trie-alpha [enwiki],   1152212 hits,   6.97196, 257.23 Mb/s
     decode-only [enwiki],   1786517 hits,   2.95336, 607.23 Mb/s
            noop [enwiki],   1699856 hits,   0.98224, 1825.80 Mb/s
====================
        m8-Alpha [ruwiki],    414460 hits,     2.098, 434.18 Mb/s
         m8-Mark [ruwiki],         1 hits,   2.08212, 437.49 Mb/s
       m8-Symbol [ruwiki],        18 hits,    2.0894, 435.96 Mb/s
       m8-Number [ruwiki],      2403 hits,   2.07996, 437.94 Mb/s
      trie-Alpha [ruwiki],    414460 hits,   3.27576, 278.07 Mb/s
  new-trie-alpha [ruwiki],    414460 hits,   3.21828, 283.04 Mb/s
     decode-only [ruwiki],    496887 hits,    2.3026, 395.60 Mb/s
            noop [ruwiki],    884374 hits,   0.50044, 1820.20 Mb/s
====================
        m8-Alpha [dewiki],   1435456 hits,   4.19328, 405.62 Mb/s
         m8-Mark [dewiki],         1 hits,   3.93076, 432.71 Mb/s
       m8-Symbol [dewiki],        36 hits,   3.84788, 442.03 Mb/s
       m8-Number [dewiki],      9461 hits,   3.82956, 444.14 Mb/s
      trie-Alpha [dewiki],   1435456 hits,   6.60008, 257.70 Mb/s
  new-trie-alpha [dewiki],   1435456 hits,    6.7368, 252.47 Mb/s
     decode-only [dewiki],   1680050 hits,    2.9934, 568.21 Mb/s
            noop [dewiki],   1607454 hits,   0.89596, 1898.37 Mb/s
====================
        m8-Alpha [arwiki],    273256 hits,   1.49356, 418.33 Mb/s
         m8-Mark [arwiki],       496 hits,   1.43812, 434.46 Mb/s
       m8-Symbol [arwiki],       152 hits,   1.44284, 433.04 Mb/s
       m8-Number [arwiki],      7216 hits,   1.46328, 426.99 Mb/s
      trie-Alpha [arwiki],    273256 hits,   2.26984, 275.27 Mb/s
  new-trie-alpha [arwiki],    273256 hits,   2.21084, 282.61 Mb/s
     decode-only [arwiki],    342832 hits,   1.54952, 403.23 Mb/s
            noop [arwiki],    599328 hits,   0.32548, 1919.65 Mb/s

It's a step zero to get decode-less std.regex.
UTF matchers are efficient functors around a set of
specific tries. Enables processing Unicode characters
without decoding at speeds on par with decoding itself.

Along the way make staticIota at 'package' protected and reuse it.

Fix a shameful typo in setSearcher.
Granularity is horribly high. Auto-inference for templates has the
downside that it, leaves no explanations or reasons for failure.
Overlong sequences, wrong continuation  for UTF-8.
Lone high surrogate for UTf-16/.
Spelling, style etc.
Drop public for documented unittests
@MartinNowak
Copy link
Member

How to run that benchmark?

@DmitryOlshansky
Copy link
Member Author

@MartinNowak
Copy link
Member

It's really a nuisance that we're developing performance sensitive code with such a dull backend.
I recall the Sentinel InputRange.
Currently trying the ldc build, the generated code looks much better.

The benchmark only runs matcher.skip. Wouldn't a realistic use mix calls to skip, match and test?

@MartinNowak
Copy link
Member

I running out of ideas to speed up the single function variant, so I'm OK with the skip, match and test API.

@DmitryOlshansky
Copy link
Member Author

@MartinNowak regarding skip, test and match.

I have a vision that most code will do one of:
a) Use only skip -- for instance this is what splitters do, they skip code points until result toggles from true to false (or vise versa) and split at prior position. Only single matcher is ever used here.
b) Use only match -- string tokenization & pattern matching usually does it, trying multiple sets of code points in if-else round-robin fashion.
c) test was added for symmetry. The main case I have in mind is ThompsonMatcher in std.regex that executes multiple 'threads' (=matchers) on single code point. Even there it's a transitional step, matchers enable better algorithm that I and Fawzi devised back in 2011.

Another possible use case for test is to reduce amount of states to try based on a lookahead of 1 code point in the current backtracking engine (i.e. do a quick-test - if it can't possibly match, then do not save this state in a stack).

@MartinNowak
Copy link
Member

So are we ready to merge this?

@DmitryOlshansky
Copy link
Member Author

@MartinNowak I think it should be good to go, the only problem left is that it's useless on DMD.

@MartinNowak
Copy link
Member

Auto-merge toggled on

MartinNowak added a commit that referenced this pull request Apr 23, 2014
@MartinNowak MartinNowak merged commit 3e06a3b into dlang:master Apr 23, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants