Skip to content

Conversation

@clipperhouse
Copy link

@clipperhouse clipperhouse commented Jun 15, 2022

In this PR, as an experiment I’ve swapped out the Unicode segmenter in Bleve with this one.

Benefits

  • ~2x throughput improvement
Before
BenchmarkTokenizeEnglishText-8   	   17751	     65341 ns/op	   49528 B/op	      11 allocs/op

After
BenchmarkTokenizeEnglishText-8   	   34965	     32749 ns/op	   49528 B/op	      11 allocs/op

(Run on my M2 MacBook.)
  • Updates Unicode 8 → Unicode 15

Testing & compatibility

  • Both segmenters pass the official Unicode test suites
  • Bleve tests pass
  • Added adaptors to the underlying UAX29 package to determine Bleve token types
  • Added fuzzing to the UAX29 package
  • Known differences:
    • The original segementer splits runs of spaces into separate tokens; UAX29 concatenates runs of spaces into a single token. This should be irrelevant since Bleve filters whitespace in any case.
    • The original segmenter doesn’t handle emoji skin tone modifiers, the new one does. I believe the old tokenizer doesn’t handle Thai script, or at least numbers in that script.

@abhinavdangeti
Copy link
Member

Thank you for submitting this @clipperhouse , we'll review it soon.

@clipperhouse
Copy link
Author

clipperhouse commented Jun 15, 2022

Thanks @abhinavdangeti. I did some more testing of differences between segmenters here: https://github.com/clipperhouse/segmenter-repro

@clipperhouse
Copy link
Author

Updated benchmark, using new sample text (~110K in size, multilingual). Note allocations.

Previous
BenchmarkTokenizeEnglishText-4   	     285	   4167394 ns/op	  26.15 MB/s	     16042 tokens	 1348515 B/op	     610 allocs/op

New
BenchmarkTokenizeEnglishText-4   	     614	   1986500 ns/op	  54.86 MB/s	     16042 tokens	 1310730 B/op	       2 allocs/op

@clipperhouse
Copy link
Author

@abhinavdangeti I think this is in good shape for your review. Happy to discuss.

@clipperhouse
Copy link
Author

I see that the go:build syntax isn’t compatible with old Go versions, fix incoming.

@clipperhouse
Copy link
Author

@abhinavdangeti OK, try those workflows again? Thanks.

@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from 89f74ba to cf2065d Compare June 30, 2022 13:39
@clipperhouse
Copy link
Author

I rebased a bit for a cleaner merge.

@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from cf2065d to 93cc997 Compare July 14, 2022 21:35
@clipperhouse
Copy link
Author

(Rebased)

@clipperhouse
Copy link
Author

@abhinavdangeti Friendly ping, let me know if you'd like to pursue this.

@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch 2 times, most recently from a8db884 to c8e26eb Compare July 23, 2022 03:17
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch 2 times, most recently from d4de8e6 to 2782f9f Compare August 18, 2022 21:45
@clipperhouse clipperhouse changed the title Unicode segmenter perf experiment Unicode segmenter performance Sep 10, 2022
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from f944431 to e7dd25c Compare October 8, 2022 16:10
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from e7dd25c to b6bca0a Compare October 16, 2022 20:34
@clipperhouse
Copy link
Author

Hiya @abhinavdangeti, the above pushes are just rebases, no updates here in a while. Would you like to run checks?

@abhinavdangeti
Copy link
Member

Thanks @clipperhouse , re-running the checks.

@clipperhouse
Copy link
Author

Looks good, thanks. Ready for review at your convenience.

@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from b6bca0a to 9b41d64 Compare October 19, 2022 21:37
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch 2 times, most recently from 226cf98 to cccea71 Compare November 12, 2022 16:17
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from cccea71 to 8fa8ed9 Compare May 26, 2023 19:18
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from 8fa8ed9 to 4bfba33 Compare November 6, 2023 18:41
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from 4bfba33 to 042b2d8 Compare August 12, 2024 14:27
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from 042b2d8 to 7014345 Compare April 13, 2025 02:27
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from 7014345 to 05ac589 Compare July 29, 2025 04:44
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch from 05ac589 to 97df007 Compare July 29, 2025 16:01
@clipperhouse clipperhouse force-pushed the unicode-segmenter-perf branch 2 times, most recently from 60ea1cf to 48bcf3b Compare September 14, 2025 19:01
Replacing blevesearch/segment. ~2x perf improvement.
@clipperhouse
Copy link
Author

Hi @abhinavdangeti, I check on this every few months just to keep it fresh. If this is in your “someday” file, great. If not, feel free to close it.

Most recently, I removed my refactoring of the allocations, and just focused on a like-for-like swap of the segmenter. Simpler for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants