-
Notifications
You must be signed in to change notification settings - Fork 502
improve text encoder encode performance #5448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
anonrig
wants to merge
29
commits into
main
Choose a base branch
from
yagiz/experiment-value-view
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
cc39cf3
experiment with value view and simdutf
anonrig d93fcf7
address pr reviews
anonrig 28b102d
address pr reviews
anonrig d6691fe
get rid of multiple valueviews
anonrig 572ffa3
apply optimization to improve invalid utf16
anonrig 6ce652b
add missing simdutf dependency
anonrig d980b42
apply review recommendations
anonrig c439565
optimize encodeInto
anonrig abef75c
optimize ASCII paths
anonrig de0de38
add fast path that avoids length calculation
anonrig 06b0349
make the code reviewable
anonrig 9e89282
address pr reviews
anonrig 2bfb85a
more optimizations
anonrig 022e1a2
make the code reviewable
anonrig 6982921
use simdutf trim_partial_utf16
anonrig 525cbac
avoid repetitive simdutf_length calls
anonrig 538ed74
get rid of string flattening
anonrig 575373e
add more comments
anonrig 621e3ce
simplify things
anonrig 4558093
address pr reviews
anonrig 95fe642
simplify implementation
anonrig 411a055
An attempt to simplify the encodeInto change. (#5565)
erikcorry 62fb056
Add some tests of encodeinto for short output buffers. (#5570)
erikcorry 6956cb5
put changes behind an autogate
anonrig 166e9fd
Attempt to eliminate last regression (#5579)
erikcorry d462ca1
make changes due to simdutf
anonrig e4e393d
leverage simdutf more (#5797)
anonrig a03390f
fix build warning
anonrig 1a2eab7
address pr reviews
anonrig File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| // Copyright (c) 2025 Cloudflare, Inc. | ||
| // Licensed under the Apache 2.0 license found in the LICENSE file or at: | ||
| // https://opensource.org/licenses/Apache-2.0 | ||
|
|
||
| #include "encoding.h" | ||
|
|
||
| #include <kj/test.h> | ||
|
|
||
| namespace workerd::api { | ||
| namespace test { | ||
|
|
||
| // These tests verify the findBestFit() function used by TextEncoder.encodeInto(). | ||
| // | ||
| // bestFit(input, bufferSize) returns the number of input code units that can be | ||
| // fully converted to UTF-8 and fit within the given output buffer size in bytes. | ||
| // | ||
| // The key insight is that different characters expand to different UTF-8 byte lengths: | ||
| // - ASCII (U+0000-U+007F): 1 byte per code unit | ||
| // - Latin-1 extended (U+0080-U+00FF): 2 bytes per code unit | ||
| // - BMP characters (U+0100-U+FFFF): 2-3 bytes per code unit | ||
| // - Supplementary characters (U+10000+): 4 bytes, encoded as surrogate pairs in UTF-16 | ||
| // | ||
| // The function must never split a surrogate pair, so if there's only room for part of | ||
| // a multi-byte character, it stops before that character. | ||
| KJ_TEST("BestFitASCII") { | ||
| // If there's zero input or output space, the answer is zero. | ||
| KJ_ASSERT(bestFit("", 0) == 0); | ||
anonrig marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| KJ_ASSERT(bestFit("a", 0) == 0); | ||
| KJ_ASSERT(bestFit("aa", 0) == 0); | ||
| KJ_ASSERT(bestFit("aaa", 0) == 0); | ||
| KJ_ASSERT(bestFit("aaaa", 0) == 0); | ||
| KJ_ASSERT(bestFit("aaaaa", 0) == 0); | ||
| KJ_ASSERT(bestFit("", 0) == 0); | ||
| KJ_ASSERT(bestFit("", 1) == 0); | ||
| KJ_ASSERT(bestFit("", 2) == 0); | ||
| KJ_ASSERT(bestFit("", 3) == 0); | ||
| KJ_ASSERT(bestFit("", 4) == 0); | ||
| KJ_ASSERT(bestFit("", 5) == 0); | ||
| // Zero cases with two-byte strings. | ||
| KJ_ASSERT(bestFit(u"", 0) == 0); | ||
| KJ_ASSERT(bestFit(u"€", 0) == 0); | ||
| KJ_ASSERT(bestFit(u"€€", 0) == 0); | ||
| KJ_ASSERT(bestFit(u"€€€", 0) == 0); | ||
| KJ_ASSERT(bestFit(u"€€€€", 0) == 0); | ||
| KJ_ASSERT(bestFit(u"€€€€€", 0) == 0); | ||
| KJ_ASSERT(bestFit(u"", 0) == 0); | ||
| KJ_ASSERT(bestFit(u"", 1) == 0); | ||
| KJ_ASSERT(bestFit(u"", 2) == 0); | ||
| KJ_ASSERT(bestFit(u"", 3) == 0); | ||
| KJ_ASSERT(bestFit(u"", 4) == 0); | ||
| KJ_ASSERT(bestFit(u"", 5) == 0); | ||
| // Small buffers that only just fit. | ||
| KJ_ASSERT(bestFit(u"a", 1) == 1); | ||
| KJ_ASSERT(bestFit(u"å", 2) == 1); | ||
| KJ_ASSERT(bestFit(u"€", 3) == 1); | ||
| KJ_ASSERT(bestFit(u"😹", 4) == 2); | ||
| // Small buffers that don't fit. | ||
| KJ_ASSERT(bestFit(u"å", 1) == 0); | ||
| KJ_ASSERT(bestFit(u"€", 2) == 0); | ||
| KJ_ASSERT(bestFit(u"😹", 3) == 0); | ||
| // Don't chop a surrogate pair. | ||
| KJ_ASSERT(bestFit(u"1😹", 4) == 1); | ||
| KJ_ASSERT(bestFit(u"12😹", 5) == 2); | ||
| KJ_ASSERT(bestFit(u"123😹", 6) == 3); | ||
| KJ_ASSERT(bestFit(u"1234😹", 7) == 4); | ||
| KJ_ASSERT(bestFit(u"12345😹", 8) == 5); | ||
| // Some bigger ones just for fun. | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 0) == 0); | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 1) == 0); | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 2) == 0); | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 3) == 0); | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 4) == 2); | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 5) == 2); | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 6) == 2); | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 7) == 2); | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 8) == 4); | ||
| KJ_ASSERT(bestFit(u"😹😹😹😹😹😹", 9) == 4); | ||
| KJ_ASSERT(bestFit(u"0😹😹😹😹😹😹", 9) == 5); // 0😹😹 is 5 and takes 9. | ||
| KJ_ASSERT(bestFit(u"01😹😹😹😹😹😹", 9) == 4); // 01😹 is 4 and takes 6. | ||
| KJ_ASSERT(bestFit(u"012😹😹😹😹😹😹", 9) == 5); // 012😹 is 5 and takes 7. | ||
| KJ_ASSERT(bestFit(u"0123😹😹😹😹😹😹", 9) == 6); // 0123😹 is 6 and takes 8. | ||
| KJ_ASSERT(bestFit(u"01234😹😹😹😹😹😹", 9) == 7); // 01234😹 is 7 and takes 9. | ||
| KJ_ASSERT(bestFit(u"012345😹😹😹😹😹😹", 9) == 6); // 012345 is 6 and takes 6. | ||
| KJ_ASSERT(bestFit(u"0123456😹😹😹😹😹😹", 9) == 7); // 0123456 is 7 and takes 7. | ||
| KJ_ASSERT(bestFit(u"01234567😹😹😹😹😹😹", 9) == 8); // 0123456 is 8 and takes 8. | ||
| KJ_ASSERT(bestFit(u"012345678😹😹😹😹😹😹", 9) == 9); // 0123456 is 9 and takes 9. | ||
| } | ||
|
|
||
| } // namespace test | ||
| } // namespace workerd::api | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.