-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-42247: [C++] Support casting to and from utf8_view/binary_view #43302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a TODO for this PR? Otherwise, perhaps create a GH issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU this one needs to be fixed with the same fix I added in line 477 // Check against offset overflow. I forgot that there were two places with this TODO.
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! memset to 0 really handle some tricky problem in protocol layer, thanks for your effort!
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a few nits
7d5a377 to
7d1f438
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Unrelated to this pr: What reminds me is the utf8 checking in arrow-rs, maybe we can use same algorithm? apache/arrow-rs#6009 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good [Parquet] issue to open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think I see what you mean: we could similarly assemble larger contiguous byte ranges on which we run a single Utf8 validation pass.
For the common case of views whose out-of-line data directly follows the previous out-of-line bytes, this would yield one long byte range for Utf8 validation.
Inline strings would also always be valid Utf8 since their size would consist of 3 zero bytes and one small byte plus the inline data and padding zero bytes, so we could validate on runs of inline views too.
02d7467 to
2416d19
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: 2416d19 Submitted crossbow builds: ursacomputing/crossbow @ actions-aaf4a45dfc |
|
CI failures are unrelated, I'll merge |
|
Thanks! |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 85fc3eb. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 97 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…ew (apache#43302) ### Rationale for this change We need casts between string (binary) and string-view (binary-view) types since they are semantically equivalent. ### What changes are included in this PR? - Add `is_binary_view_like()` type predicate - Add `BinaryViewTypes()` list including `STRING_VIEW/BINARY_VIEW` - New cast kernels ### Are these changes tested? Yes, but test coverage might be improved. ### Are there any user-facing changes? More casts are available. * GitHub Issue: apache#42247 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…tring and binary types (#44822) ### Rationale for this change Use `CopyBitmap` to optimize performance in string casting from string-view to offset string. ### What changes are included in this PR? Originally, the way we create the bitmap is by appending one bit at a time, which is slow. Since casting should not change the values in bitmap, this feature takes advantage of `CopyBitmap` to create the entire bitmap at once. Then, to create offsets and buffer array, I use `TypedBufferBuilder` as suggested in the original comment #43302 (comment). ### Are these changes tested? The original unit tests have passed. ### Are there any user-facing changes? No, the casting behavior should remain unchanged. closes [ #43573 ](#43573) * GitHub Issue: #43573 Lead-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Crystal Zhou <45134936+CrystalZhou0529@users.noreply.github.com> Co-authored-by: Crystal <45134936+CrystalZhou0529@users.noreply.github.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…fset string and binary types (apache#44822) ### Rationale for this change Use `CopyBitmap` to optimize performance in string casting from string-view to offset string. ### What changes are included in this PR? Originally, the way we create the bitmap is by appending one bit at a time, which is slow. Since casting should not change the values in bitmap, this feature takes advantage of `CopyBitmap` to create the entire bitmap at once. Then, to create offsets and buffer array, I use `TypedBufferBuilder` as suggested in the original comment apache#43302 (comment). ### Are these changes tested? The original unit tests have passed. ### Are there any user-facing changes? No, the casting behavior should remain unchanged. closes [ apache#43573 ](apache#43573) * GitHub Issue: apache#43573 Lead-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Crystal Zhou <45134936+CrystalZhou0529@users.noreply.github.com> Co-authored-by: Crystal <45134936+CrystalZhou0529@users.noreply.github.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…tring and binary types (#44822) ### Rationale for this change Use `CopyBitmap` to optimize performance in string casting from string-view to offset string. ### What changes are included in this PR? Originally, the way we create the bitmap is by appending one bit at a time, which is slow. Since casting should not change the values in bitmap, this feature takes advantage of `CopyBitmap` to create the entire bitmap at once. Then, to create offsets and buffer array, I use `TypedBufferBuilder` as suggested in the original comment apache/arrow#43302 (comment). ### Are these changes tested? The original unit tests have passed. ### Are there any user-facing changes? No, the casting behavior should remain unchanged. closes [ #43573 ](apache/arrow#43573) * GitHub Issue: #43573 Lead-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Crystal Zhou <45134936+CrystalZhou0529@users.noreply.github.com> Co-authored-by: Crystal <45134936+CrystalZhou0529@users.noreply.github.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
Rationale for this change
We need casts between string (binary) and string-view (binary-view) types since they are semantically equivalent.
What changes are included in this PR?
is_binary_view_like()type predicateBinaryViewTypes()list includingSTRING_VIEW/BINARY_VIEWAre these changes tested?
Yes, but test coverage might be improved.
Are there any user-facing changes?
More casts are available.