Skip to content

Conversation

@felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Jul 17, 2024

Rationale for this change

We need casts between string (binary) and string-view (binary-view) types since they are semantically equivalent.

What changes are included in this PR?

  • Add is_binary_view_like() type predicate
  • Add BinaryViewTypes() list including STRING_VIEW/BINARY_VIEW
  • New cast kernels

Are these changes tested?

Yes, but test coverage might be improved.

Are there any user-facing changes?

More casts are available.

@felipecrv felipecrv changed the title GH-43010: [C++] Support casting to and from utf8_view/binary_view GH-42247: [C++] Support casting to and from utf8_view/binary_view Jul 17, 2024
@felipecrv felipecrv marked this pull request as ready for review July 21, 2024 17:48
@felipecrv felipecrv requested a review from bkietz July 21, 2024 18:47
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO for this PR? Otherwise, perhaps create a GH issue for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapleFU this one needs to be fixed with the same fix I added in line 477 // Check against offset overflow. I forgot that there were two places with this TODO.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! memset to 0 really handle some tricky problem in protocol layer, thanks for your effort!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 23, 2024
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few nits

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 5, 2024
@felipecrv felipecrv requested a review from pitrou August 6, 2024 00:10
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 6, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Unrelated to this pr: What reminds me is the utf8 checking in arrow-rs, maybe we can use same algorithm? apache/arrow-rs#6009 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good [Parquet] issue to open

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I see what you mean: we could similarly assemble larger contiguous byte ranges on which we run a single Utf8 validation pass.

For the common case of views whose out-of-line data directly follows the previous out-of-line bytes, this would yield one long byte range for Utf8 validation.

Inline strings would also always be valid Utf8 since their size would consist of 3 zero bytes and one small byte plus the inline data and padding zero bytes, so we could validate on runs of inline views too.

@pitrou
Copy link
Member

pitrou commented Sep 12, 2024

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 2416d19

Submitted crossbow builds: ursacomputing/crossbow @ actions-aaf4a45dfc

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou
Copy link
Member

pitrou commented Sep 12, 2024

CI failures are unrelated, I'll merge

@pitrou pitrou merged commit 85fc3eb into apache:main Sep 12, 2024
@pitrou pitrou removed the awaiting change review Awaiting change review label Sep 12, 2024
@felipecrv felipecrv deleted the str2str_casts branch September 12, 2024 18:24
@mapleFU
Copy link
Member

mapleFU commented Sep 13, 2024

Thanks!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 85fc3eb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 97 possible false positives for unstable benchmarks that are known to sometimes produce them.

khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…ew (apache#43302)

### Rationale for this change

We need casts between string (binary) and string-view (binary-view) types since they are semantically equivalent.

### What changes are included in this PR?

 - Add `is_binary_view_like()` type predicate
 - Add `BinaryViewTypes()` list including `STRING_VIEW/BINARY_VIEW`
 - New cast kernels

### Are these changes tested?

Yes, but test coverage might be improved.

### Are there any user-facing changes?

More casts are available.
* GitHub Issue: apache#42247

Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
mapleFU added a commit that referenced this pull request Mar 27, 2025
…tring and binary types (#44822)

### Rationale for this change
Use `CopyBitmap`  to optimize performance in string casting from string-view to offset string.

### What changes are included in this PR?
Originally, the way we create the bitmap is by appending one bit at a time, which is slow. Since casting should not change the values in bitmap, this feature takes advantage of `CopyBitmap` to create the entire bitmap at once.

Then, to create offsets and buffer array, I use `TypedBufferBuilder` as suggested in the original comment #43302 (comment).

### Are these changes tested?

The original unit tests have passed.

### Are there any user-facing changes?
No, the casting behavior should remain unchanged.

closes [ #43573 ](#43573)
* GitHub Issue: #43573

Lead-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Crystal Zhou <45134936+CrystalZhou0529@users.noreply.github.com>
Co-authored-by: Crystal <45134936+CrystalZhou0529@users.noreply.github.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Apr 15, 2025
…fset string and binary types (apache#44822)

### Rationale for this change
Use `CopyBitmap`  to optimize performance in string casting from string-view to offset string.

### What changes are included in this PR?
Originally, the way we create the bitmap is by appending one bit at a time, which is slow. Since casting should not change the values in bitmap, this feature takes advantage of `CopyBitmap` to create the entire bitmap at once.

Then, to create offsets and buffer array, I use `TypedBufferBuilder` as suggested in the original comment apache#43302 (comment).

### Are these changes tested?

The original unit tests have passed.

### Are there any user-facing changes?
No, the casting behavior should remain unchanged.

closes [ apache#43573 ](apache#43573)
* GitHub Issue: apache#43573

Lead-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Crystal Zhou <45134936+CrystalZhou0529@users.noreply.github.com>
Co-authored-by: Crystal <45134936+CrystalZhou0529@users.noreply.github.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
…tring and binary types (#44822)

### Rationale for this change
Use `CopyBitmap`  to optimize performance in string casting from string-view to offset string.

### What changes are included in this PR?
Originally, the way we create the bitmap is by appending one bit at a time, which is slow. Since casting should not change the values in bitmap, this feature takes advantage of `CopyBitmap` to create the entire bitmap at once.

Then, to create offsets and buffer array, I use `TypedBufferBuilder` as suggested in the original comment apache/arrow#43302 (comment).

### Are these changes tested?

The original unit tests have passed.

### Are there any user-facing changes?
No, the casting behavior should remain unchanged.

closes [ #43573 ](apache/arrow#43573)
* GitHub Issue: #43573

Lead-authored-by: Crystal Zhou <crystal.zhouxiaoyue@hotmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Crystal Zhou <45134936+CrystalZhou0529@users.noreply.github.com>
Co-authored-by: Crystal <45134936+CrystalZhou0529@users.noreply.github.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants